Re: [FFmpeg-devel] [RFC] New swscale internal design prototype

From: Niklas Haas <ffmpeg@haasn.xyz>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
Date: Wed, 12 Mar 2025 12:27:02 +0100
Message-ID: <20250312122702.GB6345@haasn.xyz> (raw)
In-Reply-To: <a0321e3-acb5-8d72-755c-8e74824e1f9a@martin.st>

On Wed, 12 Mar 2025 09:15:22 +0200 Martin Storsjö <martin@martin.st> wrote:
> On Wed, 12 Mar 2025, Niklas Haas wrote:
>
> > On Sun, 09 Mar 2025 20:45:23 +0100 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> >> On Sun, 09 Mar 2025 18:11:54 +0200 Martin Storsjö <martin@martin.st> wrote:
> >> > On Sat, 8 Mar 2025, Niklas Haas wrote:
> >> >
> >> > > What are the thoughts on the float-first approach?
> >> >
> >> > In general, for modern architectures, relying on floats probably is
> >> > reasonable. (On architectures that aren't of quite as widespread interest,
> >> > it might not be so clear cut though.)
> >> >
> >> > However with the benchmark example you provided a couple of weeks ago, we
> >> > concluded that even on x86 on modern HW, floats were faster than int16
> >> > only in one case: When using Clang, not GCC, and when compiling with
> >> > -mavx2, not without it. In all the other cases, int16 was faster than
> >> > float.
> >>
> >> Hi Martin,
> >>
> >> I should preface that this particular benchmark was a very specific test for
> >> floating point *filtering*, which is considerably more punishing than the
> >> conversion pipeline I have implemented here, and I think it's partly the
> >> fault of compilers generating very unoptimal filtering code.
> >>
> >> I think it would be better to re-assess using the current prototype on actual
> >> hardware. I threw up a quick NEON test branch: (untested, should hopefully work)
> >> https://github.com/haasn/FFmpeg/commits/swscale3-neon
> >>
> >> # adjust the benchmark iters count as needed based on the HW perf
> >> make libswscale/tests/swscale && libswscale/tests/swscale -unscaled 1 -bench 50
> >>
> >> If this differs significantly from the ~1.8x speedup I measure on x86, I
> >> will be far more concerned about the new approach.
>
> Sorry, I haven't had time to try this out myself yet...

No worries. I think I have gathered enough performance figures myself to come
to the conclusion that this approach won't work unmodified - not because of the
usage of floats so much as the fact that the load/store overhead is
sufficiently expensive in very simple scenarios to the point where it outweighs
the benefits.

I think my plan for now is the following:

1. Delete all of the "optimized" variants of the C templates, and keep only the
   general purpose base case merely as a fallback / reference code.

2. Instead, make the architectural split at a higher level; and allow arch-
   specific implementations to choose their own preferred chunk size, or even
   do something wildly different like runtime code generation or custom calling
   conventions.

3. Merge the new code for now guarded under an explicit opt in flag so we can
   continue to develop it alongside the existing approach until arch-specific
   optimized variants are available and sufficiently fast in _all_ cases.

>
> > I gave it a try. So, the result of a naive/blind run on a Cortex-X1 using clang
> > version 20.0.0 (from the latest Android NDK v29) is:
> >
> > Overall speedup=1.688x faster, min=0.141x max=45.898x
> >
> > This has quite a lot more significant speed regressions compared to x86 though.
> >
> > In particular, clang/LLVM refuses to vectorize packed reads of 2 or 3 elements,
> > so any sort of operation involving rgb24 or bgr24 suffers horribly:
>
> So, if the performance of this relies on compiler autovectorization,
> what's the plan wrt GCC? We blanket disable autovectorization when
> compiling with GCC - see fd6dbc53855fbfc9a782095d0ffe11dd3a98905f for when
> it was disabled last time. Building and running fate with
> autovectorization in GCC does succeed at least on modern GCC on x86_64,
> but it's of course possible that it still can cause issues in various more
> tricky configurations.

See https://github.com/haasn/FFmpeg/blob/swscale3/libswscale/ops_internal.h#L28

>
> // Martin
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".