Re: [FFmpeg-devel] [RFC] New swscale internal design prototype

From: Niklas Haas <ffmpeg@haasn.xyz>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
Date: Wed, 12 Mar 2025 00:56:51 +0100
Message-ID: <20250312005651.GC800168@haasn.xyz> (raw)
In-Reply-To: <20250309204523.GC683063@haasn.xyz>

On Sun, 09 Mar 2025 20:45:23 +0100 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> On Sun, 09 Mar 2025 18:11:54 +0200 Martin Storsjö <martin@martin.st> wrote:
> > On Sat, 8 Mar 2025, Niklas Haas wrote:
> > 
> > > What are the thoughts on the float-first approach?
> > 
> > In general, for modern architectures, relying on floats probably is 
> > reasonable. (On architectures that aren't of quite as widespread interest, 
> > it might not be so clear cut though.)
> > 
> > However with the benchmark example you provided a couple of weeks ago, we 
> > concluded that even on x86 on modern HW, floats were faster than int16 
> > only in one case: When using Clang, not GCC, and when compiling with 
> > -mavx2, not without it. In all the other cases, int16 was faster than 
> > float.
> 
> Hi Martin,
> 
> I should preface that this particular benchmark was a very specific test for
> floating point *filtering*, which is considerably more punishing than the
> conversion pipeline I have implemented here, and I think it's partly the
> fault of compilers generating very unoptimal filtering code.
> 
> I think it would be better to re-assess using the current prototype on actual
> hardware. I threw up a quick NEON test branch: (untested, should hopefully work)
> https://github.com/haasn/FFmpeg/commits/swscale3-neon
> 
> # adjust the benchmark iters count as needed based on the HW perf
> make libswscale/tests/swscale && libswscale/tests/swscale -unscaled 1 -bench 50
> 
> If this differs significantly from the ~1.8x speedup I measure on x86, I
> will be far more concerned about the new approach.

I gave it a try. So, the result of a naive/blind run on a Cortex-X1 using clang
version 20.0.0 (from the latest Android NDK v29) is:

Overall speedup=1.688x faster, min=0.141x max=45.898x

This has quite a lot more significant speed regressions compared to x86 though.

In particular, clang/LLVM refuses to vectorize packed reads of 2 or 3 elements,
so any sort of operation involving rgb24 or bgr24 suffers horribly:

  Conversion pass for rgb24 -> rgba:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
    [ u8 ...X -> ++++] SWS_OP_CLEAR        : {_ _ _ 255}
    [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) packed >> 0
      (X = unused, + = exact, 0 = zero)
  rgb24 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=2856 us, ref=387 us, speedup=0.136x slower

Another thing LLVM seemingly does not optimize at all is integer shifts, they
also end up as horribly inefficient scalar code:

  Conversion pass for yuv444p -> yuv444p16le:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) planar >> 0
    [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> u16
    [u16 ...X -> +++X] SWS_OP_LSHIFT       : << 8
    [u16 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
      (X = unused, + = exact, 0 = zero)
  yuv444p 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=1564 us, ref=590 us, speedup=0.377x slower

On the other hand, float performance does not seem to be an issue here:

  Conversion pass for rgba -> yuv444p:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 4 elem(s) packed >> 0
    [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> f32
    [f32 ...X -> ...X] SWS_OP_LINEAR       : matrix3+off3 [[0.256788 0.504129 0.097906 0 16] [-0.148223 -0.290993 112/255 0 128] [112/255 -0.367788 -0.071427 0 128] [0 0 0 1 0]]
    [f32 ...X -> ...X] SWS_OP_DITHER       : 16x16 matrix
    [f32 ...X -> ...X] SWS_OP_CLAMP        : 0 <= x <= {255 255 255 _}
    [f32 ...X -> +++X] SWS_OP_CONVERT      : f32 -> u8
    [ u8 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
      (X = unused, + = exact, 0 = zero)
  rgba 1920x1080 -> yuv444p 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4074 us, ref=6987 us, speedup=1.715x faster

So in summary, from what I gather, on all platforms I tested so far, the two
most important ASM routines to focus on are:

1. packed reads/writes
2. integer shifts

Because compilers seem to have a very hard time generating good code for
these. On the other hand, simple floating point FMAs and planar reads/writes
are handled quite well as is.

> 
> > After doing those benchmarks, my understanding was that you concluded that 
> > we probably need to keep int16 based codepaths still, then.
> 
> This may have been a misunderstanding. While I think we should keep the option
> of using fixed point precision *open*, the main take-away for me was that we
> will definitely need to transition to custom SIMD; since we cannot rely on the
> compiler to generate good code for us.
> 
> > Did something fundamental come up since we did these benchmarks that 
> > changed your conclusion?
> > 
> > // Martin
> > 
> > _______________________________________________
> > ffmpeg-devel mailing list
> > ffmpeg-devel@ffmpeg.org
> > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> > 
> > To unsubscribe, visit link above, or email
> > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".