Re: [FFmpeg-devel] [RFC] New swscale internal design prototype

From: Niklas Haas <ffmpeg@haasn.xyz>
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
Date: Sun, 9 Mar 2025 22:13:49 +0100
Message-ID: <20250309221349.GG683063@haasn.xyz> (raw)
In-Reply-To: <20250308235342.GB669161@haasn.xyz>

On Sat, 08 Mar 2025 23:53:42 +0100 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> Hi all,
> 
> for the past two months, I have been working on a prototype for a radical
> redesign of the swscale internals, specifically the format handling layer.
> This includes, or will eventually expand to include, all format input/output
> and unscaled special conversion steps.
> 
> I am not yet at a point where the new code can replace the scaling kernels,
> but for the time being, we could start usaing it for the simple unscaled cases,
> in theory, right away.
> 
> Rather than repeating my entire design document here, I opted to collect my
> notes into a design document on my WIP branch:
> 
> https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
> 
> I have spent the past week or so ironing out the last kinks and extensively
> benchmarking the new design at least on x86, and it is generally a roughly 1.9x
> improvement over the existing unscaled special converters across the board,
> before even adding any hand written ASM. (This speedup is *just* using the
> less-than-optimal compiler output from my reference C code!)
> 
> In some cases we even measure ~3-4x or even ~6x speedups, especially those
> where swscale does not currently have hand written SIMD. Overall:
> 
> cpu: 16-core AMD Ryzen Threadripper 1950X
> gcc 14.2.1:
>    single thread:
>      Overall speedup=1.887x faster, min=0.250x max=22.578x
>    multi thread:
>      Overall speedup=1.657x faster, min=0.190x max=87.972x

I was asked to substantiate these figures with more practical and relevant
examples. Apologies in advance for the wall of text, but I felt the need to be
thorough. The most important information is up-front.

Methodology: Sorting the most popular pixel formats (by occurrence count inside
FFmpeg internal code), and excluding subsampled formats, we have:

    347 AV_PIX_FMT_YUV444P
    331 AV_PIX_FMT_GRAY8
    281 AV_PIX_FMT_RGB24
    235 AV_PIX_FMT_BGR24
    232 AV_PIX_FMT_GBRP
    220 AV_PIX_FMT_RGBA
    190 AV_PIX_FMT_YUV444P10
    185 AV_PIX_FMT_BGRA
    184 AV_PIX_FMT_GBRAP
    177 AV_PIX_FMT_YUVJ444P
    172 AV_PIX_FMT_YUVA444P
    162 AV_PIX_FMT_YUV444P12
    150 AV_PIX_FMT_YUV444P16
    150 AV_PIX_FMT_GBRP10
    139 AV_PIX_FMT_GBRP12
    138 AV_PIX_FMT_ARGB
    131 AV_PIX_FMT_GRAY16
    129 AV_PIX_FMT_YUV444P9
    127 AV_PIX_FMT_ABGR
    119 AV_PIX_FMT_YUVA444P10
    115 AV_PIX_FMT_GBRP16
    113 AV_PIX_FMT_GRAY10
    111 AV_PIX_FMT_GBRP9
    109 AV_PIX_FMT_YUV444P14
  (remaining formats are used fewer than 100 times)

Across this reduced set of formats, the overall speedup (on my weaker, older
laptop) was:

  CPU: quad core AMD Ryzen 7 PRO 3700U w/ Radeon Vega Mobile Gfx (-MT MCP-)
  Overall speedup=1.666x faster, min=0.431x max=5.819x

The biggest speedups were seen for anything involving gbrp:

  Conversion pass for bgra -> gbrp16le:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 4 elem(s) packed >> 0
    [ u8 ...X -> +++X] SWS_OP_SWIZZLE      : 1023
    [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> u16 (expand)
    [u16 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
      (X = unused, + = exact, 0 = zero)
  bgra 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1933 us, ref=8216 us, speedup=4.249x faster

  Conversion pass for gray -> gbrp:
    [ u8 XXXX -> +XXX] SWS_OP_READ         : 1 elem(s) packed >> 0
    [ u8 .XXX -> +++X] SWS_OP_SWIZZLE      : 0003
    [ u8 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
      (X = unused, + = exact, 0 = zero)
  gray 1920x1080 -> gbrp 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=1.000000 V=1.000000 A=1.000000}
    time=868 us, ref=3510 us, speedup=4.039x faster

  Conversion pass for gbrp -> gbrp16le:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) planar >> 0
    [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> u16 (expand)
    [u16 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
      (X = unused, + = exact, 0 = zero)
  gbrp 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1724 us, ref=9489 us, speedup=5.505x faster

Though an honorable mention goes out to reductions in plane count, which the
optimizer identifies as a noop and optimizes into a refcopy / memcpy:

  yuva444p10le 1920x1080 -> yuv444p10le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=0 us, ref=2453 us, speedup=6072.812x faster

The worst slowdowns are currently those involving any sort of packed swizzle
for which there exist dedicated MMX functions currently:

  Conversion pass for bgr24 -> abgr:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
    [ u8 ...X -> X+++] SWS_OP_SWIZZLE      : 0012
    [ u8 X... -> ++++] SWS_OP_CLEAR        : {255 _ _ _}
    [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) packed >> 0
      (X = unused, + = exact, 0 = zero)
  bgr24 1920x1080 -> abgr 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1710 us, ref=826 us, speedup=0.483x slower

I have previously identified these as a particularly weak spot in the compiler
output, since no matter what C code I write, the result will always be roughly
0.5x compared to the existing hand-written MMX. That said, I also plan on taking
that existing MMX code and simply plugging it into the new architecture, which
should get rid of these last few slow cases.

On the other hand, the generated code outperforms the existing architecture
in cases where the old code fails to provide a dedicated function, e.g.:

  Conversion pass for bgr24 -> argb:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
    [ u8 ...X -> X+++] SWS_OP_SWIZZLE      : 0210
    [ u8 X... -> ++++] SWS_OP_CLEAR        : {255 _ _ _}
    [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) packed >> 0
      (X = unused, + = exact, 0 = zero)
  bgr24 1920x1080 -> argb 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1685 us, ref=2646 us, speedup=1.570x faster

Thus proving that the general-purpose pipeline is faster than the old
general-purpose pipeline.

And lastly, here is a randomly chosen subset of the overall test:

  bgra 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4243 us, ref=5824 us, speedup=1.372x faster
  yuv444p12le 1920x1080 -> yuva444p 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=2985 us, ref=2813 us, speedup=0.942x slower
  gbrp10le 1920x1080 -> gbrp 1920x1080, flags=0 dither=1, SSIM {Y=0.999998 U=0.999987 V=1.000000 A=1.000000}
    time=4473 us, ref=9638 us, speedup=2.155x faster
  yuv444p10le 1920x1080 -> yuva444p10le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=2040 us, ref=3095 us, speedup=1.517x faster
  yuv444p10le 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=3855 us, ref=7277 us, speedup=1.888x faster
  gbrp 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1059 us, ref=1032 us, speedup=0.975x slower
  argb 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=1.000000 V=1.000000 A=1.000000}
    time=3113 us, ref=3697 us, speedup=1.187x faster
  yuv444p12le 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4066 us, ref=7141 us, speedup=1.756x faster
  gbrp 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1384 us, ref=3072 us, speedup=2.220x faster
  yuvj444p 1920x1080 -> argb 1920x1080, flags=0 dither=1, SSIM {Y=0.999980 U=0.999974 V=0.999987 A=1.000000}
    time=4777 us, ref=9294 us, speedup=1.946x faster
  yuvj444p 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=0.999978 V=0.999978 A=1.000000}
    time=3850 us, ref=7314 us, speedup=1.900x faster
  gray10le 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999991 U=1.000000 V=1.000000 A=1.000000}
    time=1269 us, ref=1296 us, speedup=1.021x faster
  argb 1920x1080 -> bgra 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1052 us, ref=1047 us, speedup=0.995x slower
  yuv444p16le 1920x1080 -> yuv444p14le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=2926 us, ref=3618 us, speedup=1.237x faster
  gbrp12le 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999988 V=1.000000 A=1.000000}
    time=4221 us, ref=11934 us, speedup=2.827x faster
  yuvj444p 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=0.999978 V=0.999978 A=1.000000}
    time=3939 us, ref=7227 us, speedup=1.835x faster
  yuv444p14le 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4188 us, ref=7221 us, speedup=1.724x faster
  gbrp10le 1920x1080 -> yuv444p12le 1920x1080, flags=0 dither=1, SSIM {Y=0.999998 U=0.999996 V=1.000000 A=1.000000}
    time=4325 us, ref=10025 us, speedup=2.318x faster
  yuv444p14le 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=1333 us, ref=2065 us, speedup=1.549x faster
  gbrap 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4346 us, ref=7390 us, speedup=1.700x faster

The only two actual slowdowns here were:

Conversion pass for gbrp -> gbrap:
  [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) planar >> 0
  [ u8 ...X -> ++++] SWS_OP_CLEAR        : {_ _ _ 255}
  [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) planar >> 0
    (X = unused, + = exact, 0 = zero)

I neglected to add a dedicated kernel for read-clear-write, so this is going
through the general path with three spearate function calls. And even so,
it is only 2.5% slower than the existing dedicated fast path. I imagine
that we will want to add a fast path here eventually, unless the custom calling
convention invalidates the need for such fast paths.

Conversion pass for yuva444p -> yuv444p12le:
  [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) planar >> 0
  [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> u16
  [u16 ...X -> +++X] SWS_OP_LSHIFT       : << 4
  [u16 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
    (X = unused, + = exact, 0 = zero)

Which is another case where the only operation being performed (an expanding
left shift) is so small, that the load/store overhead is great enough to
cause a measurable slowdown - 5.8% in this case. As with the previous, it
would be easy to add a dedicated read-shift-write implementation to make
these cases faster, I just opted not to because the slowdown was not
massive.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".