Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [RFC] New swscale internal design prototype
@ 2025-03-08 22:53 Niklas Haas
  2025-03-09 16:11 ` Martin Storsjö
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Niklas Haas @ 2025-03-08 22:53 UTC (permalink / raw)
  To: ffmpeg-devel

Hi all,

for the past two months, I have been working on a prototype for a radical
redesign of the swscale internals, specifically the format handling layer.
This includes, or will eventually expand to include, all format input/output
and unscaled special conversion steps.

I am not yet at a point where the new code can replace the scaling kernels,
but for the time being, we could start usaing it for the simple unscaled cases,
in theory, right away.

Rather than repeating my entire design document here, I opted to collect my
notes into a design document on my WIP branch:

https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt

I have spent the past week or so ironing out the last kinks and extensively
benchmarking the new design at least on x86, and it is generally a roughly 1.9x
improvement over the existing unscaled special converters across the board,
before even adding any hand written ASM. (This speedup is *just* using the
less-than-optimal compiler output from my reference C code!)

In some cases we even measure ~3-4x or even ~6x speedups, especially those
where swscale does not currently have hand written SIMD. Overall:

cpu: 16-core AMD Ryzen Threadripper 1950X
gcc 14.2.1:
   single thread:
     Overall speedup=1.887x faster, min=0.250x max=22.578x
   multi thread:
     Overall speedup=1.657x faster, min=0.190x max=87.972x

(The 0.2x slowdown cases are for rgb8/gbr8 input, which requires LUT support
 for efficient decoding, but I wanted to focus on the core operations first
 before worrying about adding LUT-based optimizations to the design)

I am (almost) ready to begin moving forwards with this design, merging it into
swscale and using it at least for unscaled format conversions, XYZ decoding,
colorspace transformations (subsuming the existing, horribly unoptimized,
3DLUT layer), gamma transformations, and so on.

I wanted to post it here to gather some feedback on the approach. Where does
it fall on the "madness" scale? Is the new operations and optimizer design
comprehensible? Am I trying too hard to reinvent compilers? Are there any
platforms where the high number of function calls per frame would be
probitively expensive? What are the thoughts on the float-first approach? See
also the list of limitations and improvement ideas at the bottom of my design
document.

Thanks for your time,
Niklas
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
  2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
@ 2025-03-09 16:11 ` Martin Storsjö
  2025-03-09 19:45   ` Niklas Haas
  2025-03-09 18:18 ` Rémi Denis-Courmont
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Martin Storsjö @ 2025-03-09 16:11 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Sat, 8 Mar 2025, Niklas Haas wrote:

> What are the thoughts on the float-first approach?

In general, for modern architectures, relying on floats probably is 
reasonable. (On architectures that aren't of quite as widespread interest, 
it might not be so clear cut though.)

However with the benchmark example you provided a couple of weeks ago, we 
concluded that even on x86 on modern HW, floats were faster than int16 
only in one case: When using Clang, not GCC, and when compiling with 
-mavx2, not without it. In all the other cases, int16 was faster than 
float.

After doing those benchmarks, my understanding was that you concluded that 
we probably need to keep int16 based codepaths still, then.

Did something fundamental come up since we did these benchmarks that 
changed your conclusion?

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
  2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
  2025-03-09 16:11 ` Martin Storsjö
@ 2025-03-09 18:18 ` Rémi Denis-Courmont
  2025-03-09 19:57   ` Niklas Haas
  2025-03-09 19:41 ` Michael Niedermayer
  2025-03-09 21:13 ` Niklas Haas
  3 siblings, 1 reply; 9+ messages in thread
From: Rémi Denis-Courmont @ 2025-03-09 18:18 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

Hi,

Le 8 mars 2025 14:53:42 GMT-08:00, Niklas Haas <ffmpeg@haasn.xyz> a écrit :
>https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt

>I have spent the past week or so ironing 
>I wanted to post it here to gather some feedback on the approach. Where does
>it fall on the "madness" scale? Is the new operations and optimizer design
>comprehensible? Am I trying too hard to reinvent compilers? Are there any
>platforms where the high number of function calls per frame would be
>probitively expensive? What are the thoughts on the float-first approach? See
>also the list of limitations and improvement ideas at the bottom of my design
>document.

Using floats internally may be fine if there's (almost) never any spillage, but that necessarily implies custom calling conventions. And won't work with as many as 32 pixels. On RVV 128-bit, you'd have only 4 vectors. On Arm NEON, it would be even worse as scalars/constants need to be stored in vectors as well.

Otherwise transferring two or even four times as much data to/from memory at every step is probably going to more than absorb any performance gains from using floats (notably not needing to scale values).
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
  2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
  2025-03-09 16:11 ` Martin Storsjö
  2025-03-09 18:18 ` Rémi Denis-Courmont
@ 2025-03-09 19:41 ` Michael Niedermayer
  2025-03-09 21:13 ` Niklas Haas
  3 siblings, 0 replies; 9+ messages in thread
From: Michael Niedermayer @ 2025-03-09 19:41 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 3252 bytes --]

Hi Niklas

On Sat, Mar 08, 2025 at 11:53:42PM +0100, Niklas Haas wrote:
> Hi all,
> 
> for the past two months, I have been working on a prototype for a radical
> redesign of the swscale internals, specifically the format handling layer.
> This includes, or will eventually expand to include, all format input/output
> and unscaled special conversion steps.
> 
> I am not yet at a point where the new code can replace the scaling kernels,
> but for the time being, we could start usaing it for the simple unscaled cases,
> in theory, right away.
> 
> Rather than repeating my entire design document here, I opted to collect my
> notes into a design document on my WIP branch:
> 
> https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
> 
> I have spent the past week or so ironing out the last kinks and extensively
> benchmarking the new design at least on x86, and it is generally a roughly 1.9x
> improvement over the existing unscaled special converters across the board,
> before even adding any hand written ASM. (This speedup is *just* using the
> less-than-optimal compiler output from my reference C code!)
> 
> In some cases we even measure ~3-4x or even ~6x speedups, especially those
> where swscale does not currently have hand written SIMD. Overall:
> 
> cpu: 16-core AMD Ryzen Threadripper 1950X
> gcc 14.2.1:
>    single thread:
>      Overall speedup=1.887x faster, min=0.250x max=22.578x
>    multi thread:
>      Overall speedup=1.657x faster, min=0.190x max=87.972x
> 
> (The 0.2x slowdown cases are for rgb8/gbr8 input, which requires LUT support
>  for efficient decoding, but I wanted to focus on the core operations first
>  before worrying about adding LUT-based optimizations to the design)
> 
> I am (almost) ready to begin moving forwards with this design, merging it into
> swscale and using it at least for unscaled format conversions, XYZ decoding,
> colorspace transformations (subsuming the existing, horribly unoptimized,
> 3DLUT layer), gamma transformations, and so on.
> 
> I wanted to post it here to gather some feedback on the approach. Where does
> it fall on the "madness" scale? Is the new operations and optimizer design
> comprehensible? Am I trying too hard to reinvent compilers? Are there any
> platforms where the high number of function calls per frame would be
> probitively expensive? What are the thoughts on the float-first approach? See
> also the list of limitations and improvement ideas at the bottom of my design
> document.

I think a more float centric design probably makes sense. Floats make things
nicer and cleaner
It may be needed to support an integer only path for architectures that
have a weak fpu. And also may be needed for some cases to get them bitexact

AVFloating, a rational float type or AVRational64, both interresting.
Do we have other places where either could be used ?

thx

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Frequently ignored answer#1 FFmpeg bugs should be sent to our bugtracker. User
questions about the command line tools should be sent to the ffmpeg-user ML.
And questions about how to use libav* should be sent to the libav-user ML.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
  2025-03-09 16:11 ` Martin Storsjö
@ 2025-03-09 19:45   ` Niklas Haas
  0 siblings, 0 replies; 9+ messages in thread
From: Niklas Haas @ 2025-03-09 19:45 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Sun, 09 Mar 2025 18:11:54 +0200 Martin Storsjö <martin@martin.st> wrote:
> On Sat, 8 Mar 2025, Niklas Haas wrote:
> 
> > What are the thoughts on the float-first approach?
> 
> In general, for modern architectures, relying on floats probably is 
> reasonable. (On architectures that aren't of quite as widespread interest, 
> it might not be so clear cut though.)
> 
> However with the benchmark example you provided a couple of weeks ago, we 
> concluded that even on x86 on modern HW, floats were faster than int16 
> only in one case: When using Clang, not GCC, and when compiling with 
> -mavx2, not without it. In all the other cases, int16 was faster than 
> float.

Hi Martin,

I should preface that this particular benchmark was a very specific test for
floating point *filtering*, which is considerably more punishing than the
conversion pipeline I have implemented here, and I think it's partly the
fault of compilers generating very unoptimal filtering code.

I think it would be better to re-assess using the current prototype on actual
hardware. I threw up a quick NEON test branch: (untested, should hopefully work)
https://github.com/haasn/FFmpeg/commits/swscale3-neon

# adjust the benchmark iters count as needed based on the HW perf
make libswscale/tests/swscale && libswscale/tests/swscale -unscaled 1 -bench 50

If this differs significantly from the ~1.8x speedup I measure on x86, I
will be far more concerned about the new approach.

> After doing those benchmarks, my understanding was that you concluded that 
> we probably need to keep int16 based codepaths still, then.

This may have been a misunderstanding. While I think we should keep the option
of using fixed point precision *open*, the main take-away for me was that we
will definitely need to transition to custom SIMD; since we cannot rely on the
compiler to generate good code for us.

> Did something fundamental come up since we did these benchmarks that 
> changed your conclusion?
> 
> // Martin
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
  2025-03-09 18:18 ` Rémi Denis-Courmont
@ 2025-03-09 19:57   ` Niklas Haas
  2025-03-10  0:57     ` Rémi Denis-Courmont
  0 siblings, 1 reply; 9+ messages in thread
From: Niklas Haas @ 2025-03-09 19:57 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Sun, 09 Mar 2025 11:18:04 -0700 Rémi Denis-Courmont <remi@remlab.net> wrote:
> Hi,
> 
> Le 8 mars 2025 14:53:42 GMT-08:00, Niklas Haas <ffmpeg@haasn.xyz> a écrit :
> >https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
> 
> >I have spent the past week or so ironing 
> >I wanted to post it here to gather some feedback on the approach. Where does
> >it fall on the "madness" scale? Is the new operations and optimizer design
> >comprehensible? Am I trying too hard to reinvent compilers? Are there any
> >platforms where the high number of function calls per frame would be
> >probitively expensive? What are the thoughts on the float-first approach? See
> >also the list of limitations and improvement ideas at the bottom of my design
> >document.
> 
> Using floats internally may be fine if there's (almost) never any spillage, but that necessarily implies custom calling conventions. And won't work with as many as 32 pixels. On RVV 128-bit, you'd have only 4 vectors. On Arm NEON, it would be even worse as scalars/constants need to be stored in vectors as well.

I think that a custom calling convention is not as unreasonable as it may sound,
and will actually be easier to implement than the standard calling convention
since functions will not have to deal with pixel load/store, nor will there be
any need for "fused" versions of operations (whose only purpose is to avoid
the roundtrip through L1).

The pixel chunk size is easily changed; it is a compile time constant and there
are no strict requirements on it. If RISC-V (or any other platform) struggles
with storing 32 floats in vector registers, we could go down to 16 (or even 8);
the number 32 was merely chosen by benchmarking and not through any careful
design consideration.

During my initial prototype, I was using 16 pixel chunks (= 512 bits, or
enough to fit into an m4 register on RVV 128). That still gives you room to
keep 4 vectors in memory (for the custom CC) and still have 4 spare vectors to
implement operations. I *think* that should be sufficient, with the current
set of vector operations.

> 
> Otherwise transferring two or even four times as much data to/from memory at every step is probably going to more than absorb any performance gains from using floats (notably not needing to scale values).

It's quite possible. I don't think that there is any major barrier to adding
fixed precision integer support to SwsOp design *per se*. The main reason
I am hesitant to explore that option comes from the fact that I don't want to
introduce (or encourage) platform-specific variations in the output. By forcing
all platforms to adhere to a single precision, we can guarantee a consistent
output regardless of the optimization decisions.

So it would probably have to involve us switching from floats to fixed16
across the board, even on x86.

In either case, I think I will stick with the float centric design during the
development of my prototype if for no other reason than simplicity, unless
there is a vary major performance issue associated with them.

Do you have access to anything with decent RVV F32 support that we could use
for testing? It's my understanding that existing RVV implementations have been
rather primitive.

> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
  2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
                   ` (2 preceding siblings ...)
  2025-03-09 19:41 ` Michael Niedermayer
@ 2025-03-09 21:13 ` Niklas Haas
  2025-03-09 21:28   ` Niklas Haas
  3 siblings, 1 reply; 9+ messages in thread
From: Niklas Haas @ 2025-03-09 21:13 UTC (permalink / raw)
  To: ffmpeg-devel

On Sat, 08 Mar 2025 23:53:42 +0100 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> Hi all,
> 
> for the past two months, I have been working on a prototype for a radical
> redesign of the swscale internals, specifically the format handling layer.
> This includes, or will eventually expand to include, all format input/output
> and unscaled special conversion steps.
> 
> I am not yet at a point where the new code can replace the scaling kernels,
> but for the time being, we could start usaing it for the simple unscaled cases,
> in theory, right away.
> 
> Rather than repeating my entire design document here, I opted to collect my
> notes into a design document on my WIP branch:
> 
> https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
> 
> I have spent the past week or so ironing out the last kinks and extensively
> benchmarking the new design at least on x86, and it is generally a roughly 1.9x
> improvement over the existing unscaled special converters across the board,
> before even adding any hand written ASM. (This speedup is *just* using the
> less-than-optimal compiler output from my reference C code!)
> 
> In some cases we even measure ~3-4x or even ~6x speedups, especially those
> where swscale does not currently have hand written SIMD. Overall:
> 
> cpu: 16-core AMD Ryzen Threadripper 1950X
> gcc 14.2.1:
>    single thread:
>      Overall speedup=1.887x faster, min=0.250x max=22.578x
>    multi thread:
>      Overall speedup=1.657x faster, min=0.190x max=87.972x

I was asked to substantiate these figures with more practical and relevant
examples. Apologies in advance for the wall of text, but I felt the need to be
thorough. The most important information is up-front.

Methodology: Sorting the most popular pixel formats (by occurrence count inside
FFmpeg internal code), and excluding subsampled formats, we have:

    347 AV_PIX_FMT_YUV444P
    331 AV_PIX_FMT_GRAY8
    281 AV_PIX_FMT_RGB24
    235 AV_PIX_FMT_BGR24
    232 AV_PIX_FMT_GBRP
    220 AV_PIX_FMT_RGBA
    190 AV_PIX_FMT_YUV444P10
    185 AV_PIX_FMT_BGRA
    184 AV_PIX_FMT_GBRAP
    177 AV_PIX_FMT_YUVJ444P
    172 AV_PIX_FMT_YUVA444P
    162 AV_PIX_FMT_YUV444P12
    150 AV_PIX_FMT_YUV444P16
    150 AV_PIX_FMT_GBRP10
    139 AV_PIX_FMT_GBRP12
    138 AV_PIX_FMT_ARGB
    131 AV_PIX_FMT_GRAY16
    129 AV_PIX_FMT_YUV444P9
    127 AV_PIX_FMT_ABGR
    119 AV_PIX_FMT_YUVA444P10
    115 AV_PIX_FMT_GBRP16
    113 AV_PIX_FMT_GRAY10
    111 AV_PIX_FMT_GBRP9
    109 AV_PIX_FMT_YUV444P14
  (remaining formats are used fewer than 100 times)

Across this reduced set of formats, the overall speedup (on my weaker, older
laptop) was:

  CPU: quad core AMD Ryzen 7 PRO 3700U w/ Radeon Vega Mobile Gfx (-MT MCP-)
  Overall speedup=1.666x faster, min=0.431x max=5.819x

The biggest speedups were seen for anything involving gbrp:

  Conversion pass for bgra -> gbrp16le:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 4 elem(s) packed >> 0
    [ u8 ...X -> +++X] SWS_OP_SWIZZLE      : 1023
    [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> u16 (expand)
    [u16 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
      (X = unused, + = exact, 0 = zero)
  bgra 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1933 us, ref=8216 us, speedup=4.249x faster

  Conversion pass for gray -> gbrp:
    [ u8 XXXX -> +XXX] SWS_OP_READ         : 1 elem(s) packed >> 0
    [ u8 .XXX -> +++X] SWS_OP_SWIZZLE      : 0003
    [ u8 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
      (X = unused, + = exact, 0 = zero)
  gray 1920x1080 -> gbrp 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=1.000000 V=1.000000 A=1.000000}
    time=868 us, ref=3510 us, speedup=4.039x faster

  Conversion pass for gbrp -> gbrp16le:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) planar >> 0
    [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> u16 (expand)
    [u16 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
      (X = unused, + = exact, 0 = zero)
  gbrp 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1724 us, ref=9489 us, speedup=5.505x faster

Though an honorable mention goes out to reductions in plane count, which the
optimizer identifies as a noop and optimizes into a refcopy / memcpy:

  yuva444p10le 1920x1080 -> yuv444p10le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=0 us, ref=2453 us, speedup=6072.812x faster

The worst slowdowns are currently those involving any sort of packed swizzle
for which there exist dedicated MMX functions currently:

  Conversion pass for bgr24 -> abgr:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
    [ u8 ...X -> X+++] SWS_OP_SWIZZLE      : 0012
    [ u8 X... -> ++++] SWS_OP_CLEAR        : {255 _ _ _}
    [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) packed >> 0
      (X = unused, + = exact, 0 = zero)
  bgr24 1920x1080 -> abgr 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1710 us, ref=826 us, speedup=0.483x slower

I have previously identified these as a particularly weak spot in the compiler
output, since no matter what C code I write, the result will always be roughly
0.5x compared to the existing hand-written MMX. That said, I also plan on taking
that existing MMX code and simply plugging it into the new architecture, which
should get rid of these last few slow cases.

On the other hand, the generated code outperforms the existing architecture
in cases where the old code fails to provide a dedicated function, e.g.:

  Conversion pass for bgr24 -> argb:
    [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
    [ u8 ...X -> X+++] SWS_OP_SWIZZLE      : 0210
    [ u8 X... -> ++++] SWS_OP_CLEAR        : {255 _ _ _}
    [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) packed >> 0
      (X = unused, + = exact, 0 = zero)
  bgr24 1920x1080 -> argb 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1685 us, ref=2646 us, speedup=1.570x faster

Thus proving that the general-purpose pipeline is faster than the old
general-purpose pipeline.

And lastly, here is a randomly chosen subset of the overall test:

  bgra 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4243 us, ref=5824 us, speedup=1.372x faster
  yuv444p12le 1920x1080 -> yuva444p 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=2985 us, ref=2813 us, speedup=0.942x slower
  gbrp10le 1920x1080 -> gbrp 1920x1080, flags=0 dither=1, SSIM {Y=0.999998 U=0.999987 V=1.000000 A=1.000000}
    time=4473 us, ref=9638 us, speedup=2.155x faster
  yuv444p10le 1920x1080 -> yuva444p10le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=2040 us, ref=3095 us, speedup=1.517x faster
  yuv444p10le 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=3855 us, ref=7277 us, speedup=1.888x faster
  gbrp 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1059 us, ref=1032 us, speedup=0.975x slower
  argb 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=1.000000 V=1.000000 A=1.000000}
    time=3113 us, ref=3697 us, speedup=1.187x faster
  yuv444p12le 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4066 us, ref=7141 us, speedup=1.756x faster
  gbrp 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1384 us, ref=3072 us, speedup=2.220x faster
  yuvj444p 1920x1080 -> argb 1920x1080, flags=0 dither=1, SSIM {Y=0.999980 U=0.999974 V=0.999987 A=1.000000}
    time=4777 us, ref=9294 us, speedup=1.946x faster
  yuvj444p 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=0.999978 V=0.999978 A=1.000000}
    time=3850 us, ref=7314 us, speedup=1.900x faster
  gray10le 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999991 U=1.000000 V=1.000000 A=1.000000}
    time=1269 us, ref=1296 us, speedup=1.021x faster
  argb 1920x1080 -> bgra 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=1052 us, ref=1047 us, speedup=0.995x slower
  yuv444p16le 1920x1080 -> yuv444p14le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=2926 us, ref=3618 us, speedup=1.237x faster
  gbrp12le 1920x1080 -> rgba 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999988 V=1.000000 A=1.000000}
    time=4221 us, ref=11934 us, speedup=2.827x faster
  yuvj444p 1920x1080 -> gbrp16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999977 U=0.999978 V=0.999978 A=1.000000}
    time=3939 us, ref=7227 us, speedup=1.835x faster
  yuv444p14le 1920x1080 -> gbrap 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4188 us, ref=7221 us, speedup=1.724x faster
  gbrp10le 1920x1080 -> yuv444p12le 1920x1080, flags=0 dither=1, SSIM {Y=0.999998 U=0.999996 V=1.000000 A=1.000000}
    time=4325 us, ref=10025 us, speedup=2.318x faster
  yuv444p14le 1920x1080 -> gray16le 1920x1080, flags=0 dither=1, SSIM {Y=1.000000 U=1.000000 V=1.000000 A=1.000000}
    time=1333 us, ref=2065 us, speedup=1.549x faster
  gbrap 1920x1080 -> yuv444p16le 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
    time=4346 us, ref=7390 us, speedup=1.700x faster

The only two actual slowdowns here were:

Conversion pass for gbrp -> gbrap:
  [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) planar >> 0
  [ u8 ...X -> ++++] SWS_OP_CLEAR        : {_ _ _ 255}
  [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) planar >> 0
    (X = unused, + = exact, 0 = zero)

I neglected to add a dedicated kernel for read-clear-write, so this is going
through the general path with three spearate function calls. And even so,
it is only 2.5% slower than the existing dedicated fast path. I imagine
that we will want to add a fast path here eventually, unless the custom calling
convention invalidates the need for such fast paths.

Conversion pass for yuva444p -> yuv444p12le:
  [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) planar >> 0
  [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> u16
  [u16 ...X -> +++X] SWS_OP_LSHIFT       : << 4
  [u16 ...X -> XXXX] SWS_OP_WRITE        : 3 elem(s) planar >> 0
    (X = unused, + = exact, 0 = zero)

Which is another case where the only operation being performed (an expanding
left shift) is so small, that the load/store overhead is great enough to
cause a measurable slowdown - 5.8% in this case. As with the previous, it
would be easy to add a dedicated read-shift-write implementation to make
these cases faster, I just opted not to because the slowdown was not
massive.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
  2025-03-09 21:13 ` Niklas Haas
@ 2025-03-09 21:28   ` Niklas Haas
  0 siblings, 0 replies; 9+ messages in thread
From: Niklas Haas @ 2025-03-09 21:28 UTC (permalink / raw)
  To: ffmpeg-devel

On Sun, 09 Mar 2025 22:13:49 +0100 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> The worst slowdowns are currently those involving any sort of packed swizzle
> for which there exist dedicated MMX functions currently:
> 
>   Conversion pass for bgr24 -> abgr:
>     [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
>     [ u8 ...X -> X+++] SWS_OP_SWIZZLE      : 0012
>     [ u8 X... -> ++++] SWS_OP_CLEAR        : {255 _ _ _}
>     [ u8 .... -> XXXX] SWS_OP_WRITE        : 4 elem(s) packed >> 0
>       (X = unused, + = exact, 0 = zero)
>   bgr24 1920x1080 -> abgr 1920x1080, flags=0 dither=1, SSIM {Y=0.999997 U=0.999989 V=1.000000 A=1.000000}
>     time=1710 us, ref=826 us, speedup=0.483x slower
> 
> I have previously identified these as a particularly weak spot in the compiler
> output, since no matter what C code I write, the result will always be roughly
> 0.5x compared to the existing hand-written MMX. That said, I also plan on taking
> that existing MMX code and simply plugging it into the new architecture, which
> should get rid of these last few slow cases.

I also wanted to point out that a lot of our conversions are also more
*accurate* than the previous implementations. An illustrative example:

  Conversion pass for gray -> gray10le:
    [ u8 XXXX -> +XXX] SWS_OP_READ         : 1 elem(s) packed >> 0
    [ u8 .XXX -> +XXX] SWS_OP_CONVERT      : u8 -> f32
    [f32 .XXX -> .XXX] SWS_OP_SCALE        : * 341/85
    [f32 .XXX -> .XXX] SWS_OP_DITHER       : 16x16 matrix
    [f32 .XXX -> .XXX] SWS_OP_CLAMP        : 0 <= x <= {1023 _ _ _}
    [f32 .XXX -> +XXX] SWS_OP_CONVERT      : f32 -> u16
    [u16 .XXX -> XXXX] SWS_OP_WRITE        : 1 elem(s) packed >> 0
      (X = unused, + = exact, 0 = zero)
  gray 1920x1080 -> gray10le 1920x1080, flags=0 dither=1, SSIM {Y=0.999974 U=1.000000 V=1.000000 A=1.000000}
    time=1317 us, ref=1300 us, speedup=0.987x slower

The reference implementation handles this as a full range shift:

  gray10 = gray << 2 | gray >> 6.

But this is *not* accurate and will therefore introduce round trip error. For
example, a value of 200 produces 200 << 2 | 200 >> 6 = 803, while the correct
result would be 200 / 255 * 1023 = 802.3529411764706. Our new implementation
accurately handles this conversion in floating point math and dithers the
result down to a 35%/65% mix of 802 and 803.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
  2025-03-09 19:57   ` Niklas Haas
@ 2025-03-10  0:57     ` Rémi Denis-Courmont
  0 siblings, 0 replies; 9+ messages in thread
From: Rémi Denis-Courmont @ 2025-03-10  0:57 UTC (permalink / raw)
  To: FFmpeg development discussions and patches



Le 9 mars 2025 12:57:47 GMT-07:00, Niklas Haas <ffmpeg@haasn.xyz> a écrit :
>On Sun, 09 Mar 2025 11:18:04 -0700 Rémi Denis-Courmont <remi@remlab.net> wrote:
>> Hi,
>> 
>> Le 8 mars 2025 14:53:42 GMT-08:00, Niklas Haas <ffmpeg@haasn.xyz> a écrit :
>> >https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
>> 
>> >I have spent the past week or so ironing 
>> >I wanted to post it here to gather some feedback on the approach. Where does
>> >it fall on the "madness" scale? Is the new operations and optimizer design
>> >comprehensible? Am I trying too hard to reinvent compilers? Are there any
>> >platforms where the high number of function calls per frame would be
>> >probitively expensive? What are the thoughts on the float-first approach? See
>> >also the list of limitations and improvement ideas at the bottom of my design
>> >document.
>> 
>> Using floats internally may be fine if there's (almost) never any spillage, but that necessarily implies custom calling conventions. And won't work with as many as 32 pixels. On RVV 128-bit, you'd have only 4 vectors. On Arm NEON, it would be even worse as scalars/constants need to be stored in vectors as well.
>
>I think that a custom calling convention is not as unreasonable as it may sound,
>and will actually be easier to implement than the standard calling convention
>since functions will not have to deal with pixel load/store, nor will there be
>any need for "fused" versions of operations (whose only purpose is to avoid
>the roundtrip through L1).
>
>The pixel chunk size is easily changed; it is a compile time constant and there
>are no strict requirements on it. If RISC-V (or any other platform) struggles
>with storing 32 floats in vector registers, we could go down to 16 (or even 8);
>the number 32 was merely chosen by benchmarking and not through any careful
>design consideration.

It can't be a compile time constant on RVV nor (if it's ever introduced) SVE because they are scalable. I doubt that a compile-time constant will work well across all variants of x86 as well, but not that I'd know.

>Do you have access to anything with decent RVV F32 support that we could use
>for testing? It's my understanding that existing RVV implementations have been
>rather primitive.

Float is quite okay on RVV. It is faster than integers on some lavc audio loops already.

That said, I only have access to TH-C908 (128-bit) and  ST-X60 (256-bit), as before, and I haven't been contacted to get access anything better. The X60 is used on FATE.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-10  0:58 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
2025-03-09 16:11 ` Martin Storsjö
2025-03-09 19:45   ` Niklas Haas
2025-03-09 18:18 ` Rémi Denis-Courmont
2025-03-09 19:57   ` Niklas Haas
2025-03-10  0:57     ` Rémi Denis-Courmont
2025-03-09 19:41 ` Michael Niedermayer
2025-03-09 21:13 ` Niklas Haas
2025-03-09 21:28   ` Niklas Haas

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git