* [FFmpeg-devel] [RFC] New swscale internal design prototype
@ 2025-03-08 22:53 Niklas Haas
2025-03-09 16:11 ` Martin Storsjö
` (2 more replies)
0 siblings, 3 replies; 6+ messages in thread
From: Niklas Haas @ 2025-03-08 22:53 UTC (permalink / raw)
To: ffmpeg-devel
Hi all,
for the past two months, I have been working on a prototype for a radical
redesign of the swscale internals, specifically the format handling layer.
This includes, or will eventually expand to include, all format input/output
and unscaled special conversion steps.
I am not yet at a point where the new code can replace the scaling kernels,
but for the time being, we could start usaing it for the simple unscaled cases,
in theory, right away.
Rather than repeating my entire design document here, I opted to collect my
notes into a design document on my WIP branch:
https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
I have spent the past week or so ironing out the last kinks and extensively
benchmarking the new design at least on x86, and it is generally a roughly 1.9x
improvement over the existing unscaled special converters across the board,
before even adding any hand written ASM. (This speedup is *just* using the
less-than-optimal compiler output from my reference C code!)
In some cases we even measure ~3-4x or even ~6x speedups, especially those
where swscale does not currently have hand written SIMD. Overall:
cpu: 16-core AMD Ryzen Threadripper 1950X
gcc 14.2.1:
single thread:
Overall speedup=1.887x faster, min=0.250x max=22.578x
multi thread:
Overall speedup=1.657x faster, min=0.190x max=87.972x
(The 0.2x slowdown cases are for rgb8/gbr8 input, which requires LUT support
for efficient decoding, but I wanted to focus on the core operations first
before worrying about adding LUT-based optimizations to the design)
I am (almost) ready to begin moving forwards with this design, merging it into
swscale and using it at least for unscaled format conversions, XYZ decoding,
colorspace transformations (subsuming the existing, horribly unoptimized,
3DLUT layer), gamma transformations, and so on.
I wanted to post it here to gather some feedback on the approach. Where does
it fall on the "madness" scale? Is the new operations and optimizer design
comprehensible? Am I trying too hard to reinvent compilers? Are there any
platforms where the high number of function calls per frame would be
probitively expensive? What are the thoughts on the float-first approach? See
also the list of limitations and improvement ideas at the bottom of my design
document.
Thanks for your time,
Niklas
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
@ 2025-03-09 16:11 ` Martin Storsjö
2025-03-09 19:45 ` Niklas Haas
2025-03-09 18:18 ` Rémi Denis-Courmont
2025-03-09 19:41 ` Michael Niedermayer
2 siblings, 1 reply; 6+ messages in thread
From: Martin Storsjö @ 2025-03-09 16:11 UTC (permalink / raw)
To: FFmpeg development discussions and patches
On Sat, 8 Mar 2025, Niklas Haas wrote:
> What are the thoughts on the float-first approach?
In general, for modern architectures, relying on floats probably is
reasonable. (On architectures that aren't of quite as widespread interest,
it might not be so clear cut though.)
However with the benchmark example you provided a couple of weeks ago, we
concluded that even on x86 on modern HW, floats were faster than int16
only in one case: When using Clang, not GCC, and when compiling with
-mavx2, not without it. In all the other cases, int16 was faster than
float.
After doing those benchmarks, my understanding was that you concluded that
we probably need to keep int16 based codepaths still, then.
Did something fundamental come up since we did these benchmarks that
changed your conclusion?
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
2025-03-09 16:11 ` Martin Storsjö
@ 2025-03-09 19:45 ` Niklas Haas
0 siblings, 0 replies; 6+ messages in thread
From: Niklas Haas @ 2025-03-09 19:45 UTC (permalink / raw)
To: FFmpeg development discussions and patches
On Sun, 09 Mar 2025 18:11:54 +0200 Martin Storsjö <martin@martin.st> wrote:
> On Sat, 8 Mar 2025, Niklas Haas wrote:
>
> > What are the thoughts on the float-first approach?
>
> In general, for modern architectures, relying on floats probably is
> reasonable. (On architectures that aren't of quite as widespread interest,
> it might not be so clear cut though.)
>
> However with the benchmark example you provided a couple of weeks ago, we
> concluded that even on x86 on modern HW, floats were faster than int16
> only in one case: When using Clang, not GCC, and when compiling with
> -mavx2, not without it. In all the other cases, int16 was faster than
> float.
Hi Martin,
I should preface that this particular benchmark was a very specific test for
floating point *filtering*, which is considerably more punishing than the
conversion pipeline I have implemented here, and I think it's partly the
fault of compilers generating very unoptimal filtering code.
I think it would be better to re-assess using the current prototype on actual
hardware. I threw up a quick NEON test branch: (untested, should hopefully work)
https://github.com/haasn/FFmpeg/commits/swscale3-neon
# adjust the benchmark iters count as needed based on the HW perf
make libswscale/tests/swscale && libswscale/tests/swscale -unscaled 1 -bench 50
If this differs significantly from the ~1.8x speedup I measure on x86, I
will be far more concerned about the new approach.
> After doing those benchmarks, my understanding was that you concluded that
> we probably need to keep int16 based codepaths still, then.
This may have been a misunderstanding. While I think we should keep the option
of using fixed point precision *open*, the main take-away for me was that we
will definitely need to transition to custom SIMD; since we cannot rely on the
compiler to generate good code for us.
> Did something fundamental come up since we did these benchmarks that
> changed your conclusion?
>
> // Martin
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
2025-03-09 16:11 ` Martin Storsjö
@ 2025-03-09 18:18 ` Rémi Denis-Courmont
2025-03-09 19:57 ` Niklas Haas
2025-03-09 19:41 ` Michael Niedermayer
2 siblings, 1 reply; 6+ messages in thread
From: Rémi Denis-Courmont @ 2025-03-09 18:18 UTC (permalink / raw)
To: FFmpeg development discussions and patches
Hi,
Le 8 mars 2025 14:53:42 GMT-08:00, Niklas Haas <ffmpeg@haasn.xyz> a écrit :
>https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
>I have spent the past week or so ironing
>I wanted to post it here to gather some feedback on the approach. Where does
>it fall on the "madness" scale? Is the new operations and optimizer design
>comprehensible? Am I trying too hard to reinvent compilers? Are there any
>platforms where the high number of function calls per frame would be
>probitively expensive? What are the thoughts on the float-first approach? See
>also the list of limitations and improvement ideas at the bottom of my design
>document.
Using floats internally may be fine if there's (almost) never any spillage, but that necessarily implies custom calling conventions. And won't work with as many as 32 pixels. On RVV 128-bit, you'd have only 4 vectors. On Arm NEON, it would be even worse as scalars/constants need to be stored in vectors as well.
Otherwise transferring two or even four times as much data to/from memory at every step is probably going to more than absorb any performance gains from using floats (notably not needing to scale values).
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
2025-03-09 18:18 ` Rémi Denis-Courmont
@ 2025-03-09 19:57 ` Niklas Haas
0 siblings, 0 replies; 6+ messages in thread
From: Niklas Haas @ 2025-03-09 19:57 UTC (permalink / raw)
To: FFmpeg development discussions and patches
On Sun, 09 Mar 2025 11:18:04 -0700 Rémi Denis-Courmont <remi@remlab.net> wrote:
> Hi,
>
> Le 8 mars 2025 14:53:42 GMT-08:00, Niklas Haas <ffmpeg@haasn.xyz> a écrit :
> >https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
>
> >I have spent the past week or so ironing
> >I wanted to post it here to gather some feedback on the approach. Where does
> >it fall on the "madness" scale? Is the new operations and optimizer design
> >comprehensible? Am I trying too hard to reinvent compilers? Are there any
> >platforms where the high number of function calls per frame would be
> >probitively expensive? What are the thoughts on the float-first approach? See
> >also the list of limitations and improvement ideas at the bottom of my design
> >document.
>
> Using floats internally may be fine if there's (almost) never any spillage, but that necessarily implies custom calling conventions. And won't work with as many as 32 pixels. On RVV 128-bit, you'd have only 4 vectors. On Arm NEON, it would be even worse as scalars/constants need to be stored in vectors as well.
I think that a custom calling convention is not as unreasonable as it may sound,
and will actually be easier to implement than the standard calling convention
since functions will not have to deal with pixel load/store, nor will there be
any need for "fused" versions of operations (whose only purpose is to avoid
the roundtrip through L1).
The pixel chunk size is easily changed; it is a compile time constant and there
are no strict requirements on it. If RISC-V (or any other platform) struggles
with storing 32 floats in vector registers, we could go down to 16 (or even 8);
the number 32 was merely chosen by benchmarking and not through any careful
design consideration.
During my initial prototype, I was using 16 pixel chunks (= 512 bits, or
enough to fit into an m4 register on RVV 128). That still gives you room to
keep 4 vectors in memory (for the custom CC) and still have 4 spare vectors to
implement operations. I *think* that should be sufficient, with the current
set of vector operations.
>
> Otherwise transferring two or even four times as much data to/from memory at every step is probably going to more than absorb any performance gains from using floats (notably not needing to scale values).
It's quite possible. I don't think that there is any major barrier to adding
fixed precision integer support to SwsOp design *per se*. The main reason
I am hesitant to explore that option comes from the fact that I don't want to
introduce (or encourage) platform-specific variations in the output. By forcing
all platforms to adhere to a single precision, we can guarantee a consistent
output regardless of the optimization decisions.
So it would probably have to involve us switching from floats to fixed16
across the board, even on x86.
In either case, I think I will stick with the float centric design during the
development of my prototype if for no other reason than simplicity, unless
there is a vary major performance issue associated with them.
Do you have access to anything with decent RVV F32 support that we could use
for testing? It's my understanding that existing RVV implementations have been
rather primitive.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [FFmpeg-devel] [RFC] New swscale internal design prototype
2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
2025-03-09 16:11 ` Martin Storsjö
2025-03-09 18:18 ` Rémi Denis-Courmont
@ 2025-03-09 19:41 ` Michael Niedermayer
2 siblings, 0 replies; 6+ messages in thread
From: Michael Niedermayer @ 2025-03-09 19:41 UTC (permalink / raw)
To: FFmpeg development discussions and patches
[-- Attachment #1.1: Type: text/plain, Size: 3252 bytes --]
Hi Niklas
On Sat, Mar 08, 2025 at 11:53:42PM +0100, Niklas Haas wrote:
> Hi all,
>
> for the past two months, I have been working on a prototype for a radical
> redesign of the swscale internals, specifically the format handling layer.
> This includes, or will eventually expand to include, all format input/output
> and unscaled special conversion steps.
>
> I am not yet at a point where the new code can replace the scaling kernels,
> but for the time being, we could start usaing it for the simple unscaled cases,
> in theory, right away.
>
> Rather than repeating my entire design document here, I opted to collect my
> notes into a design document on my WIP branch:
>
> https://github.com/haasn/FFmpeg/blob/swscale3/doc/swscale-v2.txt
>
> I have spent the past week or so ironing out the last kinks and extensively
> benchmarking the new design at least on x86, and it is generally a roughly 1.9x
> improvement over the existing unscaled special converters across the board,
> before even adding any hand written ASM. (This speedup is *just* using the
> less-than-optimal compiler output from my reference C code!)
>
> In some cases we even measure ~3-4x or even ~6x speedups, especially those
> where swscale does not currently have hand written SIMD. Overall:
>
> cpu: 16-core AMD Ryzen Threadripper 1950X
> gcc 14.2.1:
> single thread:
> Overall speedup=1.887x faster, min=0.250x max=22.578x
> multi thread:
> Overall speedup=1.657x faster, min=0.190x max=87.972x
>
> (The 0.2x slowdown cases are for rgb8/gbr8 input, which requires LUT support
> for efficient decoding, but I wanted to focus on the core operations first
> before worrying about adding LUT-based optimizations to the design)
>
> I am (almost) ready to begin moving forwards with this design, merging it into
> swscale and using it at least for unscaled format conversions, XYZ decoding,
> colorspace transformations (subsuming the existing, horribly unoptimized,
> 3DLUT layer), gamma transformations, and so on.
>
> I wanted to post it here to gather some feedback on the approach. Where does
> it fall on the "madness" scale? Is the new operations and optimizer design
> comprehensible? Am I trying too hard to reinvent compilers? Are there any
> platforms where the high number of function calls per frame would be
> probitively expensive? What are the thoughts on the float-first approach? See
> also the list of limitations and improvement ideas at the bottom of my design
> document.
I think a more float centric design probably makes sense. Floats make things
nicer and cleaner
It may be needed to support an integer only path for architectures that
have a weak fpu. And also may be needed for some cases to get them bitexact
AVFloating, a rational float type or AVRational64, both interresting.
Do we have other places where either could be used ?
thx
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
Frequently ignored answer#1 FFmpeg bugs should be sent to our bugtracker. User
questions about the command line tools should be sent to the ffmpeg-user ML.
And questions about how to use libav* should be sent to the libav-user ML.
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
[-- Attachment #2: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-03-09 19:57 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-03-08 22:53 [FFmpeg-devel] [RFC] New swscale internal design prototype Niklas Haas
2025-03-09 16:11 ` Martin Storsjö
2025-03-09 19:45 ` Niklas Haas
2025-03-09 18:18 ` Rémi Denis-Courmont
2025-03-09 19:57 ` Niklas Haas
2025-03-09 19:41 ` Michael Niedermayer
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git