Re: [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]

From: Niklas Haas <ffmpeg@haasn.xyz>
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]
Date: Fri, 2 May 2025 19:51:11 +0200
Message-ID: <20250502195111.GB219301@haasn.xyz> (raw)
In-Reply-To: <20250426175603.726924-1-ffmpeg@haasn.xyz>

On Sat, 26 Apr 2025 19:41:04 +0200 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> Hi all,
>
> After extensive amounts of refactoring and iteration on the design and API,
> and the implementation of an x86 SIMD backend, I'm happy to present the
> revised version of my ongoing swscale rewrite. Now with 100% less reliance on
> compiler autovectorization.
>
> As before, I recommend (re)reading the design document to understand the
> motivation, structure and implementation details of this rewrite. At this
> point, I expect the major API and internal organization decisions to remain
> stable.
>
> I will preface with some benchmark figures, on my (new) AMD Ryzen 9 9950X3D:
>
> All formats:
>   - single thread: Overall speedup=2.109x faster, min=0.018x max=40.309x
>   - multi thread:  Overall speedup=2.607x faster, min=0.112x max=254.738x
>
> "Common" formats: (referenced >100 times in FFmpeg source code)
>   - single thread: Overall speedup=2.797x faster, min=0.408x max=16.514x
>   - multi thread:  Overall speedup=2.870x faster, min=0.715x max=21.983x

Small update: I noticed that one code path was accidentally not enabled. I
also implemented asm for the remaining bit-packed formats. After those two
changes, the new numbers are:

All formats:
  - single thread: Overall speedup=4.247x faster, min=0.177x max=224.809x
  - multi thread:  Overall speedup=4.000x faster, min=0.256x max=968.725x

"Common" formats:
  - single thread: Overall speedup=3.174x faster, min=0.596x max=12.616x
  - multi thread:  Overall speedup=3.005x faster, min=0.617x max=14.739x

>
> However, the main goal of this rewrite is not to improve performance, but to
> improve the maintainability, extensibility and correctness of the code. Most of
> the slowdowns for "common" formats are due to increased correctness (e.g.
> accurate rounding and dithering), and not the result of a regression per se.
>
> All of the remaining slowdowns (notably, the 0.1x cases) are due to incomplete
> coverage of the x86 SIMD. Notably, this currently affects bit packed formats
> (e.g. rgb8, rgb4). (I also did not yet incorporate any AVX-512 code, which
> some of the existing routines take advantage of)
>
> While I will continue working on this and expanding coverage to all remaining
> operations, I felt that now is a good point in time to get some code review
> and feedback regardless. I would especially appreciate code review of the x86
> SIMD code inside libswscale/x86/ops_*.asm, as this is my first time writing
> x86 assembly code.
>
>  doc/APIchanges                |   3 +
>  doc/scaler.texi               |   3 +
>  doc/swscale-v2.txt            | 344 +++++++++++++++++++++++++++
>  libswscale/Makefile           |   9 +
>  libswscale/format.c           | 945 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  libswscale/format.h           |  29 ++-
>  libswscale/graph.c            | 151 ++++++++----
>  libswscale/graph.h            |  37 ++-
>  libswscale/ops.c              | 850 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/ops.h              | 263 +++++++++++++++++++++
>  libswscale/ops_backend.c      | 101 ++++++++
>  libswscale/ops_backend.h      | 181 ++++++++++++++
>  libswscale/ops_chain.c        | 291 +++++++++++++++++++++++
>  libswscale/ops_chain.h        | 108 +++++++++
>  libswscale/ops_internal.h     | 103 ++++++++
>  libswscale/ops_optimizer.c    | 810 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/ops_tmpl_common.c  | 176 ++++++++++++++
>  libswscale/ops_tmpl_float.c   | 255 ++++++++++++++++++++
>  libswscale/ops_tmpl_int.c     | 609 +++++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/options.c          |   1 +
>  libswscale/swscale.h          |   7 +
>  libswscale/tests/swscale.c    |  11 +-
>  libswscale/version.h          |   2 +-
>  libswscale/x86/Makefile       |   3 +
>  libswscale/x86/ops.c          | 735 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/x86/ops_common.asm | 208 ++++++++++++++++
>  libswscale/x86/ops_float.asm  | 376 +++++++++++++++++++++++++++++
>  libswscale/x86/ops_int.asm    | 882 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/checkasm/Makefile       |   8 +-
>  tests/checkasm/checkasm.c     |   4 +-
>  tests/checkasm/checkasm.h     |  26 +-
>  tests/checkasm/sw_ops.c       | 748 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  32 files changed, 8206 insertions(+), 73 deletions(-)
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".