Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]
@ 2025-04-26 17:41 Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 01/17] tests/swscale: improve colorization of speedup Niklas Haas
                   ` (19 more replies)
  0 siblings, 20 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel

Hi all,

After extensive amounts of refactoring and iteration on the design and API,
and the implementation of an x86 SIMD backend, I'm happy to present the
revised version of my ongoing swscale rewrite. Now with 100% less reliance on
compiler autovectorization.

As before, I recommend (re)reading the design document to understand the
motivation, structure and implementation details of this rewrite. At this
point, I expect the major API and internal organization decisions to remain
stable.

I will preface with some benchmark figures, on my (new) AMD Ryzen 9 9950X3D:

All formats:
  - single thread: Overall speedup=2.109x faster, min=0.018x max=40.309x
  - multi thread:  Overall speedup=2.607x faster, min=0.112x max=254.738x

"Common" formats: (referenced >100 times in FFmpeg source code)
  - single thread: Overall speedup=2.797x faster, min=0.408x max=16.514x
  - multi thread:  Overall speedup=2.870x faster, min=0.715x max=21.983x

However, the main goal of this rewrite is not to improve performance, but to
improve the maintainability, extensibility and correctness of the code. Most of
the slowdowns for "common" formats are due to increased correctness (e.g.
accurate rounding and dithering), and not the result of a regression per se.

All of the remaining slowdowns (notably, the 0.1x cases) are due to incomplete
coverage of the x86 SIMD. Notably, this currently affects bit packed formats
(e.g. rgb8, rgb4). (I also did not yet incorporate any AVX-512 code, which
some of the existing routines take advantage of)

While I will continue working on this and expanding coverage to all remaining
operations, I felt that now is a good point in time to get some code review
and feedback regardless. I would especially appreciate code review of the x86
SIMD code inside libswscale/x86/ops_*.asm, as this is my first time writing
x86 assembly code.

 doc/APIchanges                |   3 +
 doc/scaler.texi               |   3 +
 doc/swscale-v2.txt            | 344 +++++++++++++++++++++++++++
 libswscale/Makefile           |   9 +
 libswscale/format.c           | 945 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 libswscale/format.h           |  29 ++-
 libswscale/graph.c            | 151 ++++++++----
 libswscale/graph.h            |  37 ++-
 libswscale/ops.c              | 850 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 libswscale/ops.h              | 263 +++++++++++++++++++++
 libswscale/ops_backend.c      | 101 ++++++++
 libswscale/ops_backend.h      | 181 ++++++++++++++
 libswscale/ops_chain.c        | 291 +++++++++++++++++++++++
 libswscale/ops_chain.h        | 108 +++++++++
 libswscale/ops_internal.h     | 103 ++++++++
 libswscale/ops_optimizer.c    | 810 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 libswscale/ops_tmpl_common.c  | 176 ++++++++++++++
 libswscale/ops_tmpl_float.c   | 255 ++++++++++++++++++++
 libswscale/ops_tmpl_int.c     | 609 +++++++++++++++++++++++++++++++++++++++++++++++
 libswscale/options.c          |   1 +
 libswscale/swscale.h          |   7 +
 libswscale/tests/swscale.c    |  11 +-
 libswscale/version.h          |   2 +-
 libswscale/x86/Makefile       |   3 +
 libswscale/x86/ops.c          | 735 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 libswscale/x86/ops_common.asm | 208 ++++++++++++++++
 libswscale/x86/ops_float.asm  | 376 +++++++++++++++++++++++++++++
 libswscale/x86/ops_int.asm    | 882 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 tests/checkasm/Makefile       |   8 +-
 tests/checkasm/checkasm.c     |   4 +-
 tests/checkasm/checkasm.h     |  26 +-
 tests/checkasm/sw_ops.c       | 748 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 32 files changed, 8206 insertions(+), 73 deletions(-)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 01/17] tests/swscale: improve colorization of speedup
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 02/17] swscale/graph: expose ff_sws_graph_add_pass Niklas Haas
                   ` (18 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

The old limits were a bit too tightly clustered around 1.0. Make the
value range much more generous, and also introduce a new highlight
for speedups above 10.0 (order of magnitude improvement).
---
 libswscale/tests/swscale.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/libswscale/tests/swscale.c b/libswscale/tests/swscale.c
index 7081058130..0f1f8311c9 100644
--- a/libswscale/tests/swscale.c
+++ b/libswscale/tests/swscale.c
@@ -79,11 +79,12 @@ static int speedup_count;
 
 static const char *speedup_color(double ratio)
 {
-    return ratio > 1.10 ? "\033[1;32m" : /* bold green */
-           ratio > 1.02 ? "\033[32m"   : /* green */
-           ratio > 0.98 ? ""           : /* default */
-           ratio > 0.95 ? "\033[33m"   : /* yellow */
-           ratio > 0.90 ? "\033[31m"   : /* red */
+    return ratio > 10.00 ? "\033[1;94m" : /* bold blue */
+           ratio >  2.00 ? "\033[1;32m" : /* bold green */
+           ratio >  1.02 ? "\033[32m"   : /* green */
+           ratio >  0.98 ? ""           : /* default */
+           ratio >  0.90 ? "\033[33m"   : /* yellow */
+           ratio >  0.75 ? "\033[31m"   : /* red */
             "\033[1;31m";  /* bold red */
 }
 
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 02/17] swscale/graph: expose ff_sws_graph_add_pass
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 01/17] tests/swscale: improve colorization of speedup Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 03/17] swscale/graph: make noop loop more robust Niklas Haas
                   ` (17 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

So we can move pass-adding business logic outside of graph.c.
---
 libswscale/graph.c | 41 ++++++++++++++++++++---------------------
 libswscale/graph.h | 18 ++++++++++++++++++
 2 files changed, 38 insertions(+), 21 deletions(-)

diff --git a/libswscale/graph.c b/libswscale/graph.c
index cd56f51f88..6cfcf4f2c6 100644
--- a/libswscale/graph.c
+++ b/libswscale/graph.c
@@ -44,10 +44,9 @@ static int pass_alloc_output(SwsPass *pass)
                           pass->num_slices * pass->slice_h, pass->format, 64);
 }
 
-/* slice_align should be a power of two, or 0 to disable slice threading */
-static SwsPass *pass_add(SwsGraph *graph, void *priv, enum AVPixelFormat fmt,
-                         int w, int h, SwsPass *input, int slice_align,
-                         sws_filter_run_t run)
+SwsPass *ff_sws_graph_add_pass(SwsGraph *graph, enum AVPixelFormat fmt,
+                               int width, int height, SwsPass *input,
+                               int align, void *priv, sws_filter_run_t run)
 {
     int ret;
     SwsPass *pass = av_mallocz(sizeof(*pass));
@@ -58,8 +57,8 @@ static SwsPass *pass_add(SwsGraph *graph, void *priv, enum AVPixelFormat fmt,
     pass->run    = run;
     pass->priv   = priv;
     pass->format = fmt;
-    pass->width  = w;
-    pass->height = h;
+    pass->width  = width;
+    pass->height = height;
     pass->input  = input;
     pass->output.fmt = AV_PIX_FMT_NONE;
 
@@ -69,12 +68,12 @@ static SwsPass *pass_add(SwsGraph *graph, void *priv, enum AVPixelFormat fmt,
         return NULL;
     }
 
-    if (!slice_align) {
+    if (!align) {
         pass->slice_h = pass->height;
         pass->num_slices = 1;
     } else {
         pass->slice_h = (pass->height + graph->num_threads - 1) / graph->num_threads;
-        pass->slice_h = FFALIGN(pass->slice_h, slice_align);
+        pass->slice_h = FFALIGN(pass->slice_h, align);
         pass->num_slices = (pass->height + pass->slice_h - 1) / pass->slice_h;
     }
 
@@ -84,12 +83,11 @@ static SwsPass *pass_add(SwsGraph *graph, void *priv, enum AVPixelFormat fmt,
     return pass;
 }
 
-/* Wrapper around pass_add that chains a pass "in-place" */
-static int pass_append(SwsGraph *graph, void *priv, enum AVPixelFormat fmt,
-                       int w, int h, SwsPass **pass, int slice_align,
-                       sws_filter_run_t run)
+/* Wrapper around ff_sws_graph_add_pass() that chains a pass "in-place" */
+static int pass_append(SwsGraph *graph, enum AVPixelFormat fmt, int w, int h,
+                       SwsPass **pass, int align, void *priv, sws_filter_run_t run)
 {
-    SwsPass *new = pass_add(graph, priv, fmt, w, h, *pass, slice_align, run);
+    SwsPass *new = ff_sws_graph_add_pass(graph, fmt, w, h, *pass, align, priv, run);
     if (!new)
         return AVERROR(ENOMEM);
     *pass = new;
@@ -325,19 +323,19 @@ static int init_legacy_subpass(SwsGraph *graph, SwsContext *sws,
         align = 0; /* disable slice threading */
 
     if (c->src0Alpha && !c->dst0Alpha && isALPHA(sws->dst_format)) {
-        ret = pass_append(graph, c, AV_PIX_FMT_RGBA, src_w, src_h, &input, 1, run_rgb0);
+        ret = pass_append(graph, AV_PIX_FMT_RGBA, src_w, src_h, &input, 1, c, run_rgb0);
         if (ret < 0)
             return ret;
     }
 
     if (c->srcXYZ && !(c->dstXYZ && unscaled)) {
-        ret = pass_append(graph, c, AV_PIX_FMT_RGB48, src_w, src_h, &input, 1, run_xyz2rgb);
+        ret = pass_append(graph, AV_PIX_FMT_RGB48, src_w, src_h, &input, 1, c, run_xyz2rgb);
         if (ret < 0)
             return ret;
     }
 
-    pass = pass_add(graph, sws, sws->dst_format, dst_w, dst_h, input, align,
-                    c->convert_unscaled ? run_legacy_unscaled : run_legacy_swscale);
+    pass = ff_sws_graph_add_pass(graph, sws->dst_format, dst_w, dst_h, input, align, sws,
+                                 c->convert_unscaled ? run_legacy_unscaled : run_legacy_swscale);
     if (!pass)
         return AVERROR(ENOMEM);
     pass->setup = setup_legacy_swscale;
@@ -387,7 +385,7 @@ static int init_legacy_subpass(SwsGraph *graph, SwsContext *sws,
     }
 
     if (c->dstXYZ && !(c->srcXYZ && unscaled)) {
-        ret = pass_append(graph, c, AV_PIX_FMT_RGB48, dst_w, dst_h, &pass, 1, run_rgb2xyz);
+        ret = pass_append(graph, AV_PIX_FMT_RGB48, dst_w, dst_h, &pass, 1, c, run_rgb2xyz);
         if (ret < 0)
             return ret;
     }
@@ -548,8 +546,8 @@ static int adapt_colors(SwsGraph *graph, SwsFormat src, SwsFormat dst,
         return ret;
     }
 
-    pass = pass_add(graph, lut, fmt_out, src.width, src.height,
-                    input, 1, run_lut3d);
+    pass = ff_sws_graph_add_pass(graph, fmt_out, src.width, src.height,
+                                 input, 1, lut, run_lut3d);
     if (!pass) {
         ff_sws_lut3d_free(&lut);
         return AVERROR(ENOMEM);
@@ -589,7 +587,8 @@ static int init_passes(SwsGraph *graph)
         graph->noop = 1;
 
         /* Add threaded memcpy pass */
-        pass = pass_add(graph, NULL, dst.format, dst.width, dst.height, pass, 1, run_copy);
+        pass = ff_sws_graph_add_pass(graph, dst.format, dst.width, dst.height,
+                                     pass, 1, NULL, run_copy);
         if (!pass)
             return AVERROR(ENOMEM);
     }
diff --git a/libswscale/graph.h b/libswscale/graph.h
index b42d54be04..62b622a065 100644
--- a/libswscale/graph.h
+++ b/libswscale/graph.h
@@ -128,6 +128,24 @@ typedef struct SwsGraph {
 int ff_sws_graph_create(SwsContext *ctx, const SwsFormat *dst, const SwsFormat *src,
                         int field, SwsGraph **out_graph);
 
+
+/**
+ * Allocate and add a new pass to the filter graph.
+ *
+ * @param graph  Filter graph to add the pass to.
+ * @param fmt    Pixel format of the output image.
+ * @param w      Width of the output image.
+ * @param h      Height of the output image.
+ * @param input  Previous pass to read from, or NULL for the input image.
+ * @param align  Minimum slice alignment for this pass, or 0 for no threading.
+ * @param priv   Private state for the filter run function.
+ * @param run    Filter function to run.
+ * @return The newly created pass, or NULL on error.
+ */
+SwsPass *ff_sws_graph_add_pass(SwsGraph *graph, enum AVPixelFormat fmt,
+                               int width, int height, SwsPass *input,
+                               int align, void *priv, sws_filter_run_t run);
+
 /**
  * Uninitialize any state associate with this filter graph and free it.
  */
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 03/17] swscale/graph: make noop loop more robust
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 01/17] tests/swscale: improve colorization of speedup Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 02/17] swscale/graph: expose ff_sws_graph_add_pass Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 04/17] swscale/graph: move vshift() and shift_img() to shared header Niklas Haas
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

The current loop only works if the input and output have the same number
of planes. However, with the new scaling logic, we can also optimize into a
noop the case where the input has extra unneeded planes.

For the memcpy fallback to work in these cases we have to instead check if
the *output* pointer is set, rather than the input pointer.
---
 libswscale/graph.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/libswscale/graph.c b/libswscale/graph.c
index 6cfcf4f2c6..c5a46eb257 100644
--- a/libswscale/graph.c
+++ b/libswscale/graph.c
@@ -115,8 +115,10 @@ static void run_copy(const SwsImg *out_base, const SwsImg *in_base,
     SwsImg in  = shift_img(in_base,  y);
     SwsImg out = shift_img(out_base, y);
 
-    for (int i = 0; i < FF_ARRAY_ELEMS(in.data) && in.data[i]; i++) {
+    for (int i = 0; i < FF_ARRAY_ELEMS(out.data) && out.data[i]; i++) {
         const int lines = h >> vshift(in.fmt, i);
+        av_assert1(in.data[i]);
+
         if (in.linesize[i] == out.linesize[i]) {
             memcpy(out.data[i], in.data[i], lines * out.linesize[i]);
         } else {
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 04/17] swscale/graph: move vshift() and shift_img() to shared header
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (2 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 03/17] swscale/graph: make noop loop more robust Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-05-16 15:41   ` Ramiro Polla
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 05/17] swscale/graph: prefer bools to ints Niklas Haas
                   ` (15 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

I need to reuse these inside `ops.c`.
---
 libswscale/graph.c | 29 +++++++----------------------
 libswscale/graph.h | 13 +++++++++++++
 2 files changed, 20 insertions(+), 22 deletions(-)

diff --git a/libswscale/graph.c b/libswscale/graph.c
index c5a46eb257..b921b7ec02 100644
--- a/libswscale/graph.c
+++ b/libswscale/graph.c
@@ -94,29 +94,14 @@ static int pass_append(SwsGraph *graph, enum AVPixelFormat fmt, int w, int h,
     return 0;
 }
 
-static int vshift(enum AVPixelFormat fmt, int plane)
-{
-    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
-    return (plane == 1 || plane == 2) ? desc->log2_chroma_h : 0;
-}
-
-/* Shift an image vertically by y lines */
-static SwsImg shift_img(const SwsImg *img_base, int y)
-{
-    SwsImg img = *img_base;
-    for (int i = 0; i < 4 && img.data[i]; i++)
-        img.data[i] += (y >> vshift(img.fmt, i)) * img.linesize[i];
-    return img;
-}
-
 static void run_copy(const SwsImg *out_base, const SwsImg *in_base,
                      int y, int h, const SwsPass *pass)
 {
-    SwsImg in  = shift_img(in_base,  y);
-    SwsImg out = shift_img(out_base, y);
+    SwsImg in  = ff_sws_img_shift(*in_base,  y);
+    SwsImg out = ff_sws_img_shift(*out_base, y);
 
     for (int i = 0; i < FF_ARRAY_ELEMS(out.data) && out.data[i]; i++) {
-        const int lines = h >> vshift(in.fmt, i);
+        const int lines = h >> ff_fmt_vshift(in.fmt, i);
         av_assert1(in.data[i]);
 
         if (in.linesize[i] == out.linesize[i]) {
@@ -219,7 +204,7 @@ static void run_legacy_unscaled(const SwsImg *out, const SwsImg *in_base,
 {
     SwsContext *sws = slice_ctx(pass, y);
     SwsInternal *c = sws_internal(sws);
-    const SwsImg in = shift_img(in_base, y);
+    const SwsImg in = ff_sws_img_shift(*in_base, y);
 
     c->convert_unscaled(c, (const uint8_t *const *) in.data, in.linesize, y, h,
                         out->data, out->linesize);
@@ -230,7 +215,7 @@ static void run_legacy_swscale(const SwsImg *out_base, const SwsImg *in,
 {
     SwsContext *sws = slice_ctx(pass, y);
     SwsInternal *c = sws_internal(sws);
-    const SwsImg out = shift_img(out_base, y);
+    const SwsImg out = ff_sws_img_shift(*out_base, y);
 
     ff_swscale(c, (const uint8_t *const *) in->data, in->linesize, 0,
                sws->src_h, out.data, out.linesize, y, h);
@@ -490,8 +475,8 @@ static void run_lut3d(const SwsImg *out_base, const SwsImg *in_base,
                       int y, int h, const SwsPass *pass)
 {
     SwsLut3D *lut = pass->priv;
-    const SwsImg in  = shift_img(in_base,  y);
-    const SwsImg out = shift_img(out_base, y);
+    const SwsImg in  = ff_sws_img_shift(*in_base,  y);
+    const SwsImg out = ff_sws_img_shift(*out_base, y);
 
     ff_sws_lut3d_apply(lut, in.data[0], in.linesize[0], out.data[0],
                        out.linesize[0], pass->width, h);
diff --git a/libswscale/graph.h b/libswscale/graph.h
index 62b622a065..191734b794 100644
--- a/libswscale/graph.h
+++ b/libswscale/graph.h
@@ -34,6 +34,19 @@ typedef struct SwsImg {
     int linesize[4];
 } SwsImg;
 
+static av_always_inline av_const int ff_fmt_vshift(enum AVPixelFormat fmt, int plane)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    return (plane == 1 || plane == 2) ? desc->log2_chroma_h : 0;
+}
+
+static av_const inline SwsImg ff_sws_img_shift(SwsImg img, const int y)
+{
+    for (int i = 0; i < 4 && img.data[i]; i++)
+        img.data[i] += (y >> ff_fmt_vshift(img.fmt, i)) * img.linesize[i];
+    return img;
+}
+
 typedef struct SwsPass  SwsPass;
 typedef struct SwsGraph SwsGraph;
 
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 05/17] swscale/graph: prefer bools to ints
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (3 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 04/17] swscale/graph: move vshift() and shift_img() to shared header Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 06/17] doc: add swscale rewrite design document Niklas Haas
                   ` (14 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This is more consistent with the rest of the newly added code, which
universally switched to using bools for boolean values.
---
 libswscale/format.c | 2 +-
 libswscale/format.h | 6 ++++--
 libswscale/graph.h  | 6 ++++--
 3 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/libswscale/format.c b/libswscale/format.c
index b859af7b04..e4c1348b90 100644
--- a/libswscale/format.c
+++ b/libswscale/format.c
@@ -483,7 +483,7 @@ static int infer_trc_ref(SwsColor *csp, const SwsColor *ref)
     return 1;
 }
 
-int ff_infer_colors(SwsColor *src, SwsColor *dst)
+bool ff_infer_colors(SwsColor *src, SwsColor *dst)
 {
     int incomplete = 0;
 
diff --git a/libswscale/format.h b/libswscale/format.h
index 11b4345f7c..3b6d745159 100644
--- a/libswscale/format.h
+++ b/libswscale/format.h
@@ -21,6 +21,8 @@
 #ifndef SWSCALE_FORMAT_H
 #define SWSCALE_FORMAT_H
 
+#include <stdbool.h>
+
 #include "libavutil/csp.h"
 #include "libavutil/pixdesc.h"
 
@@ -129,7 +131,7 @@ static inline int ff_fmt_align(enum AVPixelFormat fmt)
 
 int ff_test_fmt(const SwsFormat *fmt, int output);
 
-/* Returns 1 if the formats are incomplete, 0 otherwise */
-int ff_infer_colors(SwsColor *src, SwsColor *dst);
+/* Returns true if the formats are incomplete, false otherwise */
+bool ff_infer_colors(SwsColor *src, SwsColor *dst);
 
 #endif /* SWSCALE_FORMAT_H */
diff --git a/libswscale/graph.h b/libswscale/graph.h
index 191734b794..97378afc72 100644
--- a/libswscale/graph.h
+++ b/libswscale/graph.h
@@ -21,6 +21,8 @@
 #ifndef SWSCALE_GRAPH_H
 #define SWSCALE_GRAPH_H
 
+#include <stdbool.h>
+
 #include "libavutil/slicethread.h"
 #include "swscale.h"
 #include "format.h"
@@ -108,8 +110,8 @@ typedef struct SwsGraph {
     SwsContext *ctx;
     AVSliceThread *slicethread;
     int num_threads; /* resolved at init() time */
-    int incomplete;  /* set during init() if formats had to be inferred */
-    int noop;        /* set during init() if the graph is a no-op */
+    bool incomplete; /* set during init() if formats had to be inferred */
+    bool noop;       /* set during init() if the graph is a no-op */
 
     /** Sorted sequence of filter passes to apply */
     SwsPass **passes;
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 06/17] doc: add swscale rewrite design document
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (4 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 05/17] swscale/graph: prefer bools to ints Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 07/17] swscale: add SWS_EXPERIMENTAL flag Niklas Haas
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This should hopefully serve as a better introduction to my new swscale
redesign than hunting down random commit message monologues.
---
 doc/swscale-v2.txt | 344 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 344 insertions(+)
 create mode 100644 doc/swscale-v2.txt

diff --git a/doc/swscale-v2.txt b/doc/swscale-v2.txt
new file mode 100644
index 0000000000..3ae2b27036
--- /dev/null
+++ b/doc/swscale-v2.txt
@@ -0,0 +1,344 @@
+New swscale design to change everything (tm)
+============================================
+
+SwsGraph
+--------
+
+The entry point to the new architecture, SwsGraph is what coordinates
+multiple "passes". These can include cascaded scaling passes, error diffusion
+dithering, and so on. Or we could have separate passes for the vertical and
+horizontal scaling. In between each SwsPass lies a fully allocated image buffer.
+Graph passes may have different levels of threading, e.g. we can have a single
+threaded error diffusion pass following a multi-threaded scaling pass.
+
+SwsGraph is internally recreated whenever the image format, dimensions or
+settings change in any way. sws_scale_frame() is itself just a light-weight
+wrapper that runs ff_sws_graph_create() whenever the format changes, splits
+interlaced images into separate fields, and calls ff_sws_graph_run() on each.
+
+From the point of view of SwsGraph itself, all inputs are progressive.
+
+SwsOp / SwsOpList
+-----------------
+
+This is the newly introduced abstraction layer between the high-level format
+handling logic and the low-level backing implementation. Each SwsOp is designed
+to be as small and atomic as possible, with the possible exception of the
+read / write operations due to their numerous variants.
+
+The basic idea is to split logic between three major components:
+
+1. The high-level format "business logic", which generates in a very
+   naive way a sequence of operations guaranteed to get you from point A
+   to point B. This logic is written with correctness in mind only, and
+   ignoring any performance concerns or low-level implementation decisions.
+   Semantically, everything is always decoded from the input format to
+   normalized (real valued) RGB, and then encoded back to output format.
+
+   This code lives in libswscale/format.c
+
+2. The optimizer. This is where the "magic" happens, so to speak. The
+   optimizer's job is to take the abstract sequence of operations
+   produced by the high-level format analysis code and incrementally
+   optimize it. Each optimization step is designed to be minute and provably
+   lossless, or otherwise guarded behind the BITEXACT flag. This ensures that
+   the resulting output is always identical, no matter how many layers of
+   optimization we add.
+
+   This code lives in libswscale/ops.c
+
+3. The compiler. Once we have a sequence of operations as output by the
+   optimizer, we "compile" this down to a callable function. This is then
+   applied by the dispatch wrapper by striping it over the input image.
+
+   See libswscale/ops_backend.c for the reference backend, or
+   libswscale/x86/ops.c for a more complex SIMD example.
+
+This overall approach has a considerable number of benefits:
+
+1. It allows us to verify correctness of logic and spot semantic errors at a
+   very high level, by simply looking at the sequence of operations (available
+   by default at debug / verbose log level), without having to dig through the
+   multiple levels of complicated, interwoven format handling code that is
+   legacy swscale.
+
+2. Because most of the brains lives inside the the powerful optimizer, we get
+   fast paths "for free" for any suitable format conversion, rather than having
+   to enumerate them one by one. SIMD code itself can be written in a very
+   general way and does need to be tied to specific pixel formats - subsequent
+   low-level implementations can be strung together without much overhead.
+
+3. We can in the future, with relative ease, compile these operations
+   down to SPIR-V (or even LLVM IR) and generate efficient GPU or
+   target-machine specific implementations. This also opens the window for
+   adding hardware frame support to libswscale, and even transparently using
+   GPU acceleration for CPU frames.
+
+4. Platform-specific SIMD can be reduced down to a comparatively small set of
+   optimized routines, while still providing 100% coverage for all possible
+   pixel formats and operations. (As of writing, the x86 example backend has
+   about 60 unique implementations, of which 20 are trivial swizzles, 10 are
+   read/write ops, 10 are pixel type conversions and the remaining 20 are the
+   various logic/arithmetic ops).
+
+5. Backends hide behind a layer of abstraction offering them a considerable
+   deal of flexibility in how they want to implement their operations. For
+   example, the x86 backend has a dedicated function for compiling compatible
+   operations down to a single in-place pshufb instruction.
+
+   Platform specific low level data is self-contained within its own setup()
+   function and private data structure, eliminating all reads into SwsContext
+   or the possibility of conflicts between platforms.
+
+6. We can compute an exact reference result for each operation with fixed
+   precision (ff_sws_op_apply_q), and use that to e.g. measure the amount of
+   error introduced by dithering, or even catch bugs in the reference C
+   implementation. (In theory - currently checkasm just compares against C)
+
+Examples of SwsOp in action
+---------------------------
+
+For illustration, here is the sequence of operations currently generated by
+my prototype, for a conversion from RGB24 to YUV444P:
+
+Unoptimized operation list:
+  [ u8 .... -> ....] SWS_OP_READ         : 3 elem(s) packed >> 0
+  [ u8 .... -> ....] SWS_OP_SWIZZLE      : 0123
+  [ u8 .... -> ....] SWS_OP_RSHIFT       : >> 0
+  [ u8 .... -> ....] SWS_OP_CLEAR        : {_ _ _ 0}
+  [ u8 .... -> ....] SWS_OP_CONVERT      : u8 -> f32
+  [f32 .... -> ....] SWS_OP_LINEAR       : diag3+alpha [[1/255 0 0 0 0] [0 1/255 0 0 0] [0 0 1/255 0 0] [0 0 0 1 1]]
+  [f32 .... -> ....] SWS_OP_LINEAR       : matrix3 [[0.299000 0.587000 0.114000 0 0] [-0.168736 -0.331264 1/2 0 0] [1/2 -0.418688 -57/701 0 0] [0 0 0 1 0]]
+  [f32 .... -> ....] SWS_OP_LINEAR       : diag3+off3 [[219 0 0 0 16] [0 224 0 0 128] [0 0 224 0 128] [0 0 0 1 0]]
+  [f32 .... -> ....] SWS_OP_DITHER       : 16x16 matrix
+  [f32 .... -> ....] SWS_OP_MAX          : {0 0 0 0} <= x
+  [f32 .... -> ....] SWS_OP_MIN          : x <= {255 255 255 _}
+  [f32 .... -> ....] SWS_OP_CONVERT      : f32 -> u8
+  [ u8 .... -> ....] SWS_OP_LSHIFT       : << 0
+  [ u8 .... -> ....] SWS_OP_SWIZZLE      : 0123
+  [ u8 .... -> ....] SWS_OP_WRITE        : 3 elem(s) planar >> 0
+
+This is optimized into the following sequence:
+
+Optimized operation list:
+  [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) packed >> 0
+  [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> f32
+  [f32 ...X -> ...X] SWS_OP_LINEAR       : matrix3+off3 [[0.256788 0.504129 0.097906 0 16] [-0.148223 -0.290993 112/255 0 128] [112/255 -0.367788 -0.071427 0 128] [0 0 0 1 0]]
+  [f32 ...X -> ...X] SWS_OP_DITHER       : 16x16 matrix
+  [f32 ...X -> +++X] SWS_OP_CONVERT      : f32 -> u8
+  [ u8 ...X -> +++X] SWS_OP_WRITE        : 3 elem(s) planar >> 0
+    (X = unused, + = exact, 0 = zero)
+
+The extra metadata on the left of the operation list is just a dump of the
+internal state used by the optimizer during optimization. It keeps track of
+knowledge about the pixel values, such as their value range, whether or not
+they're exact integers, and so on.
+
+In this example, you can see that the input values are exact (except for
+the alpha channel, which is undefined), until the first SWS_OP_LINEAR
+multiplies them by a noninteger constant. They regain their exact integer
+status only after the (truncating) conversion to U8 in the output step.
+
+Example of more aggressive optimization
+---------------------------------------
+
+Conversion pass for gray -> rgb48:
+Unoptimized operation list:
+  [ u8 .... -> ....] SWS_OP_READ         : 1 elem(s) planar >> 0
+  [ u8 .... -> ....] SWS_OP_SWIZZLE      : 0123
+  [ u8 .... -> ....] SWS_OP_RSHIFT       : >> 0
+  [ u8 .... -> ....] SWS_OP_CLEAR        : {_ 0 0 0}
+  [ u8 .... -> ....] SWS_OP_CONVERT      : u8 -> f32
+  [f32 .... -> ....] SWS_OP_LINEAR       : luma+alpha [[1/255 0 0 0 0] [0 1 0 0 0] [0 0 1 0 0] [0 0 0 1 1]]
+  [f32 .... -> ....] SWS_OP_LINEAR       : matrix3 [[1 0 701/500 0 0] [1 -0.344136 -0.714136 0 0] [1 443/250 0 0 0] [0 0 0 1 0]]
+  [f32 .... -> ....] SWS_OP_LINEAR       : diag3 [[65535 0 0 0 0] [0 65535 0 0 0] [0 0 65535 0 0] [0 0 0 1 0]]
+  [f32 .... -> ....] SWS_OP_MAX          : {0 0 0 0} <= x
+  [f32 .... -> ....] SWS_OP_MIN          : x <= {65535 65535 65535 _}
+  [f32 .... -> ....] SWS_OP_CONVERT      : f32 -> u16
+  [u16 .... -> ....] SWS_OP_LSHIFT       : << 0
+  [u16 .... -> ....] SWS_OP_SWIZZLE      : 0123
+  [u16 .... -> ....] SWS_OP_WRITE        : 3 elem(s) packed >> 0
+
+Optimized operation list:
+  [ u8 XXXX -> +XXX] SWS_OP_READ         : 1 elem(s) planar >> 0
+  [ u8 .XXX -> +XXX] SWS_OP_CONVERT      : u8 -> u16 (expand)
+  [u16 .XXX -> +++X] SWS_OP_SWIZZLE      : 0003
+  [u16 ...X -> +++X] SWS_OP_WRITE        : 3 elem(s) packed >> 0
+    (X = unused, + = exact, 0 = zero)
+
+Here, the optimizer has managed to eliminate all of the unnecessary linear
+operations on previously zero'd values, turn the resulting column matrix into
+a swizzle operation, avoid the unnecessary dither (and round trip via float)
+because the pixel values are guaranteed to be bit exact, and finally, turns
+the multiplication by 65535 / 255 = 257 into a simple integer expand operation.
+
+As a final bonus, the x86 backend further optimizes this into a 12-byte shuffle:
+  pshufb = {0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, -1, -1, -1, -1}
+
+time=208 us, ref=4212 us, speedup=20.236x faster (single thread)
+time=57 us, ref=472 us, speedup=8.160x faster (multi thread)
+
+Compiler and underlying implementation layer (SwsOpChain)
+---------------------------------------------------------
+
+While the backend API is flexible enough to permit more exotic implementations
+(e.g. using JIT code generation), we establish a common set of helpers for use
+in "traditional" SIMD implementations.
+
+The basic idea is to have one "kernel" (or implementation) per operation,
+and then just chain a list of these kernels together as separate function
+calls. For best performance, we want to keep data in vector registers in
+between function calls using a custom calling convention, thus avoiding any
+unnecessary memory accesses. Additionally, we want the per-kernel overhead to
+be as low as possible, with each kernel ideally just jumping directly into
+the next kernel.
+
+As a result, we arrive at a design where we first divide the image into small
+chunks, or "blocks", and then dispatch the "chain" of kernels on each chunk in
+sequence. Each kernel processes a fixed number of pixels, with the overall
+entry point taking care of looping. Remaining pixels (the "tail") are handled
+generically by the backend-invariant dispatch code (located in ops.c), using a
+partial memcpy into a suitably sized temporary buffer.
+
+To minimize the per-kernel function call overhead, we use a "continuation
+passing style" for chaining kernels. Each operation computes its result and
+then directly calls the next operation in the sequence, with the appropriate
+internal function signature.
+
+The C reference backend reads data into the stack and then passes the array
+pointers to the next continuation as regular function arguments:
+
+  void process(GlobalContext *ctx, OpContext *op,
+               block_t x, block_t y, block_t z, block_t w)
+  {
+      for (int i = 0; i < SWS_BLOCK_SIZE; i++)
+          // do something with x[i], y[i], z[i], w[i]
+
+      op->next(ctx, &op[1], x, y, z, w);
+  }
+
+With type conversions pushing the new data onto the stack as well:
+
+  void convert8to16(GlobalContext *ctx, OpContext *op,
+                    block_t x, block_t y, block_t z, block_t w)
+  {
+        /* Pseudo-code */
+        u16block_t x16 = (u16block_t) x;
+        u16block_t y16 = (u16block_t) y;
+        u16block_t z16 = (u16block_t) z;
+        u16block_t w16 = (u16block_t) w;
+
+        op->next(ctx, &op[1], x16, y16, z16, w16);
+  }
+
+By contrast, the x86 backend always keeps the X/Y/Z/W values pinned in specific
+vector registers (ymm0-ymm3 for the lower half, and ymm4-ymm7 for the second
+half).
+
+Each kernel additionally has access to a 32 byte per-op context storing the
+pointer to the next kernel plus 16 bytes of arbitrary private data. This is
+used during construction of the function chain to place things like small
+constants.
+
+In assembly, the per-kernel overhead looks like this:
+
+  load $tmp, $arg1
+  ...
+  add $arg1, 32
+  jump $tmp
+
+This design gives vastly better performance than the alternative of returning
+out to a central loop or "trampoline". This is partly because the order of
+kernels within a chain is always the same, so the branch predictor can easily
+remember the target address of each "jump" instruction.
+
+The only way to realistically improve on this design would be to directly
+stitch the kernel body together using runtime code generation.
+
+Future considerations and limitations
+-------------------------------------
+
+My current prototype has a number of severe limitations and opportunities
+for improvements:
+
+1. It does not handle scaling at all. I am not yet entirely sure on how I want
+   to handle scaling; this includes handling of subsampled content. I have a
+   number of vague ideas in my head, but nothing where I can say with certainty
+   that it will work out well.
+
+   It's possible that we won't come up with a perfect solution here, and will
+   need to decide on which set of compromises we are comfortable accepting:
+
+   1. Do we need the ability to scale YUV -> YUV by handling luma and chroma
+      independently? When downscaling 100x100 4:2:0 to 50x50 4:4:4, should we
+      support the option of reusing the chroma plane directly (even though
+      this would introduce a subpixel shift for typical chroma siting)?
+
+   Looking towards zimg, I am also thinking that we probably also want to do
+   scaling on floating point values, since this is best for both performance
+   and accuracy, especially given that we need to go up to 32-bit intermediates
+   during scaling anyway.
+
+   So far, the most promising approach seems to be to handle subsampled
+   input/output as a dedicated read/write operation type; perhaps even with a
+   fixed/static subsampling kernel. To avoid compromising on performance when
+   chroma resampling is not necessary, the optimizer could then relax the
+   pipeline to use non-interpolating read/writes when all intermediate
+   operations are component-independent.
+
+2. Since each operation is conceptually defined on 4-component pixels, we end
+   up defining a lot of variants of each implementation for each possible
+   *subset*. For example, we have four different implementations for
+   SWS_OP_SCALE in my current templates:
+    - op_scale_1000
+    - op_scale_1001
+    - op_scale_1110
+    - op_scale_1111
+
+   This reflects the four different arangements of pixel components that are
+   typically present (or absent). While best for performance, it does turn into
+   a bit of a chore when implementing these kernels.
+
+   The only real alternative would be to either branch inside the kernel (bad),
+   or to use separate kernels for each individual component and chain them all
+   together. I have not yet tested whether the latter approach would be faster
+   after the latest round of refactors to the kernel glue code.
+
+3. I do not yet have any support for LUTs. But when I add them, something we
+   could do is have the optimized pass automatically "promote" a sequence of
+   operations to LUTs. For example, any sequence that looks like:
+
+   1. [u8] SWS_OP_CONVERT -> X
+   2. [X] ... // only per-component operations
+   4. [X] SWS_OP_CONVERT -> Y
+   3. [Y] SWS_OP_WRITE
+
+   could be replaced by a LUT with 256 entries. This is especially important
+   for anything involving packed 8-bit input (e.g. rgb8, rgb4_byte).
+
+   We also definitely want to hook this up to the existing CMS code for
+   transformations between different primaries.
+
+4. Because we rely on AVRational math to generate the coefficients for
+   operations, we need to be able to represent all pixel values as an
+   AVRational. However, this presents a challenge for 32-bit formats (e.g.
+   GRAY32, RGBA128), because their size exceeds INT_MAX, which is the maximum
+   value representable by an AVRational.
+
+   It's possible we may want to introduce an AVRational64 for this, or
+   perhaps more flexibly, extend AVRational to an AVFloating type which is
+   represented as { AVRational n; int exp; }, representing n/d * 2^exp. This
+   would preserve our ability to represent all pixel values exactly, while
+   opening up the range arbitrarily.
+
+5. Is there ever a situation where the use of floats introduces the risk of
+   non bit-exact output? For this reason, and possible performance advantages,
+   we may want to explore the use of a fixed-point 16 bit path as an alternative
+   to the floating point math.
+
+   So far, I have managed to avoid any bit exactness issues inside the x86
+   backend by ensuring that the order of linear operations is identical
+   between the C backend and the x86 backend, but this may not be practical
+   to guarantee on all backends. The x86 float code is also dramatically
+   faster than the old fixed point code, so I'm tentatively optimistic about
+   the lack of a need for a fixed point path.
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 07/17] swscale: add SWS_EXPERIMENTAL flag
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (5 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 06/17] doc: add swscale rewrite design document Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-05-08 11:37   ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 08/17] swscale/ops: introduce new low level framework Niklas Haas
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Give users and developers a way to opt in to the new format conversion code,
and more code from the swscale rewrite in general.
---
 doc/APIchanges       | 3 +++
 doc/scaler.texi      | 3 +++
 libswscale/options.c | 1 +
 libswscale/swscale.h | 7 +++++++
 libswscale/version.h | 2 +-
 5 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/doc/APIchanges b/doc/APIchanges
index 22aa6fa5c7..84bc721569 100644
--- a/doc/APIchanges
+++ b/doc/APIchanges
@@ -2,6 +2,9 @@ The last version increases of all libraries were on 2025-03-28
 
 API changes, most recent first:
 
+2025-04-xx - xxxxxxxxxx - lsws 9.1.100 - swscale.h
+  Add SWS_EXPERIMENTAL flag.
+
 2025-04-16 - c818c67991 - libpostproc 59.1.100 - postprocess.h
   Deprecate PP_CPU_CAPS_3DNOW.
 
diff --git a/doc/scaler.texi b/doc/scaler.texi
index eb045de6b7..519a83b5d3 100644
--- a/doc/scaler.texi
+++ b/doc/scaler.texi
@@ -68,6 +68,9 @@ Select full chroma input.
 
 @item bitexact
 Enable bitexact output.
+
+@item experimental
+Allow the use of experimental new code. For testing only.
 @end table
 
 @item srcw @var{(API only)}
diff --git a/libswscale/options.c b/libswscale/options.c
index feecae8c89..044c7c7f0b 100644
--- a/libswscale/options.c
+++ b/libswscale/options.c
@@ -50,6 +50,7 @@ static const AVOption swscale_options[] = {
         { "full_chroma_inp", "full chroma input",             0,  AV_OPT_TYPE_CONST, { .i64 = SWS_FULL_CHR_H_INP }, .flags = VE, .unit = "sws_flags" },
         { "bitexact",        "bit-exact mode",                0,  AV_OPT_TYPE_CONST, { .i64 = SWS_BITEXACT       }, .flags = VE, .unit = "sws_flags" },
         { "error_diffusion", "error diffusion dither",        0,  AV_OPT_TYPE_CONST, { .i64 = SWS_ERROR_DIFFUSION}, .flags = VE, .unit = "sws_flags" },
+        { "experimental",    "allow experimental new code",   0,  AV_OPT_TYPE_CONST, { .i64 = SWS_EXPERIMENTAL   }, .flags = VE, .unit = "sws_flags" },
 
     { "param0",          "scaler param 0", OFFSET(scaler_params[0]), AV_OPT_TYPE_DOUBLE, { .dbl = SWS_PARAM_DEFAULT  }, INT_MIN, INT_MAX, VE },
     { "param1",          "scaler param 1", OFFSET(scaler_params[1]), AV_OPT_TYPE_DOUBLE, { .dbl = SWS_PARAM_DEFAULT  }, INT_MIN, INT_MAX, VE },
diff --git a/libswscale/swscale.h b/libswscale/swscale.h
index b04aa182d2..82a69e97fc 100644
--- a/libswscale/swscale.h
+++ b/libswscale/swscale.h
@@ -155,6 +155,13 @@ typedef enum SwsFlags {
     SWS_ACCURATE_RND   = 1 << 18,
     SWS_BITEXACT       = 1 << 19,
 
+    /**
+     * Allow using experimental new code paths. This may be faster, slower,
+     * or produce different output, with semantics subject to change at any
+     * point in time. For testing and debugging purposes only.
+     */
+    SWS_EXPERIMENTAL   = 1 << 20,
+
     /**
      * Deprecated flags.
      */
diff --git a/libswscale/version.h b/libswscale/version.h
index 148efd83eb..4e54701aba 100644
--- a/libswscale/version.h
+++ b/libswscale/version.h
@@ -28,7 +28,7 @@
 
 #include "version_major.h"
 
-#define LIBSWSCALE_VERSION_MINOR   0
+#define LIBSWSCALE_VERSION_MINOR   1
 #define LIBSWSCALE_VERSION_MICRO 100
 
 #define LIBSWSCALE_VERSION_INT  AV_VERSION_INT(LIBSWSCALE_VERSION_MAJOR, \
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 08/17] swscale/ops: introduce new low level framework
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (6 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 07/17] swscale: add SWS_EXPERIMENTAL flag Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 09/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
                   ` (11 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

See the design document introduced in the previous commit for an
in-depth introduction to the new approach.

This commit merely introduces the common dispatch code, the SwsOpList
boilerplate, and the high-level optimizer. The subsequent commits will
add the underlying implementations.
---
 libswscale/Makefile        |   2 +
 libswscale/ops.c           | 843 +++++++++++++++++++++++++++++++++++++
 libswscale/ops.h           | 265 ++++++++++++
 libswscale/ops_internal.h  | 103 +++++
 libswscale/ops_optimizer.c | 810 +++++++++++++++++++++++++++++++++++
 5 files changed, 2023 insertions(+)
 create mode 100644 libswscale/ops.c
 create mode 100644 libswscale/ops.h
 create mode 100644 libswscale/ops_internal.h
 create mode 100644 libswscale/ops_optimizer.c

diff --git a/libswscale/Makefile b/libswscale/Makefile
index d5e10d17dc..810c9dee78 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -15,6 +15,8 @@ OBJS = alphablend.o                                     \
        graph.o                                          \
        input.o                                          \
        lut3d.o                                          \
+       ops.o                                            \
+       ops_optimizer.o                                  \
        options.o                                        \
        output.o                                         \
        rgb2rgb.o                                        \
diff --git a/libswscale/ops.c b/libswscale/ops.c
new file mode 100644
index 0000000000..6d9a844e06
--- /dev/null
+++ b/libswscale/ops.c
@@ -0,0 +1,843 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/bswap.h"
+#include "libavutil/mem.h"
+#include "libavutil/rational.h"
+#include "libavutil/refstruct.h"
+
+#include "ops.h"
+#include "ops_internal.h"
+
+const SwsOpBackend * const ff_sws_op_backends[] = {
+    NULL
+};
+
+const int ff_sws_num_op_backends = FF_ARRAY_ELEMS(ff_sws_op_backends) - 1;
+
+#define Q(N) ((AVRational) { N, 1 })
+
+#define RET(x)                                                                 \
+    do {                                                                       \
+        if ((ret = (x)) < 0)                                                   \
+            return ret;                                                        \
+    } while (0)
+
+const char *ff_sws_pixel_type_name(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:   return "u8";
+    case SWS_PIXEL_U16:  return "u16";
+    case SWS_PIXEL_U32:  return "u32";
+    case SWS_PIXEL_F32:  return "f32";
+    case SWS_PIXEL_NONE: return "none";
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return "ERR";
+}
+
+int ff_sws_pixel_type_size(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:  return sizeof(uint8_t);
+    case SWS_PIXEL_U16: return sizeof(uint16_t);
+    case SWS_PIXEL_U32: return sizeof(uint32_t);
+    case SWS_PIXEL_F32: return sizeof(float);
+    case SWS_PIXEL_NONE: break;
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return 0;
+}
+
+bool ff_sws_pixel_type_is_int(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:
+    case SWS_PIXEL_U16:
+    case SWS_PIXEL_U32:
+        return true;
+    case SWS_PIXEL_F32:
+        return false;
+    case SWS_PIXEL_NONE:
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return false;
+}
+
+SwsPixelType ff_sws_pixel_type_to_uint(SwsPixelType type)
+{
+    if (!type)
+        return type;
+
+    switch (ff_sws_pixel_type_size(type)) {
+    case 8:  return SWS_PIXEL_U8;
+    case 16: return SWS_PIXEL_U16;
+    case 32: return SWS_PIXEL_U32;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return SWS_PIXEL_NONE;
+}
+
+/* biased towards `a` */
+static AVRational av_min_q(AVRational a, AVRational b)
+{
+    return av_cmp_q(a, b) == 1 ? b : a;
+}
+
+static AVRational av_max_q(AVRational a, AVRational b)
+{
+    return av_cmp_q(a, b) == -1 ? b : a;
+}
+
+static AVRational expand_factor(SwsPixelType from, SwsPixelType to)
+{
+    const int src = ff_sws_pixel_type_size(from);
+    const int dst = ff_sws_pixel_type_size(to);
+    int scale = 0;
+    for (int i = 0; i < dst / src; i++)
+        scale = scale << src * 8 | 1;
+    return Q(scale);
+}
+
+void ff_sws_apply_op_q(const SwsOp *op, AVRational x[4])
+{
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        return;
+    case SWS_OP_UNPACK: {
+        unsigned val = x[0].num;
+        int shift = ff_sws_pixel_type_size(op->type) * 8;
+        for (int i = 0; i < 4; i++) {
+            const unsigned mask = (1 << op->pack.pattern[i]) - 1;
+            shift -= op->pack.pattern[i];
+            x[i] = Q((val >> shift) & mask);
+        }
+        return;
+    }
+    case SWS_OP_PACK: {
+        unsigned val = 0;
+        int shift = ff_sws_pixel_type_size(op->type) * 8;
+        for (int i = 0; i < 4; i++) {
+            const unsigned mask = (1 << op->pack.pattern[i]) - 1;
+            shift -= op->pack.pattern[i];
+            val |= (x[i].num & mask) << shift;
+        }
+        x[0] = Q(val);
+        return;
+    }
+    case SWS_OP_SWAP_BYTES:
+        switch (ff_sws_pixel_type_size(op->type)) {
+        case 2:
+            for (int i = 0; i < 4; i++)
+                x[i].num = av_bswap16(x[i].num);
+            break;
+        case 4:
+            for (int i = 0; i < 4; i++)
+                x[i].num = av_bswap32(x[i].num);
+            break;
+        }
+        return;
+    case SWS_OP_CLEAR:
+        for (int i = 0; i < 4; i++) {
+            if (op->c.q4[i].den)
+                x[i] = op->c.q4[i];
+        }
+        return;
+    case SWS_OP_LSHIFT: {
+        AVRational mult = Q(1 << op->c.u);
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_mul_q(x[i], mult) : x[i];
+        return;
+    }
+    case SWS_OP_RSHIFT: {
+        AVRational mult = Q(1 << op->c.u);
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_div_q(x[i], mult) : x[i];
+        return;
+    }
+    case SWS_OP_SWIZZLE: {
+        const AVRational orig[4] = { x[0], x[1], x[2], x[3] };
+        for (int i = 0; i < 4; i++)
+            x[i] = orig[op->swizzle.in[i]];
+        return;
+    }
+    case SWS_OP_CONVERT:
+        if (ff_sws_pixel_type_is_int(op->convert.to)) {
+            const AVRational scale = expand_factor(op->type, op->convert.to);
+            for (int i = 0; i < 4; i++) {
+                x[i] = x[i].den ? Q(x[i].num / x[i].den) : x[i];
+                if (op->convert.expand)
+                    x[i] = av_mul_q(x[i], scale);
+            }
+        }
+        return;
+    case SWS_OP_DITHER:
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_add_q(x[i], av_make_q(1, 2)) : x[i];
+        return;
+    case SWS_OP_MIN:
+        for (int i = 0; i < 4; i++)
+            x[i] = av_min_q(x[i], op->c.q4[i]);
+        return;
+    case SWS_OP_MAX:
+        for (int i = 0; i < 4; i++)
+            x[i] = av_max_q(x[i], op->c.q4[i]);
+        return;
+    case SWS_OP_LINEAR: {
+        const AVRational orig[4] = { x[0], x[1], x[2], x[3] };
+        for (int i = 0; i < 4; i++) {
+            AVRational sum = op->lin.m[i][4];
+            for (int j = 0; j < 4; j++)
+                sum = av_add_q(sum, av_mul_q(orig[j], op->lin.m[i][j]));
+            x[i] = sum;
+        }
+        return;
+    }
+    case SWS_OP_SCALE:
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_mul_q(x[i], op->c.q) : x[i];
+        return;
+    }
+
+    av_assert0(!"Invalid operation type!");
+}
+
+static void op_uninit(SwsOp *op)
+{
+    switch (op->op) {
+    case SWS_OP_DITHER:
+        av_refstruct_unref(&op->dither.matrix);
+        break;
+    }
+
+    *op = (SwsOp) {0};
+}
+
+SwsOpList *ff_sws_op_list_alloc(void)
+{
+    return av_mallocz(sizeof(SwsOpList));
+}
+
+void ff_sws_op_list_free(SwsOpList **p_ops)
+{
+    SwsOpList *ops = *p_ops;
+    if (!ops)
+        return;
+
+    for (int i = 0; i < ops->num_ops; i++)
+        op_uninit(&ops->ops[i]);
+
+    av_freep(&ops->ops);
+    av_free(ops);
+    *p_ops = NULL;
+}
+
+SwsOpList *ff_sws_op_list_duplicate(const SwsOpList *ops)
+{
+    SwsOpList *copy = av_malloc(sizeof(*copy));
+    if (!copy)
+        return NULL;
+
+    *copy = *ops;
+    copy->ops = av_memdup(ops->ops, ops->num_ops * sizeof(ops->ops[0]));
+    if (!copy->ops) {
+        av_free(copy);
+        return NULL;
+    }
+
+    for (int i = 0; i < ops->num_ops; i++) {
+        const SwsOp *op = &ops->ops[i];
+        switch (op->op) {
+        case SWS_OP_DITHER:
+            av_refstruct_ref(copy->ops[i].dither.matrix);
+            break;
+        }
+    }
+
+    return copy;
+}
+
+void ff_sws_op_list_remove_at(SwsOpList *ops, int index, int count)
+{
+    const int end = ops->num_ops - count;
+    av_assert2(index >= 0 && count >= 0 && index + count <= ops->num_ops);
+    for (int i = index; i < end; i++)
+        ops->ops[i] = ops->ops[i + count];
+    ops->num_ops = end;
+}
+
+int ff_sws_op_list_insert_at(SwsOpList *ops, int index, SwsOp *op)
+{
+    void *ret;
+    ret = av_dynarray2_add((void **) &ops->ops, &ops->num_ops, sizeof(*op),
+                           (const void *) op);
+    if (!ret) {
+        op_uninit(op);
+        return AVERROR(ENOMEM);
+    }
+
+    for (int i = ops->num_ops - 1; i > index; i--)
+        ops->ops[i] = ops->ops[i - 1];
+    ops->ops[index] = *op;
+    *op = (SwsOp) {0};
+    return 0;
+}
+
+int ff_sws_op_list_append(SwsOpList *ops, SwsOp *op)
+{
+    return ff_sws_op_list_insert_at(ops, ops->num_ops, op);
+}
+
+int ff_sws_op_list_max_size(const SwsOpList *ops)
+{
+    int max_size = 0;
+    for (int i = 0; i < ops->num_ops; i++) {
+        const int size = ff_sws_pixel_type_size(ops->ops[i].type);
+        max_size = FFMAX(max_size, size);
+    }
+
+    return max_size;
+}
+
+uint32_t ff_sws_linear_mask(const SwsLinearOp c)
+{
+    uint32_t mask = 0;
+    for (int i = 0; i < 4; i++) {
+        for (int j = 0; j < 5; j++) {
+            if (av_cmp_q(c.m[i][j], Q(i == j)))
+                mask |= SWS_MASK(i, j);
+        }
+    }
+    return mask;
+}
+
+static const char *describe_lin_mask(uint32_t mask)
+{
+    /* Try to be fairly descriptive without assuming too much */
+    static const struct {
+        const char *name;
+        uint32_t mask;
+    } patterns[] = {
+        { "noop",               0 },
+        { "luma",               SWS_MASK_LUMA },
+        { "alpha",              SWS_MASK_ALPHA },
+        { "luma+alpha",         SWS_MASK_LUMA | SWS_MASK_ALPHA },
+        { "dot3",               0b111 },
+        { "dot4",               0b1111 },
+        { "row0",               SWS_MASK_ROW(0) },
+        { "row0+alpha",         SWS_MASK_ROW(0) | SWS_MASK_ALPHA },
+        { "col0",               SWS_MASK_COL(0) },
+        { "col0+off3",          SWS_MASK_COL(0) | SWS_MASK_OFF3 },
+        { "off3",               SWS_MASK_OFF3 },
+        { "off3+alpha",         SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag3",              SWS_MASK_DIAG3 },
+        { "diag4",              SWS_MASK_DIAG4 },
+        { "diag3+alpha",        SWS_MASK_DIAG3 | SWS_MASK_ALPHA },
+        { "diag3+off3",         SWS_MASK_DIAG3 | SWS_MASK_OFF3 },
+        { "diag3+off3+alpha",   SWS_MASK_DIAG3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag4+off4",         SWS_MASK_DIAG4 | SWS_MASK_OFF4 },
+        { "matrix3",            SWS_MASK_MAT3 },
+        { "matrix3+off3",       SWS_MASK_MAT3 | SWS_MASK_OFF3 },
+        { "matrix3+off3+alpha", SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "matrix4",            SWS_MASK_MAT4 },
+        { "matrix4+off4",       SWS_MASK_MAT4 | SWS_MASK_OFF4 },
+    };
+
+    for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+        if (!(mask & ~patterns[i].mask))
+            return patterns[i].name;
+    }
+
+    return "full";
+}
+
+static char describe_comp_flags(unsigned flags)
+{
+    if (flags & SWS_COMP_GARBAGE)
+        return 'X';
+    else if (flags & SWS_COMP_ZERO)
+        return '0';
+    else if (flags & SWS_COMP_EXACT)
+        return '+';
+    else
+        return '.';
+}
+
+static const char *print_q(const AVRational q, char buf[], int buf_len)
+{
+    if (!q.den) {
+        switch (q.num) {
+        case  1: return "inf";
+        case -1: return "-inf";
+        default: return "nan";
+        }
+    }
+
+    if (q.den == 1) {
+        snprintf(buf, buf_len, "%d", q.num);
+        return buf;
+    }
+
+    if (abs(q.num) > 1000 || abs(q.den) > 1000) {
+        snprintf(buf, buf_len, "%f", av_q2d(q));
+        return buf;
+    }
+
+    snprintf(buf, buf_len, "%d/%d", q.num, q.den);
+    return buf;
+}
+
+#define PRINTQ(q) print_q(q, (char[32]){0}, sizeof(char[32]) - 1)
+
+
+void ff_sws_op_list_print(void *log, int lev, const SwsOpList *ops)
+{
+    if (!ops->num_ops) {
+        av_log(log, lev, "  (empty)\n");
+        return;
+    }
+
+    for (int i = 0; i < ops->num_ops; i++) {
+        const SwsOp *op = &ops->ops[i];
+        av_log(log, lev, "  [%3s %c%c%c%c -> %c%c%c%c] ",
+               ff_sws_pixel_type_name(op->type),
+               op->comps.unused[0] ? 'X' : '.',
+               op->comps.unused[1] ? 'X' : '.',
+               op->comps.unused[2] ? 'X' : '.',
+               op->comps.unused[3] ? 'X' : '.',
+               describe_comp_flags(op->comps.flags[0]),
+               describe_comp_flags(op->comps.flags[1]),
+               describe_comp_flags(op->comps.flags[2]),
+               describe_comp_flags(op->comps.flags[3]));
+
+        switch (op->op) {
+        case SWS_OP_INVALID:
+            av_log(log, lev, "SWS_OP_INVALID\n");
+            break;
+        case SWS_OP_READ:
+        case SWS_OP_WRITE:
+            av_log(log, lev, "%-20s: %d elem(s) %s >> %d\n",
+                   op->op == SWS_OP_READ ? "SWS_OP_READ"
+                                         : "SWS_OP_WRITE",
+                   op->rw.elems,  op->rw.packed ? "packed" : "planar",
+                   op->rw.frac);
+            break;
+        case SWS_OP_SWAP_BYTES:
+            av_log(log, lev, "SWS_OP_SWAP_BYTES\n");
+            break;
+        case SWS_OP_LSHIFT:
+            av_log(log, lev, "%-20s: << %u\n", "SWS_OP_LSHIFT", op->c.u);
+            break;
+        case SWS_OP_RSHIFT:
+            av_log(log, lev, "%-20s: >> %u\n", "SWS_OP_RSHIFT", op->c.u);
+            break;
+        case SWS_OP_PACK:
+        case SWS_OP_UNPACK:
+            av_log(log, lev, "%-20s: {%d %d %d %d}\n",
+                   op->op == SWS_OP_PACK ? "SWS_OP_PACK"
+                                         : "SWS_OP_UNPACK",
+                   op->pack.pattern[0], op->pack.pattern[1],
+                   op->pack.pattern[2], op->pack.pattern[3]);
+            break;
+        case SWS_OP_CLEAR:
+            av_log(log, lev, "%-20s: {%s %s %s %s}\n", "SWS_OP_CLEAR",
+                   op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                   op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                   op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                   op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_SWIZZLE:
+            av_log(log, lev, "%-20s: %d%d%d%d\n", "SWS_OP_SWIZZLE",
+                   op->swizzle.x, op->swizzle.y, op->swizzle.z, op->swizzle.w);
+            break;
+        case SWS_OP_CONVERT:
+            av_log(log, lev, "%-20s: %s -> %s%s\n", "SWS_OP_CONVERT",
+                   ff_sws_pixel_type_name(op->type),
+                   ff_sws_pixel_type_name(op->convert.to),
+                   op->convert.expand ? " (expand)" : "");
+            break;
+        case SWS_OP_DITHER:
+            av_log(log, lev, "%-20s: %dx%d matrix\n", "SWS_OP_DITHER",
+                    1 << op->dither.size_log2, 1 << op->dither.size_log2);
+            break;
+        case SWS_OP_MIN:
+            av_log(log, lev, "%-20s: x <= {%s %s %s %s}\n", "SWS_OP_MIN",
+                    op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                    op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                    op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                    op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_MAX:
+            av_log(log, lev, "%-20s: {%s %s %s %s} <= x\n", "SWS_OP_MAX",
+                    op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                    op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                    op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                    op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_LINEAR:
+            av_log(log, lev, "%-20s: %s [[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s]]\n",
+                   "SWS_OP_LINEAR", describe_lin_mask(op->lin.mask),
+                   PRINTQ(op->lin.m[0][0]), PRINTQ(op->lin.m[0][1]), PRINTQ(op->lin.m[0][2]), PRINTQ(op->lin.m[0][3]), PRINTQ(op->lin.m[0][4]),
+                   PRINTQ(op->lin.m[1][0]), PRINTQ(op->lin.m[1][1]), PRINTQ(op->lin.m[1][2]), PRINTQ(op->lin.m[1][3]), PRINTQ(op->lin.m[1][4]),
+                   PRINTQ(op->lin.m[2][0]), PRINTQ(op->lin.m[2][1]), PRINTQ(op->lin.m[2][2]), PRINTQ(op->lin.m[2][3]), PRINTQ(op->lin.m[2][4]),
+                   PRINTQ(op->lin.m[3][0]), PRINTQ(op->lin.m[3][1]), PRINTQ(op->lin.m[3][2]), PRINTQ(op->lin.m[3][3]), PRINTQ(op->lin.m[3][4]));
+            break;
+        case SWS_OP_SCALE:
+            av_log(log, lev, "%-20s: * %s\n", "SWS_OP_SCALE",
+                   PRINTQ(op->c.q));
+            break;
+        case SWS_OP_TYPE_NB:
+            break;
+        }
+
+        if (op->comps.min[0].den || op->comps.min[1].den ||
+            op->comps.min[2].den || op->comps.min[3].den ||
+            op->comps.max[0].den || op->comps.max[1].den ||
+            op->comps.max[2].den || op->comps.max[3].den)
+        {
+            av_log(log, AV_LOG_TRACE, "    min: {%s, %s, %s, %s}, max: {%s, %s, %s, %s}\n",
+                PRINTQ(op->comps.min[0]), PRINTQ(op->comps.min[1]),
+                PRINTQ(op->comps.min[2]), PRINTQ(op->comps.min[3]),
+                PRINTQ(op->comps.max[0]), PRINTQ(op->comps.max[1]),
+                PRINTQ(op->comps.max[2]), PRINTQ(op->comps.max[3]));
+        }
+
+    }
+
+    av_log(log, lev, "    (X = unused, + = exact, 0 = zero)\n");
+}
+
+typedef struct SwsOpPass {
+    SwsCompiledOp comp;
+    SwsOpExec exec_base;
+    int num_blocks;
+    int tail_off_in;
+    int tail_off_out;
+    int tail_size_in;
+    int tail_size_out;
+    bool memcpy_in;
+    bool memcpy_out;
+} SwsOpPass;
+
+static void op_pass_free(void *ptr)
+{
+    SwsOpPass *p = ptr;
+    if (!p)
+        return;
+
+    if (p->comp.free)
+        p->comp.free(p->comp.priv);
+
+    av_free(p);
+}
+
+static void op_pass_setup(const SwsImg *out, const SwsImg *in, const SwsPass *pass)
+{
+    const AVPixFmtDescriptor *indesc  = av_pix_fmt_desc_get(in->fmt);
+    const AVPixFmtDescriptor *outdesc = av_pix_fmt_desc_get(out->fmt);
+
+    SwsOpPass *p = pass->priv;
+    SwsOpExec *exec = &p->exec_base;
+    const SwsCompiledOp *comp = &p->comp;
+    const int block_size = comp->block_size;
+    p->num_blocks = (pass->width + block_size - 1) / block_size;
+
+    /* Set up main loop parameters */
+    const int aligned_w  = p->num_blocks * block_size;
+    const int safe_width = (p->num_blocks - 1) * block_size;
+    const int tail_size  = pass->width - safe_width;
+    p->tail_off_in   = safe_width * exec->pixel_bits_in  >> 3;
+    p->tail_off_out  = safe_width * exec->pixel_bits_out >> 3;
+    p->tail_size_in  = tail_size  * exec->pixel_bits_in  >> 3;
+    p->tail_size_out = tail_size  * exec->pixel_bits_out >> 3;
+    p->memcpy_in     = false;
+    p->memcpy_out    = false;
+
+    for (int i = 0; i < 4 && in->data[i]; i++) {
+        const int sub_x      = (i == 1 || i == 2) ? indesc->log2_chroma_w : 0;
+        const int plane_w    = (aligned_w + sub_x) >> sub_x;
+        const int plane_pad  = (comp->over_read + sub_x) >> sub_x;
+        const int plane_size = plane_w * exec->pixel_bits_in >> 3;
+        p->memcpy_in |= plane_size + plane_pad > in->linesize[i];
+        exec->in_stride[i] = in->linesize[i];
+    }
+
+    for (int i = 0; i < 4 && out->data[i]; i++) {
+        const int sub_x      = (i == 1 || i == 2) ? outdesc->log2_chroma_w : 0;
+        const int plane_w    = (aligned_w + sub_x) >> sub_x;
+        const int plane_pad  = (comp->over_write + sub_x) >> sub_x;
+        const int plane_size = plane_w * exec->pixel_bits_out >> 3;
+        p->memcpy_out |= plane_size + plane_pad > out->linesize[i];
+        exec->out_stride[i] = out->linesize[i];
+    }
+}
+
+/* Dispatch kernel over the last column of the image using memcpy */
+static av_always_inline void
+handle_tail(const SwsOpPass *p, SwsOpExec *exec,
+            const SwsImg *out_base, const bool copy_out,
+            const SwsImg *in_base, const bool copy_in,
+            const int y, const int y_end)
+{
+    DECLARE_ALIGNED_64(uint8_t, tmp)[2][4][sizeof(uint32_t[128])];
+
+    const SwsCompiledOp *comp = &p->comp;
+    const int block_size    = comp->block_size;
+    const int tail_size_in  = p->tail_size_in;
+    const int tail_size_out = p->tail_size_out;
+
+    SwsImg in  = ff_sws_img_shift(*in_base,  y);
+    SwsImg out = ff_sws_img_shift(*out_base, y);
+    for (int i = 0; i < 4 && in.data[i]; i++) {
+        in.data[i]  += p->tail_off_in;
+        if (copy_in) {
+            exec->in[i] = (void *) tmp[0][i];
+            exec->in_stride[i] = sizeof(tmp[0][i]);
+        } else {
+            exec->in[i] = in.data[i];
+        }
+    }
+
+    for (int i = 0; i < 4 && out.data[i]; i++) {
+        out.data[i] += p->tail_off_out;
+        if (copy_out) {
+            exec->out[i] = (void *) tmp[1][i];
+            exec->out_stride[i] = sizeof(tmp[1][i]);
+        } else {
+            exec->out[i] = out.data[i];
+        }
+    }
+
+    exec->x = (p->num_blocks - 1) * block_size;
+    for (exec->y = y; exec->y < y_end; exec->y++) {
+        if (copy_in) {
+            for (int i = 0; i < 4 && in.data[i]; i++) {
+                av_assert2(tmp[0][i] + tail_size_in < (uint8_t *) tmp[1]);
+                memcpy(tmp[0][i], in.data[i], tail_size_in);
+                in.data[i] += in.linesize[i];
+            }
+        }
+
+        comp->func(exec, comp->priv, 1);
+
+        if (copy_out) {
+            for (int i = 0; i < 4 && out.data[i]; i++) {
+                av_assert2(tmp[1][i] + tail_size_out < (uint8_t *) tmp[2]);
+                memcpy(out.data[i], tmp[1][i], tail_size_out);
+                out.data[i] += out.linesize[i];
+            }
+        }
+
+        for (int i = 0; i < 4; i++) {
+            if (!copy_in)
+                exec->in[i] += in.linesize[i];
+            if (!copy_out)
+                exec->out[i] += out.linesize[i];
+        }
+    }
+}
+
+static av_always_inline void
+op_pass_run(const SwsImg *out_base, const SwsImg *in_base,
+            const int y, const int h, const SwsPass *pass)
+{
+    const SwsOpPass *p = pass->priv;
+    const SwsCompiledOp *comp = &p->comp;
+
+    /* Fill exec metadata for this slice */
+    const SwsImg in  = ff_sws_img_shift(*in_base,  y);
+    const SwsImg out = ff_sws_img_shift(*out_base, y);
+    SwsOpExec exec = p->exec_base;
+    exec.slice_y = y;
+    exec.slice_h = h;
+    for (int i = 0; i < 4; i++) {
+        exec.in[i]  = in.data[i];
+        exec.out[i] = out.data[i];
+    }
+
+    /**
+     *  To ensure safety, we need to consider the following:
+     *
+     * 1. We can overread the input, unless this is the last line of an
+     *    unpadded buffer. All operation chains must be able to handle
+     *    arbitrary pixel input, so arbitrary overread is fine.
+     *
+     * 2. We can overwrite the output, as long as we don't write more than the
+     *    amount of pixels that fit into one linesize. So we always need to
+     *    memcpy the last column on the output side if unpadded.
+     *
+     * 3. For the last row, we also need to memcpy the remainder of the input,
+     *    to avoid reading past the end of the buffer. Note that since we know
+     *    the run() function is called on stripes of the same buffer, we don't
+     *    need to worry about this for the end of a slice.
+     */
+
+    const int last_slice  = y + h == pass->height;
+    const bool memcpy_in  = last_slice && p->memcpy_in;
+    const bool memcpy_out = p->memcpy_out;
+    const int num_blocks  = p->num_blocks;
+    const int blocks_main = num_blocks - memcpy_out;
+    const int y_end       = y + h - memcpy_in;
+
+    /* Handle main section */
+    for (exec.y = y; exec.y < y_end; exec.y++) {
+        comp->func(&exec, comp->priv, blocks_main);
+        for (int i = 0; i < 4; i++) {
+            exec.in[i]  += in.linesize[i];
+            exec.out[i] += out.linesize[i];
+        }
+    }
+
+    if (memcpy_in)
+        comp->func(&exec, comp->priv, num_blocks - 1); /* safe part of last row */
+
+    /* Handle last column via memcpy, takes over `exec` so call these last */
+    if (memcpy_out)
+        handle_tail(p, &exec, out_base, true, in_base, false, y, y_end);
+    if (memcpy_in)
+        handle_tail(p, &exec, out_base, memcpy_out, in_base, true, y_end, y + h);
+}
+
+static int rw_pixel_bits(const SwsOp *op)
+{
+    const int elems = op->rw.packed ? op->rw.elems : 1;
+    const int size  = ff_sws_pixel_type_size(op->type);
+    const int bits  = 8 >> op->rw.frac;
+    av_assert1(bits >= 1);
+    return elems * size * bits;
+}
+
+int ff_sws_ops_compile_backend(SwsContext *ctx, const SwsOpBackend *backend,
+                               const SwsOpList *ops, SwsCompiledOp *out)
+{
+    SwsOpList *copy, rest;
+    int ret = 0;
+
+    copy = ff_sws_op_list_duplicate(ops);
+    if (!copy)
+        return AVERROR(ENOMEM);
+
+    /* Ensure these are always set during compilation */
+    ff_sws_op_list_update_comps(copy);
+
+    /* Make an on-stack copy of `ops` to ensure we can still properly clean up
+     * the copy afterwards */
+    rest = *copy;
+
+    ret = backend->compile(ctx, &rest, out);
+    if (ret == AVERROR(ENOTSUP)) {
+        av_log(ctx, AV_LOG_DEBUG, "Backend '%s' does not support operations:\n", backend->name);
+        ff_sws_op_list_print(ctx, AV_LOG_DEBUG, &rest);
+    } else if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to compile operations: %s\n", av_err2str(ret));
+        ff_sws_op_list_print(ctx, AV_LOG_ERROR, &rest);
+    }
+
+    ff_sws_op_list_free(&copy);
+    return ret;
+}
+
+int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out)
+{
+    for (int n = 0; ff_sws_op_backends[n]; n++) {
+        const SwsOpBackend *backend = ff_sws_op_backends[n];
+        if (ff_sws_ops_compile_backend(ctx, backend, ops, out) < 0)
+            continue;
+
+        av_log(ctx, AV_LOG_VERBOSE, "Compiled using backend '%s': "
+               "block size = %d, over-read = %d, over-write = %d\n",
+               backend->name, out->block_size, out->over_read, out->over_write);
+        return 0;
+    }
+
+    av_log(ctx, AV_LOG_WARNING, "No backend found for operations:\n");
+    ff_sws_op_list_print(ctx, AV_LOG_WARNING, ops);
+    return AVERROR(ENOTSUP);
+}
+
+int ff_sws_compile_pass(SwsGraph *graph, SwsOpList *ops, int flags, SwsFormat dst,
+                        SwsPass *input, SwsPass **output)
+{
+    SwsContext *ctx = graph->ctx;
+    SwsOpPass *p = NULL;
+    const SwsOp *read = &ops->ops[0];
+    const SwsOp *write = &ops->ops[ops->num_ops - 1];
+    SwsPass *pass;
+    int ret;
+
+    if (ops->num_ops < 2) {
+        av_log(ctx, AV_LOG_ERROR, "Need at least two operations.\n");
+        return AVERROR(EINVAL);
+    }
+
+    if (read->op != SWS_OP_READ || write->op != SWS_OP_WRITE) {
+        av_log(ctx, AV_LOG_ERROR, "First and last operations must be a read "
+               "and write, respectively.\n");
+        return AVERROR(EINVAL);
+    }
+
+    if (flags & SWS_OP_FLAG_OPTIMIZE)
+        RET(ff_sws_op_list_optimize(ops));
+    else
+        ff_sws_op_list_update_comps(ops);
+
+    p = av_mallocz(sizeof(*p));
+    if (!p)
+        return AVERROR(ENOMEM);
+
+    p->exec_base = (SwsOpExec) {
+        .width  = dst.width,
+        .height = dst.height,
+        .pixel_bits_in  = rw_pixel_bits(read),
+        .pixel_bits_out = rw_pixel_bits(write),
+    };
+
+    ret = ff_sws_ops_compile(ctx, ops, &p->comp);
+    if (ret < 0)
+        goto fail;
+
+    pass = ff_sws_graph_add_pass(graph, dst.format, dst.width, dst.height, input,
+                                 1, p, op_pass_run);
+    if (!pass) {
+        ret = AVERROR(ENOMEM);
+        goto fail;
+    }
+    pass->setup = op_pass_setup;
+    pass->free  = op_pass_free;
+
+    *output = pass;
+    return 0;
+
+fail:
+    op_pass_free(p);
+    return ret;
+}
diff --git a/libswscale/ops.h b/libswscale/ops.h
new file mode 100644
index 0000000000..c9c5706cbf
--- /dev/null
+++ b/libswscale/ops.h
@@ -0,0 +1,265 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_H
+#define SWSCALE_OPS_H
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdalign.h>
+
+#include "graph.h"
+
+typedef enum SwsPixelType {
+    SWS_PIXEL_NONE = 0,
+    SWS_PIXEL_U8,
+    SWS_PIXEL_U16,
+    SWS_PIXEL_U32,
+    SWS_PIXEL_F32,
+    SWS_PIXEL_TYPE_NB
+} SwsPixelType;
+
+const char *ff_sws_pixel_type_name(SwsPixelType type);
+int ff_sws_pixel_type_size(SwsPixelType type) av_const;
+bool ff_sws_pixel_type_is_int(SwsPixelType type) av_const;
+SwsPixelType ff_sws_pixel_type_to_uint(SwsPixelType type) av_const;
+
+typedef enum SwsOpType {
+    SWS_OP_INVALID = 0,
+
+    /* Input/output handling */
+    SWS_OP_READ,            /* gather raw pixels from planes */
+    SWS_OP_WRITE,           /* write raw pixels to planes */
+    SWS_OP_SWAP_BYTES,      /* swap byte order (for differing endianness) */
+    SWS_OP_UNPACK,          /* split tightly packed data into components */
+    SWS_OP_PACK,            /* compress components into tightly packed data */
+
+    /* Pixel manipulation */
+    SWS_OP_CLEAR,           /* clear pixel values */
+    SWS_OP_LSHIFT,          /* logical left shift of raw pixel values by (u8) */
+    SWS_OP_RSHIFT,          /* right shift of raw pixel values by (u8) */
+    SWS_OP_SWIZZLE,         /* rearrange channel order, or duplicate channels */
+    SWS_OP_CONVERT,         /* convert (cast) between formats */
+    SWS_OP_DITHER,          /* add dithering noise */
+
+    /* Arithmetic operations */
+    SWS_OP_LINEAR,          /* generalized linear affine transform */
+    SWS_OP_SCALE,           /* multiplication by scalar (q) */
+    SWS_OP_MIN,             /* numeric minimum (q4) */
+    SWS_OP_MAX,             /* numeric maximum (q4) */
+
+    SWS_OP_TYPE_NB,
+} SwsOpType;
+
+enum SwsCompFlags {
+    SWS_COMP_GARBAGE = 1 << 0, /* contents are undefined / garbage data */
+    SWS_COMP_EXACT   = 1 << 1, /* value is an in-range, exact, integer */
+    SWS_COMP_ZERO    = 1 << 2, /* known to be a constant zero */
+};
+
+typedef union SwsConst {
+    /* Generic constant value */
+    AVRational q;
+    AVRational q4[4];
+    unsigned u;
+} SwsConst;
+
+typedef struct SwsComps {
+    unsigned flags[4]; /* knowledge about (output) component contents */
+    bool unused[4];    /* which input components are definitely unused */
+
+    /* Keeps track of the known possible value range, or {0, 0} for undefined
+     * or (unknown range) floating point inputs */
+    AVRational min[4], max[4];
+} SwsComps;
+
+typedef struct SwsReadWriteOp {
+    /* Note: Unread pixel data is explicitly cleared to {0} for sanity */
+
+    int elems;   /* number of elements (of type `op.type`) to read/write */
+    bool packed; /* read multiple elements from a single plane */
+    int frac;    /* fractional pixel step factor (log2) */
+
+    /** Examples:
+     *    rgba      = 4x u8 packed
+     *    yuv444p   = 3x u8
+     *    rgb565    = 1x u16   <- use SWS_OP_UNPACK to unpack
+     *    monow     = 1x u8 (frac 3)
+     *    rgb4      = 1x u8 (frac 1)
+     */
+} SwsReadWriteOp;
+
+typedef struct SwsPackOp {
+    int pattern[4]; /* bit depth pattern, from MSB to LSB */
+} SwsPackOp;
+
+typedef struct SwsSwizzleOp {
+    /**
+     * Input component for each output component:
+     *   Out[x] := In[swizzle.in[x]]
+     */
+    union {
+        uint32_t mask;
+        uint8_t in[4];
+        struct { uint8_t x, y, z, w; };
+    };
+} SwsSwizzleOp;
+
+#define SWS_SWIZZLE(X,Y,Z,W) ((SwsSwizzleOp) { .in = {X, Y, Z, W} })
+
+typedef struct SwsConvertOp {
+    SwsPixelType to; /* type of pixel to convert to */
+    bool expand; /* if true, integers are expanded to the full range */
+} SwsConvertOp;
+
+typedef struct SwsDitherOp {
+    AVRational *matrix; /* tightly packed dither matrix (refstruct) */
+    int size_log2; /* size (in bits) of the dither matrix */
+} SwsDitherOp;
+
+typedef struct SwsLinearOp {
+    /**
+     * Generalized 5x5 affine transformation:
+     *   [ Out.x ] = [ A B C D E ]
+     *   [ Out.y ] = [ F G H I J ] * [ x y z w 1 ]
+     *   [ Out.z ] = [ K L M N O ]
+     *   [ Out.w ] = [ P Q R S T ]
+     *
+     * The mask keeps track of which components differ from an identity matrix.
+     * There may be more efficient implementations of particular subsets, for
+     * example the common subset of {A, E, G, J, M, O} can be implemented with
+     * just three fused multiply-add operations.
+     */
+    AVRational m[4][5];
+    uint32_t mask; /* m[i][j] <-> 1 << (5 * i + j) */
+} SwsLinearOp;
+
+#define SWS_MASK(I, J)  (1 << (5 * (I) + (J)))
+#define SWS_MASK_OFF(I) SWS_MASK(I, 4)
+#define SWS_MASK_ROW(I) (0b11111 << (5 * (I)))
+#define SWS_MASK_COL(J) (0b1000010000100001 << J)
+
+enum {
+    SWS_MASK_ALL   = (1 << 20) - 1,
+    SWS_MASK_LUMA  = SWS_MASK(0, 0) | SWS_MASK_OFF(0),
+    SWS_MASK_ALPHA = SWS_MASK(3, 3) | SWS_MASK_OFF(3),
+
+    SWS_MASK_DIAG3 = SWS_MASK(0, 0)  | SWS_MASK(1, 1)  | SWS_MASK(2, 2),
+    SWS_MASK_OFF3  = SWS_MASK_OFF(0) | SWS_MASK_OFF(1) | SWS_MASK_OFF(2),
+    SWS_MASK_MAT3  = SWS_MASK(0, 0)  | SWS_MASK(0, 1)  | SWS_MASK(0, 2) |
+                     SWS_MASK(1, 0)  | SWS_MASK(1, 1)  | SWS_MASK(1, 2) |
+                     SWS_MASK(2, 0)  | SWS_MASK(2, 1)  | SWS_MASK(2, 2),
+
+    SWS_MASK_DIAG4 = SWS_MASK_DIAG3  | SWS_MASK(3, 3),
+    SWS_MASK_OFF4  = SWS_MASK_OFF3   | SWS_MASK_OFF(3),
+    SWS_MASK_MAT4  = SWS_MASK_ALL & ~SWS_MASK_OFF4,
+};
+
+/* Helper function to compute the correct mask */
+uint32_t ff_sws_linear_mask(SwsLinearOp);
+
+typedef struct SwsOp {
+    SwsOpType op;      /* operation to perform */
+    SwsPixelType type; /* pixel type to operate on */
+    union {
+        SwsReadWriteOp  rw;
+        SwsPackOp       pack;
+        SwsSwizzleOp    swizzle;
+        SwsConvertOp    convert;
+        SwsDitherOp     dither;
+        SwsLinearOp     lin;
+        SwsConst        c;
+    };
+
+    /* For use internal use inside ff_sws_*() functions */
+    SwsComps comps;
+} SwsOp;
+
+/**
+ * Frees any allocations associated with an SwsOp and sets it to {0}.
+ */
+void ff_sws_op_uninit(SwsOp *op);
+
+/**
+ * Apply an operation to an AVRational. No-op for read/write operations.
+ */
+void ff_sws_apply_op_q(const SwsOp *op, AVRational x[4]);
+
+/**
+ * Helper struct for representing a list of operations.
+ */
+typedef struct SwsOpList {
+    SwsOp *ops;
+    int num_ops;
+} SwsOpList;
+
+SwsOpList *ff_sws_op_list_alloc(void);
+void ff_sws_op_list_free(SwsOpList **ops);
+
+/**
+ * Returns a duplicate of `ops`, or NULL on OOM.
+ */
+SwsOpList *ff_sws_op_list_duplicate(const SwsOpList *ops);
+
+/**
+ * Returns the size of the largest pixel type used in `ops`.
+ */
+int ff_sws_op_list_max_size(const SwsOpList *ops);
+
+/**
+ * These will take over ownership of `op` and set it to {0}, even on failure.
+ */
+int ff_sws_op_list_append(SwsOpList *ops, SwsOp *op);
+int ff_sws_op_list_insert_at(SwsOpList *ops, int index, SwsOp *op);
+
+void ff_sws_op_list_remove_at(SwsOpList *ops, int index, int count);
+
+/**
+ * Print out the contents of an operation list.
+ */
+void ff_sws_op_list_print(void *log_ctx, int log_level, const SwsOpList *ops);
+
+/**
+ * Infer + propagate known information about components. Called automatically
+ * when needed by the optimizer and compiler.
+ */
+void ff_sws_op_list_update_comps(SwsOpList *ops);
+
+/**
+ * Fuse compatible and eliminate redundant operations, as well as replacing
+ * some operations with more efficient alternatives.
+ */
+int ff_sws_op_list_optimize(SwsOpList *ops);
+
+enum SwsOpCompileFlags {
+    /* Automatically optimize the operations when compiling */
+    SWS_OP_FLAG_OPTIMIZE = 1 << 0,
+};
+
+/**
+ * Resolves an operation list to a graph pass. The first and last operations
+ * must be a read/write respectively. `flags` is a list of SwsOpCompileFlags.
+ *
+ * Note: `ops` may be modified by this function.
+ */
+int ff_sws_compile_pass(SwsGraph *graph, SwsOpList *ops, int flags, SwsFormat dst,
+                        SwsPass *input, SwsPass **output);
+
+#endif
diff --git a/libswscale/ops_internal.h b/libswscale/ops_internal.h
new file mode 100644
index 0000000000..ac0319321e
--- /dev/null
+++ b/libswscale/ops_internal.h
@@ -0,0 +1,103 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_INTERNAL_H
+#define SWSCALE_OPS_INTERNAL_H
+
+#include "libavutil/mem_internal.h"
+
+#include "ops.h"
+
+/**
+ * Global execution context for all compiled functions.
+ *
+ * Note: This struct is hard-coded in assembly, so do not change the layout
+ * without updating the corresponding assembly definitions.
+ */
+typedef struct SwsOpExec {
+    /* The data pointers point to the first pixel to process */
+    const uint8_t *in[4];
+    uint8_t *out[4];
+
+    /* Separation between lines in bytes */
+    ptrdiff_t in_stride[4];
+    ptrdiff_t out_stride[4];
+
+    /* Extra metadata, may or may not be useful */
+    int32_t x, y;               /* Starting pixel coordinates */
+    int32_t width, height;      /* Overall image dimensions */
+    int32_t slice_y, slice_h;   /* Start and height of current slice */
+    int32_t pixel_bits_in;      /* Bits per input pixel */
+    int32_t pixel_bits_out;     /* Bits per output pixel */
+} SwsOpExec;
+
+static_assert(sizeof(SwsOpExec) == 16 * sizeof(void *) + 8 * sizeof(int32_t),
+              "SwsOpExec layout mismatch");
+
+/* Process a given number of pixel blocks */
+typedef void (*SwsOpFunc)(const SwsOpExec *exec, const void *priv, int blocks);
+
+#define SWS_DECL_FUNC(NAME) \
+    void NAME(const SwsOpExec *, const void *, int)
+
+typedef struct SwsCompiledOp {
+    SwsOpFunc func;
+
+    int block_size; /* number of pixels processed per iteration */
+    int over_read;  /* implementation over-reads input by this many bytes */
+    int over_write; /* implementation over-writes output by this many bytes */
+
+    /* Arbitrary private data */
+    void *priv;
+    void (*free)(void *priv);
+} SwsCompiledOp;
+
+typedef struct SwsOpBackend {
+    const char *name; /* Descriptive name for this backend */
+
+    /**
+     * Compile an operation list to an implementation chain. May modify `ops`
+     * freely; the original list will be freed automatically by the caller.
+     *
+     * Returns 0 or a negative error code.
+     */
+    int (*compile)(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out);
+} SwsOpBackend;
+
+/* List of all backends, terminated by NULL */
+extern const SwsOpBackend *const ff_sws_op_backends[];
+extern const int ff_sws_num_op_backends; /* excludes terminating NULL */
+
+/**
+ * Attempt to compile a list of operations using a specific backend.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_ops_compile_backend(SwsContext *ctx, const SwsOpBackend *backend,
+                               const SwsOpList *ops, SwsCompiledOp *out);
+
+/**
+ * Compile a list of operations using the best available backend.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out);
+
+#endif
diff --git a/libswscale/ops_optimizer.c b/libswscale/ops_optimizer.c
new file mode 100644
index 0000000000..9f509085ba
--- /dev/null
+++ b/libswscale/ops_optimizer.c
@@ -0,0 +1,810 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/mem.h"
+#include "libavutil/rational.h"
+
+#include "ops.h"
+#include "ops_internal.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+
+#define RET(x)                                                                 \
+    do {                                                                       \
+        if ((ret = (x)) < 0)                                                   \
+            return ret;                                                        \
+    } while (0)
+
+/* Returns true for operations that are independent per channel. These can
+ * usually be commuted freely other such operations. */
+static bool op_type_is_independent(SwsOpType op)
+{
+    switch (op) {
+    case SWS_OP_SWAP_BYTES:
+    case SWS_OP_LSHIFT:
+    case SWS_OP_RSHIFT:
+    case SWS_OP_CONVERT:
+    case SWS_OP_DITHER:
+    case SWS_OP_MIN:
+    case SWS_OP_MAX:
+    case SWS_OP_SCALE:
+        return true;
+    case SWS_OP_INVALID:
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+    case SWS_OP_SWIZZLE:
+    case SWS_OP_CLEAR:
+    case SWS_OP_LINEAR:
+    case SWS_OP_PACK:
+    case SWS_OP_UNPACK:
+        return false;
+    case SWS_OP_TYPE_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid operation type!");
+    return false;
+}
+
+static AVRational expand_factor(SwsPixelType from, SwsPixelType to)
+{
+    const int src = ff_sws_pixel_type_size(from);
+    const int dst = ff_sws_pixel_type_size(to);
+    int scale = 0;
+    for (int i = 0; i < dst / src; i++)
+        scale = scale << src * 8 | 1;
+    return Q(scale);
+}
+
+/* merge_comp_flags() forms a monoid with flags_identity as the null element */
+static const unsigned flags_identity = SWS_COMP_ZERO | SWS_COMP_EXACT;
+static unsigned merge_comp_flags(unsigned a, unsigned b)
+{
+    const unsigned flags_or  = SWS_COMP_GARBAGE;
+    const unsigned flags_and = SWS_COMP_ZERO | SWS_COMP_EXACT;
+    return ((a & b) & flags_and) | ((a | b) & flags_or);
+}
+
+/* Infer + propagate known information about components */
+void ff_sws_op_list_update_comps(SwsOpList *ops)
+{
+    SwsComps next = { .unused = {true, true, true, true} };
+    SwsComps prev = { .flags = {
+        SWS_COMP_GARBAGE, SWS_COMP_GARBAGE, SWS_COMP_GARBAGE, SWS_COMP_GARBAGE,
+    }};
+
+    /* Forwards pass, propagates knowledge about the incoming pixel values */
+    for (int n = 0; n < ops->num_ops; n++) {
+        SwsOp *op = &ops->ops[n];
+
+        /* Prefill min/max values automatically; may have to be fixed in
+         * special cases */
+        memcpy(op->comps.min, prev.min, sizeof(prev.min));
+        memcpy(op->comps.max, prev.max, sizeof(prev.max));
+        ff_sws_apply_op_q(op, op->comps.min);
+        ff_sws_apply_op_q(op, op->comps.max);
+
+        switch (op->op) {
+        case SWS_OP_READ:
+            for (int i = 0; i < op->rw.elems; i++) {
+                if (ff_sws_pixel_type_is_int(op->type)) {
+                    const int size = ff_sws_pixel_type_size(op->type);
+                    const uint64_t max_val = (1 << 8 * size) - 1;
+                    op->comps.flags[i] |= SWS_COMP_EXACT;
+                    op->comps.min[i] = Q(0);
+                    op->comps.max[i] = Q(max_val);
+                }
+            }
+            for (int i = op->rw.elems; i < 4; i++)
+                op->comps.flags[i] |= prev.flags[i];
+            break;
+        case SWS_OP_WRITE:
+            for (int i = 0; i < op->rw.elems; i++)
+                av_assert1(!(prev.flags[i] & SWS_COMP_GARBAGE));
+            /* fall through */
+        case SWS_OP_SWAP_BYTES:
+        case SWS_OP_LSHIFT:
+        case SWS_OP_RSHIFT:
+        case SWS_OP_MIN:
+        case SWS_OP_MAX:
+            /* Linearly propagate flags per component */
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] |= prev.flags[i];
+            break;
+        case SWS_OP_DITHER:
+            /* Strip zero flag because of the nonzero dithering offset */
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] |= prev.flags[i] & ~SWS_COMP_ZERO;
+            break;
+        case SWS_OP_UNPACK:
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    op->comps.flags[i] |= prev.flags[0];
+                else
+                    op->comps.flags[i] = SWS_COMP_GARBAGE;
+            }
+            break;
+        case SWS_OP_PACK: {
+            unsigned flags = flags_identity;
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    flags = merge_comp_flags(flags, prev.flags[i]);
+                if (i > 0) /* clear remaining comps for sanity */
+                    op->comps.flags[i] = SWS_COMP_GARBAGE;
+            }
+            op->comps.flags[0] |= flags;
+            break;
+        }
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (op->c.q4[i].den) {
+                    if (op->c.q4[i].num == 0)
+                        op->comps.flags[i] |= SWS_COMP_ZERO | SWS_COMP_EXACT;
+                    if (op->c.q4[i].den == 1)
+                        op->comps.flags[i] |= SWS_COMP_EXACT;
+                }
+                else
+                    op->comps.flags[i] |= prev.flags[i];
+            }
+            break;
+        case SWS_OP_SWIZZLE:
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] |= prev.flags[op->swizzle.in[i]];
+            break;
+        case SWS_OP_CONVERT:
+            for (int i = 0; i < 4; i++) {
+                op->comps.flags[i] |= prev.flags[i];
+                if (ff_sws_pixel_type_is_int(op->convert.to))
+                    op->comps.flags[i] |= SWS_COMP_EXACT;
+            }
+            break;
+        case SWS_OP_LINEAR:
+            for (int i = 0; i < 4; i++) {
+                unsigned flags = flags_identity;
+                AVRational min = Q(0), max = Q(0);
+                for (int j = 0; j < 4; j++) {
+                    const AVRational k = op->lin.m[i][j];
+                    AVRational mink = av_mul_q(prev.min[j], k);
+                    AVRational maxk = av_mul_q(prev.max[j], k);
+                    if (k.num) {
+                        flags = merge_comp_flags(flags, prev.flags[j]);
+                        if (k.den != 1) /* fractional coefficient */
+                            flags &= ~SWS_COMP_EXACT;
+                        if (k.num < 0)
+                            FFSWAP(AVRational, mink, maxk);
+                        min = av_add_q(min, mink);
+                        max = av_add_q(max, maxk);
+                    }
+                }
+                if (op->lin.m[i][4].num) { /* nonzero offset */
+                    flags &= ~SWS_COMP_ZERO;
+                    if (op->lin.m[i][4].den != 1) /* fractional offset */
+                        flags &= ~SWS_COMP_EXACT;
+                    min = av_add_q(min, op->lin.m[i][4]);
+                    max = av_add_q(max, op->lin.m[i][4]);
+                }
+                op->comps.flags[i] |= flags;
+                op->comps.min[i] = min;
+                op->comps.max[i] = max;
+            }
+            break;
+        case SWS_OP_SCALE:
+            for (int i = 0; i < 4; i++) {
+                op->comps.flags[i] |= prev.flags[i];
+                if (op->c.q.den != 1) /* fractional scale */
+                    op->comps.flags[i] &= ~SWS_COMP_EXACT;
+                if (op->c.q.num < 0)
+                    FFSWAP(AVRational, op->comps.min[i], op->comps.max[i]);
+            }
+            break;
+
+        case SWS_OP_INVALID:
+        case SWS_OP_TYPE_NB:
+            av_assert0(!"Invalid operation type!");
+        }
+
+        prev = op->comps;
+    }
+
+    /* Backwards pass, solves for component dependencies */
+    for (int n = ops->num_ops - 1; n >= 0; n--) {
+        SwsOp *op = &ops->ops[n];
+
+        switch (op->op) {
+        case SWS_OP_READ:
+        case SWS_OP_WRITE:
+            for (int i = 0; i < op->rw.elems; i++)
+                op->comps.unused[i] = op->op == SWS_OP_READ;
+            for (int i = op->rw.elems; i < 4; i++)
+                op->comps.unused[i] |= next.unused[i];
+            break;
+        case SWS_OP_SWAP_BYTES:
+        case SWS_OP_LSHIFT:
+        case SWS_OP_RSHIFT:
+        case SWS_OP_CONVERT:
+        case SWS_OP_DITHER:
+        case SWS_OP_MIN:
+        case SWS_OP_MAX:
+        case SWS_OP_SCALE:
+            for (int i = 0; i < 4; i++)
+                op->comps.unused[i] |= next.unused[i];
+            break;
+        case SWS_OP_UNPACK: {
+            bool unused = true;
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    unused &= next.unused[i];
+                op->comps.unused[i] |= i > 0;
+            }
+            op->comps.unused[0] = unused;
+            break;
+        }
+        case SWS_OP_PACK:
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    op->comps.unused[i] |= next.unused[0];
+                else
+                    op->comps.unused[i] = true;
+            }
+            break;
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (op->c.q4[i].den)
+                    op->comps.unused[i] = true;
+                else
+                    op->comps.unused[i] |= next.unused[i];
+            }
+            break;
+        case SWS_OP_SWIZZLE: {
+            bool unused[4] = { true, true, true, true };
+            for (int i = 0; i < 4; i++)
+                unused[op->swizzle.in[i]] &= next.unused[i];
+            for (int i = 0; i < 4; i++)
+                op->comps.unused[i] = unused[i];
+            break;
+        }
+        case SWS_OP_LINEAR:
+            for (int j = 0; j < 4; j++) {
+                bool unused = true;
+                for (int i = 0; i < 4; i++) {
+                    if (op->lin.m[i][j].num)
+                        unused &= next.unused[i];
+                }
+                op->comps.unused[j] = unused;
+            }
+            break;
+        }
+
+        next = op->comps;
+    }
+}
+
+/* returns log2(x) only if x is a power of two, or 0 otherwise */
+static int exact_log2(const int x)
+{
+    int p;
+    if (x <= 0)
+        return 0;
+    p = av_log2(x);
+    return (1 << p) == x ? p : 0;
+}
+
+static int exact_log2_q(const AVRational x)
+{
+    if (x.den == 1)
+        return exact_log2(x.num);
+    else if (x.num == 1)
+        return -exact_log2(x.den);
+    else
+        return 0;
+}
+
+/**
+ * If a linear operation can be reduced to a scalar multiplication, returns
+ * the corresponding scaling factor, or 0 otherwise.
+ */
+static bool extract_scalar(const SwsLinearOp *c, SwsComps prev, SwsComps next,
+                           SwsConst *out_scale)
+{
+    SwsConst scale = {0};
+
+    /* There are components not on the main diagonal */
+    if (c->mask & ~SWS_MASK_DIAG4)
+        return false;
+
+    for (int i = 0; i < 4; i++) {
+        const AVRational s = c->m[i][i];
+        if ((prev.flags[i] & SWS_COMP_ZERO) || next.unused[i])
+            continue;
+        if (scale.q.den && av_cmp_q(s, scale.q))
+            return false;
+        scale.q = s;
+    }
+
+    if (scale.q.den)
+        *out_scale = scale;
+    return scale.q.den;
+}
+
+/* Extracts an integer clear operation (subset) from the given linear op. */
+static bool extract_constant_rows(SwsLinearOp *c, SwsComps prev,
+                                  SwsConst *out_clear)
+{
+    SwsConst clear = {0};
+    bool ret = false;
+
+    for (int i = 0; i < 4; i++) {
+        bool const_row = c->m[i][4].den == 1; /* offset is integer */
+        for (int j = 0; j < 4; j++) {
+            const_row &= c->m[i][j].num == 0 || /* scalar is zero */
+                         (prev.flags[j] & SWS_COMP_ZERO); /* input is zero */
+        }
+        if (const_row && (c->mask & SWS_MASK_ROW(i))) {
+            clear.q4[i] = c->m[i][4];
+            for (int j = 0; j < 5; j++)
+                c->m[i][j] = Q(i == j);
+            c->mask &= ~SWS_MASK_ROW(i);
+            ret = true;
+        }
+    }
+
+    if (ret)
+        *out_clear = clear;
+    return ret;
+}
+
+/* Unswizzle a linear operation by aligning single-input rows with
+ * their corresponding diagonal */
+static bool extract_swizzle(SwsLinearOp *op, SwsComps prev, SwsSwizzleOp *out_swiz)
+{
+    SwsSwizzleOp swiz = SWS_SWIZZLE(0, 1, 2, 3);
+    SwsLinearOp c = *op;
+
+    for (int i = 0; i < 4; i++) {
+        int idx = -1;
+        for (int j = 0; j < 4; j++) {
+            if (!c.m[i][j].num || (prev.flags[j] & SWS_COMP_ZERO))
+                continue;
+            if (idx >= 0)
+                return false; /* multiple inputs */
+            idx = j;
+        }
+
+        if (idx >= 0 && idx != i) {
+            /* Move coefficient to the diagonal */
+            c.m[i][i] = c.m[i][idx];
+            c.m[i][idx] = Q(0);
+            swiz.in[i] = idx;
+        }
+    }
+
+    if (swiz.mask == SWS_SWIZZLE(0, 1, 2, 3).mask)
+        return false; /* no swizzle was identified */
+
+    c.mask = ff_sws_linear_mask(c);
+    *out_swiz = swiz;
+    *op = c;
+    return true;
+}
+
+static void op_copy_flags(SwsOp *op, const SwsOp *op2)
+{
+    for (int i = 0; i < 4; i++)
+        op->comps.flags[i] = op2->comps.flags[i];
+}
+
+/* Should only be used on ops that commute with each other, and only after
+ * applying the necessary adjustments
+ */
+static void swap_ops(SwsOp *op, SwsOp *next)
+{
+    /* Clear all inferred flags */
+    op->comps = next->comps = (SwsComps) {0};
+    FFSWAP(SwsOp, *op, *next);
+}
+
+int ff_sws_op_list_optimize(SwsOpList *ops)
+{
+    int prev_num_ops, ret;
+    bool progress;
+
+    do {
+        prev_num_ops = ops->num_ops;
+        progress = false;
+
+        ff_sws_op_list_update_comps(ops);
+
+        for (int n = 0; n < ops->num_ops;) {
+            SwsOp dummy = {0};
+            SwsOp *op = &ops->ops[n];
+            SwsOp *prev = n ? &ops->ops[n - 1] : &dummy;
+            SwsOp *next = n + 1 < ops->num_ops ? &ops->ops[n + 1] : &dummy;
+
+            /* common helper variables */
+            bool changed = false;
+            bool noop = true;
+
+            switch (op->op) {
+            case SWS_OP_READ:
+                /* Optimized further into refcopy / memcpy */
+                if (next->op == SWS_OP_WRITE &&
+                    next->rw.elems == op->rw.elems &&
+                    next->rw.packed == op->rw.packed &&
+                    next->rw.frac == op->rw.frac)
+                {
+                    ff_sws_op_list_remove_at(ops, n, 2);
+                    av_assert1(ops->num_ops == 0);
+                    return 0;
+                }
+
+                /* Skip reading extra unneeded components */
+                if (!op->rw.packed) {
+                    int needed = op->rw.elems;
+                    while (needed > 0 && next->comps.unused[needed - 1])
+                        needed--;
+                    if (op->rw.elems != needed) {
+                        op->rw.elems = needed;
+                        op->rw.packed &= op->rw.elems > 1;
+                        progress = true;
+                        continue;
+                    }
+                }
+                break;
+
+            case SWS_OP_SWAP_BYTES:
+                /* Redundant (double) swap */
+                if (next->op == SWS_OP_SWAP_BYTES) {
+                    ff_sws_op_list_remove_at(ops, n, 2);
+                    continue;
+                }
+                break;
+
+            case SWS_OP_UNPACK:
+                /* Redundant unpack+pack */
+                if (next->op == SWS_OP_PACK && next->type == op->type &&
+                    next->pack.pattern[0] == op->pack.pattern[0] &&
+                    next->pack.pattern[1] == op->pack.pattern[1] &&
+                    next->pack.pattern[2] == op->pack.pattern[2] &&
+                    next->pack.pattern[3] == op->pack.pattern[3])
+                {
+                    ff_sws_op_list_remove_at(ops, n, 2);
+                    continue;
+                }
+
+                /* Skip unpacking components that are not used */
+                for (int i = 3; i > 0 && next->comps.unused[i]; i--)
+                    op->pack.pattern[i] = 0;
+                break;
+
+            case SWS_OP_PACK:
+                /* Skip packing known-to-be-zero components */
+                for (int i = 3; i > 0; i--) {
+                    if (!(prev->comps.flags[i] & SWS_COMP_ZERO))
+                        break;
+                    op->pack.pattern[i] = 0;
+                }
+                break;
+
+            case SWS_OP_LSHIFT:
+            case SWS_OP_RSHIFT:
+                /* Two shifts in the same direction */
+                if (next->op == op->op) {
+                    op->c.u += next->c.u;
+                    ff_sws_op_list_remove_at(ops, n + 1, 1);
+                    continue;
+                }
+
+                /* No-op shift */
+                if (!op->c.u) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+                break;
+
+            case SWS_OP_CLEAR:
+                for (int i = 0; i < 4; i++) {
+                    if (!op->c.q4[i].den)
+                        continue;
+
+                    if ((prev->comps.flags[i] & SWS_COMP_ZERO) &&
+                        !(prev->comps.flags[i] & SWS_COMP_GARBAGE) &&
+                        op->c.q4[i].num == 0)
+                    {
+                        /* Redundant clear-to-zero of zero component */
+                        op->c.q4[i].den = 0;
+                    } else if (next->comps.unused[i]) {
+                        /* Unnecessary clear of unused component */
+                        op->c.q4[i] = (AVRational) {0, 0};
+                    } else if (op->c.q4[i].den) {
+                        noop = false;
+                    }
+                }
+
+                if (noop) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+
+                /* Transitive clear */
+                if (next->op == SWS_OP_CLEAR) {
+                    for (int i = 0; i < 4; i++) {
+                        if (next->c.q4[i].den)
+                            op->c.q4[i] = next->c.q4[i];
+                    }
+                    ff_sws_op_list_remove_at(ops, n + 1, 1);
+                    continue;
+                }
+
+                /* Prefer to clear as late as possible, to avoid doing
+                 * redundant work */
+                if ((op_type_is_independent(next->op) && next->op != SWS_OP_SWAP_BYTES) ||
+                    next->op == SWS_OP_SWIZZLE)
+                {
+                    if (next->op == SWS_OP_CONVERT)
+                        op->type = next->convert.to;
+                    ff_sws_apply_op_q(next, op->c.q4);
+                    swap_ops(op, next);
+                    progress = true;
+                    continue;
+                }
+                break;
+
+            case SWS_OP_SWIZZLE: {
+                bool seen[4] = {0};
+                bool has_duplicates = false;
+                for (int i = 0; i < 4; i++) {
+                    if (next->comps.unused[i])
+                        continue;
+                    if (op->swizzle.in[i] != i)
+                        noop = false;
+                    has_duplicates |= seen[op->swizzle.in[i]];
+                    seen[op->swizzle.in[i]] = true;
+                }
+
+                /* Identity swizzle */
+                if (noop) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+
+                /* Transitive swizzle */
+                if (next->op == SWS_OP_SWIZZLE) {
+                    const SwsSwizzleOp orig = op->swizzle;
+                    for (int i = 0; i < 4; i++)
+                        op->swizzle.in[i] = orig.in[next->swizzle.in[i]];
+                    op_copy_flags(op, next);
+                    ff_sws_op_list_remove_at(ops, n + 1, 1);
+                    continue;
+                }
+
+                /* Try to push swizzles with duplicates towards the output */
+                if (has_duplicates && op_type_is_independent(next->op)) {
+                    if (next->op == SWS_OP_CONVERT)
+                        op->type = next->convert.to;
+                    if (next->op == SWS_OP_MIN || next->op == SWS_OP_MAX) {
+                        /* Un-swizzle the next operation */
+                        const SwsConst c = next->c;
+                        for (int i = 0; i < 4; i++) {
+                            if (!next->comps.unused[i])
+                                next->c.q4[op->swizzle.in[i]] = c.q4[i];
+                        }
+                    }
+                    swap_ops(op, next);
+                    progress = true;
+                    continue;
+                }
+                break;
+            }
+
+            case SWS_OP_CONVERT:
+                /* No-op conversion */
+                if (op->type == op->convert.to) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+
+                /* Transitive conversion */
+                if (next->op == SWS_OP_CONVERT &&
+                    op->convert.expand == next->convert.expand)
+                {
+                    av_assert1(op->convert.to == next->type);
+                    op->convert.to = next->convert.to;
+                    op_copy_flags(op, next);
+                    ff_sws_op_list_remove_at(ops, n + 1, 1);
+                    continue;
+                }
+
+                /* Conversion followed by integer expansion */
+                if (next->op == SWS_OP_SCALE &&
+                    !av_cmp_q(next->c.q, expand_factor(op->type, op->convert.to)))
+                {
+                    op->convert.expand = true;
+                    ff_sws_op_list_remove_at(ops, n + 1, 1);
+                    continue;
+                }
+                break;
+
+            case SWS_OP_MIN:
+                for (int i = 0; i < 4; i++) {
+                    if (next->comps.unused[i] || !op->c.q4[i].den)
+                        continue;
+                    if (av_cmp_q(op->c.q4[i], prev->comps.max[i]) < 0)
+                        noop = false;
+                }
+
+                if (noop) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+                break;
+
+            case SWS_OP_MAX:
+                for (int i = 0; i < 4; i++) {
+                    if (next->comps.unused[i] || !op->c.q4[i].den)
+                        continue;
+                    if (av_cmp_q(prev->comps.min[i], op->c.q4[i]) < 0)
+                        noop = false;
+                }
+
+                if (noop) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+                break;
+
+            case SWS_OP_DITHER:
+                for (int i = 0; i < 4; i++) {
+                    noop &= (prev->comps.flags[i] & SWS_COMP_EXACT) ||
+                            next->comps.unused[i];
+                }
+
+                if (noop) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+                break;
+
+            case SWS_OP_LINEAR: {
+                SwsSwizzleOp swizzle;
+                SwsConst c;
+
+                /* No-op (identity) linear operation */
+                if (!op->lin.mask) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+
+                if (next->op == SWS_OP_LINEAR) {
+                    /* 5x5 matrix multiplication after appending [ 0 0 0 0 1 ] */
+                    const SwsLinearOp m1 = op->lin;
+                    const SwsLinearOp m2 = next->lin;
+                    for (int i = 0; i < 4; i++) {
+                        for (int j = 0; j < 5; j++) {
+                            AVRational sum = Q(0);
+                            for (int k = 0; k < 4; k++)
+                                sum = av_add_q(sum, av_mul_q(m2.m[i][k], m1.m[k][j]));
+                            if (j == 4) /* m1.m[4][j] == 1 */
+                                sum = av_add_q(sum, m2.m[i][4]);
+                            op->lin.m[i][j] = sum;
+                        }
+                    }
+                    op_copy_flags(op, next);
+                    op->lin.mask = ff_sws_linear_mask(op->lin);
+                    ff_sws_op_list_remove_at(ops, n + 1, 1);
+                    continue;
+                }
+
+                /* Optimize away zero columns */
+                for (int j = 0; j < 4; j++) {
+                    const uint32_t col = SWS_MASK_COL(j);
+                    if (!(prev->comps.flags[j] & SWS_COMP_ZERO) || !(op->lin.mask & col))
+                        continue;
+                    for (int i = 0; i < 4; i++)
+                        op->lin.m[i][j] = Q(i == j);
+                    op->lin.mask &= ~col;
+                    changed = true;
+                }
+
+                /* Optimize away unused rows */
+                for (int i = 0; i < 4; i++) {
+                    const uint32_t row = SWS_MASK_ROW(i);
+                    if (!next->comps.unused[i] || !(op->lin.mask & row))
+                        continue;
+                    for (int j = 0; j < 5; j++)
+                        op->lin.m[i][j] = Q(i == j);
+                    op->lin.mask &= ~row;
+                    changed = true;
+                }
+
+                if (changed) {
+                    progress = true;
+                    continue;
+                }
+
+                /* Convert constant rows to explicit clear instruction */
+                if (extract_constant_rows(&op->lin, prev->comps, &c)) {
+                    RET(ff_sws_op_list_insert_at(ops, n + 1, &(SwsOp) {
+                        .op    = SWS_OP_CLEAR,
+                        .type  = op->type,
+                        .comps = op->comps,
+                        .c     = c,
+                    }));
+                    continue;
+                }
+
+                /* Multiplication by scalar constant */
+                if (extract_scalar(&op->lin, prev->comps, next->comps, &c)) {
+                    op->op = SWS_OP_SCALE;
+                    op->c  = c;
+                    progress = true;
+                    continue;
+                }
+
+                /* Swizzle by fixed pattern */
+                if (extract_swizzle(&op->lin, prev->comps, &swizzle)) {
+                    RET(ff_sws_op_list_insert_at(ops, n, &(SwsOp) {
+                        .op      = SWS_OP_SWIZZLE,
+                        .type    = op->type,
+                        .swizzle = swizzle,
+                    }));
+                    continue;
+                }
+                break;
+            }
+
+            case SWS_OP_SCALE: {
+                const int factor2 = exact_log2_q(op->c.q);
+
+                /* No-op scaling */
+                if (op->c.q.num == 1 && op->c.q.den == 1) {
+                    ff_sws_op_list_remove_at(ops, n, 1);
+                    continue;
+                }
+
+                /* Scaling by integer before conversion to int */
+                if (op->c.q.den == 1 &&
+                    next->op == SWS_OP_CONVERT &&
+                    ff_sws_pixel_type_is_int(next->convert.to))
+                {
+                    op->type = next->convert.to;
+                    swap_ops(op, next);
+                    progress = true;
+                    continue;
+                }
+
+                /* Scaling by exact power of two */
+                if (factor2 && ff_sws_pixel_type_is_int(op->type)) {
+                    op->op = factor2 > 0 ? SWS_OP_LSHIFT : SWS_OP_RSHIFT;
+                    op->c.u = FFABS(factor2);
+                    progress = true;
+                    continue;
+                }
+                break;
+            }
+            }
+
+            /* No optimization triggered, move on to next operation */
+            n++;
+        }
+    } while (prev_num_ops != ops->num_ops || progress);
+
+    return 0;
+}
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 09/17] swscale/ops_chain: add internal abstraction for kernel linking
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (7 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 08/17] swscale/ops: introduce new low level framework Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 10/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
                   ` (10 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

See doc/swscale-v2.txt for design details.
---
 libswscale/Makefile    |   1 +
 libswscale/ops_chain.c | 291 +++++++++++++++++++++++++++++++++++++++++
 libswscale/ops_chain.h | 108 +++++++++++++++
 3 files changed, 400 insertions(+)
 create mode 100644 libswscale/ops_chain.c
 create mode 100644 libswscale/ops_chain.h

diff --git a/libswscale/Makefile b/libswscale/Makefile
index 810c9dee78..c9dfa78c89 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -16,6 +16,7 @@ OBJS = alphablend.o                                     \
        input.o                                          \
        lut3d.o                                          \
        ops.o                                            \
+       ops_chain.o                                      \
        ops_optimizer.o                                  \
        options.o                                        \
        output.o                                         \
diff --git a/libswscale/ops_chain.c b/libswscale/ops_chain.c
new file mode 100644
index 0000000000..92e22cb384
--- /dev/null
+++ b/libswscale/ops_chain.c
@@ -0,0 +1,291 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/mem.h"
+#include "libavutil/rational.h"
+
+#include "ops_chain.h"
+
+SwsOpChain *ff_sws_op_chain_alloc(void)
+{
+    return av_mallocz(sizeof(SwsOpChain));
+}
+
+void ff_sws_op_chain_free(SwsOpChain *chain)
+{
+    if (!chain)
+        return;
+
+    for (int i = 0; i < chain->num_impl + 1; i++) {
+        if (chain->free[i])
+            chain->free[i](chain->impl[i].priv.ptr);
+    }
+
+    av_free(chain);
+}
+
+int ff_sws_op_chain_append(SwsOpChain *chain, SwsFuncPtr func,
+                           void (*free)(void *), SwsOpPriv priv)
+{
+    const int idx = chain->num_impl;
+    if (idx == SWS_MAX_OPS)
+        return AVERROR(EINVAL);
+
+    av_assert1(func);
+    chain->impl[idx].cont = func;
+    chain->impl[idx + 1].priv = priv;
+    chain->free[idx + 1] = free;
+    chain->num_impl++;
+    return 0;
+}
+
+/**
+ * Match an operation against a reference operation. Returns a score for how
+ * well the reference matches the operation, or 0 if there is no match.
+ *
+ * If `ref->comps` has any flags set, they must be set in `op` as well.
+ * Likewise, if `ref->comps` has any components marked as unused, they must be
+ * marked as as unused in `ops` as well.
+ *
+ * For SWS_OP_LINEAR, `ref->linear.mask` must be a strict superset of
+ * `op->linear.mask`, but may not contain any columns explicitly ignored by
+ * `op->comps.unused`.
+ *
+ * For SWS_OP_READ, SWS_OP_WRITE, SWS_OP_SWAP_BYTES and SWS_OP_SWIZZLE, the
+ * exact type is not checked, just the size.
+ *
+ * Components set in `next.unused` are ignored when matching. If `flexible`
+ * is true, the op body is ignored - only the operation, pixel type, and
+ * component masks are checked.
+ */
+static int op_match(const SwsOp *op, const SwsOpEntry *entry, const SwsComps next)
+{
+    const SwsOp *ref = &entry->op;
+    int score = 10;
+    if (op->op != ref->op)
+        return 0;
+
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+    case SWS_OP_SWAP_BYTES:
+    case SWS_OP_SWIZZLE:
+        /* Only the size matters for these operations */
+        if (ff_sws_pixel_type_size(op->type) != ff_sws_pixel_type_size(ref->type))
+            return 0;
+        break;
+    default:
+        if (op->type != ref->type)
+            return 0;
+        break;
+    }
+
+    for (int i = 0; i < 4; i++) {
+        if (ref->comps.unused[i]) {
+            if (op->comps.unused[i])
+                score += 1; /* Operating on fewer components is better .. */
+            else
+                return false; /* .. but not too few! */
+        }
+
+        if (ref->comps.flags[i]) {
+            if (ref->comps.flags[i] & ~op->comps.flags[i]) {
+                return false; /* Missing required output assumptions */
+            } else {
+                /* Implementation is more specialized */
+                score += av_popcount(ref->comps.flags[i]);
+            }
+        }
+    }
+
+    /* Flexible variants always match, but lower the score to prioritize more
+     * specific implementations if they exist */
+    if (entry->flexible)
+        return score - 5;
+
+    switch (op->op) {
+    case SWS_OP_INVALID:
+        return 0;
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        if (op->rw.elems  != ref->rw.elems  ||
+            op->rw.packed != ref->rw.packed ||
+            op->rw.frac   != ref->rw.frac)
+            return 0;
+        return score;
+    case SWS_OP_SWAP_BYTES:
+        return score;
+    case SWS_OP_PACK:
+    case SWS_OP_UNPACK:
+        for (int i = 0; i < 4 && op->pack.pattern[i]; i++) {
+            if (op->pack.pattern[i] != ref->pack.pattern[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_CLEAR:
+        for (int i = 0; i < 4; i++) {
+            if (!op->c.q4[i].den)
+                continue;
+            if (av_cmp_q(op->c.q4[i], ref->c.q4[i]) && !next.unused[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_LSHIFT:
+    case SWS_OP_RSHIFT:
+        return op->c.u == ref->c.u ? score : 0;
+    case SWS_OP_SWIZZLE:
+        for (int i = 0; i < 4; i++) {
+            if (op->swizzle.in[i] != ref->swizzle.in[i] && !next.unused[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_CONVERT:
+        if (op->convert.to     != ref->convert.to ||
+            op->convert.expand != ref->convert.expand)
+            return 0;
+        return score;
+    case SWS_OP_DITHER:
+        return op->dither.size_log2 == ref->dither.size_log2 ? score : 0;
+    case SWS_OP_MIN:
+    case SWS_OP_MAX:
+        for (int i = 0; i < 4; i++) {
+            if (av_cmp_q(op->c.q4[i], ref->c.q4[i]) && !next.unused[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_LINEAR:
+        /* All required elements must be present */
+        if (op->lin.mask & ~ref->lin.mask)
+            return 0;
+        /* To avoid operating on possibly undefined memory, filter out
+         * implementations that operate on more input components */
+        for (int i = 0; i < 4; i++) {
+            if ((ref->lin.mask & SWS_MASK_COL(i)) && op->comps.unused[i])
+                return 0;
+        }
+        /* Prioritize smaller implementations */
+        score += av_popcount(SWS_MASK_ALL ^ ref->lin.mask);
+        return score;
+    case SWS_OP_SCALE:
+        return score;
+    case SWS_OP_TYPE_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid operation type!");
+    return 0;
+}
+
+int ff_sws_op_compile_tables(const SwsOpTable *const tables[], int num_tables,
+                             SwsOpList *ops, const int block_size,
+                             SwsOpChain *chain)
+{
+    static const SwsOp dummy = { .comps.unused = { true, true, true, true }};
+    const SwsOp *next = ops->num_ops > 1 ? &ops->ops[1] : &dummy;
+    const unsigned cpu_flags = av_get_cpu_flags();
+    const SwsOpEntry *best = NULL;
+    const SwsOp *op = &ops->ops[0];
+    int ret, best_score = 0;
+    SwsOpPriv priv = {0};
+
+    for (int n = 0; n < num_tables; n++) {
+        const SwsOpTable *table = tables[n];
+        if (table->block_size && table->block_size != block_size ||
+            table->cpu_flags & ~cpu_flags)
+            continue;
+
+        for (int i = 0; table->entries[i].op.op; i++) {
+            const SwsOpEntry *entry = &table->entries[i];
+            int score = op_match(op, entry, next->comps);
+            if (score > best_score) {
+                best_score = score;
+                best = entry;
+            }
+        }
+    }
+
+    if (!best)
+        return AVERROR(ENOTSUP);
+
+    if (best->setup) {
+        ret = best->setup(op, &priv);
+        if (ret < 0)
+            return ret;
+    }
+
+    ret = ff_sws_op_chain_append(chain, best->func, best->free, priv);
+    if (ret < 0) {
+        if (best->free)
+            best->free(&priv);
+        return ret;
+    }
+
+    ops->ops++;
+    ops->num_ops--;
+    return ops->num_ops ? AVERROR(EAGAIN) : 0;
+}
+
+#define q2pixel(type, q) ((q).den ? (type) (q).num / (q).den : 0)
+
+int ff_sws_setup_u8(const SwsOp *op, SwsOpPriv *out)
+{
+    out->u8[0] = op->c.u;
+    return 0;
+}
+
+int ff_sws_setup_u(const SwsOp *op, SwsOpPriv *out)
+{
+    switch (op->type) {
+    case SWS_PIXEL_U8:  out->u8[0]  = op->c.u; return 0;
+    case SWS_PIXEL_U16: out->u16[0] = op->c.u; return 0;
+    case SWS_PIXEL_U32: out->u32[0] = op->c.u; return 0;
+    case SWS_PIXEL_F32: out->f32[0] = op->c.u; return 0;
+    default: return AVERROR(EINVAL);
+    }
+}
+
+int ff_sws_setup_q(const SwsOp *op, SwsOpPriv *out)
+{
+    switch (op->type) {
+    case SWS_PIXEL_U8:  out->u8[0]  = q2pixel(uint8_t,  op->c.q); return 0;
+    case SWS_PIXEL_U16: out->u16[0] = q2pixel(uint16_t, op->c.q); return 0;
+    case SWS_PIXEL_U32: out->u32[0] = q2pixel(uint32_t, op->c.q); return 0;
+    case SWS_PIXEL_F32: out->f32[0] = q2pixel(float,    op->c.q); return 0;
+    default: return AVERROR(EINVAL);
+    }
+
+    return 0;
+}
+
+int ff_sws_setup_q4(const SwsOp *op, SwsOpPriv *out)
+{
+    for (int i = 0; i < 4; i++) {
+        switch (op->type) {
+        case SWS_PIXEL_U8:  out->u8[i]  = q2pixel(uint8_t,  op->c.q4[i]); break;
+        case SWS_PIXEL_U16: out->u16[i] = q2pixel(uint16_t, op->c.q4[i]); break;
+        case SWS_PIXEL_U32: out->u32[i] = q2pixel(uint32_t, op->c.q4[i]); break;
+        case SWS_PIXEL_F32: out->f32[i] = q2pixel(float,    op->c.q4[i]); break;
+        default: return AVERROR(EINVAL);
+        }
+    }
+
+    return 0;
+}
diff --git a/libswscale/ops_chain.h b/libswscale/ops_chain.h
new file mode 100644
index 0000000000..cf13f1a8af
--- /dev/null
+++ b/libswscale/ops_chain.h
@@ -0,0 +1,108 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_CHAIN_H
+#define SWSCALE_OPS_CHAIN_H
+
+#include "libavutil/cpu.h"
+
+#include "ops_internal.h"
+
+/**
+ * Helpers for SIMD implementations based on chained kernels, using a
+ * continuation passing style to link them together.
+ */
+
+/**
+ * Private data for each kernel.
+ */
+typedef union SwsOpPriv {
+    DECLARE_ALIGNED_16(char, data)[16];
+
+    /* Common types */
+    void *ptr;
+    uint8_t   u8[16];
+    uint16_t u16[8];
+    uint32_t u32[4];
+    float    f32[4];
+} SwsOpPriv;
+
+static_assert(sizeof(SwsOpPriv) == 16, "SwsOpPriv size mismatch");
+
+/* Setup helpers */
+int ff_sws_setup_u(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_u8(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_q(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_q4(const SwsOp *op, SwsOpPriv *out);
+
+/**
+ * Per-kernel execution context.
+ *
+ * Note: This struct is hard-coded in assembly, so do not change the layout.
+ */
+typedef void (*SwsFuncPtr)(void);
+typedef struct SwsOpImpl {
+    SwsFuncPtr cont; /* [offset =  0] Continuation for this operation. */
+    SwsOpPriv  priv; /* [offset = 16] Private data for this operation. */
+} SwsOpImpl;
+
+static_assert(sizeof(SwsOpImpl) == 32,         "SwsOpImpl layout mismatch");
+static_assert(offsetof(SwsOpImpl, priv) == 16, "SwsOpImpl layout mismatch");
+
+/* Compiled chain of operations, which can be dispatched efficiently */
+typedef struct SwsOpChain {
+#define SWS_MAX_OPS 16
+    SwsOpImpl impl[SWS_MAX_OPS + 1]; /* reserve extra space for the entrypoint */
+    void (*free[SWS_MAX_OPS + 1])(void *);
+    int num_impl;
+} SwsOpChain;
+
+SwsOpChain *ff_sws_op_chain_alloc(void);
+void ff_sws_op_chain_free(SwsOpChain *chain);
+
+/* Returns 0 on success, or a negative error code. */
+int ff_sws_op_chain_append(SwsOpChain *chain, SwsFuncPtr func,
+                           void (*free)(void *), SwsOpPriv priv);
+
+typedef struct SwsOpEntry {
+    SwsOp op;
+    SwsFuncPtr func;
+    bool flexible; /* if true, only the type and op are matched */
+    int (*setup)(const SwsOp *op, SwsOpPriv *out); /* optional */
+    void (*free)(void *priv);
+} SwsOpEntry;
+
+typedef struct SwsOpTable {
+    unsigned cpu_flags;   /* required CPU flags for this table */
+    int block_size;       /* fixed block size of this table */
+    SwsOpEntry entries[]; /* terminated by {0} */
+} SwsOpTable;
+
+/**
+ * "Compile" a single op by looking it up in a list of fixed size op tables.
+ * See `op_match` in `ops.c` for details on how the matching works.
+ *
+ * Returns 0, AVERROR(EAGAIN), or a negative error code.
+ */
+int ff_sws_op_compile_tables(const SwsOpTable *const tables[], int num_tables,
+                             SwsOpList *ops, const int block_size,
+                             SwsOpChain *chain);
+
+#endif
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 10/17] swscale/ops_backend: add reference backend basend on C templates
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (8 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 09/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-05-02 15:06   ` Michael Niedermayer
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 11/17] swscale/x86: add SIMD backend Niklas Haas
                   ` (9 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This will serve as a reference for the SIMD backends to come. That said,
with auto-vectorization enabled, the performance of this is not atrocious, and
can often beat even the old SIMD.

In theory, we can dramatically speed it up by using GCC vectors instead of
arrays, but the performance gains from this are too dependent on exact GCC
versions and flags, so it practice it's not a substitute for a SIMD
implementation.
---
 libswscale/Makefile          |   6 +
 libswscale/ops.c             |   3 +
 libswscale/ops.h             |   2 -
 libswscale/ops_backend.c     | 101 ++++++
 libswscale/ops_backend.h     | 181 +++++++++++
 libswscale/ops_tmpl_common.c | 176 ++++++++++
 libswscale/ops_tmpl_float.c  | 255 +++++++++++++++
 libswscale/ops_tmpl_int.c    | 609 +++++++++++++++++++++++++++++++++++
 8 files changed, 1331 insertions(+), 2 deletions(-)
 create mode 100644 libswscale/ops_backend.c
 create mode 100644 libswscale/ops_backend.h
 create mode 100644 libswscale/ops_tmpl_common.c
 create mode 100644 libswscale/ops_tmpl_float.c
 create mode 100644 libswscale/ops_tmpl_int.c

diff --git a/libswscale/Makefile b/libswscale/Makefile
index c9dfa78c89..6e5696c5a6 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -16,6 +16,7 @@ OBJS = alphablend.o                                     \
        input.o                                          \
        lut3d.o                                          \
        ops.o                                            \
+       ops_backend.o                                    \
        ops_chain.o                                      \
        ops_optimizer.o                                  \
        options.o                                        \
@@ -29,6 +30,11 @@ OBJS = alphablend.o                                     \
        yuv2rgb.o                                        \
        vscale.o                                         \
 
+OPS-CFLAGS = -Wno-uninitialized \
+             -ffinite-math-only
+
+$(SUBDIR)ops_backend.o: CFLAGS += $(OPS-CFLAGS)
+
 # Objects duplicated from other libraries for shared builds
 SHLIBOBJS                    += log2_tab.o half2float.o
 
diff --git a/libswscale/ops.c b/libswscale/ops.c
index 6d9a844e06..9600e3c9df 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -27,7 +27,10 @@
 #include "ops.h"
 #include "ops_internal.h"
 
+extern SwsOpBackend backend_c;
+
 const SwsOpBackend * const ff_sws_op_backends[] = {
+    &backend_c,
     NULL
 };
 
diff --git a/libswscale/ops.h b/libswscale/ops.h
index c9c5706cbf..b8ab6d8522 100644
--- a/libswscale/ops.h
+++ b/libswscale/ops.h
@@ -91,8 +91,6 @@ typedef struct SwsComps {
 } SwsComps;
 
 typedef struct SwsReadWriteOp {
-    /* Note: Unread pixel data is explicitly cleared to {0} for sanity */
-
     int elems;   /* number of elements (of type `op.type`) to read/write */
     bool packed; /* read multiple elements from a single plane */
     int frac;    /* fractional pixel step factor (log2) */
diff --git a/libswscale/ops_backend.c b/libswscale/ops_backend.c
new file mode 100644
index 0000000000..6cd2b2d9b9
--- /dev/null
+++ b/libswscale/ops_backend.c
@@ -0,0 +1,101 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "ops_backend.h"
+
+/* Array-based reference implementation */
+
+#ifndef SWS_BLOCK_SIZE
+#  define SWS_BLOCK_SIZE 32
+#endif
+
+typedef  uint8_t  u8block_t[SWS_BLOCK_SIZE];
+typedef uint16_t u16block_t[SWS_BLOCK_SIZE];
+typedef uint32_t u32block_t[SWS_BLOCK_SIZE];
+typedef    float f32block_t[SWS_BLOCK_SIZE];
+
+#define BIT_DEPTH 8
+# include "ops_tmpl_int.c"
+#undef BIT_DEPTH
+
+#define BIT_DEPTH 16
+# include "ops_tmpl_int.c"
+#undef BIT_DEPTH
+
+#define BIT_DEPTH 32
+# include "ops_tmpl_int.c"
+# include "ops_tmpl_float.c"
+#undef BIT_DEPTH
+
+static void process(const SwsOpExec *exec, const void *priv, int num_blocks)
+{
+    const SwsOpChain *chain = priv;
+    const SwsOpImpl *impl = chain->impl;
+    SwsOpIter iter;
+
+    iter.y = exec->y;
+    for (int i = 0; i < 4; i++) {
+        iter.in[i]  = exec->in[i];
+        iter.out[i] = exec->out[i];
+    }
+
+    for (iter.x = exec->x; num_blocks-- > 0; iter.x += SWS_BLOCK_SIZE) {
+        ((void (*)(SwsOpIter *, const SwsOpImpl *)) impl->cont)
+            (&iter, &impl[1]);
+    }
+}
+
+static int compile(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out)
+{
+    int ret;
+
+    SwsOpChain *chain = ff_sws_op_chain_alloc();
+    if (!chain)
+        return AVERROR(ENOMEM);
+
+    static const SwsOpTable *const tables[] = {
+        &bitfn(op_table_int,    u8),
+        &bitfn(op_table_int,   u16),
+        &bitfn(op_table_int,   u32),
+        &bitfn(op_table_float, f32),
+    };
+
+    do {
+        ret = ff_sws_op_compile_tables(tables, FF_ARRAY_ELEMS(tables), ops,
+                                       SWS_BLOCK_SIZE, chain);
+    } while (ret == AVERROR(EAGAIN));
+    if (ret < 0) {
+        ff_sws_op_chain_free(chain);
+        return ret;
+    }
+
+    *out = (SwsCompiledOp) {
+        .func = process,
+        .block_size = SWS_BLOCK_SIZE,
+        .priv = chain,
+        .free = (void (*)(void *)) ff_sws_op_chain_free,
+    };
+    return 0;
+}
+
+SwsOpBackend backend_c = {
+    .name       = "c",
+    .compile    = compile,
+};
diff --git a/libswscale/ops_backend.h b/libswscale/ops_backend.h
new file mode 100644
index 0000000000..3d09ba791a
--- /dev/null
+++ b/libswscale/ops_backend.h
@@ -0,0 +1,181 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_BACKEND_H
+#define SWSCALE_OPS_BACKEND_H
+
+/**
+ * Helper macros for the C-based backend.
+ *
+ * To use these macros, the following types must be defined:
+ *  - PIXEL_TYPE should be one of SWS_PIXEL_*
+ *  - pixel_t should be the type of pixels
+ *  - block_t should be the type of blocks (groups of pixels)
+ */
+
+#include <assert.h>
+#include <float.h>
+#include <stdint.h>
+
+#include "libavutil/attributes.h"
+#include "libavutil/mem.h"
+
+#include "ops_chain.h"
+
+/**
+ * Internal context holding per-iter execution data. The data pointers will be
+ * directly incremented by the corresponding read/write functions.
+ */
+typedef struct SwsOpIter {
+    const uint8_t *in[4];
+    uint8_t *out[4];
+    int x, y;
+} SwsOpIter;
+
+#ifdef __clang__
+#  define SWS_FUNC
+#  define SWS_LOOP AV_PRAGMA(clang loop vectorize(assume_safety))
+#elif defined(__GNUC__)
+#  define SWS_FUNC __attribute__((optimize("tree-vectorize")))
+#  define SWS_LOOP AV_PRAGMA(GCC ivdep)
+#else
+#  define SWS_FUNC
+#  define SWS_LOOP
+#endif
+
+#if defined(__clang__)
+#  define SWS_ASSUME(cond) __builtin_assume(cond)
+#elif defined(__GNUC__)
+#  define SWS_ASSUME(cond) { if (!(cond)) __builtin_unreachable(); }
+#else
+#  define SWS_ASSUME(cond) ((void) (cond))
+#endif
+
+#if defined(__clang__) || defined(__GNUC__)
+#  define SWS_ASSUME_ALIGNED(ptr, align)  __builtin_assume_aligned(ptr, align)
+#else
+#  define SWS_ASSUME_ALIGNED(ptr, align) ((void *) (ptr))
+#endif
+
+/* Miscellaneous helpers */
+#define bitfn2(name, ext) name ## _ ## ext
+#define bitfn(name, ext)  bitfn2(name, ext)
+
+#define FN_SUFFIX AV_JOIN(FMT_CHAR, BIT_DEPTH)
+#define fn(name)  bitfn(name, FN_SUFFIX)
+
+#define av_q2pixel(q) ((q).den ? (pixel_t) (q).num / (q).den : 0)
+
+/* Helper macros to make writing common function signatures less painful */
+#define DECL_FUNC(NAME, ...)                                                    \
+    static av_always_inline void fn(NAME)(SwsOpIter *restrict iter,             \
+                                          const SwsOpImpl *restrict impl,       \
+                                          block_t x, block_t y,                 \
+                                          block_t z, block_t w,                 \
+                                          __VA_ARGS__)
+
+#define DECL_READ(NAME, ...)                                                    \
+    static av_always_inline void fn(NAME)(SwsOpIter *restrict iter,             \
+                                          const SwsOpImpl *restrict impl,       \
+                                          const pixel_t *restrict in0,          \
+                                          const pixel_t *restrict in1,          \
+                                          const pixel_t *restrict in2,          \
+                                          const pixel_t *restrict in3,          \
+                                          __VA_ARGS__)
+
+#define DECL_WRITE(NAME, ...)                                                   \
+    DECL_FUNC(NAME, pixel_t *restrict out0, pixel_t *restrict out1,             \
+                    pixel_t *restrict out2, pixel_t *restrict out3,             \
+                    __VA_ARGS__)
+
+/* Helper macros to call into functions declared with DECL_FUNC_* */
+#define CALL(FUNC, ...) \
+    fn(FUNC)(iter, impl, x, y, z, w, __VA_ARGS__)
+
+#define CALL_READ(FUNC, ...)                                                    \
+    fn(FUNC)(iter, impl, (const pixel_t *) iter->in[0],                         \
+                         (const pixel_t *) iter->in[1],                         \
+                         (const pixel_t *) iter->in[2],                         \
+                         (const pixel_t *) iter->in[3], __VA_ARGS__)
+
+#define CALL_WRITE(FUNC, ...)                                                   \
+    CALL(FUNC, (pixel_t *) iter->out[0], (pixel_t *) iter->out[1],              \
+               (pixel_t *) iter->out[2], (pixel_t *) iter->out[3], __VA_ARGS__)
+
+/* Helper macros to declare continuation functions */
+#define DECL_IMPL(NAME)                                                         \
+    static SWS_FUNC void fn(NAME)(SwsOpIter *restrict iter,                     \
+                                  const SwsOpImpl *restrict impl,               \
+                                  block_t x, block_t y,                         \
+                                  block_t z, block_t w)                         \
+
+/* Helper macro to call into the next continuation with a given type */
+#define CONTINUE(TYPE, ...)                                                     \
+    ((void (*)(SwsOpIter *, const SwsOpImpl *,                                  \
+               TYPE x, TYPE y, TYPE z, TYPE w)) impl->cont)                     \
+        (iter, &impl[1], __VA_ARGS__)
+
+/* Helper macros for common op setup code */
+#define DECL_SETUP(NAME)                                                        \
+    static int fn(NAME)(const SwsOp *op, SwsOpPriv *out)
+
+#define SETUP_MEMDUP(c) ff_setup_memdup(&(c), sizeof(c), out)
+static inline int ff_setup_memdup(const void *c, size_t size, SwsOpPriv *out)
+{
+    out->ptr = av_memdup(c, size);
+    return out->ptr ? 0 : AVERROR(ENOMEM);
+}
+
+/* Helper macro for declaring op table entries */
+#define DECL_ENTRY(NAME, ...)                                                   \
+    static const SwsOpEntry fn(op_##NAME) = {                                   \
+        .func = (SwsFuncPtr) fn(NAME),                                          \
+        .op.type = PIXEL_TYPE,                                                  \
+        __VA_ARGS__                                                             \
+    }
+
+/* Helpers to define functions for common subsets of components */
+#define DECL_PATTERN(NAME) \
+    DECL_FUNC(NAME, const bool X, const bool Y, const bool Z, const bool W)
+
+#define WRAP_PATTERN(FUNC, X, Y, Z, W, ...)                                     \
+    DECL_IMPL(FUNC##_##X##Y##Z##W)                                              \
+    {                                                                           \
+        CALL(FUNC, X, Y, Z, W);                                                 \
+    }                                                                           \
+                                                                                \
+    DECL_ENTRY(FUNC##_##X##Y##Z##W,                                             \
+        .op.comps.unused = { !X, !Y, !Z, !W },                                  \
+        __VA_ARGS__                                                             \
+    )
+
+#define WRAP_COMMON_PATTERNS(FUNC, ...)                                         \
+    WRAP_PATTERN(FUNC, 1, 0, 0, 0, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 0, 0, 1, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 1, 1, 0, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 1, 1, 1, __VA_ARGS__)
+
+#define REF_COMMON_PATTERNS(NAME)                                               \
+    fn(op_##NAME##_1000),                                                       \
+    fn(op_##NAME##_1001),                                                       \
+    fn(op_##NAME##_1110),                                                       \
+    fn(op_##NAME##_1111)
+
+#endif
diff --git a/libswscale/ops_tmpl_common.c b/libswscale/ops_tmpl_common.c
new file mode 100644
index 0000000000..a9410a8a61
--- /dev/null
+++ b/libswscale/ops_tmpl_common.c
@@ -0,0 +1,176 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  error Should only be included from ops_tmpl_*.c!
+#endif
+
+#define WRAP_CONVERT_UINT(N)                                                    \
+DECL_PATTERN(convert_uint##N)                                                   \
+{                                                                               \
+    u##N##block_t xu, yu, zu, wu;                                               \
+                                                                                \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        if (X)                                                                  \
+            xu[i] = x[i];                                                       \
+        if (Y)                                                                  \
+            yu[i] = y[i];                                                       \
+        if (Z)                                                                  \
+            zu[i] = z[i];                                                       \
+        if (W)                                                                  \
+            wu[i] = w[i];                                                       \
+    }                                                                           \
+                                                                                \
+    CONTINUE(u##N##block_t, xu, yu, zu, wu);                                    \
+}                                                                               \
+                                                                                \
+WRAP_COMMON_PATTERNS(convert_uint##N,                                           \
+    .op.op = SWS_OP_CONVERT,                                                    \
+    .op.convert.to = SWS_PIXEL_U##N,                                            \
+);
+
+#if BIT_DEPTH != 8
+WRAP_CONVERT_UINT(8)
+#endif
+
+#if BIT_DEPTH != 16
+WRAP_CONVERT_UINT(16)
+#endif
+
+#if BIT_DEPTH != 32 || defined(IS_FLOAT)
+WRAP_CONVERT_UINT(32)
+#endif
+
+DECL_FUNC(clear, const bool X, const bool Y, const bool Z, const bool W)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (!X)
+            x[i] = impl->priv.px[0];
+        if (!Y)
+            y[i] = impl->priv.px[1];
+        if (!Z)
+            z[i] = impl->priv.px[2];
+        if (!W)
+            w[i] = impl->priv.px[3];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_CLEAR(X, Y, Z, W)                                                  \
+DECL_IMPL(clear##_##X##Y##Z##W)                                                 \
+{                                                                               \
+    CALL(clear, X, Y, Z, W);                                                    \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(clear##_##X##Y##Z##W,                                                \
+    .setup = ff_sws_setup_q4,                                                   \
+    .flexible = true,                                                           \
+    .op.op = SWS_OP_CLEAR,                                                      \
+    .op.comps.unused = { !X, !Y, !Z, !W },                                      \
+);
+
+WRAP_CLEAR(1, 1, 1, 0) /* rgba alpha */
+WRAP_CLEAR(0, 1, 1, 1) /* argb alpha */
+
+WRAP_CLEAR(0, 0, 1, 1) /* vuya chroma */
+WRAP_CLEAR(1, 0, 0, 1) /* yuva chroma */
+WRAP_CLEAR(1, 1, 0, 0) /* ayuv chroma */
+WRAP_CLEAR(0, 1, 0, 1) /* uyva chroma */
+WRAP_CLEAR(1, 0, 1, 0) /* xvyu chroma */
+
+WRAP_CLEAR(1, 0, 0, 0) /* gray -> yuva */
+WRAP_CLEAR(0, 1, 0, 0) /* gray -> ayuv */
+WRAP_CLEAR(0, 0, 1, 0) /* gray -> vuya */
+
+DECL_PATTERN(min)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = FFMIN(x[i], impl->priv.px[0]);
+        if (Y)
+            y[i] = FFMIN(y[i], impl->priv.px[1]);
+        if (Z)
+            z[i] = FFMIN(z[i], impl->priv.px[2]);
+        if (W)
+            w[i] = FFMIN(w[i], impl->priv.px[3]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_PATTERN(max)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = FFMAX(x[i], impl->priv.px[0]);
+        if (Y)
+            y[i] = FFMAX(y[i], impl->priv.px[1]);
+        if (Z)
+            z[i] = FFMAX(z[i], impl->priv.px[2]);
+        if (W)
+            w[i] = FFMAX(w[i], impl->priv.px[3]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(min,
+    .op.op = SWS_OP_MIN,
+    .setup = ff_sws_setup_q4,
+    .flexible = true,
+);
+
+WRAP_COMMON_PATTERNS(max,
+    .op.op = SWS_OP_MAX,
+    .setup = ff_sws_setup_q4,
+    .flexible = true,
+);
+
+DECL_PATTERN(scale)
+{
+    const pixel_t scale = impl->priv.px[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] *= scale;
+        if (Y)
+            y[i] *= scale;
+        if (Z)
+            z[i] *= scale;
+        if (W)
+            w[i] *= scale;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(scale,
+    .op.op = SWS_OP_SCALE,
+    .setup = ff_sws_setup_q,
+    .flexible = true,
+);
diff --git a/libswscale/ops_tmpl_float.c b/libswscale/ops_tmpl_float.c
new file mode 100644
index 0000000000..9acdbd01bf
--- /dev/null
+++ b/libswscale/ops_tmpl_float.c
@@ -0,0 +1,255 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  define BIT_DEPTH 32
+#endif
+
+#if BIT_DEPTH == 32
+#  define PIXEL_TYPE SWS_PIXEL_F32
+#  define PIXEL_MAX  FLT_MAX
+#  define PIXEL_MIN  FLT_MIN
+#  define pixel_t    float
+#  define block_t    f32block_t
+#  define px         f32
+#else
+#  error Invalid BIT_DEPTH
+#endif
+
+#define IS_FLOAT 1
+#define FMT_CHAR f
+#include "ops_tmpl_common.c"
+
+#define MAX_DITHER_SIZE 16
+#if MAX_DITHER_SIZE > SWS_BLOCK_SIZE
+#  define DITHER_ROW_SIZE MAX_DITHER_SIZE
+#else
+#  define DITHER_ROW_SIZE SWS_BLOCK_SIZE
+#endif
+
+typedef struct {
+    pixel_t matrix[MAX_DITHER_SIZE][DITHER_ROW_SIZE];
+} fn(DitherCoeffs);
+
+DECL_SETUP(setup_dither)
+{
+    fn(DitherCoeffs) c = {0};
+    const int size = 1 << op->dither.size_log2;
+
+    if (!size) {
+        /* We special case this value */
+        av_assert1(!av_cmp_q(op->dither.matrix[0], av_make_q(1, 2)));
+        out->ptr = NULL;
+        return 0;
+    }
+
+    for (int y = 0; y < size; y++) {
+        for (int x = 0; x < size; x++)
+            c.matrix[y][x] = av_q2pixel(op->dither.matrix[y * size + x]);
+        for (int x = size; x < SWS_BLOCK_SIZE; x++)
+            c.matrix[y][x] = c.matrix[y][x % size]; /* pad to chunk size */
+    }
+
+    return SETUP_MEMDUP(c);
+}
+
+DECL_FUNC(dither, const int size_log2)
+{
+    const fn(DitherCoeffs) *restrict c = impl->priv.ptr;
+    const int mask = (1 << size_log2) - 1;
+    const int y_line = iter->y;
+    const int row0 = (y_line +  0) & mask;
+    const int row1 = (y_line +  3) & mask;
+    const int row2 = (y_line +  2) & mask;
+    const int row3 = (y_line +  5) & mask;
+    const int base = iter->x & (SWS_BLOCK_SIZE & (MAX_DITHER_SIZE - 1));
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] += size_log2 ? c->matrix[row0][base + i] : (pixel_t) 0.5;
+        y[i] += size_log2 ? c->matrix[row1][base + i] : (pixel_t) 0.5;
+        z[i] += size_log2 ? c->matrix[row2][base + i] : (pixel_t) 0.5;
+        w[i] += size_log2 ? c->matrix[row3][base + i] : (pixel_t) 0.5;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_DITHER(N)                                                          \
+DECL_IMPL(dither##N)                                                            \
+{                                                                               \
+    CALL(dither, N);                                                            \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(dither##N,                                                           \
+    .op.op = SWS_OP_DITHER,                                                     \
+    .op.dither.size_log2 = N,                                                   \
+    .setup = fn(setup_dither),                                                  \
+    .free = av_free,                                                            \
+);
+
+WRAP_DITHER(0)
+WRAP_DITHER(1)
+WRAP_DITHER(2)
+WRAP_DITHER(3)
+WRAP_DITHER(4)
+
+typedef struct {
+    /* Stored in split form for convenience */
+    pixel_t m[4][4];
+    pixel_t k[4];
+} fn(LinCoeffs);
+
+DECL_SETUP(setup_linear)
+{
+    fn(LinCoeffs) c;
+
+    for (int i = 0; i < 4; i++) {
+        for (int j = 0; j < 4; j++)
+            c.m[i][j] = av_q2pixel(op->lin.m[i][j]);
+        c.k[i] = av_q2pixel(op->lin.m[i][4]);
+    }
+
+    return SETUP_MEMDUP(c);
+}
+
+/**
+ * Fully general case for a 5x5 linear affine transformation. Should never be
+ * called without constant `mask`. This function will compile down to the
+ * appropriately optimized version for the required subset of operations when
+ * called with a constant mask.
+ */
+DECL_FUNC(linear_mask, const uint32_t mask)
+{
+    const fn(LinCoeffs) c = *(const fn(LinCoeffs) *) impl->priv.ptr;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        const pixel_t xx = x[i];
+        const pixel_t yy = y[i];
+        const pixel_t zz = z[i];
+        const pixel_t ww = w[i];
+
+        x[i]  = (mask & SWS_MASK_OFF(0)) ? c.k[0] : 0;
+        x[i] += (mask & SWS_MASK(0, 0))  ? c.m[0][0] * xx : xx;
+        x[i] += (mask & SWS_MASK(0, 1))  ? c.m[0][1] * yy : 0;
+        x[i] += (mask & SWS_MASK(0, 2))  ? c.m[0][2] * zz : 0;
+        x[i] += (mask & SWS_MASK(0, 3))  ? c.m[0][3] * ww : 0;
+
+        y[i]  = (mask & SWS_MASK_OFF(1)) ? c.k[1] : 0;
+        y[i] += (mask & SWS_MASK(1, 0))  ? c.m[1][0] * xx : 0;
+        y[i] += (mask & SWS_MASK(1, 1))  ? c.m[1][1] * yy : yy;
+        y[i] += (mask & SWS_MASK(1, 2))  ? c.m[1][2] * zz : 0;
+        y[i] += (mask & SWS_MASK(1, 3))  ? c.m[1][3] * ww : 0;
+
+        z[i]  = (mask & SWS_MASK_OFF(2)) ? c.k[2] : 0;
+        z[i] += (mask & SWS_MASK(2, 0))  ? c.m[2][0] * xx : 0;
+        z[i] += (mask & SWS_MASK(2, 1))  ? c.m[2][1] * yy : 0;
+        z[i] += (mask & SWS_MASK(2, 2))  ? c.m[2][2] * zz : zz;
+        z[i] += (mask & SWS_MASK(2, 3))  ? c.m[2][3] * ww : 0;
+
+        w[i]  = (mask & SWS_MASK_OFF(3)) ? c.k[3] : 0;
+        w[i] += (mask & SWS_MASK(3, 0))  ? c.m[3][0] * xx : 0;
+        w[i] += (mask & SWS_MASK(3, 1))  ? c.m[3][1] * yy : 0;
+        w[i] += (mask & SWS_MASK(3, 2))  ? c.m[3][2] * zz : 0;
+        w[i] += (mask & SWS_MASK(3, 3))  ? c.m[3][3] * ww : ww;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_LINEAR(NAME, MASK)                                                 \
+DECL_IMPL(linear_##NAME)                                                        \
+{                                                                               \
+    CALL(linear_mask, MASK);                                                    \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(linear_##NAME,                                                       \
+    .setup = fn(setup_linear),                                                  \
+    .free = av_free,                                                            \
+    .op.op = SWS_OP_LINEAR,                                                     \
+    .op.lin.mask = (MASK),                                                      \
+);
+
+WRAP_LINEAR(luma,      SWS_MASK_LUMA)
+WRAP_LINEAR(alpha,     SWS_MASK_ALPHA)
+WRAP_LINEAR(lumalpha,  SWS_MASK_LUMA | SWS_MASK_ALPHA)
+WRAP_LINEAR(dot3,      0b111)
+WRAP_LINEAR(row0,      SWS_MASK_ROW(0))
+WRAP_LINEAR(row0a,     SWS_MASK_ROW(0) | SWS_MASK_ALPHA)
+WRAP_LINEAR(diag3,     SWS_MASK_DIAG3)
+WRAP_LINEAR(diag4,     SWS_MASK_DIAG4)
+WRAP_LINEAR(diagoff3,  SWS_MASK_DIAG3 | SWS_MASK_OFF3)
+WRAP_LINEAR(matrix3,   SWS_MASK_MAT3)
+WRAP_LINEAR(affine3,   SWS_MASK_MAT3 | SWS_MASK_OFF3)
+WRAP_LINEAR(affine3a,  SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA)
+WRAP_LINEAR(matrix4,   SWS_MASK_MAT4)
+WRAP_LINEAR(affine4,   SWS_MASK_MAT4 | SWS_MASK_OFF4)
+
+static const SwsOpTable fn(op_table_float) = {
+    .block_size = SWS_BLOCK_SIZE,
+    .entries = {
+        REF_COMMON_PATTERNS(convert_uint8),
+        REF_COMMON_PATTERNS(convert_uint16),
+        REF_COMMON_PATTERNS(convert_uint32),
+
+        fn(op_clear_1110),
+        REF_COMMON_PATTERNS(min),
+        REF_COMMON_PATTERNS(max),
+        REF_COMMON_PATTERNS(scale),
+
+        fn(op_dither0),
+        fn(op_dither1),
+        fn(op_dither2),
+        fn(op_dither3),
+        fn(op_dither4),
+
+        fn(op_linear_luma),
+        fn(op_linear_alpha),
+        fn(op_linear_lumalpha),
+        fn(op_linear_dot3),
+        fn(op_linear_row0),
+        fn(op_linear_row0a),
+        fn(op_linear_diag3),
+        fn(op_linear_diag4),
+        fn(op_linear_diagoff3),
+        fn(op_linear_matrix3),
+        fn(op_linear_affine3),
+        fn(op_linear_affine3a),
+        fn(op_linear_matrix4),
+        fn(op_linear_affine4),
+
+        {{0}}
+    },
+};
+
+#undef PIXEL_TYPE
+#undef PIXEL_MAX
+#undef PIXEL_MIN
+#undef pixel_t
+#undef block_t
+#undef px
+
+#undef FMT_CHAR
+#undef IS_FLOAT
diff --git a/libswscale/ops_tmpl_int.c b/libswscale/ops_tmpl_int.c
new file mode 100644
index 0000000000..e91ff4fe2c
--- /dev/null
+++ b/libswscale/ops_tmpl_int.c
@@ -0,0 +1,609 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/bswap.h"
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  define BIT_DEPTH 8
+#endif
+
+#if BIT_DEPTH == 32
+#  define PIXEL_TYPE SWS_PIXEL_U32
+#  define PIXEL_MAX  0xFFFFFFFFu
+#  define SWAP_BYTES av_bswap32
+#  define pixel_t    uint32_t
+#  define block_t    u32block_t
+#  define px         u32
+#elif BIT_DEPTH == 16
+#  define PIXEL_TYPE SWS_PIXEL_U16
+#  define PIXEL_MAX  0xFFFFu
+#  define SWAP_BYTES av_bswap16
+#  define pixel_t    uint16_t
+#  define block_t    u16block_t
+#  define px         u16
+#elif BIT_DEPTH == 8
+#  define PIXEL_TYPE SWS_PIXEL_U8
+#  define PIXEL_MAX  0xFFu
+#  define pixel_t    uint8_t
+#  define block_t    u8block_t
+#  define px         u8
+#else
+#  error Invalid BIT_DEPTH
+#endif
+
+#define IS_FLOAT  0
+#define FMT_CHAR  u
+#define PIXEL_MIN 0
+#include "ops_tmpl_common.c"
+
+DECL_READ(read_planar, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] = in0[i];
+        if (elems > 1)
+            y[i] = in1[i];
+        if (elems > 2)
+            z[i] = in2[i];
+        if (elems > 3)
+            w[i] = in3[i];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_READ(read_packed, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] = in0[elems * i + 0];
+        if (elems > 1)
+            y[i] = in0[elems * i + 1];
+        if (elems > 2)
+            z[i] = in0[elems * i + 2];
+        if (elems > 3)
+            w[i] = in0[elems * i + 3];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_WRITE(write_planar, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        out0[i] = x[i];
+        if (elems > 1)
+            out1[i] = y[i];
+        if (elems > 2)
+            out2[i] = z[i];
+        if (elems > 3)
+            out3[i] = w[i];
+    }
+}
+
+DECL_WRITE(write_packed, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        out0[elems * i + 0] = x[i];
+        if (elems > 1)
+            out0[elems * i + 1] = y[i];
+        if (elems > 2)
+            out0[elems * i + 2] = z[i];
+        if (elems > 3)
+            out0[elems * i + 3] = w[i];
+    }
+}
+
+#define WRAP_READ(FUNC, ELEMS, FRAC, PACKED)                                    \
+DECL_IMPL(FUNC##ELEMS)                                                          \
+{                                                                               \
+    CALL_READ(FUNC, ELEMS);                                                     \
+    for (int i = 0; i < (PACKED ? 1 : ELEMS); i++)                              \
+        iter->in[i] += sizeof(block_t) * (PACKED ? ELEMS : 1) >> FRAC;          \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(FUNC##ELEMS,                                                         \
+    .op.op = SWS_OP_READ,                                                       \
+    .op.rw = {                                                                  \
+        .elems  = ELEMS,                                                        \
+        .packed = PACKED,                                                       \
+        .frac   = FRAC,                                                         \
+    },                                                                          \
+);
+
+WRAP_READ(read_planar, 1, 0, false)
+WRAP_READ(read_planar, 2, 0, false)
+WRAP_READ(read_planar, 3, 0, false)
+WRAP_READ(read_planar, 4, 0, false)
+WRAP_READ(read_packed, 2, 0, true)
+WRAP_READ(read_packed, 3, 0, true)
+WRAP_READ(read_packed, 4, 0, true)
+
+#define WRAP_WRITE(FUNC, ELEMS, FRAC, PACKED)                                   \
+DECL_IMPL(FUNC##ELEMS)                                                          \
+{                                                                               \
+    CALL_WRITE(FUNC, ELEMS);                                                    \
+    for (int i = 0; i < (PACKED ? 1 : ELEMS); i++)                              \
+        iter->out[i] += sizeof(block_t) * (PACKED ? ELEMS : 1) >> FRAC;         \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(FUNC##ELEMS,                                                         \
+    .op.op = SWS_OP_WRITE,                                                      \
+    .op.rw = {                                                                  \
+        .elems  = ELEMS,                                                        \
+        .packed = PACKED,                                                       \
+        .frac   = FRAC,                                                         \
+    },                                                                          \
+);
+
+WRAP_WRITE(write_planar, 1, 0, false)
+WRAP_WRITE(write_planar, 2, 0, false)
+WRAP_WRITE(write_planar, 3, 0, false)
+WRAP_WRITE(write_planar, 4, 0, false)
+WRAP_WRITE(write_packed, 2, 0, true)
+WRAP_WRITE(write_packed, 3, 0, true)
+WRAP_WRITE(write_packed, 4, 0, true)
+
+#if BIT_DEPTH == 8
+DECL_READ(read_nibbles, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 2) {
+        const pixel_t val = ((const pixel_t *) in0)[i >> 1];
+        x[i + 0] = val >> 4;  /* high nibble */
+        x[i + 1] = val & 0xF; /* low nibble */
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_READ(read_bits, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 8) {
+        const pixel_t val = ((const pixel_t *) in0)[i >> 3];
+        x[i + 0] = (val >> 7) & 1;
+        x[i + 1] = (val >> 6) & 1;
+        x[i + 2] = (val >> 5) & 1;
+        x[i + 3] = (val >> 4) & 1;
+        x[i + 4] = (val >> 3) & 1;
+        x[i + 5] = (val >> 2) & 1;
+        x[i + 6] = (val >> 1) & 1;
+        x[i + 7] = (val >> 0) & 1;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_READ(read_nibbles, 1, 1, false)
+WRAP_READ(read_bits,    1, 3, false)
+
+DECL_WRITE(write_nibbles, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 2)
+        out0[i >> 1] = x[i] << 4 | x[i + 1];
+}
+
+DECL_WRITE(write_bits, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 8) {
+        out0[i >> 3] = x[i + 0] << 7 |
+                       x[i + 1] << 6 |
+                       x[i + 2] << 5 |
+                       x[i + 3] << 4 |
+                       x[i + 4] << 3 |
+                       x[i + 5] << 2 |
+                       x[i + 6] << 1 |
+                       x[i + 7];
+    }
+}
+
+WRAP_WRITE(write_nibbles, 1, 1, false)
+WRAP_WRITE(write_bits,    1, 3, false)
+#endif /* BIT_DEPTH == 8 */
+
+#ifdef SWAP_BYTES
+DECL_PATTERN(swap_bytes)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = SWAP_BYTES(x[i]);
+        if (Y)
+            y[i] = SWAP_BYTES(y[i]);
+        if (Z)
+            z[i] = SWAP_BYTES(z[i]);
+        if (W)
+            w[i] = SWAP_BYTES(w[i]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(swap_bytes, .op.op = SWS_OP_SWAP_BYTES);
+#endif /* SWAP_BYTES */
+
+#if BIT_DEPTH == 8
+DECL_PATTERN(expand16)
+{
+    u16block_t x16, y16, z16, w16;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x16[i] = x[i] << 8 | x[i];
+        if (Y)
+            y16[i] = y[i] << 8 | y[i];
+        if (Z)
+            z16[i] = z[i] << 8 | z[i];
+        if (W)
+            w16[i] = w[i] << 8 | w[i];
+    }
+
+    CONTINUE(u16block_t, x16, y16, z16, w16);
+}
+
+WRAP_COMMON_PATTERNS(expand16,
+    .op.op = SWS_OP_CONVERT,
+    .op.convert.to = SWS_PIXEL_U16,
+    .op.convert.expand = true,
+);
+
+DECL_PATTERN(expand32)
+{
+    u32block_t x32, y32, z32, w32;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x32[i] = x[i] << 24 | x[i] << 16 | x[i] << 8 | x[i];
+        y32[i] = y[i] << 24 | y[i] << 16 | y[i] << 8 | y[i];
+        z32[i] = z[i] << 24 | z[i] << 16 | z[i] << 8 | z[i];
+        w32[i] = w[i] << 24 | w[i] << 16 | w[i] << 8 | w[i];
+    }
+
+    CONTINUE(u32block_t, x32, y32, z32, w32);
+}
+
+WRAP_COMMON_PATTERNS(expand32,
+    .op.op = SWS_OP_CONVERT,
+    .op.convert.to = SWS_PIXEL_U32,
+    .op.convert.expand = true,
+);
+#endif
+
+#define WRAP_PACK_UNPACK(X, Y, Z, W)                                            \
+inline DECL_IMPL(pack_##X##Y##Z##W)                                             \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        x[i] = x[i] << (Y+Z+W);                                                 \
+        if (Y)                                                                  \
+            x[i] |= y[i] << (Z+W);                                              \
+        if (Z)                                                                  \
+            x[i] |= z[i] << W;                                                  \
+        if (W)                                                                  \
+            x[i] |= w[i];                                                       \
+    }                                                                           \
+                                                                                \
+    CONTINUE(block_t, x, y, z, w);                                              \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(pack_##X##Y##Z##W,                                                   \
+    .op.op = SWS_OP_PACK,                                                       \
+    .op.pack.pattern = { X, Y, Z, W },                                          \
+    .op.comps.unused = { !X, !Y, !Z, !W },                                      \
+);                                                                              \
+                                                                                \
+inline DECL_IMPL(unpack_##X##Y##Z##W)                                           \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        const pixel_t val = x[i];                                               \
+        x[i] = val >> (Y+Z+W);                                                  \
+        if (Y)                                                                  \
+            y[i] = (val >> (Z+W)) & ((1 << Y) - 1);                             \
+        if (Z)                                                                  \
+            z[i] = (val >> W) & ((1 << Z) - 1);                                 \
+        if (W)                                                                  \
+            w[i] = val & ((1 << W) - 1);                                        \
+    }                                                                           \
+                                                                                \
+    CONTINUE(block_t, x, y, z, w);                                              \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(unpack_##X##Y##Z##W,                                                 \
+    .op.op = SWS_OP_UNPACK,                                                     \
+    .op.pack.pattern = { X, Y, Z, W },                                          \
+    .op.comps.flags = {                                                         \
+        X ? 0 : SWS_COMP_GARBAGE, Y ? 0 : SWS_COMP_GARBAGE,                     \
+        Z ? 0 : SWS_COMP_GARBAGE, W ? 0 : SWS_COMP_GARBAGE,                     \
+    },                                                                          \
+);
+
+WRAP_PACK_UNPACK( 3,  3,  2,  0)
+WRAP_PACK_UNPACK( 2,  3,  3,  0)
+WRAP_PACK_UNPACK( 1,  2,  1,  0)
+WRAP_PACK_UNPACK( 5,  6,  5,  0)
+WRAP_PACK_UNPACK( 5,  5,  5,  0)
+WRAP_PACK_UNPACK( 4,  4,  4,  0)
+WRAP_PACK_UNPACK( 2, 10, 10, 10)
+WRAP_PACK_UNPACK(10, 10, 10,  2)
+
+#if BIT_DEPTH != 8
+DECL_PATTERN(lshift)
+{
+    const uint8_t amount = impl->priv.u8[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] <<= amount;
+        y[i] <<= amount;
+        z[i] <<= amount;
+        w[i] <<= amount;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_PATTERN(rshift)
+{
+    const uint8_t amount = impl->priv.u8[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] >>= amount;
+        y[i] >>= amount;
+        z[i] >>= amount;
+        w[i] >>= amount;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(lshift,
+    .op.op    = SWS_OP_LSHIFT,
+    .setup    = ff_sws_setup_u8,
+    .flexible = true,
+);
+
+WRAP_COMMON_PATTERNS(rshift,
+    .op.op    = SWS_OP_RSHIFT,
+    .setup    = ff_sws_setup_u8,
+    .flexible = true,
+);
+#endif /* BIT_DEPTH != 8 */
+
+DECL_PATTERN(convert_float)
+{
+    f32block_t xf, yf, zf, wf;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        xf[i] = x[i];
+        yf[i] = y[i];
+        zf[i] = z[i];
+        wf[i] = w[i];
+    }
+
+    CONTINUE(f32block_t, xf, yf, zf, wf);
+}
+
+WRAP_COMMON_PATTERNS(convert_float,
+    .op.op = SWS_OP_CONVERT,
+    .op.convert.to = SWS_PIXEL_F32,
+);
+
+/**
+ * Swizzle by directly swapping the order of arguments to the continuation.
+ * Note that this is only safe to do if no arguments are duplicated.
+ */
+#define DECL_SWIZZLE(X, Y, Z, W)                                                \
+static SWS_FUNC void                                                            \
+fn(swizzle_##X##Y##Z##W)(SwsOpIter *restrict iter,                              \
+                         const SwsOpImpl *restrict impl,                        \
+                         block_t c0, block_t c1, block_t c2, block_t c3)        \
+{                                                                               \
+    CONTINUE(block_t, c##X, c##Y, c##Z, c##W);                                  \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(swizzle_##X##Y##Z##W,                                                \
+    .op.op = SWS_OP_SWIZZLE,                                                    \
+    .op.swizzle = SWS_SWIZZLE(X, Y, Z, W),                                      \
+);
+
+DECL_SWIZZLE(3, 0, 1, 2)
+DECL_SWIZZLE(3, 0, 2, 1)
+DECL_SWIZZLE(2, 1, 0, 3)
+DECL_SWIZZLE(3, 2, 1, 0)
+DECL_SWIZZLE(3, 1, 0, 2)
+DECL_SWIZZLE(3, 2, 0, 1)
+DECL_SWIZZLE(1, 2, 0, 3)
+DECL_SWIZZLE(1, 0, 2, 3)
+DECL_SWIZZLE(2, 0, 1, 3)
+DECL_SWIZZLE(2, 3, 1, 0)
+DECL_SWIZZLE(2, 1, 3, 0)
+DECL_SWIZZLE(1, 2, 3, 0)
+DECL_SWIZZLE(1, 3, 2, 0)
+DECL_SWIZZLE(0, 2, 1, 3)
+DECL_SWIZZLE(0, 2, 3, 1)
+DECL_SWIZZLE(0, 3, 1, 2)
+DECL_SWIZZLE(3, 1, 2, 0)
+DECL_SWIZZLE(0, 3, 2, 1)
+
+/* Broadcast luma -> rgb (only used for y(a) -> rgb(a)) */
+#define DECL_EXPAND_LUMA(X, W, T0, T1)                                          \
+static SWS_FUNC void                                                            \
+fn(expand_luma_##X##W)(SwsOpIter *restrict iter,                                \
+                       const SwsOpImpl *restrict impl,                          \
+                       block_t c0, block_t c1,  block_t c2, block_t c3)         \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++)                                    \
+        T0[i] = T1[i] = c0[i];                                                  \
+                                                                                \
+    CONTINUE(block_t, c##X, T0, T1, c##W);                                      \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(expand_luma_##X##W,                                                  \
+    .op.op = SWS_OP_SWIZZLE,                                                    \
+    .op.swizzle = SWS_SWIZZLE(X, 0, 0, W),                                      \
+);
+
+DECL_EXPAND_LUMA(0, 3, c1, c2)
+DECL_EXPAND_LUMA(3, 0, c1, c2)
+DECL_EXPAND_LUMA(1, 0, c2, c3)
+DECL_EXPAND_LUMA(0, 1, c2, c3)
+
+static const SwsOpTable fn(op_table_int) = {
+    .block_size = SWS_BLOCK_SIZE,
+    .entries = {
+        fn(op_read_planar1),
+        fn(op_read_planar2),
+        fn(op_read_planar3),
+        fn(op_read_planar4),
+        fn(op_read_packed2),
+        fn(op_read_packed3),
+        fn(op_read_packed4),
+
+        fn(op_write_planar1),
+        fn(op_write_planar2),
+        fn(op_write_planar3),
+        fn(op_write_planar4),
+        fn(op_write_packed2),
+        fn(op_write_packed3),
+        fn(op_write_packed4),
+
+#if BIT_DEPTH == 8
+        fn(op_read_bits1),
+        fn(op_read_nibbles1),
+        fn(op_write_bits1),
+        fn(op_write_nibbles1),
+
+        fn(op_pack_1210),
+        fn(op_pack_2330),
+        fn(op_pack_3320),
+
+        fn(op_unpack_1210),
+        fn(op_unpack_2330),
+        fn(op_unpack_3320),
+
+
+        REF_COMMON_PATTERNS(expand16),
+        REF_COMMON_PATTERNS(expand32),
+#elif BIT_DEPTH == 16
+        fn(op_pack_4440),
+        fn(op_pack_5550),
+        fn(op_pack_5650),
+        fn(op_unpack_4440),
+        fn(op_unpack_5550),
+        fn(op_unpack_5650),
+#elif BIT_DEPTH == 32
+        fn(op_pack_2101010),
+        fn(op_pack_1010102),
+        fn(op_unpack_2101010),
+        fn(op_unpack_1010102),
+#endif
+
+#ifdef SWAP_BYTES
+        REF_COMMON_PATTERNS(swap_bytes),
+#endif
+
+        REF_COMMON_PATTERNS(min),
+        REF_COMMON_PATTERNS(max),
+        REF_COMMON_PATTERNS(scale),
+        REF_COMMON_PATTERNS(convert_float),
+
+        fn(op_clear_1110),
+        fn(op_clear_0111),
+        fn(op_clear_0011),
+        fn(op_clear_1001),
+        fn(op_clear_1100),
+        fn(op_clear_0101),
+        fn(op_clear_1010),
+        fn(op_clear_1000),
+        fn(op_clear_0100),
+        fn(op_clear_0010),
+
+        fn(op_swizzle_3012),
+        fn(op_swizzle_3021),
+        fn(op_swizzle_2103),
+        fn(op_swizzle_3210),
+        fn(op_swizzle_3102),
+        fn(op_swizzle_3201),
+        fn(op_swizzle_1203),
+        fn(op_swizzle_1023),
+        fn(op_swizzle_2013),
+        fn(op_swizzle_2310),
+        fn(op_swizzle_2130),
+        fn(op_swizzle_1230),
+        fn(op_swizzle_1320),
+        fn(op_swizzle_0213),
+        fn(op_swizzle_0231),
+        fn(op_swizzle_0312),
+        fn(op_swizzle_3120),
+        fn(op_swizzle_0321),
+
+        fn(op_expand_luma_03),
+        fn(op_expand_luma_30),
+        fn(op_expand_luma_10),
+        fn(op_expand_luma_01),
+
+#if BIT_DEPTH != 8
+        REF_COMMON_PATTERNS(lshift),
+        REF_COMMON_PATTERNS(rshift),
+        REF_COMMON_PATTERNS(convert_uint8),
+#endif /* BIT_DEPTH != 8 */
+
+#if BIT_DEPTH != 16
+        REF_COMMON_PATTERNS(convert_uint16),
+#endif
+#if BIT_DEPTH != 32
+        REF_COMMON_PATTERNS(convert_uint32),
+#endif
+
+        {{0}}
+    },
+};
+
+#undef PIXEL_TYPE
+#undef PIXEL_MAX
+#undef PIXEL_MIN
+#undef SWAP_BYTES
+#undef pixel_t
+#undef block_t
+#undef px
+
+#undef FMT_CHAR
+#undef IS_FLOAT
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 11/17] swscale/x86: add SIMD backend
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (9 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 10/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-29 13:00   ` Michael Niedermayer
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 12/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
                   ` (8 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
floating point operations. While this is not yet 100% coverage, it's good
enough for the vast majority of formats out there.

Of special note is the packed shuffle solver, which can reduce any compatible
series of operations down to a single pshufb loop. This takes care of any sort
of packed swizzle, but also e.g. grayscale to packed RGB expansion, RGB bit
depth conversions, endianness swapping and so on.
---
 libswscale/ops.c              |   4 +
 libswscale/x86/Makefile       |   3 +
 libswscale/x86/ops.c          | 735 ++++++++++++++++++++++++++++
 libswscale/x86/ops_common.asm | 208 ++++++++
 libswscale/x86/ops_float.asm  | 376 +++++++++++++++
 libswscale/x86/ops_int.asm    | 882 ++++++++++++++++++++++++++++++++++
 6 files changed, 2208 insertions(+)
 create mode 100644 libswscale/x86/ops.c
 create mode 100644 libswscale/x86/ops_common.asm
 create mode 100644 libswscale/x86/ops_float.asm
 create mode 100644 libswscale/x86/ops_int.asm

diff --git a/libswscale/ops.c b/libswscale/ops.c
index 9600e3c9df..e408d7ca42 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -27,9 +27,13 @@
 #include "ops.h"
 #include "ops_internal.h"
 
+extern SwsOpBackend backend_x86;
 extern SwsOpBackend backend_c;
 
 const SwsOpBackend * const ff_sws_op_backends[] = {
+#if ARCH_X86
+    &backend_x86,
+#endif
     &backend_c,
     NULL
 };
diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index f00154941d..a04bc8336f 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -10,6 +10,9 @@ OBJS-$(CONFIG_XMM_CLOBBER_TEST) += x86/w64xmmtest.o
 
 X86ASM-OBJS                     += x86/input.o                          \
                                    x86/output.o                         \
+                                   x86/ops_int.o                        \
+                                   x86/ops_float.o                      \
+                                   x86/ops.o                            \
                                    x86/scale.o                          \
                                    x86/scale_avx2.o                          \
                                    x86/range_convert.o                  \
diff --git a/libswscale/x86/ops.c b/libswscale/x86/ops.c
new file mode 100644
index 0000000000..d37edb72f1
--- /dev/null
+++ b/libswscale/x86/ops.c
@@ -0,0 +1,735 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <float.h>
+
+#include <libavutil/avassert.h>
+#include <libavutil/bswap.h>
+#include <libavutil/mem.h>
+
+#include "../ops_chain.h"
+
+#define DECL_ENTRY(TYPE, NAME, ...)                                             \
+    static const SwsOpEntry op_##NAME = {                                       \
+        .op.type = SWS_PIXEL_##TYPE,                                            \
+        __VA_ARGS__                                                             \
+    }
+
+#define DECL_ASM(TYPE, NAME, ...)                                               \
+    void ff_##NAME(void);                                                       \
+    DECL_ENTRY(TYPE, NAME,                                                      \
+        .func = ff_##NAME,                                                      \
+        __VA_ARGS__)
+
+#define DECL_PATTERN(TYPE, NAME, X, Y, Z, W, ...)                               \
+    DECL_ASM(TYPE, p##X##Y##Z##W##_##NAME,                                      \
+        .op.comps.unused = { !X, !Y, !Z, !W },                                  \
+        __VA_ARGS__                                                             \
+    )
+
+#define REF_PATTERN(NAME, X, Y, Z, W)                                           \
+    op_p##X##Y##Z##W##_##NAME
+
+#define DECL_COMMON_PATTERNS(TYPE, NAME, ...)                                   \
+    DECL_PATTERN(TYPE, NAME, 1, 0, 0, 0, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 0, 0, 1, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 1, 1, 0, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 1, 1, 1, __VA_ARGS__)                           \
+
+#define REF_COMMON_PATTERNS(NAME)                                               \
+    REF_PATTERN(NAME, 1, 0, 0, 0),                                              \
+    REF_PATTERN(NAME, 1, 0, 0, 1),                                              \
+    REF_PATTERN(NAME, 1, 1, 1, 0),                                              \
+    REF_PATTERN(NAME, 1, 1, 1, 1)
+
+#define DECL_RW(EXT, TYPE, NAME, OP, ELEMS, PACKED, FRAC)                       \
+    DECL_ASM(TYPE, NAME##ELEMS##EXT,                                            \
+        .op.op = SWS_OP_##OP,                                                   \
+        .op.rw = { .elems = ELEMS, .packed = PACKED, .frac = FRAC },            \
+    );
+
+#define DECL_PACKED_RW(EXT, DEPTH)                                              \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  2, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  3, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  4, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 2, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 3, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 4, true,  0)           \
+
+static int setup_swap_bytes(const SwsOp *op, SwsOpPriv *out)
+{
+    const int mask = ff_sws_pixel_type_size(op->type) - 1;
+    for (int i = 0; i < 16; i++)
+        out->u8[i] = (i & ~mask) | (mask - (i & mask));
+    return 0;
+}
+
+#define DECL_SWAP_BYTES(EXT, TYPE, X, Y, Z, W)                                  \
+    DECL_PATTERN(TYPE, swap_bytes_##TYPE##EXT, X, Y, Z, W,                      \
+        .func = ff_p##X##Y##Z##W##_shuffle##EXT,                                \
+        .op.op = SWS_OP_SWAP_BYTES,                                             \
+        .setup = setup_swap_bytes,                                              \
+    );
+
+#define DECL_CLEAR_ALPHA(EXT, IDX)                                              \
+    DECL_ASM(U8, clear_alpha##IDX##EXT,                                         \
+        .op.op = SWS_OP_CLEAR,                                                  \
+        .op.c.q4[IDX] = { .num = -1, .den = 1 },                                \
+        .op.comps.unused[IDX] = true,                                           \
+    );                                                                          \
+
+#define DECL_CLEAR_ZERO(EXT, IDX)                                               \
+    DECL_ASM(U8, clear_zero##IDX##EXT,                                          \
+        .op.op = SWS_OP_CLEAR,                                                  \
+        .op.c.q4[IDX] = { .num = 0, .den = 1 },                                 \
+        .op.comps.unused[IDX] = true,                                           \
+    );
+
+static int setup_clear(const SwsOp *op, SwsOpPriv *out)
+{
+    for (int i = 0; i < 4; i++)
+        out->u32[i] = (uint32_t) op->c.q4[i].num;
+    return 0;
+}
+
+#define DECL_CLEAR(EXT, X, Y, Z, W)                                             \
+    DECL_PATTERN(U8, clear##EXT, X, Y, Z, W,                                    \
+        .op.op = SWS_OP_CLEAR,                                                  \
+        .setup = setup_clear,                                                   \
+        .flexible = true,                                                       \
+    );
+
+#define DECL_SWIZZLE(EXT, X, Y, Z, W)                                           \
+    DECL_ASM(U8, swizzle_##X##Y##Z##W##EXT,                                     \
+        .op.op = SWS_OP_SWIZZLE,                                                \
+        .op.swizzle = SWS_SWIZZLE( X, Y, Z, W ),                                \
+    );
+
+#define DECL_CONVERT(EXT, FROM, TO)                                             \
+    DECL_COMMON_PATTERNS(FROM, convert_##FROM##_##TO##EXT,                      \
+        .op.op = SWS_OP_CONVERT,                                                \
+        .op.convert.to = SWS_PIXEL_##TO,                                        \
+    );
+
+#define DECL_EXPAND(EXT, FROM, TO)                                              \
+    DECL_COMMON_PATTERNS(FROM, expand_##FROM##_##TO##EXT,                       \
+        .op.op = SWS_OP_CONVERT,                                                \
+        .op.convert.to = SWS_PIXEL_##TO,                                        \
+        .op.convert.expand = true,                                              \
+    );
+
+static int setup_shift(const SwsOp *op, SwsOpPriv *out)
+{
+    out->u16[0] = op->c.u;
+    return 0;
+}
+
+#define DECL_SHIFT16(EXT)                                                       \
+    DECL_COMMON_PATTERNS(U16, lshift16##EXT,                                    \
+        .op.op = SWS_OP_LSHIFT,                                                 \
+        .setup = setup_shift,                                                   \
+    );                                                                          \
+                                                                                \
+    DECL_COMMON_PATTERNS(U16, rshift16##EXT,                                    \
+        .op.op = SWS_OP_RSHIFT,                                                 \
+        .setup = setup_shift,                                                   \
+    );
+
+#define DECL_MIN_MAX(EXT)                                                       \
+    DECL_COMMON_PATTERNS(F32, min##EXT,                                         \
+        .op.op = SWS_OP_MIN,                                                    \
+        .setup = ff_sws_setup_q4,                                               \
+        .flexible = true,                                                       \
+    );                                                                          \
+                                                                                \
+    DECL_COMMON_PATTERNS(F32, max##EXT,                                         \
+        .op.op = SWS_OP_MAX,                                                    \
+        .setup = ff_sws_setup_q4,                                               \
+        .flexible = true,                                                       \
+    );
+
+#define DECL_SCALE(EXT)                                                         \
+    DECL_COMMON_PATTERNS(F32, scale##EXT,                                       \
+        .op.op = SWS_OP_SCALE,                                                  \
+        .setup = ff_sws_setup_q,                                                \
+    );
+
+/* 2x2 matrix fits inside SwsOpPriv directly, save an indirect in this case */
+static_assert(sizeof(SwsOpPriv) >= sizeof(float[2][2]), "2x2 dither matrix too large");
+static int setup_dither(const SwsOp *op, SwsOpPriv *out)
+{
+    const int size = 1 << op->dither.size_log2;
+    float *matrix = out->f32;
+    if (size > 2) {
+        matrix = out->ptr = av_mallocz(size * size * sizeof(*matrix));
+        if (!matrix)
+            return AVERROR(ENOMEM);
+    }
+
+    for (int i = 0; i < size * size; i++)
+        matrix[i] = (float) op->dither.matrix[i].num / op->dither.matrix[i].den;
+
+    return 0;
+}
+
+#define DECL_DITHER(EXT, SIZE)                                                  \
+    DECL_COMMON_PATTERNS(F32, dither##SIZE##EXT,                                \
+        .op.op = SWS_OP_DITHER,                                                 \
+        .op.dither.size_log2 = SIZE,                                            \
+        .setup = setup_dither,                                                  \
+        .free  = SIZE > 2 ? av_free : NULL,                                     \
+    );
+
+static int setup_linear(const SwsOp *op, SwsOpPriv *out)
+{
+    float *matrix = out->ptr = av_mallocz(sizeof(float[4][5]));
+    if (!matrix)
+        return AVERROR(ENOMEM);
+
+    for (int y = 0; y < 4; y++) {
+        for (int x = 0; x < 5; x++)
+            matrix[y * 5 + x] = (float) op->lin.m[y][x].num / op->lin.m[y][x].den;
+    }
+
+    return 0;
+}
+
+#define DECL_LINEAR(EXT, NAME, MASK)                                            \
+    DECL_ASM(F32, NAME##EXT,                                                    \
+        .op.op = SWS_OP_LINEAR,                                                 \
+        .op.lin.mask = (MASK),                                                  \
+        .setup = setup_linear,                                                  \
+        .free  = av_free,                                                       \
+    );
+
+#define DECL_FUNCS_8(SIZE, EXT, FLAG)                                           \
+    DECL_RW(EXT, U8, read_planar,   READ,  1, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  2, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  3, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  4, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 1, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 2, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 3, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 4, false, 0)                         \
+    DECL_RW(EXT, U8, read_nibbles,  READ,  1, false, 1)                         \
+    DECL_RW(EXT, U8, read_bits,     READ,  1, false, 3)                         \
+    DECL_RW(EXT, U8, write_bits,    WRITE, 1, false, 3)                         \
+    DECL_PACKED_RW(EXT, 8)                                                      \
+    void ff_p1000_shuffle##EXT(void);                                           \
+    void ff_p1001_shuffle##EXT(void);                                           \
+    void ff_p1110_shuffle##EXT(void);                                           \
+    void ff_p1111_shuffle##EXT(void);                                           \
+    DECL_SWIZZLE(EXT, 3, 0, 1, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 0, 2, 1)                                               \
+    DECL_SWIZZLE(EXT, 2, 1, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 3, 2, 1, 0)                                               \
+    DECL_SWIZZLE(EXT, 3, 1, 0, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 2, 0, 1)                                               \
+    DECL_SWIZZLE(EXT, 1, 2, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 1, 0, 2, 3)                                               \
+    DECL_SWIZZLE(EXT, 2, 0, 1, 3)                                               \
+    DECL_SWIZZLE(EXT, 2, 3, 1, 0)                                               \
+    DECL_SWIZZLE(EXT, 2, 1, 3, 0)                                               \
+    DECL_SWIZZLE(EXT, 1, 2, 3, 0)                                               \
+    DECL_SWIZZLE(EXT, 1, 3, 2, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 2, 1, 3)                                               \
+    DECL_SWIZZLE(EXT, 0, 2, 3, 1)                                               \
+    DECL_SWIZZLE(EXT, 0, 3, 1, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 1, 2, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 3, 2, 1)                                               \
+    DECL_SWIZZLE(EXT, 0, 0, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 3, 0, 0, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 0, 0, 1)                                               \
+    DECL_SWIZZLE(EXT, 1, 0, 0, 0)                                               \
+    DECL_CLEAR_ALPHA(EXT, 0)                                                    \
+    DECL_CLEAR_ALPHA(EXT, 1)                                                    \
+    DECL_CLEAR_ALPHA(EXT, 3)                                                    \
+    DECL_CLEAR_ZERO(EXT, 0)                                                     \
+    DECL_CLEAR_ZERO(EXT, 1)                                                     \
+    DECL_CLEAR_ZERO(EXT, 3)                                                     \
+    DECL_CLEAR(EXT, 1, 1, 1, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 1, 1)                                                 \
+    DECL_CLEAR(EXT, 0, 0, 1, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 0, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 1, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 0, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 1, 0)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 0, 1, 0)                                                 \
+                                                                                \
+static const SwsOpTable ops8##EXT = {                                           \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        op_read_planar1##EXT,                                                   \
+        op_read_planar2##EXT,                                                   \
+        op_read_planar3##EXT,                                                   \
+        op_read_planar4##EXT,                                                   \
+        op_write_planar1##EXT,                                                  \
+        op_write_planar2##EXT,                                                  \
+        op_write_planar3##EXT,                                                  \
+        op_write_planar4##EXT,                                                  \
+        op_read8_packed2##EXT,                                                  \
+        op_read8_packed3##EXT,                                                  \
+        op_read8_packed4##EXT,                                                  \
+        op_write8_packed2##EXT,                                                 \
+        op_write8_packed3##EXT,                                                 \
+        op_write8_packed4##EXT,                                                 \
+        op_read_nibbles1##EXT,                                                  \
+        op_read_bits1##EXT,                                                     \
+        op_write_bits1##EXT,                                                    \
+        op_swizzle_3012##EXT,                                                   \
+        op_swizzle_3021##EXT,                                                   \
+        op_swizzle_2103##EXT,                                                   \
+        op_swizzle_3210##EXT,                                                   \
+        op_swizzle_3102##EXT,                                                   \
+        op_swizzle_3201##EXT,                                                   \
+        op_swizzle_1203##EXT,                                                   \
+        op_swizzle_1023##EXT,                                                   \
+        op_swizzle_2013##EXT,                                                   \
+        op_swizzle_2310##EXT,                                                   \
+        op_swizzle_2130##EXT,                                                   \
+        op_swizzle_1230##EXT,                                                   \
+        op_swizzle_1320##EXT,                                                   \
+        op_swizzle_0213##EXT,                                                   \
+        op_swizzle_0231##EXT,                                                   \
+        op_swizzle_0312##EXT,                                                   \
+        op_swizzle_3120##EXT,                                                   \
+        op_swizzle_0321##EXT,                                                   \
+        op_swizzle_0003##EXT,                                                   \
+        op_swizzle_0001##EXT,                                                   \
+        op_swizzle_3000##EXT,                                                   \
+        op_swizzle_1000##EXT,                                                   \
+        op_clear_alpha0##EXT,                                                   \
+        op_clear_alpha1##EXT,                                                   \
+        op_clear_alpha3##EXT,                                                   \
+        op_clear_zero0##EXT,                                                    \
+        op_clear_zero1##EXT,                                                    \
+        op_clear_zero3##EXT,                                                    \
+        REF_PATTERN(clear##EXT, 1, 1, 1, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 1, 1),                                    \
+        REF_PATTERN(clear##EXT, 0, 0, 1, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 0, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 1, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 0, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 1, 0),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 0, 1, 0),                                    \
+        {{0}}                                                                   \
+    },                                                                          \
+};
+
+#define DECL_FUNCS_16(SIZE, EXT, FLAG)                                          \
+    DECL_PACKED_RW(EXT, 16)                                                     \
+    DECL_SWAP_BYTES(EXT, U16, 1, 0, 0, 0)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 0, 0, 1)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 1, 1, 0)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 1, 1, 1)                                       \
+    DECL_SHIFT16(EXT)                                                           \
+    DECL_CONVERT(EXT,  U8, U16)                                                 \
+    DECL_CONVERT(EXT, U16,  U8)                                                 \
+    DECL_EXPAND(EXT,   U8, U16)                                                 \
+                                                                                \
+static const SwsOpTable ops16##EXT = {                                          \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        op_read16_packed2##EXT,                                                 \
+        op_read16_packed3##EXT,                                                 \
+        op_read16_packed4##EXT,                                                 \
+        op_write16_packed2##EXT,                                                \
+        op_write16_packed3##EXT,                                                \
+        op_write16_packed4##EXT,                                                \
+        REF_COMMON_PATTERNS(swap_bytes_U16##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U8_U16##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_U8##EXT),                               \
+        REF_COMMON_PATTERNS(expand_U8_U16##EXT),                                \
+        REF_COMMON_PATTERNS(lshift16##EXT),                                     \
+        REF_COMMON_PATTERNS(rshift16##EXT),                                     \
+        {{0}}                                                                   \
+    },                                                                          \
+};
+
+#define DECL_FUNCS_32(SIZE, EXT, FLAG)                                          \
+    DECL_PACKED_RW(_m2##EXT, 32)                                                \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 0, 0, 0)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 0, 0, 1)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 1, 1, 0)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 1, 1, 1)                                  \
+    DECL_CONVERT(EXT,  U8, U32)                                                 \
+    DECL_CONVERT(EXT, U32,  U8)                                                 \
+    DECL_CONVERT(EXT, U16, U32)                                                 \
+    DECL_CONVERT(EXT, U32, U16)                                                 \
+    DECL_CONVERT(EXT,  U8, F32)                                                 \
+    DECL_CONVERT(EXT, F32,  U8)                                                 \
+    DECL_CONVERT(EXT, U16, F32)                                                 \
+    DECL_CONVERT(EXT, F32, U16)                                                 \
+    DECL_EXPAND(EXT,   U8, U32)                                                 \
+    DECL_MIN_MAX(EXT)                                                           \
+    DECL_SCALE(EXT)                                                             \
+    DECL_DITHER(EXT, 0)                                                         \
+    DECL_DITHER(EXT, 1)                                                         \
+    DECL_DITHER(EXT, 2)                                                         \
+    DECL_DITHER(EXT, 3)                                                         \
+    DECL_DITHER(EXT, 4)                                                         \
+    DECL_LINEAR(EXT, luma,      SWS_MASK_LUMA)                                  \
+    DECL_LINEAR(EXT, alpha,     SWS_MASK_ALPHA)                                 \
+    DECL_LINEAR(EXT, lumalpha,  SWS_MASK_LUMA | SWS_MASK_ALPHA)                 \
+    DECL_LINEAR(EXT, dot3,      0b111)                                          \
+    DECL_LINEAR(EXT, row0,      SWS_MASK_ROW(0))                                \
+    DECL_LINEAR(EXT, row0a,     SWS_MASK_ROW(0) | SWS_MASK_ALPHA)               \
+    DECL_LINEAR(EXT, diag3,     SWS_MASK_DIAG3)                                 \
+    DECL_LINEAR(EXT, diag4,     SWS_MASK_DIAG4)                                 \
+    DECL_LINEAR(EXT, diagoff3,  SWS_MASK_DIAG3 | SWS_MASK_OFF3)                 \
+    DECL_LINEAR(EXT, matrix3,   SWS_MASK_MAT3)                                  \
+    DECL_LINEAR(EXT, affine3,   SWS_MASK_MAT3 | SWS_MASK_OFF3)                  \
+    DECL_LINEAR(EXT, affine3a,  SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA) \
+    DECL_LINEAR(EXT, matrix4,   SWS_MASK_MAT4)                                  \
+    DECL_LINEAR(EXT, affine4,   SWS_MASK_MAT4 | SWS_MASK_OFF4)                  \
+                                                                                \
+static const SwsOpTable ops32##EXT = {                                          \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        op_read32_packed2_m2##EXT,                                              \
+        op_read32_packed3_m2##EXT,                                              \
+        op_read32_packed4_m2##EXT,                                              \
+        op_write32_packed2_m2##EXT,                                             \
+        op_write32_packed3_m2##EXT,                                             \
+        op_write32_packed4_m2##EXT,                                             \
+        REF_COMMON_PATTERNS(swap_bytes_U32_m2##EXT),                            \
+        REF_COMMON_PATTERNS(convert_U8_U32##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U32_U8##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_U32##EXT),                              \
+        REF_COMMON_PATTERNS(convert_U32_U16##EXT),                              \
+        REF_COMMON_PATTERNS(convert_U8_F32##EXT),                               \
+        REF_COMMON_PATTERNS(convert_F32_U8##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_F32##EXT),                              \
+        REF_COMMON_PATTERNS(convert_F32_U16##EXT),                              \
+        REF_COMMON_PATTERNS(expand_U8_U32##EXT),                                \
+        REF_COMMON_PATTERNS(min##EXT),                                          \
+        REF_COMMON_PATTERNS(max##EXT),                                          \
+        REF_COMMON_PATTERNS(scale##EXT),                                        \
+        REF_COMMON_PATTERNS(dither0##EXT),                                      \
+        REF_COMMON_PATTERNS(dither1##EXT),                                      \
+        REF_COMMON_PATTERNS(dither2##EXT),                                      \
+        REF_COMMON_PATTERNS(dither3##EXT),                                      \
+        REF_COMMON_PATTERNS(dither4##EXT),                                      \
+        op_luma##EXT,                                                           \
+        op_alpha##EXT,                                                          \
+        op_lumalpha##EXT,                                                       \
+        op_dot3##EXT,                                                           \
+        op_row0##EXT,                                                           \
+        op_row0a##EXT,                                                          \
+        op_diag3##EXT,                                                          \
+        op_diag4##EXT,                                                          \
+        op_diagoff3##EXT,                                                       \
+        op_matrix3##EXT,                                                        \
+        op_affine3##EXT,                                                        \
+        op_affine3a##EXT,                                                       \
+        op_matrix4##EXT,                                                        \
+        op_affine4##EXT,                                                        \
+        {{0}}                                                                   \
+    },                                                                          \
+};
+
+DECL_FUNCS_8(16, _m1_sse4, SSE4)
+DECL_FUNCS_8(32, _m1_avx2, AVX2)
+DECL_FUNCS_8(32, _m2_sse4, SSE4)
+DECL_FUNCS_8(64, _m2_avx2, AVX2)
+
+DECL_FUNCS_16(16, _m1_avx2, AVX2)
+DECL_FUNCS_16(32, _m2_avx2, AVX2)
+
+DECL_FUNCS_32(16, _avx2, AVX2)
+
+static av_const int get_mmsize(const int cpu_flags)
+{
+    if (cpu_flags & AV_CPU_FLAG_AVX2)
+        return 32;
+    else if (cpu_flags & AV_CPU_FLAG_SSE4)
+        return 16;
+    else
+        return AVERROR(ENOTSUP);
+}
+
+/**
+ * Returns true if the operation's implementation only depends on the block
+ * size, and not the underlying pixel type
+ */
+static bool op_is_type_invariant(const SwsOp *op)
+{
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        return !op->rw.packed && !op->rw.frac;
+    case SWS_OP_SWIZZLE:
+    case SWS_OP_CLEAR:
+        return true;
+    }
+
+    return false;
+}
+
+/* Tries to reduce a series of operations to an in-place shuffle mask.
+ * Returns the block size, 0 or a negative error code. */
+static int solve_shuffle(const SwsOpList *ops, int mmsize, SwsCompiledOp *out)
+{
+    const SwsOp read = ops->ops[0];
+    const int read_size = ff_sws_pixel_type_size(read.type);
+    uint32_t mask[4] = {0};
+
+    if (!ops->num_ops || read.op != SWS_OP_READ)
+        return AVERROR(EINVAL);
+    if (read.rw.frac || (!read.rw.packed && read.rw.elems > 1))
+        return AVERROR(ENOTSUP);
+
+    for (int i = 0; i < read.rw.elems; i++)
+        mask[i] = 0x01010101 * i * read_size + 0x03020100;
+
+    for (int opidx = 1; opidx < ops->num_ops; opidx++) {
+        const SwsOp *op = &ops->ops[opidx];
+        switch (op->op) {
+        case SWS_OP_SWIZZLE: {
+            uint32_t orig[4] = { mask[0], mask[1], mask[2], mask[3] };
+            for (int i = 0; i < 4; i++)
+                mask[i] = orig[op->swizzle.in[i]];
+            break;
+        }
+
+        case SWS_OP_SWAP_BYTES:
+            for (int i = 0; i < 4; i++) {
+                switch (ff_sws_pixel_type_size(op->type)) {
+                case 2: mask[i] = av_bswap16(mask[i]); break;
+                case 4: mask[i] = av_bswap32(mask[i]); break;
+                }
+            }
+            break;
+
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (!op->c.q4[i].den)
+                    continue;
+                if (op->c.q4[i].num != 0)
+                    return AVERROR(ENOTSUP);
+                mask[i] = 0x80808080ul; /* pshufb implicit clear to zero */
+            }
+            break;
+
+        case SWS_OP_CONVERT: {
+            if (!op->convert.expand)
+                return AVERROR(ENOTSUP);
+            for (int i = 0; i < 4; i++) {
+                switch (ff_sws_pixel_type_size(op->type)) {
+                case 1: mask[i] = 0x01010101 * (mask[i] & 0xFF);   break;
+                case 2: mask[i] = 0x00010001 * (mask[i] & 0xFFFF); break;
+                }
+            }
+            break;
+        }
+
+        case SWS_OP_WRITE: {
+            if (op->rw.frac || !op->rw.packed)
+                return AVERROR(ENOTSUP);
+
+            /* Initialize to no-op */
+            uint8_t shuffle[16];
+            for (int i = 0; i < 16; i++)
+                shuffle[i] = 0x80;
+
+            const int write_size  = ff_sws_pixel_type_size(op->type);
+            const int read_chunk  = read.rw.elems * read_size;
+            const int write_chunk = op->rw.elems * write_size;
+            const int groups_per_lane = 16 / FFMAX(read_chunk, write_chunk);
+            for (int n = 0; n < groups_per_lane; n++) {
+                const int base_in  = n * read_chunk;
+                const int base_out = n * write_chunk;
+                for (int i = 0; i < op->rw.elems; i++) {
+                    const int offset = base_out + i * write_size;
+                    for (int b = 0; b < write_size; b++)
+                        shuffle[offset + b] = base_in + (mask[i] >> (b * 8));
+                }
+            }
+
+            const int in_per_lane  = groups_per_lane * read_chunk;
+            const int out_per_lane = groups_per_lane * write_chunk;
+            if (in_per_lane < 16 || out_per_lane < 16)
+                mmsize = 16; /* avoid cross-lane shuffle */
+
+            const int num_lanes = mmsize / 16;
+            const int in_total  = num_lanes * in_per_lane;
+            const int out_total = num_lanes * out_per_lane;
+            const int read_size = in_total <= 4 ? 4 : in_total <= 8 ? 8 : mmsize;
+            *out = (SwsCompiledOp) {
+                .priv       = av_memdup(shuffle, sizeof(shuffle)),
+                .free       = av_free,
+                .block_size = groups_per_lane * num_lanes,
+                .over_read  = read_size - in_total,
+                .over_write = mmsize - out_total,
+            };
+
+            if (!out->priv)
+                return AVERROR(ENOMEM);
+
+#define ASSIGN_SHUFFLE_FUNC(IN, OUT, EXT)                                       \
+do {                                                                            \
+    SWS_DECL_FUNC(ff_packed_shuffle##IN##_##OUT##_##EXT);                       \
+    if (in_total == IN && out_total == OUT)                                     \
+        out->func = ff_packed_shuffle##IN##_##OUT##_##EXT;                      \
+} while (0)
+
+            ASSIGN_SHUFFLE_FUNC( 5, 15, sse4);
+            ASSIGN_SHUFFLE_FUNC( 4, 16, sse4);
+            ASSIGN_SHUFFLE_FUNC( 2, 12, sse4);
+            ASSIGN_SHUFFLE_FUNC(10, 15, sse4);
+            ASSIGN_SHUFFLE_FUNC( 8, 16, sse4);
+            ASSIGN_SHUFFLE_FUNC( 4, 12, sse4);
+            ASSIGN_SHUFFLE_FUNC(15, 15, sse4);
+            ASSIGN_SHUFFLE_FUNC(12, 16, sse4);
+            ASSIGN_SHUFFLE_FUNC( 6, 12, sse4);
+            ASSIGN_SHUFFLE_FUNC(16, 12, sse4);
+            ASSIGN_SHUFFLE_FUNC(16, 16, sse4);
+            ASSIGN_SHUFFLE_FUNC( 8, 12, sse4);
+            ASSIGN_SHUFFLE_FUNC(12, 12, sse4);
+            ASSIGN_SHUFFLE_FUNC(32, 32, avx2);
+            av_assert1(out->func);
+            return 0;
+        }
+
+        default:
+            return AVERROR(ENOTSUP);
+        }
+    }
+
+    return AVERROR(EINVAL);
+}
+
+/* Normalize clear values into 32-bit integer constants */
+static void normalize_clear(SwsOp *op)
+{
+    static_assert(sizeof(uint32_t) == sizeof(int), "int size mismatch");
+    SwsOpPriv priv;
+    union {
+        uint32_t u32;
+        int i;
+    } c;
+
+    ff_sws_setup_q4(op, &priv);
+    for (int i = 0; i < 4; i++) {
+        if (!op->c.q4[i].den)
+            continue;
+        switch (ff_sws_pixel_type_size(op->type)) {
+        case 1: c.u32 = 0x1010101 * priv.u8[i]; break;
+        case 2: c.u32 = priv.u16[i] << 16 | priv.u16[i]; break;
+        case 4: c.u32 = priv.u32[i]; break;
+        }
+
+        op->c.q4[i].num = c.i;
+        op->c.q4[i].den = 1;
+    }
+}
+
+static int compile(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out)
+{
+    const int cpu_flags = av_get_cpu_flags();
+    const int mmsize = get_mmsize(cpu_flags);
+    if (mmsize < 0)
+        return mmsize;
+
+    av_assert1(ops->num_ops > 0);
+    const SwsOp read = ops->ops[0];
+    const SwsOp write = ops->ops[ops->num_ops - 1];
+    int ret;
+
+    /* Special fast path for in-place packed shuffle */
+    ret = solve_shuffle(ops, mmsize, out);
+    if (ret != AVERROR(ENOTSUP))
+        return ret;
+
+    SwsOpChain *chain = ff_sws_op_chain_alloc();
+    if (!chain)
+        return AVERROR(ENOMEM);
+
+    *out = (SwsCompiledOp) {
+        .priv = chain,
+        .free = (void (*)(void *)) ff_sws_op_chain_free,
+
+        /* Use at most two full vregs during the widest precision section */
+        .block_size = 2 * mmsize / ff_sws_op_list_max_size(ops),
+    };
+
+    /* 3-component reads/writes process one extra garbage word */
+    if (read.rw.packed && read.rw.elems == 3)
+        out->over_read = sizeof(uint32_t);
+    if (write.rw.packed && write.rw.elems == 3)
+        out->over_write = sizeof(uint32_t);
+
+    static const SwsOpTable *const tables[] = {
+        &ops8_m1_sse4,
+        &ops8_m1_avx2,
+        &ops8_m2_sse4,
+        &ops8_m2_avx2,
+        &ops16_m1_avx2,
+        &ops16_m2_avx2,
+        &ops32_avx2,
+    };
+
+    do {
+        int op_block_size = out->block_size;
+        SwsOp *op = &ops->ops[0];
+
+        if (op_is_type_invariant(op)) {
+            if (op->op == SWS_OP_CLEAR)
+                normalize_clear(op);
+            op_block_size *= ff_sws_pixel_type_size(op->type);
+            op->type = SWS_PIXEL_U8;
+        }
+
+        ret = ff_sws_op_compile_tables(tables, FF_ARRAY_ELEMS(tables), ops,
+                                       op_block_size, chain);
+    } while (ret == AVERROR(EAGAIN));
+    if (ret < 0) {
+        ff_sws_op_chain_free(chain);
+        return ret;
+    }
+
+    SWS_DECL_FUNC(ff_sws_process1_x86);
+    SWS_DECL_FUNC(ff_sws_process2_x86);
+    SWS_DECL_FUNC(ff_sws_process3_x86);
+    SWS_DECL_FUNC(ff_sws_process4_x86);
+
+    const int read_planes  = read.rw.packed  ? 1 : read.rw.elems;
+    const int write_planes = write.rw.packed ? 1 : write.rw.elems;
+    switch (FFMAX(read_planes, write_planes)) {
+    case 1: out->func = ff_sws_process1_x86; break;
+    case 2: out->func = ff_sws_process2_x86; break;
+    case 3: out->func = ff_sws_process3_x86; break;
+    case 4: out->func = ff_sws_process4_x86; break;
+    }
+
+    return ret;
+}
+
+SwsOpBackend backend_x86 = {
+    .name       = "x86",
+    .compile    = compile,
+};
diff --git a/libswscale/x86/ops_common.asm b/libswscale/x86/ops_common.asm
new file mode 100644
index 0000000000..15d171329d
--- /dev/null
+++ b/libswscale/x86/ops_common.asm
@@ -0,0 +1,208 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "libavutil/x86/x86util.asm"
+
+struc SwsOpExec
+    .in0 resq 1
+    .in1 resq 1
+    .in2 resq 1
+    .in3 resq 1
+    .out0 resq 1
+    .out1 resq 1
+    .out2 resq 1
+    .out3 resq 1
+    .in_stride0 resq 1
+    .in_stride1 resq 1
+    .in_stride2 resq 1
+    .in_stride3 resq 1
+    .out_stride0 resq 1
+    .out_stride1 resq 1
+    .out_stride2 resq 1
+    .out_stride3 resq 1
+    .x resd 1
+    .y resd 1
+    .width resd 1
+    .height resd 1
+    .slice_y resd 1
+    .slice_h resd 1
+    .pixel_bits_in resd 1
+    .pixel_bits_out resd 1
+endstruc
+
+struc SwsOpImpl
+    .cont resb 16
+    .priv resb 16
+    .next resb 0
+endstruc
+
+; common macros for declaring operations
+%macro op 1 ; name
+    %ifdef X
+        %define ADD_PAT(name) p %+ X %+ Y %+ Z %+ W %+ _ %+ name
+    %else
+        %define ADD_PAT(name) name
+    %endif
+
+    %ifdef V2
+        %if V2
+            %define ADD_MUL(name) name %+ _m2
+        %else
+            %define ADD_MUL(name) name %+ _m1
+        %endif
+    %else
+        %define ADD_MUL(name) name
+    %endif
+
+    cglobal ADD_PAT(ADD_MUL(%1)), 0, 0, 16
+
+    %undef ADD_PAT
+    %undef ADD_MUL
+%endmacro
+
+%macro decl_v2 2+ ; v2, func
+    %xdefine V2 %1
+    %2
+    %undef V2
+%endmacro
+
+%macro decl_pattern 5+ ; X, Y, Z, W, func
+    %xdefine X %1
+    %xdefine Y %2
+    %xdefine Z %3
+    %xdefine W %4
+    %5
+    %undef X
+    %undef Y
+    %undef Z
+    %undef W
+%endmacro
+
+%macro decl_common_patterns 1+ ; func
+    decl_pattern 1, 0, 0, 0, %1 ; y
+    decl_pattern 1, 0, 0, 1, %1 ; ya
+    decl_pattern 1, 1, 1, 0, %1 ; yuv
+    decl_pattern 1, 1, 1, 1, %1 ; yuva
+%endmacro
+
+; common names for the internal calling convention
+%define mx      m0
+%define my      m1
+%define mz      m2
+%define mw      m3
+
+%define xmx     xm0
+%define xmy     xm1
+%define xmz     xm2
+%define xmw     xm3
+
+%define ymx     ym0
+%define ymy     ym1
+%define ymz     ym2
+%define ymw     ym3
+
+%define mx2     m4
+%define my2     m5
+%define mz2     m6
+%define mw2     m7
+
+%define xmx2    xm4
+%define xmy2    xm5
+%define xmz2    xm6
+%define xmw2    xm7
+
+%define ymx2    ym4
+%define ymy2    ym5
+%define ymz2    ym6
+%define ymw2    ym7
+
+; from entry point signature
+%define execq   r0q
+%define implq   r1q
+%define blocksd r2d
+
+; extra registers for free use by kernels, not saved between ops
+%define tmp0q   r3q
+%define tmp1q   r4q
+%define tmp2q   r5q
+%define tmp3q   r6q
+
+%define tmp0d   r3d
+%define tmp1d   r4d
+%define tmp2d   r5d
+%define tmp3d   r6d
+
+; pinned static registers for plane pointers
+%define  in0q   r7q
+%define out0q   r8q
+%define  in1q   r9q
+%define out1q   r10q
+%define  in2q   r11q
+%define out2q   r12q
+%define  in3q   r13q
+%define out3q   r14q
+
+; load the next operation kernel
+%macro LOAD_CONT 1 ; reg
+    mov %1, [implq + SwsOpImpl.cont]
+%endmacro
+
+; tail call into the next operation kernel
+%macro CONTINUE 1 ; reg
+    add implq, SwsOpImpl.next
+    jmp %1
+    annotate_function_size
+%endmacro
+
+%macro CONTINUE 0
+    LOAD_CONT tmp0q
+    CONTINUE tmp0q
+%endmacro
+
+; return to entry point after write, avoids unnecessary vzeroupper
+%macro END_CHAIN 0
+    ret
+    annotate_function_size
+%endmacro
+
+; helper for inline conditionals
+%rmacro IF 2+ ; cond, body
+    %if %1
+        %2
+    %endif
+%endmacro
+
+; alternate name for nested usage to work around some NASM bugs
+%rmacro IF1 2+
+    %if %1
+        %2
+    %endif
+%endmacro
+
+; move at least N pixels
+%macro MOVSZ 2+ ; size, args
+    %if %1 <= 4
+        movd %2
+    %elif %1 <= 8
+        movq %2
+    %else
+        movu %2
+    %endif
+%endmacro
diff --git a/libswscale/x86/ops_float.asm b/libswscale/x86/ops_float.asm
new file mode 100644
index 0000000000..120ccc65b2
--- /dev/null
+++ b/libswscale/x86/ops_float.asm
@@ -0,0 +1,376 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "ops_common.asm"
+
+SECTION .text
+
+;---------------------------------------------------------
+; Pixel type conversions
+
+%macro conv8to32f 0
+op convert_U8_F32
+        LOAD_CONT tmp0q
+IF X,   vpsrldq xmx2, xmx, 8
+IF Y,   vpsrldq xmy2, xmy, 8
+IF Z,   vpsrldq xmz2, xmz, 8
+IF W,   vpsrldq xmw2, xmw, 8
+IF X,   pmovzxbd mx, xmx
+IF Y,   pmovzxbd my, xmy
+IF Z,   pmovzxbd mz, xmz
+IF W,   pmovzxbd mw, xmw
+IF X,   pmovzxbd mx2, xmx2
+IF Y,   pmovzxbd my2, xmy2
+IF Z,   pmovzxbd mz2, xmz2
+IF W,   pmovzxbd mw2, xmw2
+IF X,   vcvtdq2ps mx, mx
+IF Y,   vcvtdq2ps my, my
+IF Z,   vcvtdq2ps mz, mz
+IF W,   vcvtdq2ps mw, mw
+IF X,   vcvtdq2ps mx2, mx2
+IF Y,   vcvtdq2ps my2, my2
+IF Z,   vcvtdq2ps mz2, mz2
+IF W,   vcvtdq2ps mw2, mw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv16to32f 0
+op convert_U16_F32
+        LOAD_CONT tmp0q
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   pmovzxwd mx, xmx
+IF Y,   pmovzxwd my, xmy
+IF Z,   pmovzxwd mz, xmz
+IF W,   pmovzxwd mw, xmw
+IF X,   pmovzxwd mx2, xmx2
+IF Y,   pmovzxwd my2, xmy2
+IF Z,   pmovzxwd mz2, xmz2
+IF W,   pmovzxwd mw2, xmw2
+IF X,   vcvtdq2ps mx, mx
+IF Y,   vcvtdq2ps my, my
+IF Z,   vcvtdq2ps mz, mz
+IF W,   vcvtdq2ps mw, mw
+IF X,   vcvtdq2ps mx2, mx2
+IF Y,   vcvtdq2ps my2, my2
+IF Z,   vcvtdq2ps mz2, mz2
+IF W,   vcvtdq2ps mw2, mw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32fto8 0
+op convert_F32_U8
+        LOAD_CONT tmp0q
+IF X,   cvttps2dq mx, mx
+IF Y,   cvttps2dq my, my
+IF Z,   cvttps2dq mz, mz
+IF W,   cvttps2dq mw, mw
+IF X,   cvttps2dq mx2, mx2
+IF Y,   cvttps2dq my2, my2
+IF Z,   cvttps2dq mz2, mz2
+IF W,   cvttps2dq mw2, mw2
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   packuswb xmx, xmx2
+IF Y,   packuswb xmy, xmy2
+IF Z,   packuswb xmz, xmz2
+IF W,   packuswb xmw, xmw2
+IF X,   vpshufd xmx, xmx, q3120
+IF Y,   vpshufd xmy, xmy, q3120
+IF Z,   vpshufd xmz, xmz, q3120
+IF W,   vpshufd xmw, xmw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32fto16 0
+op convert_F32_U16
+        LOAD_CONT tmp0q
+IF X,   cvttps2dq mx, mx
+IF Y,   cvttps2dq my, my
+IF Z,   cvttps2dq mz, mz
+IF W,   cvttps2dq mw, mw
+IF X,   cvttps2dq mx2, mx2
+IF Y,   cvttps2dq my2, my2
+IF Z,   cvttps2dq mz2, mz2
+IF W,   cvttps2dq mw2, mw2
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro min_max 0
+op min
+IF X,   vbroadcastss m8,  [implq + SwsOpImpl.priv + 0]
+IF Y,   vbroadcastss m9,  [implq + SwsOpImpl.priv + 4]
+IF Z,   vbroadcastss m10, [implq + SwsOpImpl.priv + 8]
+IF W,   vbroadcastss m11, [implq + SwsOpImpl.priv + 12]
+        LOAD_CONT tmp0q
+IF X,   minps mx, mx, m8
+IF Y,   minps my, my, m9
+IF Z,   minps mz, mz, m10
+IF W,   minps mw, mw, m11
+IF X,   minps mx2, m8
+IF Y,   minps my2, m9
+IF Z,   minps mz2, m10
+IF W,   minps mw2, m11
+        CONTINUE tmp0q
+
+op max
+IF X,   vbroadcastss m8,  [implq + SwsOpImpl.priv + 0]
+IF Y,   vbroadcastss m9,  [implq + SwsOpImpl.priv + 4]
+IF Z,   vbroadcastss m10, [implq + SwsOpImpl.priv + 8]
+IF W,   vbroadcastss m11, [implq + SwsOpImpl.priv + 12]
+        LOAD_CONT tmp0q
+IF X,   maxps mx, m8
+IF Y,   maxps my, m9
+IF Z,   maxps mz, m10
+IF W,   maxps mw, m11
+IF X,   maxps mx2, m8
+IF Y,   maxps my2, m9
+IF Z,   maxps mz2, m10
+IF W,   maxps mw2, m11
+        CONTINUE tmp0q
+%endmacro
+
+%macro scale 0
+op scale
+        vbroadcastss m8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   mulps mx, m8
+IF Y,   mulps my, m8
+IF Z,   mulps mz, m8
+IF W,   mulps mw, m8
+IF X,   mulps mx2, m8
+IF Y,   mulps my2, m8
+IF Z,   mulps mz2, m8
+IF W,   mulps mw2, m8
+        CONTINUE tmp0q
+%endmacro
+
+%macro load_dither_row 5 ; size_log2, y, addr, out, out2
+        lea tmp0q, %2
+        and tmp0q, (1 << %1) - 1
+        shl tmp0q, %1+2
+%if %1 == 2
+        VBROADCASTI128 %4, [%3 + tmp0q]
+%else
+        mova %4, [%3 + tmp0q]
+    %if (4 << %1) > mmsize
+        mova %5, [%3 + tmp0q + mmsize]
+    %endif
+%endif
+%endmacro
+
+%macro dither 1 ; size_log2
+op dither%1
+        %define DX  m8
+        %define DY  m9
+        %define DZ  m10
+        %define DW  m11
+        %define DX2 DX
+        %define DY2 DY
+        %define DZ2 DZ
+        %define DW2 DW
+%if %1 == 0
+        ; constant offest for all channels
+        vbroadcastss DX, [implq + SwsOpImpl.priv]
+        %define DY DX
+        %define DZ DX
+        %define DW DX
+%elif %1 == 1
+        ; 2x2 matrix, only sign of y matters
+        mov tmp0d, [execq + SwsOpExec.y]
+        and tmp0d, 1
+        shl tmp0d, 3
+    %if X || Y
+        vbroadcastsd DX, [implq + SwsOpImpl.priv + tmp0q]
+    %endif
+    %if Z || W
+        xor tmp0d, 8
+        vbroadcastsd DZ, [implq + SwsOpImpl.priv + tmp0q]
+    %endif
+        %define DY DX
+        %define DW DZ
+%else
+        ; matrix is at least 4x4, load all four channels with custom offset
+    %if (4 << %1) > mmsize
+        %define DX2 m12
+        %define DY2 m13
+        %define DZ2 m14
+        %define DW2 m15
+    %endif
+        mov tmp1d, [execq + SwsOpExec.y]
+        mov tmp2q, [implq + SwsOpImpl.priv]
+IF X,   load_dither_row %1, [tmp1d + 0], tmp2q, DX, DX2
+IF Y,   load_dither_row %1, [tmp1d + 3], tmp2q, DY, DY2
+IF Z,   load_dither_row %1, [tmp1d + 2], tmp2q, DZ, DZ2
+IF W,   load_dither_row %1, [tmp1d + 5], tmp2q, DW, DW2
+%endif
+        LOAD_CONT tmp0q
+IF X,   addps mx, DX
+IF Y,   addps my, DY
+IF Z,   addps mz, DZ
+IF W,   addps mw, DW
+IF X,   addps mx2, DX2
+IF Y,   addps my2, DY2
+IF Z,   addps mz2, DZ2
+IF W,   addps mw2, DW2
+        CONTINUE tmp0q
+%endmacro
+
+%macro dither_fns 0
+        dither 0
+        dither 1
+        dither 2
+        dither 3
+        dither 4
+%endmacro
+
+%xdefine MASK(I, J)  (1 << (5 * (I) + (J)))
+%xdefine MASK_OFF(I) MASK(I, 4)
+%xdefine MASK_ROW(I) (0b11111 << (5 * (I)))
+%xdefine MASK_COL(J) (0b1000010000100001 << J)
+%xdefine MASK_ALL    (1 << 20) - 1
+%xdefine MASK_LUMA   MASK(0, 0) | MASK_OFF(0)
+%xdefine MASK_ALPHA  MASK(3, 3) | MASK_OFF(3)
+%xdefine MASK_DIAG3  MASK(0, 0) | MASK(1, 1) | MASK(2, 2)
+%xdefine MASK_OFF3   MASK_OFF(0) | MASK_OFF(1) | MASK_OFF(2)
+%xdefine MASK_MAT3   MASK(0, 0) | MASK(0, 1) | MASK(0, 2) |\
+                     MASK(1, 0) | MASK(1, 1) | MASK(1, 2) |\
+                     MASK(2, 0) | MASK(2, 1) | MASK(2, 2)
+%xdefine MASK_DIAG4  MASK_DIAG3 | MASK(3, 3)
+%xdefine MASK_OFF4   MASK_OFF3 | MASK_OFF(3)
+%xdefine MASK_MAT4   MASK_ALL & ~MASK_OFF4
+
+%macro linear_row 7 ; res, x, y, z, w, row, mask
+%define COL(J) ((%7) & MASK(%6, J)) ; true if mask contains component J
+%define NOP(J) (J == %6 && !COL(J)) ; true if J is untouched input component
+
+    ; load weights
+    IF COL(0),  vbroadcastss m12,  [tmp0q + %6 * 20 + 0]
+    IF COL(1),  vbroadcastss m13,  [tmp0q + %6 * 20 + 4]
+    IF COL(2),  vbroadcastss m14,  [tmp0q + %6 * 20 + 8]
+    IF COL(3),  vbroadcastss m15,  [tmp0q + %6 * 20 + 12]
+
+    ; initialize result vector as appropriate
+    %if COL(4) ; offset
+        vbroadcastss %1, [tmp0q + %6 * 20 + 16]
+    %elif NOP(0)
+        ; directly reuse first component vector if possible
+        mova %1, %2
+    %else
+        xorps %1, %1
+    %endif
+
+    IF COL(0),  mulps m12, %2
+    IF COL(1),  mulps m13, %3
+    IF COL(2),  mulps m14, %4
+    IF COL(3),  mulps m15, %5
+    IF COL(0),  addps %1, m12
+    IF NOP(0) && COL(4), addps %1, %3 ; first vector was not reused
+    IF COL(1),  addps %1, m13
+    IF NOP(1),  addps %1, %3
+    IF COL(2),  addps %1, m14
+    IF NOP(2),  addps %1, %4
+    IF COL(3),  addps %1, m15
+    IF NOP(3),  addps %1, %5
+%endmacro
+
+%macro linear_inner 5 ; x, y, z, w, mask
+    %define ROW(I) ((%5) & MASK_ROW(I))
+    IF1 ROW(0), linear_row m8,  %1, %2, %3, %4, 0, %5
+    IF1 ROW(1), linear_row m9,  %1, %2, %3, %4, 1, %5
+    IF1 ROW(2), linear_row m10, %1, %2, %3, %4, 2, %5
+    IF1 ROW(3), linear_row m11, %1, %2, %3, %4, 3, %5
+    IF ROW(0),  mova %1, m8
+    IF ROW(1),  mova %2, m9
+    IF ROW(2),  mova %3, m10
+    IF ROW(3),  mova %4, m11
+%endmacro
+
+%macro linear_mask 2 ; name, mask
+op %1
+        mov tmp0q, [implq + SwsOpImpl.priv] ; address of matrix
+        linear_inner mx,  my,  mz,  mw,  %2
+        linear_inner mx2, my2, mz2, mw2, %2
+        CONTINUE
+%endmacro
+
+; specialized functions for very simple cases
+%macro linear_dot3 0
+op dot3
+        mov tmp0q, [implq + SwsOpImpl.priv]
+        vbroadcastss m12,  [tmp0q + 0]
+        vbroadcastss m13,  [tmp0q + 4]
+        vbroadcastss m14,  [tmp0q + 8]
+        LOAD_CONT tmp0q
+        mulps mx, m12
+        mulps m8, my, m13
+        mulps m9, mz, m14
+        addps mx, m8
+        addps mx, m9
+        mulps mx2, m12
+        mulps m10, my2, m13
+        mulps m11, mz2, m14
+        addps mx2, m10
+        addps mx2, m11
+        CONTINUE tmp0q
+%endmacro
+
+%macro linear_fns 0
+        linear_dot3
+        linear_mask luma,       MASK_LUMA
+        linear_mask alpha,      MASK_ALPHA
+        linear_mask lumalpha,   MASK_LUMA | MASK_ALPHA
+        linear_mask row0,       MASK_ROW(0)
+        linear_mask row0a,      MASK_ROW(0) | MASK_ALPHA
+        linear_mask diag3,      MASK_DIAG3
+        linear_mask diag4,      MASK_DIAG4
+        linear_mask diagoff3,   MASK_DIAG3 | MASK_OFF3
+        linear_mask matrix3,    MASK_MAT3
+        linear_mask affine3,    MASK_MAT3 | MASK_OFF3
+        linear_mask affine3a,   MASK_MAT3 | MASK_OFF3 | MASK_ALPHA
+        linear_mask matrix4,    MASK_MAT4
+        linear_mask affine4,    MASK_MAT4 | MASK_OFF4
+%endmacro
+
+INIT_YMM avx2
+decl_common_patterns conv8to32f
+decl_common_patterns conv16to32f
+decl_common_patterns conv32fto8
+decl_common_patterns conv32fto16
+decl_common_patterns min_max
+decl_common_patterns scale
+decl_common_patterns dither_fns
+linear_fns
diff --git a/libswscale/x86/ops_int.asm b/libswscale/x86/ops_int.asm
new file mode 100644
index 0000000000..3f995d71e2
--- /dev/null
+++ b/libswscale/x86/ops_int.asm
@@ -0,0 +1,882 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "ops_common.asm"
+
+SECTION_RODATA
+
+expand16_shuf:  db   0,  0,  2,  2,  4,  4,  6,  6,  8,  8, 10, 10, 12, 12, 14, 14
+expand32_shuf:  db   0,  0,  0,  0,  4,  4,  4,  4,  8,  8,  8,  8, 12, 12, 12, 12
+
+read8_unpack2:  db   0,  2,  4,  6,  8, 10, 12, 14,  1,  3,  5,  7,  9, 11, 13, 15
+read8_unpack3:  db   0,  3,  6,  9,  1,  4,  7, 10,  2,  5,  8, 11, -1, -1, -1, -1
+read8_unpack4:  db   0,  4,  8, 12,  1,  5,  9, 13,  2,  6, 10, 14,  3,  7, 11, 15
+read16_unpack2: db   0,  1,  4,  5,  8,  9, 12, 13,  2,  3,  6,  7, 10, 11, 14, 15
+read16_unpack3: db   0,  1,  6,  7,  2,  3,  8,  9,  4,  5, 10, 11, -1, -1, -1, -1
+read16_unpack4: db   0,  1,  8,  9,  2,  3, 10, 11,  4,  5, 12, 13,  6,  7, 14, 15
+write8_pack2:   db   0,  8,  1,  9,  2, 10,  3, 11,  4, 12,  5, 13,  6, 14,  7, 15
+write8_pack3:   db   0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11, -1, -1, -1, -1
+write16_pack3:  db   0,  1,  4,  5,  8,  9,  2,  3,  6,  7, 10, 11, -1, -1, -1, -1
+
+%define write8_pack4  read8_unpack4
+%define write16_pack4 read16_unpack2
+%define write16_pack2 read16_unpack4
+
+align 32
+bits_shuf:      db   0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1, \
+                     2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,  3
+bits_mask:      db 128, 64, 32, 16,  8,  4,  2,  1,128, 64, 32, 16,  8,  4,  2,  1
+bits_reverse:   db   7,  6,  5,  4,  3,  2,  1,  0, 15, 14, 13, 12, 11, 10,  9,  8,
+
+nibble_mask:   times 16 db 0x0F
+ones_mask:     times 16 db 0x01
+
+SECTION .text
+
+;---------------------------------------------------------
+; Global entry point
+
+%macro process_fn 1 ; num_planes
+cglobal sws_process%1_x86, 6, 7 + 2 * %1, 16
+            ; set up static registers
+            mov in0q,  [execq + SwsOpExec.in0]
+IF %1 > 1,  mov in1q,  [execq + SwsOpExec.in1]
+IF %1 > 2,  mov in2q,  [execq + SwsOpExec.in2]
+IF %1 > 3,  mov in3q,  [execq + SwsOpExec.in3]
+            mov out0q, [execq + SwsOpExec.out0]
+IF %1 > 1,  mov out1q, [execq + SwsOpExec.out1]
+IF %1 > 2,  mov out2q, [execq + SwsOpExec.out2]
+IF %1 > 3,  mov out3q, [execq + SwsOpExec.out3]
+            push implq
+.loop:
+            mov tmp0q, [implq + SwsOpImpl.cont]
+            add implq, SwsOpImpl.next
+            call tmp0q
+            mov implq, [rsp + 0]
+            dec blocksd
+            jg .loop
+
+            ; clean up
+            add rsp, 8
+            RET
+%endmacro
+
+process_fn 1
+process_fn 2
+process_fn 3
+process_fn 4
+
+;---------------------------------------------------------
+; Planar reads / writes
+
+%macro read_planar 1 ; elems
+op read_planar%1
+            movu mx, [in0q]
+IF %1 > 1,  movu my, [in1q]
+IF %1 > 2,  movu mz, [in2q]
+IF %1 > 3,  movu mw, [in3q]
+%if V2
+            movu mx2, [in0q + mmsize]
+IF %1 > 1,  movu my2, [in1q + mmsize]
+IF %1 > 2,  movu mz2, [in2q + mmsize]
+IF %1 > 3,  movu mw2, [in3q + mmsize]
+%endif
+            LOAD_CONT tmp0q
+            add in0q, mmsize * (1 + V2)
+IF %1 > 1,  add in1q, mmsize * (1 + V2)
+IF %1 > 2,  add in2q, mmsize * (1 + V2)
+IF %1 > 3,  add in3q, mmsize * (1 + V2)
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_planar 1 ; elems
+op write_planar%1
+            movu [out0q], mx
+IF %1 > 1,  movu [out1q], my
+IF %1 > 2,  movu [out2q], mz
+IF %1 > 3,  movu [out3q], mw
+%if V2
+            movu [out0q + mmsize], mx2
+IF %1 > 1,  movu [out1q + mmsize], my2
+IF %1 > 2,  movu [out2q + mmsize], mz2
+IF %1 > 3,  movu [out3q + mmsize], mw2
+%endif
+            add out0q, mmsize * (1 + V2)
+IF %1 > 1,  add out1q, mmsize * (1 + V2)
+IF %1 > 2,  add out2q, mmsize * (1 + V2)
+IF %1 > 3,  add out3q, mmsize * (1 + V2)
+            END_CHAIN
+%endmacro
+
+%macro read_packed2 1 ; depth
+op read%1_packed2
+            movu m8,  [in0q + 0*mmsize]
+            movu m9,  [in0q + 1*mmsize]
+    IF V2,  movu m10, [in0q + 2*mmsize]
+    IF V2,  movu m11, [in0q + 3*mmsize]
+IF %1 < 32, VBROADCASTI128 m12, [read%1_unpack2]
+            LOAD_CONT tmp0q
+            add in0q, mmsize * (2 + V2 * 2)
+%if %1 == 32
+            shufps m8, m8, q3120
+            shufps m9, m9, q3120
+    IF V2,  shufps m10, m10, q3120
+    IF V2,  shufps m11, m11, q3120
+%else
+            pshufb m8, m12              ; { X0 Y0 | X1 Y1 }
+            pshufb m9, m12              ; { X2 Y2 | X3 Y3 }
+    IF V2,  pshufb m10, m12
+    IF V2,  pshufb m11, m12
+%endif
+            unpcklpd mx, m8, m9         ; { X0 X2 | X1 X3 }
+            unpckhpd my, m8, m9         ; { Y0 Y2 | Y1 Y3 }
+    IF V2,  unpcklpd mx2, m10, m11
+    IF V2,  unpckhpd my2, m10, m11
+%if avx_enabled
+            vpermq mx, mx, q3120       ; { X0 X1 | X2 X3 }
+            vpermq my, my, q3120       ; { Y0 Y1 | Y2 Y3 }
+    IF V2,  vpermq mx2, mx2, q3120
+    IF V2,  vpermq my2, my2, q3120
+%endif
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_packed2 1 ; depth
+op write%1_packed2
+IF %1 < 32, VBROADCASTI128 m12, [write%1_pack2]
+            LOAD_CONT tmp0q
+%if avx_enabled
+            vpermq mx, mx, q3120       ; { X0 X2 | X1 X3 }
+            vpermq my, my, q3120       ; { Y0 Y2 | Y1 Y3 }
+    IF V2,  vpermq mx2, mx2, q3120
+    IF V2,  vpermq my2, my2, q3120
+%endif
+            unpcklpd m8, mx, my        ; { X0 Y0 | X1 Y1 }
+            unpckhpd m9, mx, my        ; { X2 Y2 | X3 Y3 }
+    IF V2,  unpcklpd m10, mx2, my2
+    IF V2,  unpckhpd m11, mx2, my2
+%if %1 == 32
+            shufps m8, m8, q3120
+            shufps m9, m9, q3120
+    IF V2,  shufps m10, m10, q3120
+    IF V2,  shufps m11, m11, q3120
+%else
+            pshufb m8, m12
+            pshufb m9, m12
+    IF V2,  pshufb m10, m12
+    IF V2,  pshufb m11, m12
+%endif
+            movu [out0q + 0*mmsize], m8
+            movu [out0q + 1*mmsize], m9
+IF V2,      movu [out0q + 2*mmsize], m10
+IF V2,      movu [out0q + 3*mmsize], m11
+            add out0q, mmsize * (2 + V2 * 2)
+            END_CHAIN
+%endmacro
+
+%macro read_packed_inner 7 ; x, y, z, w, addr, num, depth
+            movu xm8,  [%5 + 0  * %6]
+            movu xm9,  [%5 + 4  * %6]
+            movu xm10, [%5 + 8  * %6]
+            movu xm11, [%5 + 12 * %6]
+    %if avx_enabled
+            vinserti128 m8,  m8,  [%5 + 16 * %6], 1
+            vinserti128 m9,  m9,  [%5 + 20 * %6], 1
+            vinserti128 m10, m10, [%5 + 24 * %6], 1
+            vinserti128 m11, m11, [%5 + 28 * %6], 1
+    %endif
+    %if %7 == 32
+            mova %1, m8
+            mova %2, m9
+            mova %3, m10
+            mova %4, m11
+    %else
+            pshufb %1, m8,  m12         ; { X0 Y0 Z0 W0 | X4 Y4 Z4 W4 }
+            pshufb %2, m9,  m12         ; { X1 Y1 Z1 W1 | X5 Y5 Z5 W5 }
+            pshufb %3, m10, m12         ; { X2 Y2 Z2 W2 | X6 Y6 Z6 W6 }
+            pshufb %4, m11, m12         ; { X3 Y3 Z3 W3 | X7 Y7 Z7 W7 }
+    %endif
+            punpckldq m8,  %1, %2       ; { X0 X1 Y0 Y1 | X4 X5 Y4 Y5 }
+            punpckldq m9,  %3, %4       ; { X2 X3 Y2 Y3 | X6 X7 Y6 Y7 }
+            punpckhdq m10, %1, %2       ; { Z0 Z1 W0 W1 | Z4 Z5 W4 W5 }
+            punpckhdq m11, %3, %4       ; { Z2 Z3 W2 W3 | Z6 Z7 W6 W7 }
+            punpcklqdq %1, m8, m9       ; { X0 X1 X2 X3 | X4 X5 X6 X7 }
+            punpckhqdq %2, m8, m9       ; { Y0 Y1 Y2 Y3 | Y4 Y5 Y6 Y7 }
+            punpcklqdq %3, m10, m11     ; { Z0 Z1 Z2 Z3 | Z4 Z5 Z6 Z7 }
+IF %6 > 3,  punpckhqdq %4, m10, m11     ; { W0 W1 W2 W3 | W4 W5 W6 W7 }
+%endmacro
+
+%macro read_packed 2 ; num, depth
+op read%2_packed%1
+IF %2 < 32, VBROADCASTI128 m12, [read%2_unpack%1]
+            LOAD_CONT tmp0q
+            read_packed_inner mx, my, mz, mw, in0q, %1, %2
+IF1 V2,     read_packed_inner mx2, my2, mz2, mw2, in0q + %1 * mmsize, %1, %2
+            add in0q, %1 * mmsize * (1 + V2)
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_packed_inner 7 ; x, y, z, w, addr, num, depth
+        punpckldq m8,  %1, %2       ; { X0 Y0 X1 Y1 | X4 Y4 X5 Y5 }
+        punpckldq m9,  %3, %4       ; { Z0 W0 Z1 W1 | Z4 W4 Z5 W5 }
+        punpckhdq m10, %1, %2       ; { X2 Y2 X3 Y3 | X6 Y6 X7 Y7 }
+        punpckhdq m11, %3, %4       ; { Z2 W2 Z3 W3 | Z6 W6 Z7 W7 }
+        punpcklqdq %1, m8, m9       ; { X0 Y0 Z0 W0 | X4 Y4 Z4 W4 }
+        punpckhqdq %2, m8, m9       ; { X1 Y1 Z1 W1 | X5 Y5 Z5 W5 }
+        punpcklqdq %3, m10, m11     ; { X2 Y2 Z2 W2 | X6 Y6 Z6 W6 }
+        punpckhqdq %4, m10, m11     ; { X3 Y3 Z3 W3 | X7 Y7 Z7 W7 }
+    %if %7 == 32
+        mova m8,  %1
+        mova m9,  %2
+        mova m10, %3
+        mova m11, %4
+    %else
+        pshufb m8,  %1, m12
+        pshufb m9,  %2, m12
+        pshufb m10, %3, m12
+        pshufb m11, %4, m12
+    %endif
+        movu [%5 +  0*%6], xm8
+        movu [%5 +  4*%6], xm9
+        movu [%5 +  8*%6], xm10
+        movu [%5 + 12*%6], xm11
+    %if avx_enabled
+        vextracti128 [%5 + 16*%6], m8, 1
+        vextracti128 [%5 + 20*%6], m9, 1
+        vextracti128 [%5 + 24*%6], m10, 1
+        vextracti128 [%5 + 28*%6], m11, 1
+    %endif
+%endmacro
+
+%macro write_packed 2 ; num, depth
+op write%2_packed%1
+IF %2 < 32, VBROADCASTI128 m12, [write%2_pack%1]
+            write_packed_inner mx, my, mz, mw, out0q, %1, %2
+IF1 V2,     write_packed_inner mx2, my2, mz2, mw2, out0q + %1 * mmsize, %1, %2
+            add out0q, %1 * mmsize * (1 + V2)
+            END_CHAIN
+%endmacro
+
+%macro rw_packed 1 ; depth
+        read_packed2 %1
+        read_packed 3, %1
+        read_packed 4, %1
+        write_packed2 %1
+        write_packed 3, %1
+        write_packed 4, %1
+%endmacro
+
+%macro read_nibbles 0
+op read_nibbles1
+%if avx_enabled
+        movu xmx,  [in0q]
+IF V2,  movu xmx2, [in0q + 16]
+%else
+        movq xmx,  [in0q]
+IF V2,  movq xmx2, [in0q + 8]
+%endif
+        VBROADCASTI128 m8, [nibble_mask]
+        LOAD_CONT tmp0q
+        add in0q, (mmsize >> 1) * (1 + V2)
+        pmovzxbw mx, xmx
+IF V2,  pmovzxbw mx2, xmx2
+        psllw my, mx, 8
+IF V2,  psllw my2, mx2, 8
+        psrlw mx, 4
+IF V2,  psrlw mx2, 4
+        pand my, m8
+IF V2,  pand my2, m8
+        por mx, my
+IF V2,  por mx2, my2
+        CONTINUE tmp0q
+%endmacro
+
+%macro read_bits 0
+op read_bits1
+%if avx_enabled
+        vpbroadcastd mx,  [in0q]
+IF V2,  vpbroadcastd mx2, [in0q + 4]
+%else
+        movd mx, [in0q]
+IF V2,  movd mx2, [in0q + 2]
+%endif
+        mova m8, [bits_shuf]
+        VBROADCASTI128 m9,  [bits_mask]
+        VBROADCASTI128 m10, [ones_mask]
+        LOAD_CONT tmp0q
+        add in0q, (mmsize >> 3) * (1 + V2)
+        pshufb mx,  m8
+IF V2,  pshufb mx2, m8
+        pand mx,  m9
+IF V2,  pand mx2, m9
+        pcmpeqb mx,  m9
+IF V2,  pcmpeqb mx2, m9
+        pand mx,  m10
+IF V2,  pand mx2, m10
+        CONTINUE tmp0q
+%endmacro
+
+%macro write_bits 0
+op write_bits1
+        VBROADCASTI128 m8, [bits_reverse]
+        psllw mx,  7
+IF V2,  psllw mx2, 7
+        pshufb mx,  m8
+IF V2,  pshufb mx2, m8
+        pmovmskb tmp0d, mx
+IF V2,  pmovmskb tmp1d, mx2
+%if avx_enabled
+        mov [out0q],     tmp0d
+IF V2,  mov [out0q + 4], tmp1d
+%else
+        mov [out0q],     tmp0d
+IF V2,  mov [out0q + 2], tmp1d
+%endif
+        add out0q, (mmsize >> 3) * (1 + V2)
+        END_CHAIN
+%endmacro
+
+;---------------------------------------------------------
+; Generic byte order shuffle (packed swizzle, endian, etc)
+
+%macro shuffle 0
+op shuffle
+        VBROADCASTI128 m8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   pshufb mx, m8
+IF Y,   pshufb my, m8
+IF Z,   pshufb mz, m8
+IF W,   pshufb mw, m8
+%if V2
+IF X,   pshufb mx2, m8
+IF Y,   pshufb my2, m8
+IF Z,   pshufb mz2, m8
+IF W,   pshufb mw2, m8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Clearing
+
+%macro clear_alpha 3 ; idx, vreg, vreg2
+op clear_alpha%1
+        LOAD_CONT tmp0q
+        pcmpeqb %2, %2
+IF V2,  mova %3, %2
+        CONTINUE tmp0q
+%endmacro
+
+%macro clear_zero 3 ; idx, vreg, vreg2
+op clear_zero%1
+        LOAD_CONT tmp0q
+        pxor %2, %2
+IF V2,  mova %3, %2
+        CONTINUE tmp0q
+%endmacro
+
+%macro clear_generic 0
+op clear
+            LOAD_CONT tmp0q
+%if avx_enabled
+    IF !X,  vpbroadcastd mx, [implq + SwsOpImpl.priv + 0]
+    IF !Y,  vpbroadcastd my, [implq + SwsOpImpl.priv + 4]
+    IF !Z,  vpbroadcastd mz, [implq + SwsOpImpl.priv + 8]
+    IF !W,  vpbroadcastd mw, [implq + SwsOpImpl.priv + 12]
+%else ; !avx_enabled
+    IF !X,  movd mx, [implq + SwsOpImpl.priv + 0]
+    IF !Y,  movd my, [implq + SwsOpImpl.priv + 4]
+    IF !Z,  movd mz, [implq + SwsOpImpl.priv + 8]
+    IF !W,  movd mw, [implq + SwsOpImpl.priv + 12]
+    IF !X,  pshufd mx, mx, 0
+    IF !Y,  pshufd my, my, 0
+    IF !Z,  pshufd mz, mz, 0
+    IF !W,  pshufd mw, mw, 0
+%endif
+%if V2
+    IF !X,  mova mx2, mx
+    IF !Y,  mova my2, my
+    IF !Z,  mova mz2, mz
+    IF !W,  mova mw2, mw
+%endif
+            CONTINUE tmp0q
+%endmacro
+
+%macro clear_funcs 0
+        decl_pattern 1, 1, 1, 0, clear_generic
+        decl_pattern 0, 1, 1, 1, clear_generic
+        decl_pattern 0, 0, 1, 1, clear_generic
+        decl_pattern 1, 0, 0, 1, clear_generic
+        decl_pattern 1, 1, 0, 0, clear_generic
+        decl_pattern 0, 1, 0, 1, clear_generic
+        decl_pattern 1, 0, 1, 0, clear_generic
+        decl_pattern 1, 0, 0, 0, clear_generic
+        decl_pattern 0, 1, 0, 0, clear_generic
+        decl_pattern 0, 0, 1, 0, clear_generic
+%endmacro
+
+;---------------------------------------------------------
+; Swizzling and duplicating
+
+; mA := mB, mB := mC, ... mX := mA
+%macro vrotate 2-* ; A, B, C, ...
+    %rep %0
+        %assign rot_a %1 + 4
+        %assign rot_b %2 + 4
+        mova m%1, m%2
+        IF V2, mova m%[rot_a], m%[rot_b]
+    %rotate 1
+    %endrep
+    %undef rot_a
+    %undef rot_b
+%endmacro
+
+%macro swizzle_funcs 0
+op swizzle_3012
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 2, 1
+    CONTINUE tmp0q
+
+op swizzle_3021
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 1
+    CONTINUE tmp0q
+
+op swizzle_2103
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2
+    CONTINUE tmp0q
+
+op swizzle_3210
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3
+    vrotate 8, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_3102
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 2
+    CONTINUE tmp0q
+
+op swizzle_3201
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_1203
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_1023
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1
+    CONTINUE tmp0q
+
+op swizzle_2013
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 1
+    CONTINUE tmp0q
+
+op swizzle_2310
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_2130
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_1230
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_1320
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_0213
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_0231
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_0312
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 3, 2
+    CONTINUE tmp0q
+
+op swizzle_3120
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3
+    CONTINUE tmp0q
+
+op swizzle_0321
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_0003
+    LOAD_CONT tmp0q
+    mova my, mx
+    mova mz, mx
+%if V2
+    mova my2, mx2
+    mova mz2, mx2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_0001
+    LOAD_CONT tmp0q
+    mova mw, my
+    mova mz, mx
+    mova my, mx
+%if V2
+    mova mw2, my2
+    mova mz2, mx2
+    mova my2, mx2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_3000
+    LOAD_CONT tmp0q
+    mova my, mx
+    mova mz, mx
+    mova mx, mw
+    mova mw, my
+%if V2
+    mova my2, mx2
+    mova mz2, mx2
+    mova mx2, mw2
+    mova mw2, my2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_1000
+    LOAD_CONT tmp0q
+    mova mz, mx
+    mova mw, mx
+    mova mx, my
+    mova my, mz
+%if V2
+    mova mz2, mx2
+    mova mw2, mx2
+    mova mx2, my2
+    mova my2, mz2
+%endif
+    CONTINUE tmp0q
+%endmacro
+
+%macro packed_shuffle 2-3 ; size_in, size_out, shift
+cglobal packed_shuffle%1_%2, 3, 5, 2, exec, shuffle, blocks, src, dst
+            mov srcq, [execq + SwsOpExec.in0]
+            mov dstq, [execq + SwsOpExec.out0]
+            VBROADCASTI128 m1, [shuffleq]
+    %ifnum %3
+            shl blocksd, %3
+    %else
+            imul blocksd, %2
+    %endif
+IF %1==%2,  add srcq, blocksq
+            add dstq, blocksq
+            neg blocksq
+.loop:
+    %if %1 == %2
+            MOVSZ %1, m0, [srcq + blocksq]
+    %else
+            MOVSZ %1, m0, [srcq]
+    %endif
+            pshufb m0, m1
+            movu [dstq + blocksq], m0
+IF %1!=%2,  add srcq, %1
+            add blocksq, %2
+            jl .loop
+            RET
+%endmacro
+
+;---------------------------------------------------------
+; Pixel type conversions
+
+%macro conv8to16 1 ; type
+op %1_U8_U16
+            LOAD_CONT tmp0q
+%if V2
+    %if avx_enabled
+    IF X,   vextracti128 xmx2, mx, 1
+    IF Y,   vextracti128 xmy2, my, 1
+    IF Z,   vextracti128 xmz2, mz, 1
+    IF W,   vextracti128 xmw2, mw, 1
+    %else
+    IF X,   psrldq xmx2, mx, 8
+    IF Y,   psrldq xmy2, my, 8
+    IF Z,   psrldq xmz2, mz, 8
+    IF W,   psrldq xmw2, mw, 8
+    %endif
+    IF X,   pmovzxbw mx2, xmx2
+    IF Y,   pmovzxbw my2, xmy2
+    IF Z,   pmovzxbw mz2, xmz2
+    IF W,   pmovzxbw mw2, xmw2
+%endif ; V2
+    IF X,   pmovzxbw mx, xmx
+    IF Y,   pmovzxbw my, xmy
+    IF Z,   pmovzxbw mz, xmz
+    IF W,   pmovzxbw mw, xmw
+
+%ifidn %1, expand
+            VBROADCASTI128 m8, [expand16_shuf]
+    %if V2
+    IF X,   pshufb mx2, m8
+    IF Y,   pshufb my2, m8
+    IF Z,   pshufb mz2, m8
+    IF W,   pshufb mw2, m8
+    %endif
+    IF X,   pshufb mx, m8
+    IF Y,   pshufb my, m8
+    IF Z,   pshufb mz, m8
+    IF W,   pshufb mw, m8
+%endif ; expand
+            CONTINUE tmp0q
+%endmacro
+
+%macro conv16to8 0
+op convert_U16_U8
+        LOAD_CONT tmp0q
+%if V2
+        ; this code technically works for the !V2 case as well, but slower
+IF X,   packuswb mx, mx2
+IF Y,   packuswb my, my2
+IF Z,   packuswb mz, mz2
+IF W,   packuswb mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+%else
+IF X,   vextracti128  xm8, mx, 1
+IF Y,   vextracti128  xm9, my, 1
+IF Z,   vextracti128 xm10, mz, 1
+IF W,   vextracti128 xm11, mw, 1
+IF X,   packuswb xmx, xm8
+IF Y,   packuswb xmy, xm9
+IF Z,   packuswb xmz, xm10
+IF W,   packuswb xmw, xm11
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv8to32 1 ; type
+op %1_U8_U32
+        LOAD_CONT tmp0q
+IF X,   psrldq xmx2, xmx, 8
+IF Y,   psrldq xmy2, xmy, 8
+IF Z,   psrldq xmz2, xmz, 8
+IF W,   psrldq xmw2, xmw, 8
+IF X,   pmovzxbd mx, xmx
+IF Y,   pmovzxbd my, xmy
+IF Z,   pmovzxbd mz, xmz
+IF W,   pmovzxbd mw, xmw
+IF X,   pmovzxbd mx2, xmx2
+IF Y,   pmovzxbd my2, xmy2
+IF Z,   pmovzxbd mz2, xmz2
+IF W,   pmovzxbd mw2, xmw2
+%ifidn %1, expand
+        VBROADCASTI128 m8, [expand32_shuf]
+IF X,   pshufb mx, m8
+IF Y,   pshufb my, m8
+IF Z,   pshufb mz, m8
+IF W,   pshufb mw, m8
+IF X,   pshufb mx2, m8
+IF Y,   pshufb my2, m8
+IF Z,   pshufb mz2, m8
+IF W,   pshufb mw2, m8
+%endif ; expand
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32to8 0
+op convert_U32_U8
+        LOAD_CONT tmp0q
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   packuswb xmx, xmx2
+IF Y,   packuswb xmy, xmy2
+IF Z,   packuswb xmz, xmz2
+IF W,   packuswb xmw, xmw2
+IF X,   vpshufd xmx, xmx, q3120
+IF Y,   vpshufd xmy, xmy, q3120
+IF Z,   vpshufd xmz, xmz, q3120
+IF W,   vpshufd xmw, xmw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv16to32 0
+op convert_U16_U32
+        LOAD_CONT tmp0q
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   pmovzxwd mx, xmx
+IF Y,   pmovzxwd my, xmy
+IF Z,   pmovzxwd mz, xmz
+IF W,   pmovzxwd mw, xmw
+IF X,   pmovzxwd mx2, xmx2
+IF Y,   pmovzxwd my2, xmy2
+IF Z,   pmovzxwd mz2, xmz2
+IF W,   pmovzxwd mw2, xmw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32to16 0
+op convert_U32_U16
+        LOAD_CONT tmp0q
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Shifting
+
+%macro lshift16 0
+op lshift16
+        vmovq xm8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   psllw mx, xm8
+IF Y,   psllw my, xm8
+IF Z,   psllw mz, xm8
+IF W,   psllw mw, xm8
+%if V2
+IF X,   psllw mx2, xm8
+IF Y,   psllw my2, xm8
+IF Z,   psllw mz2, xm8
+IF W,   psllw mw2, xm8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+%macro rshift16 0
+op rshift16
+        vmovq xm8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   psrlw mx, xm8
+IF Y,   psrlw my, xm8
+IF Z,   psrlw mz, xm8
+IF W,   psrlw mw, xm8
+%if V2
+IF X,   psrlw mx2, xm8
+IF Y,   psrlw my2, xm8
+IF Z,   psrlw mz2, xm8
+IF W,   psrlw mw2, xm8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Function instantiations
+
+%macro funcs_u8 0
+    read_planar 1
+    read_planar 2
+    read_planar 3
+    read_planar 4
+    write_planar 1
+    write_planar 2
+    write_planar 3
+    write_planar 4
+
+    rw_packed 8
+    read_nibbles
+    read_bits
+    write_bits
+
+    clear_alpha 0, mx, mx2
+    clear_alpha 1, my, my2
+    clear_alpha 3, mw, mw2
+    clear_zero  0, mx, mx2
+    clear_zero  1, my, my2
+    clear_zero  3, mw, mw2
+    clear_funcs
+    swizzle_funcs
+
+    decl_common_patterns shuffle
+%endmacro
+
+%macro funcs_u16 0
+    rw_packed 16
+    decl_common_patterns conv8to16 convert
+    decl_common_patterns conv8to16 expand
+    decl_common_patterns conv16to8
+    decl_common_patterns lshift16
+    decl_common_patterns rshift16
+%endmacro
+
+INIT_XMM sse4
+decl_v2 0, funcs_u8
+decl_v2 1, funcs_u8
+
+packed_shuffle  5, 15     ;  8 -> 24
+packed_shuffle  4, 16, 4  ;  8 -> 32, 16 -> 64
+packed_shuffle  2, 12     ;  8 -> 48
+packed_shuffle 10, 15     ; 16 -> 24
+packed_shuffle  8, 16, 4  ; 16 -> 32, 32 -> 64
+packed_shuffle  4, 12     ; 16 -> 48
+packed_shuffle 15, 15     ; 24 -> 24
+packed_shuffle 12, 16, 4  ; 24 -> 32
+packed_shuffle  6, 12     ; 24 -> 48
+packed_shuffle 16, 12     ; 32 -> 24, 64 -> 48
+packed_shuffle 16, 16, 4  ; 32 -> 32, 64 -> 64
+packed_shuffle  8, 12     ; 32 -> 48
+packed_shuffle 12, 12     ; 48 -> 48
+
+INIT_YMM avx2
+decl_v2 0, funcs_u8
+decl_v2 1, funcs_u8
+decl_v2 0, funcs_u16
+decl_v2 1, funcs_u16
+
+packed_shuffle 32, 32
+
+INIT_YMM avx2
+decl_v2 1, rw_packed 32
+decl_common_patterns conv8to32 convert
+decl_common_patterns conv8to32 expand
+decl_common_patterns conv32to8
+decl_common_patterns conv16to32
+decl_common_patterns conv32to16
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 12/17] tests/checkasm: increase number of runs in between measurements
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (10 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 11/17] swscale/x86: add SIMD backend Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 13/17] tests/checkasm: add checkasm_check_float Niklas Haas
                   ` (7 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Sometimes, when measuring very small functions, rdtsc is not accurate enough
to get a reliable measurement. This increases the number of runs inside the
inner loop from 4 to 32, which should help a lot. Less important when using
the more precise linux-perf API, but still useful.

There should be no user-visible change since the number of runs is adjusted
to keep the total time spent measuring the same.
---
 tests/checkasm/checkasm.c |  2 +-
 tests/checkasm/checkasm.h | 24 +++++++++++++++++++-----
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 412b8b2cd1..87b75ec36c 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -624,7 +624,7 @@ static inline double avg_cycles_per_call(const CheckasmPerf *const p)
     if (p->iterations) {
         const double cycles = (double)(10 * p->cycles) / p->iterations - state.nop_time;
         if (cycles > 0.0)
-            return cycles / 4.0; /* 4 calls per iteration */
+            return cycles / 32.0; /* 32 calls per iteration */
     }
     return 0.0;
 }
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index ad239fb2a4..215d64e076 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -340,6 +340,22 @@ typedef struct CheckasmPerf {
 #define PERF_STOP(t)  t = AV_READ_TIME() - t
 #endif
 
+#define CALL4(...)\
+    do {\
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+    } while (0)
+
+#define CALL16(...)\
+    do {\
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+    } while (0)
+
 /* Benchmark the function */
 #define bench_new(...)\
     do {\
@@ -350,14 +366,12 @@ typedef struct CheckasmPerf {
             uint64_t tsum = 0;\
             uint64_t ti, tcount = 0;\
             uint64_t t = 0; \
-            const uint64_t truns = bench_runs;\
+            const uint64_t truns = FFMAX(bench_runs >> 3, 1);\
             checkasm_set_signal_handler_state(1);\
             for (ti = 0; ti < truns; ti++) {\
                 PERF_START(t);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
+                CALL16(__VA_ARGS__);\
+                CALL16(__VA_ARGS__);\
                 PERF_STOP(t);\
                 if (t*tcount <= tsum*4 && ti > 0) {\
                     tsum += t;\
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 13/17] tests/checkasm: add checkasm_check_float
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (11 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 12/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 14/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
                   ` (6 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

---
 tests/checkasm/checkasm.c | 1 +
 tests/checkasm/checkasm.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 87b75ec36c..6e99d33d70 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -1261,3 +1261,4 @@ DEF_CHECKASM_CHECK_FUNC(uint16_t, "%04x")
 DEF_CHECKASM_CHECK_FUNC(uint32_t, "%08x")
 DEF_CHECKASM_CHECK_FUNC(int16_t,  "%6d")
 DEF_CHECKASM_CHECK_FUNC(int32_t,  "%9d")
+DEF_CHECKASM_CHECK_FUNC(float,    "%g")
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index 215d64e076..0f8dea82e9 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -420,6 +420,7 @@ DECL_CHECKASM_CHECK_FUNC(uint16_t);
 DECL_CHECKASM_CHECK_FUNC(uint32_t);
 DECL_CHECKASM_CHECK_FUNC(int16_t);
 DECL_CHECKASM_CHECK_FUNC(int32_t);
+DECL_CHECKASM_CHECK_FUNC(float);
 
 #define PASTE(a,b) a ## b
 #define CONCAT(a,b) PASTE(a,b)
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 14/17] tests/checkasm: add checkasm tests for swscale ops
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (12 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 13/17] tests/checkasm: add checkasm_check_float Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 15/17] swscale/format: rename legacy format conversion table Niklas Haas
                   ` (5 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Because of the lack of an external ABI on low-level kernels, we cannot
directly test internal functions. Instead, we construct a minimal op chain
consisting of a read, the op to be tested, and a write.

The bigger complication arises from the fact that the backend may generate
arbitrary internal state that needs to be passed back to the implementation,
which means we cannot directly call `func_ref` on the generated chain. To get
around this, always compile the op chain twice - once using the backend to be
tested, and once using the reference C backend.
---
 tests/checkasm/Makefile   |   8 +-
 tests/checkasm/checkasm.c |   1 +
 tests/checkasm/checkasm.h |   1 +
 tests/checkasm/sw_ops.c   | 748 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 757 insertions(+), 1 deletion(-)
 create mode 100644 tests/checkasm/sw_ops.c

diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index d5c50e5599..be4c6b265f 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -65,7 +65,13 @@ AVFILTEROBJS-$(CONFIG_SOBEL_FILTER)      += vf_convolution.o
 CHECKASMOBJS-$(CONFIG_AVFILTER) += $(AVFILTEROBJS-yes)
 
 # swscale tests
-SWSCALEOBJS                             += sw_gbrp.o sw_range_convert.o sw_rgb.o sw_scale.o sw_yuv2rgb.o sw_yuv2yuv.o
+SWSCALEOBJS                             += sw_gbrp.o            \
+                                           sw_ops.o             \
+                                           sw_range_convert.o   \
+                                           sw_rgb.o             \
+                                           sw_scale.o           \
+                                           sw_yuv2rgb.o         \
+                                           sw_yuv2yuv.o
 
 CHECKASMOBJS-$(CONFIG_SWSCALE)  += $(SWSCALEOBJS)
 
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 6e99d33d70..5f3a900bfd 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -294,6 +294,7 @@ static const struct {
     { "sw_scale", checkasm_check_sw_scale },
     { "sw_yuv2rgb", checkasm_check_sw_yuv2rgb },
     { "sw_yuv2yuv", checkasm_check_sw_yuv2yuv },
+    { "sw_ops", checkasm_check_sw_ops },
 #endif
 #if CONFIG_AVUTIL
         { "aes",       checkasm_check_aes },
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index 0f8dea82e9..959e66d9f8 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -131,6 +131,7 @@ void checkasm_check_sw_rgb(void);
 void checkasm_check_sw_scale(void);
 void checkasm_check_sw_yuv2rgb(void);
 void checkasm_check_sw_yuv2yuv(void);
+void checkasm_check_sw_ops(void);
 void checkasm_check_takdsp(void);
 void checkasm_check_utvideodsp(void);
 void checkasm_check_v210dec(void);
diff --git a/tests/checkasm/sw_ops.c b/tests/checkasm/sw_ops.c
new file mode 100644
index 0000000000..7b38bd6902
--- /dev/null
+++ b/tests/checkasm/sw_ops.c
@@ -0,0 +1,748 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <string.h>
+
+#include "libavutil/avassert.h"
+#include "libavutil/mem_internal.h"
+#include "libavutil/refstruct.h"
+
+#include "libswscale/ops.h"
+#include "libswscale/ops_internal.h"
+
+#include "checkasm.h"
+
+enum {
+    PIXELS = 64,
+};
+
+enum {
+    U8  = SWS_PIXEL_U8,
+    U16 = SWS_PIXEL_U16,
+    U32 = SWS_PIXEL_U32,
+    F32 = SWS_PIXEL_F32,
+};
+
+#define FMT(fmt, ...) tprintf((char[256]) {0}, 256, fmt, __VA_ARGS__)
+static const char *tprintf(char buf[], size_t size, const char *fmt, ...)
+{
+    va_list ap;
+    va_start(ap, fmt);
+    vsnprintf(buf, size, fmt, ap);
+    va_end(ap);
+    return buf;
+}
+
+static int rw_pixel_bits(const SwsOp *op)
+{
+    const int elems = op->rw.packed ? op->rw.elems : 1;
+    const int size  = ff_sws_pixel_type_size(op->type);
+    const int bits  = 8 >> op->rw.frac;
+    av_assert1(bits >= 1);
+    return elems * size * bits;
+}
+
+static float rndf(void)
+{
+    union { uint32_t u; float f; } x;
+    do {
+        x.u = rnd();
+    } while (!isnormal(x.f));
+    return x.f;
+}
+
+static void fill32f(float *line, int num, unsigned range)
+{
+    const float scale = (float) range / UINT32_MAX;
+    for (int i = 0; i < num; i++)
+        line[i] = range ? scale * rnd() : rndf();
+}
+
+static void fill32(uint32_t *line, int num, unsigned range)
+{
+    for (int i = 0; i < num; i++)
+        line[i] = range ? rnd() % (range + 1) : rnd();
+}
+
+static void fill16(uint16_t *line, int num, unsigned range)
+{
+    if (!range) {
+        fill32((uint32_t *) line, AV_CEIL_RSHIFT(num, 1), 0);
+    } else {
+        for (int i = 0; i < num; i++)
+            line[i] = rnd() % (range + 1);
+    }
+}
+
+static void fill8(uint8_t *line, int num, unsigned range)
+{
+    if (!range) {
+        fill32((uint32_t *) line, AV_CEIL_RSHIFT(num, 2), 0);
+    } else {
+        for (int i = 0; i < num; i++)
+            line[i] = rnd() % (range + 1);
+    }
+}
+
+static void check_ops(const char *report, unsigned range, const SwsOp *ops)
+{
+    SwsContext *ctx = sws_alloc_context();
+    SwsCompiledOp comp_ref = {0}, comp_new = {0};
+    SwsOpList oplist = { .ops = (SwsOp *) ops };
+    const SwsOp *read_op, *write_op;
+
+    declare_func(void, const SwsOpExec *exec, const void *priv, int pixels);
+
+    DECLARE_ALIGNED_64(char, src0)[4][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, src1)[4][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, dst0)[4][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, dst1)[4][PIXELS * sizeof(uint32_t[4])];
+
+    if (!ctx)
+        return;
+    ctx->flags = SWS_BITEXACT;
+
+    read_op = &ops[0];
+    for (oplist.num_ops = 0; ops[oplist.num_ops].op; oplist.num_ops++)
+        write_op = &ops[oplist.num_ops];
+
+    for (int p = 0; p < 4; p++) {
+        void *plane = src0[p];
+        switch (read_op->type) {
+        case U8:    fill8(plane, sizeof(src0[p]) /  sizeof(uint8_t), range); break;
+        case U16:  fill16(plane, sizeof(src0[p]) / sizeof(uint16_t), range); break;
+        case U32:  fill32(plane, sizeof(src0[p]) / sizeof(uint32_t), range); break;
+        case F32: fill32f(plane, sizeof(src0[p]) / sizeof(uint32_t), range); break;
+        }
+    }
+
+    memcpy(src1, src0, sizeof(src0));
+    memset(dst0, 0, sizeof(dst0));
+    memset(dst1, 0, sizeof(dst1));
+
+    /* Compile `ops` using both the asm and c backends */
+    for (int n = 0; ff_sws_op_backends[n]; n++) {
+        const SwsOpBackend *backend = ff_sws_op_backends[n];
+        const bool is_ref = !strcmp(backend->name, "c");
+        if (is_ref || !comp_new.func) {
+            SwsCompiledOp comp;
+            int ret = ff_sws_ops_compile_backend(ctx, backend, &oplist, &comp);
+            if (ret == AVERROR(ENOTSUP))
+                continue;
+            else if (ret < 0)
+                fail();
+            else if (PIXELS % comp.block_size != 0)
+                fail();
+
+            if (is_ref)
+                comp_ref = comp;
+            if (!comp_new.func)
+                comp_new = comp;
+        }
+    }
+
+    av_assert0(comp_ref.func && comp_new.func);
+
+    SwsOpExec exec = {0};
+    exec.pixel_bits_in  = rw_pixel_bits(read_op);
+    exec.pixel_bits_out = rw_pixel_bits(write_op);
+    exec.width = PIXELS;
+    exec.height = exec.slice_h = 1;
+    for (int i = 0; i < 4; i++) {
+        exec.in_stride[i]  = sizeof(src0[i]);
+        exec.out_stride[i] = sizeof(dst0[i]);
+    }
+
+    if (check_func(comp_new.func, "%s", report)) {
+        func_ref = comp_ref.func; /* ignore any other asm versions */
+
+        for (int i = 0; i < 4; i++) {
+            exec.in[i]  = (void *) src0[i];
+            exec.out[i] = (void *) dst0[i];
+        }
+        call_ref(&exec, comp_ref.priv, PIXELS / comp_ref.block_size);
+
+        for (int i = 0; i < 4; i++) {
+            exec.in[i]  = (void *) src1[i];
+            exec.out[i] = (void *) dst1[i];
+        }
+        call_new(&exec, comp_new.priv, PIXELS / comp_new.block_size);
+
+        for (int i = 0; i < 4; i++) {
+            const char *name = FMT("%s[%d]", report, i);
+            const int size   = PIXELS * exec.pixel_bits_out >> 3;
+            const int stride = sizeof(dst0[i]);
+
+            switch (write_op->type) {
+            case U8:
+                checkasm_check(uint8_t, (void *) dst0[i], stride,
+                                        (void *) dst1[i], stride,
+                                        size, 1, name);
+                break;
+            case U16:
+                checkasm_check(uint16_t, (void *) dst0[i], stride,
+                                         (void *) dst1[i], stride,
+                                         size >> 1, 1, name);
+                break;
+            case U32:
+                checkasm_check(uint32_t, (void *) dst0[i], stride,
+                                         (void *) dst1[i], stride,
+                                         size >> 2, 1, name);
+                break;
+            case F32:
+                checkasm_check(float, (void *) dst0[i], stride,
+                                      (void *) dst1[i], stride,
+                                      size >> 2, 1, name);
+                break;
+            }
+
+            /* Check for over-write */
+            for (int x = size + comp_new.over_write; x < sizeof(dst1[i]); x++) {
+                if (dst1[i][x] != 0) {
+                    fprintf(stderr, "Overwrite detected in %s: [%d] = 0x%02x\n",
+                            name, x, dst1[i][x]);
+                    fail();
+                }
+            }
+
+            if (write_op->rw.packed)
+                break;
+        }
+
+        bench_new(&exec, comp_new.priv, PIXELS / comp_new.block_size);
+    }
+
+    if (comp_new.func != comp_ref.func && comp_new.free)
+        comp_new.free(comp_new.priv);
+    if (comp_ref.free)
+        comp_ref.free(comp_ref.priv);
+    sws_free_context(&ctx);
+}
+
+#define CHECK_RANGE(NAME, RANGE, N_IN, N_OUT, IN, OUT, ...)                     \
+  do {                                                                          \
+    check_ops(NAME, RANGE, (SwsOp[]) {                                          \
+        {                                                                       \
+            .op = SWS_OP_READ,                                                  \
+            .type = IN,                                                         \
+            .rw.elems = N_IN,                                                   \
+        },                                                                      \
+        __VA_ARGS__,                                                            \
+        {                                                                       \
+            .op = SWS_OP_WRITE,                                                 \
+            .type = OUT,                                                        \
+            .rw.elems = N_OUT,                                                  \
+        }, {0}                                                                  \
+    });                                                                         \
+  } while (0)
+
+#define CHECK_COMMON_RANGE(NAME, RANGE, IN, OUT, ...)                           \
+    CHECK_RANGE(FMT("%s_p1000", NAME), RANGE, 1, 1, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1110", NAME), RANGE, 3, 3, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1111", NAME), RANGE, 4, 4, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1001", NAME), RANGE, 4, 2, IN, OUT, __VA_ARGS__, {     \
+        .op = SWS_OP_SWIZZLE,                                                   \
+        .type = OUT,                                                            \
+        .swizzle = SWS_SWIZZLE(0, 3, 1, 2),                                     \
+    })
+
+#define CHECK(NAME, N_IN, N_OUT, IN, OUT, ...) \
+    CHECK_RANGE(NAME, 0, N_IN, N_OUT, IN, OUT, __VA_ARGS__)
+
+#define CHECK_COMMON(NAME, IN, OUT, ...) \
+    CHECK_COMMON_RANGE(NAME, 0, IN, OUT, __VA_ARGS__)
+
+static void check_read_write(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        for (int i = 1; i <= 4; i++) {
+            /* Test N->N planar read/write */
+            for (int o = 1; o <= i; o++) {
+                check_ops(FMT("rw_%d_%d_%s", i, o, type), 0, (SwsOp[]) {
+                    {
+                        .op = SWS_OP_READ,
+                        .type = t,
+                        .rw.elems = i,
+                    }, {
+                        .op = SWS_OP_WRITE,
+                        .type = t,
+                        .rw.elems = o,
+                    }, {0}
+                });
+            }
+
+            /* Test packed read/write */
+            if (i == 1)
+                continue;
+
+            check_ops(FMT("read_packed%d_%s", i, type), 0, (SwsOp[]) {
+                {
+                    .op = SWS_OP_READ,
+                    .type = t,
+                    .rw.elems = i,
+                    .rw.packed = true,
+                }, {
+                    .op = SWS_OP_WRITE,
+                    .type = t,
+                    .rw.elems = i,
+                }, {0}
+            });
+
+            check_ops(FMT("write_packed%d_%s", i, type), 0, (SwsOp[]) {
+                {
+                    .op = SWS_OP_READ,
+                    .type = t,
+                    .rw.elems = i,
+                }, {
+                    .op = SWS_OP_WRITE,
+                    .type = t,
+                    .rw.elems = i,
+                    .rw.packed = true,
+                }, {0}
+            });
+        }
+    }
+
+    /* Test fractional reads/writes */
+    for (int frac = 1; frac <= 3; frac++) {
+        const int bits = 8 >> frac;
+        const int range = (1 << bits) - 1;
+        if (bits == 2)
+            continue; /* no 2 bit packed formats currently exist */
+
+        check_ops(FMT("read_frac%d", frac), 0, (SwsOp[]) {
+            {
+                .op = SWS_OP_READ,
+                .type = U8,
+                .rw.elems = 1,
+                .rw.frac  = frac,
+            }, {
+                .op = SWS_OP_WRITE,
+                .type = U8,
+                .rw.elems = 1,
+            }, {0}
+        });
+
+        check_ops(FMT("write_frac%d", frac), range, (SwsOp[]) {
+            {
+                .op = SWS_OP_READ,
+                .type = U8,
+                .rw.elems = 1,
+            }, {
+                .op = SWS_OP_WRITE,
+                .type = U8,
+                .rw.elems = 1,
+                .rw.frac  = frac,
+            }, {0}
+        });
+    }
+}
+
+static void check_swap_bytes(void)
+{
+    CHECK_COMMON("swap_bytes_16", U16, U16, {
+        .op   = SWS_OP_SWAP_BYTES,
+        .type = U16,
+    });
+
+    CHECK_COMMON("swap_bytes_32", U32, U32, {
+        .op   = SWS_OP_SWAP_BYTES,
+        .type = U32,
+    });
+}
+
+static void check_pack_unpack(void)
+{
+    const struct {
+        SwsPixelType type;
+        SwsPackOp op;
+    } patterns[] = {
+        { U8, {{ 3,  3,  2 }}},
+        { U8, {{ 2,  3,  3 }}},
+        { U8, {{ 1,  2,  1 }}},
+        {U16, {{ 5,  6,  5 }}},
+        {U16, {{ 5,  5,  5 }}},
+        {U16, {{ 4,  4,  4 }}},
+        {U32, {{ 2, 10, 10, 10 }}},
+        {U32, {{10, 10, 10,  2 }}},
+    };
+
+    for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+        const SwsPixelType type = patterns[i].type;
+        const SwsPackOp pack = patterns[i].op;
+        const int num = pack.pattern[3] ? 4 : 3;
+        const char *pat = FMT("%d%d%d%d", pack.pattern[0], pack.pattern[1],
+                                          pack.pattern[2], pack.pattern[3]);
+
+        CHECK(FMT("pack_%s", pat), num, 1, type, type, {
+            .op   = SWS_OP_PACK,
+            .type = type,
+            .pack = pack,
+        });
+
+        CHECK(FMT("unpack_%s", pat), 1, num, type, type, {
+            .op   = SWS_OP_UNPACK,
+            .type = type,
+            .pack = pack,
+        });
+    }
+}
+
+static AVRational rndq(SwsPixelType t)
+{
+    const unsigned num = rnd();
+    if (ff_sws_pixel_type_is_int(t)) {
+        const unsigned mask = (1 << (ff_sws_pixel_type_size(t) * 8)) - 1;
+        return (AVRational) { num & mask, 1 };
+    } else {
+        const unsigned den = rnd();
+        return (AVRational) { num, den ? den : 1 };
+    }
+}
+
+static void check_clear(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        const int bits = ff_sws_pixel_type_size(t) * 8;
+
+        /* TODO: AVRational can't fit 32 bit constants */
+        if (bits < 32) {
+            const AVRational chroma = (AVRational) { 1 << (bits - 1), 1};
+            const AVRational alpha  = (AVRational) { (1 << bits) - 1, 1};
+            const AVRational zero   = (AVRational) { 0, 1};
+            const AVRational none = {0};
+
+            const SwsConst patterns[] = {
+                /* Zero only */
+                {.q4 = {   none,   none,   none,   zero }},
+                {.q4 = {   zero,   none,   none,   none }},
+                /* Alpha only */
+                {.q4 = {   none,   none,   none,  alpha }},
+                {.q4 = {  alpha,   none,   none,   none }},
+                /* Chroma only */
+                {.q4 = { chroma, chroma,   none,   none }},
+                {.q4 = {   none, chroma, chroma,   none }},
+                {.q4 = {   none,   none, chroma, chroma }},
+                {.q4 = { chroma,   none, chroma,   none }},
+                {.q4 = {   none, chroma,   none, chroma }},
+                /* Alpha+chroma */
+                {.q4 = { chroma, chroma,   none,  alpha }},
+                {.q4 = {   none, chroma, chroma,  alpha }},
+                {.q4 = {  alpha,   none, chroma, chroma }},
+                {.q4 = { chroma,   none, chroma,  alpha }},
+                {.q4 = {  alpha, chroma,   none, chroma }},
+                /* Random values */
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+            };
+
+            for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+                CHECK(FMT("clear_pattern_%s[%d]", type, i), 4, 4, t, t, {
+                    .op   = SWS_OP_CLEAR,
+                    .type = t,
+                    .c    = patterns[i],
+                });
+            }
+        } else if (!ff_sws_pixel_type_is_int(t)) {
+            /* Floating point YUV doesn't exist, only alpha needs to be cleared */
+            CHECK(FMT("clear_alpha_%s", type), 4, 4, t, t, {
+                .op      = SWS_OP_CLEAR,
+                .type    = t,
+                .c.q4[3] = { 0, 1 },
+            });
+        }
+    }
+}
+
+static void check_shift(void)
+{
+    for (SwsPixelType t = U16; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (!ff_sws_pixel_type_is_int(t))
+            continue;
+
+        for (int shift = 1; shift <= 8; shift++) {
+            CHECK_COMMON(FMT("lshift%d_%s", shift, type), t, t, {
+                .op   = SWS_OP_LSHIFT,
+                .type = t,
+                .c.u  = shift,
+            });
+
+            CHECK_COMMON(FMT("rshift%d_%s", shift, type), t, t, {
+                .op   = SWS_OP_RSHIFT,
+                .type = t,
+                .c.u  = shift,
+            });
+        }
+    }
+}
+
+static void check_swizzle(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        static const int patterns[][4] = {
+            /* Pure swizzle */
+            {3, 0, 1, 2},
+            {3, 0, 2, 1},
+            {2, 1, 0, 3},
+            {3, 2, 1, 0},
+            {3, 1, 0, 2},
+            {3, 2, 0, 1},
+            {1, 2, 0, 3},
+            {1, 0, 2, 3},
+            {2, 0, 1, 3},
+            {2, 3, 1, 0},
+            {2, 1, 3, 0},
+            {1, 2, 3, 0},
+            {1, 3, 2, 0},
+            {0, 2, 1, 3},
+            {0, 2, 3, 1},
+            {0, 3, 1, 2},
+            {3, 1, 2, 0},
+            {0, 3, 2, 1},
+            /* Luma expansion */
+            {0, 0, 0, 3},
+            {3, 0, 0, 0},
+            {0, 0, 0, 1},
+            {1, 0, 0, 0},
+        };
+
+        for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+            const int x = patterns[i][0], y = patterns[i][1],
+                      z = patterns[i][2], w = patterns[i][3];
+            CHECK(FMT("swizzle_%d%d%d%d_%s", x, y, z, w, type), 4, 4, t, t, {
+                .op = SWS_OP_SWIZZLE,
+                .type = t,
+                .swizzle = SWS_SWIZZLE(x, y, z, w),
+            });
+        }
+    }
+}
+
+static void check_convert(void)
+{
+    for (SwsPixelType i = U8; i < SWS_PIXEL_TYPE_NB; i++) {
+        const char *itype = ff_sws_pixel_type_name(i);
+        const int isize = ff_sws_pixel_type_size(i);
+        for (SwsPixelType o = U8; o < SWS_PIXEL_TYPE_NB; o++) {
+            const char *otype = ff_sws_pixel_type_name(o);
+            const int osize = ff_sws_pixel_type_size(o);
+            const char *name = FMT("convert_%s_%s", itype, otype);
+            if (i == o)
+                continue;
+
+            if (isize < osize || !ff_sws_pixel_type_is_int(o)) {
+                CHECK_COMMON(name, i, o, {
+                    .op = SWS_OP_CONVERT,
+                    .type = i,
+                    .convert.to = o,
+                });
+            } else if (isize > osize || !ff_sws_pixel_type_is_int(i)) {
+                uint32_t range = (1 << osize * 8) - 1;
+                CHECK_COMMON_RANGE(name, range, i, o, {
+                    .op = SWS_OP_CONVERT,
+                    .type = i,
+                    .convert.to = o,
+                });
+            }
+        }
+    }
+
+    /* Check expanding conversions */
+    CHECK_COMMON("expand16", U8, U16, {
+        .op = SWS_OP_CONVERT,
+        .type = U8,
+        .convert.to = U16,
+        .convert.expand = true,
+    });
+
+    CHECK_COMMON("expand32", U8, U32, {
+        .op = SWS_OP_CONVERT,
+        .type = U8,
+        .convert.to = U32,
+        .convert.expand = true,
+    });
+}
+
+static void check_dither(void)
+{
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (ff_sws_pixel_type_is_int(t))
+            continue;
+
+        /* Test all sizes up to 16x16 */
+        for (int size_log2 = 0; size_log2 <= 4; size_log2++) {
+            const int size = 1 << size_log2;
+            AVRational *matrix = av_refstruct_allocz(size * size * sizeof(*matrix));
+            if (!matrix) {
+                fail();
+                return;
+            }
+
+            if (size == 1) {
+                matrix[0] = (AVRational) { 1, 2 };
+            } else {
+                for (int i = 0; i < size * size; i++)
+                    matrix[i] = rndq(t);
+            }
+
+            CHECK_COMMON(FMT("dither_%dx%d_%s", size, size, type), t, t, {
+                .op = SWS_OP_DITHER,
+                .type = t,
+                .dither.size_log2 = size_log2,
+                .dither.matrix = matrix,
+            });
+
+            av_refstruct_unref(&matrix);
+        }
+    }
+}
+
+static void check_min_max(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        CHECK_COMMON(FMT("min_%s", type), t, t, {
+            .op = SWS_OP_MIN,
+            .type = t,
+            .c.q4 = { rndq(t), rndq(t), rndq(t), rndq(t) },
+        });
+
+        CHECK_COMMON(FMT("max_%s", type), t, t, {
+            .op = SWS_OP_MAX,
+            .type = t,
+            .c.q4 = { rndq(t), rndq(t), rndq(t), rndq(t) },
+        });
+    }
+}
+
+static void check_linear(void)
+{
+    static const struct {
+        const char *name;
+        uint32_t mask;
+    } patterns[] = {
+        { "noop",               0 },
+        { "luma",               SWS_MASK_LUMA },
+        { "alpha",              SWS_MASK_ALPHA },
+        { "luma+alpha",         SWS_MASK_LUMA | SWS_MASK_ALPHA },
+        { "dot3",               0b111 },
+        { "dot4",               0b1111 },
+        { "row0",               SWS_MASK_ROW(0) },
+        { "row0+alpha",         SWS_MASK_ROW(0) | SWS_MASK_ALPHA },
+        { "off3",               SWS_MASK_OFF3 },
+        { "off3+alpha",         SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag3",              SWS_MASK_DIAG3 },
+        { "diag4",              SWS_MASK_DIAG4 },
+        { "diag3+alpha",        SWS_MASK_DIAG3 | SWS_MASK_ALPHA },
+        { "diag3+off3",         SWS_MASK_DIAG3 | SWS_MASK_OFF3 },
+        { "diag3+off3+alpha",   SWS_MASK_DIAG3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag4+off4",         SWS_MASK_DIAG4 | SWS_MASK_OFF4 },
+        { "matrix3",            SWS_MASK_MAT3 },
+        { "matrix3+off3",       SWS_MASK_MAT3 | SWS_MASK_OFF3 },
+        { "matrix3+off3+alpha", SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "matrix4",            SWS_MASK_MAT4 },
+        { "matrix4+off4",       SWS_MASK_MAT4 | SWS_MASK_OFF4 },
+    };
+
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (ff_sws_pixel_type_is_int(t))
+            continue;
+
+        for (int p = 0; p < FF_ARRAY_ELEMS(patterns); p++) {
+            const uint32_t mask = patterns[p].mask;
+            SwsLinearOp lin = { .mask = mask };
+
+            for (int i = 0; i < 4; i++) {
+                for (int j = 0; j < 5; j++) {
+                    if (mask & SWS_MASK(i, j)) {
+                        lin.m[i][j] = rndq(t);
+                    } else {
+                        lin.m[i][j] = (AVRational) { i == j, 1 };
+                    }
+                }
+            }
+
+            CHECK(FMT("linear_%s_%s", patterns[p].name, type), 4, 4, t, t, {
+                .op = SWS_OP_LINEAR,
+                .type = t,
+                .lin = lin,
+            });
+        }
+    }
+}
+
+static void check_scale(void)
+{
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        const int bits = ff_sws_pixel_type_size(t) * 8;
+        if (ff_sws_pixel_type_is_int(t)) {
+            /* Ensure the result won't exceed the value range */
+            const unsigned max = (1 << bits) - 1;
+            const unsigned scale = rnd() & max;
+            const unsigned range = max / (scale ? scale : 1);
+            CHECK_COMMON_RANGE(FMT("scale_%s", type), range, t, t, {
+                .op   = SWS_OP_SCALE,
+                .type = t,
+                .c.q  = { scale, 1 },
+            });
+        } else {
+            CHECK_COMMON(FMT("scale_%s", type), t, t, {
+                .op   = SWS_OP_SCALE,
+                .type = t,
+                .c.q  = rndq(t),
+            });
+        }
+    }
+}
+
+void checkasm_check_sw_ops(void)
+{
+    check_read_write();
+    report("read_write");
+    check_swap_bytes();
+    report("swap_bytes");
+    check_pack_unpack();
+    report("pack_unpack");
+    check_clear();
+    report("clear");
+    check_shift();
+    report("shift");
+    check_swizzle();
+    report("swizzle");
+    check_convert();
+    report("convert");
+    check_dither();
+    report("dither");
+    check_min_max();
+    report("min_max");
+    check_linear();
+    report("linear");
+    check_scale();
+    report("scale");
+}
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 15/17] swscale/format: rename legacy format conversion table
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (13 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 14/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic Niklas Haas
                   ` (4 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

---
 libswscale/format.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/libswscale/format.c b/libswscale/format.c
index e4c1348b90..b77081dd7a 100644
--- a/libswscale/format.c
+++ b/libswscale/format.c
@@ -24,14 +24,14 @@
 
 #include "format.h"
 
-typedef struct FormatEntry {
+typedef struct LegacyFormatEntry {
     uint8_t is_supported_in         :1;
     uint8_t is_supported_out        :1;
     uint8_t is_supported_endianness :1;
-} FormatEntry;
+} LegacyFormatEntry;
 
 /* Format support table for legacy swscale */
-static const FormatEntry format_entries[] = {
+static const LegacyFormatEntry legacy_format_entries[] = {
     [AV_PIX_FMT_YUV420P]        = { 1, 1 },
     [AV_PIX_FMT_YUYV422]        = { 1, 1 },
     [AV_PIX_FMT_RGB24]          = { 1, 1 },
@@ -262,20 +262,20 @@ static const FormatEntry format_entries[] = {
 
 int sws_isSupportedInput(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_in : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_in : 0;
 }
 
 int sws_isSupportedOutput(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_out : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_out : 0;
 }
 
 int sws_isSupportedEndiannessConversion(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_endianness : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_endianness : 0;
 }
 
 /**
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (14 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 15/17] swscale/format: rename legacy format conversion table Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-05-02 14:10   ` Michael Niedermayer
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas
                   ` (3 subsequent siblings)
  19 siblings, 1 reply; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This patch adds format handling code for the new operations. This entails
fully decoding a format to standardized RGB, and the inverse.

Handling it this way means we can always guarantee that a conversion path
exists from A to B without having to explicitly cover logic for each path;
and choosing RGB instead of YUV as the intermediate (as was done in swscale
v1) is more flexible with regards to enabling further operations such as
primaries conversions, linear scaling, etc.

In the case of YUV->YUV transform, the redundant matrix multiplication will
be canceled out anyways.
---
 libswscale/format.c | 925 ++++++++++++++++++++++++++++++++++++++++++++
 libswscale/format.h |  23 ++
 2 files changed, 948 insertions(+)

diff --git a/libswscale/format.c b/libswscale/format.c
index b77081dd7a..c0e085d717 100644
--- a/libswscale/format.c
+++ b/libswscale/format.c
@@ -21,8 +21,22 @@
 #include "libavutil/avassert.h"
 #include "libavutil/hdr_dynamic_metadata.h"
 #include "libavutil/mastering_display_metadata.h"
+#include "libavutil/refstruct.h"
 
 #include "format.h"
+#include "csputils.h"
+#include "ops_internal.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+#define Q0   Q(0)
+#define Q1   Q(1)
+
+#define RET(x)                                                                 \
+    do {                                                                       \
+        int __ret = (x);                                                       \
+        if (__ret  < 0)                                                        \
+            return __ret;                                                      \
+    } while (0)
 
 typedef struct LegacyFormatEntry {
     uint8_t is_supported_in         :1;
@@ -582,3 +596,914 @@ int sws_is_noop(const AVFrame *dst, const AVFrame *src)
 
     return 1;
 }
+
+/* Returns the type suitable for a pixel after fully decoding/unpacking it */
+static SwsPixelType fmt_pixel_type(enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    const int bits = FFALIGN(desc->comp[0].depth, 8);
+    if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) {
+        switch (bits) {
+        case 32: return SWS_PIXEL_F32;
+        }
+    } else {
+        switch (bits) {
+        case  8: return SWS_PIXEL_U8;
+        case 16: return SWS_PIXEL_U16;
+        case 32: return SWS_PIXEL_U32;
+        }
+    }
+
+    return SWS_PIXEL_NONE;
+}
+
+static SwsSwizzleOp fmt_swizzle(enum AVPixelFormat fmt)
+{
+    switch (fmt) {
+    case AV_PIX_FMT_ARGB:
+    case AV_PIX_FMT_0RGB:
+    case AV_PIX_FMT_AYUV64LE:
+    case AV_PIX_FMT_AYUV64BE:
+    case AV_PIX_FMT_AYUV:
+    case AV_PIX_FMT_X2RGB10LE:
+    case AV_PIX_FMT_X2RGB10BE:
+        return (SwsSwizzleOp) {{ .x = 3, 0, 1, 2 }};
+    case AV_PIX_FMT_BGR24:
+    case AV_PIX_FMT_BGR8:
+    case AV_PIX_FMT_BGR4:
+    case AV_PIX_FMT_BGR4_BYTE:
+    case AV_PIX_FMT_BGRA:
+    case AV_PIX_FMT_BGR565BE:
+    case AV_PIX_FMT_BGR565LE:
+    case AV_PIX_FMT_BGR555BE:
+    case AV_PIX_FMT_BGR555LE:
+    case AV_PIX_FMT_BGR444BE:
+    case AV_PIX_FMT_BGR444LE:
+    case AV_PIX_FMT_BGR48BE:
+    case AV_PIX_FMT_BGR48LE:
+    case AV_PIX_FMT_BGRA64BE:
+    case AV_PIX_FMT_BGRA64LE:
+    case AV_PIX_FMT_BGR0:
+    case AV_PIX_FMT_VUYA:
+    case AV_PIX_FMT_VUYX:
+        return (SwsSwizzleOp) {{ .x = 2, 1, 0, 3 }};
+    case AV_PIX_FMT_ABGR:
+    case AV_PIX_FMT_0BGR:
+    case AV_PIX_FMT_X2BGR10LE:
+    case AV_PIX_FMT_X2BGR10BE:
+        return (SwsSwizzleOp) {{ .x = 3, 2, 1, 0 }};
+    case AV_PIX_FMT_YA8:
+    case AV_PIX_FMT_YA16BE:
+    case AV_PIX_FMT_YA16LE:
+        return (SwsSwizzleOp) {{ .x = 0, 3, 1, 2 }};
+    case AV_PIX_FMT_XV30BE:
+    case AV_PIX_FMT_XV30LE:
+        return (SwsSwizzleOp) {{ .x = 3, 2, 0, 1 }};
+    case AV_PIX_FMT_VYU444:
+    case AV_PIX_FMT_V30XBE:
+    case AV_PIX_FMT_V30XLE:
+        return (SwsSwizzleOp) {{ .x = 2, 0, 1, 3 }};
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XV48BE:
+    case AV_PIX_FMT_XV48LE:
+    case AV_PIX_FMT_UYVA:
+        return (SwsSwizzleOp) {{ .x = 1, 0, 2, 3 }};
+    case AV_PIX_FMT_GBRP:
+    case AV_PIX_FMT_GBRP9BE:
+    case AV_PIX_FMT_GBRP9LE:
+    case AV_PIX_FMT_GBRP10BE:
+    case AV_PIX_FMT_GBRP10LE:
+    case AV_PIX_FMT_GBRP12BE:
+    case AV_PIX_FMT_GBRP12LE:
+    case AV_PIX_FMT_GBRP14BE:
+    case AV_PIX_FMT_GBRP14LE:
+    case AV_PIX_FMT_GBRP16BE:
+    case AV_PIX_FMT_GBRP16LE:
+    case AV_PIX_FMT_GBRPF16BE:
+    case AV_PIX_FMT_GBRPF16LE:
+    case AV_PIX_FMT_GBRAP:
+    case AV_PIX_FMT_GBRAP10LE:
+    case AV_PIX_FMT_GBRAP10BE:
+    case AV_PIX_FMT_GBRAP12LE:
+    case AV_PIX_FMT_GBRAP12BE:
+    case AV_PIX_FMT_GBRAP14LE:
+    case AV_PIX_FMT_GBRAP14BE:
+    case AV_PIX_FMT_GBRAP16LE:
+    case AV_PIX_FMT_GBRAP16BE:
+    case AV_PIX_FMT_GBRPF32BE:
+    case AV_PIX_FMT_GBRPF32LE:
+    case AV_PIX_FMT_GBRAPF16BE:
+    case AV_PIX_FMT_GBRAPF16LE:
+    case AV_PIX_FMT_GBRAPF32BE:
+    case AV_PIX_FMT_GBRAPF32LE:
+        return (SwsSwizzleOp) {{ .x = 1, 2, 0, 3 }};
+    default:
+        return (SwsSwizzleOp) {{ .x = 0, 1, 2, 3 }};
+    }
+}
+
+static SwsSwizzleOp swizzle_inv(SwsSwizzleOp swiz) {
+    /* Input[x] =: Output[swizzle.x] */
+    unsigned out[4];
+    out[swiz.x] = 0;
+    out[swiz.y] = 1;
+    out[swiz.z] = 2;
+    out[swiz.w] = 3;
+    return (SwsSwizzleOp) {{ .x = out[0], out[1], out[2], out[3] }};
+}
+
+/* Shift factor for MSB aligned formats */
+static int fmt_shift(enum AVPixelFormat fmt)
+{
+    switch (fmt) {
+    case AV_PIX_FMT_P010BE:
+    case AV_PIX_FMT_P010LE:
+    case AV_PIX_FMT_P210BE:
+    case AV_PIX_FMT_P210LE:
+    case AV_PIX_FMT_Y210BE:
+    case AV_PIX_FMT_Y210LE:
+        return 6;
+    case AV_PIX_FMT_P012BE:
+    case AV_PIX_FMT_P012LE:
+    case AV_PIX_FMT_P212BE:
+    case AV_PIX_FMT_P212LE:
+    case AV_PIX_FMT_P412BE:
+    case AV_PIX_FMT_P412LE:
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XYZ12BE:
+    case AV_PIX_FMT_XYZ12LE:
+        return 4;
+    }
+
+    return 0;
+}
+
+/**
+ * This initializes all absent components explicitly to zero. There is no
+ * need to worry about the correct neutral value as fmt_decode() will
+ * implicitly ignore and overwrite absent components in any case. This function
+ * is just to ensure that we don't operate on undefined memory. In most cases,
+ * it will end up getting pushed towards the output or optimized away entirely
+ * by the optimization pass.
+ */
+static SwsConst fmt_clear(enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    const bool has_chroma = desc->nb_components >= 3;
+    const bool has_alpha  = desc->flags & AV_PIX_FMT_FLAG_ALPHA;
+
+    SwsConst c = {0};
+    if (!has_chroma)
+        c.q4[1] = c.q4[2] = Q0;
+    if (!has_alpha)
+        c.q4[3] = Q0;
+
+    return c;
+}
+
+static int fmt_read_write(enum AVPixelFormat fmt, SwsReadWriteOp *rw_op,
+                          SwsPackOp *pack_op)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    if (!desc)
+        return AVERROR(EINVAL);
+
+    switch (fmt) {
+    case AV_PIX_FMT_NONE:
+    case AV_PIX_FMT_NB:
+        break;
+
+    /* Packed bitstream formats */
+    case AV_PIX_FMT_MONOWHITE:
+    case AV_PIX_FMT_MONOBLACK:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) {
+            .elems = 1,
+            .frac  = 3,
+        };
+        return 0;
+    case AV_PIX_FMT_RGB4:
+    case AV_PIX_FMT_BGR4:
+        *pack_op = (SwsPackOp) {{ 1, 2, 1 }};
+        *rw_op = (SwsReadWriteOp) {
+            .elems = 1,
+            .frac  = 1,
+        };
+        return 0;
+    /* Packed 8-bit aligned formats */
+    case AV_PIX_FMT_RGB4_BYTE:
+    case AV_PIX_FMT_BGR4_BYTE:
+        *pack_op = (SwsPackOp) {{ 1, 2, 1 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_BGR8:
+        *pack_op = (SwsPackOp) {{ 2, 3, 3 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_RGB8:
+        *pack_op = (SwsPackOp) {{ 3, 3, 2 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+
+    /* Packed 16-bit aligned formats */
+    case AV_PIX_FMT_RGB565BE:
+    case AV_PIX_FMT_RGB565LE:
+    case AV_PIX_FMT_BGR565BE:
+    case AV_PIX_FMT_BGR565LE:
+        *pack_op = (SwsPackOp) {{ 5, 6, 5 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_RGB555BE:
+    case AV_PIX_FMT_RGB555LE:
+    case AV_PIX_FMT_BGR555BE:
+    case AV_PIX_FMT_BGR555LE:
+        *pack_op = (SwsPackOp) {{ 5, 5, 5 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_RGB444BE:
+    case AV_PIX_FMT_RGB444LE:
+    case AV_PIX_FMT_BGR444BE:
+    case AV_PIX_FMT_BGR444LE:
+        *pack_op = (SwsPackOp) {{ 4, 4, 4 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    /* Packed 32-bit aligned 4:4:4 formats */
+    case AV_PIX_FMT_X2RGB10BE:
+    case AV_PIX_FMT_X2RGB10LE:
+    case AV_PIX_FMT_X2BGR10BE:
+    case AV_PIX_FMT_X2BGR10LE:
+    case AV_PIX_FMT_XV30BE:
+    case AV_PIX_FMT_XV30LE:
+        *pack_op = (SwsPackOp) {{ 2, 10, 10, 10 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_V30XBE:
+    case AV_PIX_FMT_V30XLE:
+        *pack_op = (SwsPackOp) {{ 10, 10, 10, 2 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    /* 3 component formats with one channel ignored */
+    case AV_PIX_FMT_RGB0:
+    case AV_PIX_FMT_BGR0:
+    case AV_PIX_FMT_0RGB:
+    case AV_PIX_FMT_0BGR:
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XV48BE:
+    case AV_PIX_FMT_XV48LE:
+    case AV_PIX_FMT_VUYX:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) { .elems = 4, .packed = true };
+        return 0;
+    /* Unpacked byte-aligned 4:4:4 formats */
+    case AV_PIX_FMT_YUV444P:
+    case AV_PIX_FMT_YUVJ444P:
+    case AV_PIX_FMT_YUV444P9BE:
+    case AV_PIX_FMT_YUV444P9LE:
+    case AV_PIX_FMT_YUV444P10BE:
+    case AV_PIX_FMT_YUV444P10LE:
+    case AV_PIX_FMT_YUV444P12BE:
+    case AV_PIX_FMT_YUV444P12LE:
+    case AV_PIX_FMT_YUV444P14BE:
+    case AV_PIX_FMT_YUV444P14LE:
+    case AV_PIX_FMT_YUV444P16BE:
+    case AV_PIX_FMT_YUV444P16LE:
+    case AV_PIX_FMT_YUVA444P:
+    case AV_PIX_FMT_YUVA444P9BE:
+    case AV_PIX_FMT_YUVA444P9LE:
+    case AV_PIX_FMT_YUVA444P10BE:
+    case AV_PIX_FMT_YUVA444P10LE:
+    case AV_PIX_FMT_YUVA444P12BE:
+    case AV_PIX_FMT_YUVA444P12LE:
+    case AV_PIX_FMT_YUVA444P16BE:
+    case AV_PIX_FMT_YUVA444P16LE:
+    case AV_PIX_FMT_AYUV:
+    case AV_PIX_FMT_UYVA:
+    case AV_PIX_FMT_VYU444:
+    case AV_PIX_FMT_AYUV64BE:
+    case AV_PIX_FMT_AYUV64LE:
+    case AV_PIX_FMT_VUYA:
+    case AV_PIX_FMT_RGB24:
+    case AV_PIX_FMT_BGR24:
+    case AV_PIX_FMT_RGB48BE:
+    case AV_PIX_FMT_RGB48LE:
+    case AV_PIX_FMT_BGR48BE:
+    case AV_PIX_FMT_BGR48LE:
+    //case AV_PIX_FMT_RGB96BE: TODO: AVRational can't fit 2^32-1
+    //case AV_PIX_FMT_RGB96LE:
+    //case AV_PIX_FMT_RGBF16BE: TODO: no support for float16 currently
+    //case AV_PIX_FMT_RGBF16LE:
+    case AV_PIX_FMT_RGBF32BE:
+    case AV_PIX_FMT_RGBF32LE:
+    case AV_PIX_FMT_ARGB:
+    case AV_PIX_FMT_RGBA:
+    case AV_PIX_FMT_ABGR:
+    case AV_PIX_FMT_BGRA:
+    case AV_PIX_FMT_RGBA64BE:
+    case AV_PIX_FMT_RGBA64LE:
+    case AV_PIX_FMT_BGRA64BE:
+    case AV_PIX_FMT_BGRA64LE:
+    //case AV_PIX_FMT_RGBA128BE: TODO: AVRational can't fit 2^32-1
+    //case AV_PIX_FMT_RGBA128LE:
+    case AV_PIX_FMT_RGBAF32BE:
+    case AV_PIX_FMT_RGBAF32LE:
+    case AV_PIX_FMT_GBRP:
+    case AV_PIX_FMT_GBRP9BE:
+    case AV_PIX_FMT_GBRP9LE:
+    case AV_PIX_FMT_GBRP10BE:
+    case AV_PIX_FMT_GBRP10LE:
+    case AV_PIX_FMT_GBRP12BE:
+    case AV_PIX_FMT_GBRP12LE:
+    case AV_PIX_FMT_GBRP14BE:
+    case AV_PIX_FMT_GBRP14LE:
+    case AV_PIX_FMT_GBRP16BE:
+    case AV_PIX_FMT_GBRP16LE:
+    //case AV_PIX_FMT_GBRPF16BE: TODO
+    //case AV_PIX_FMT_GBRPF16LE:
+    case AV_PIX_FMT_GBRPF32BE:
+    case AV_PIX_FMT_GBRPF32LE:
+    case AV_PIX_FMT_GBRAP:
+    case AV_PIX_FMT_GBRAP10BE:
+    case AV_PIX_FMT_GBRAP10LE:
+    case AV_PIX_FMT_GBRAP12BE:
+    case AV_PIX_FMT_GBRAP12LE:
+    case AV_PIX_FMT_GBRAP14BE:
+    case AV_PIX_FMT_GBRAP14LE:
+    case AV_PIX_FMT_GBRAP16BE:
+    case AV_PIX_FMT_GBRAP16LE:
+    //case AV_PIX_FMT_GBRAPF16BE: TODO
+    //case AV_PIX_FMT_GBRAPF16LE:
+    case AV_PIX_FMT_GBRAPF32BE:
+    case AV_PIX_FMT_GBRAPF32LE:
+    case AV_PIX_FMT_GRAY8:
+    case AV_PIX_FMT_GRAY9BE:
+    case AV_PIX_FMT_GRAY9LE:
+    case AV_PIX_FMT_GRAY10BE:
+    case AV_PIX_FMT_GRAY10LE:
+    case AV_PIX_FMT_GRAY12BE:
+    case AV_PIX_FMT_GRAY12LE:
+    case AV_PIX_FMT_GRAY14BE:
+    case AV_PIX_FMT_GRAY14LE:
+    case AV_PIX_FMT_GRAY16BE:
+    case AV_PIX_FMT_GRAY16LE:
+    //case AV_PIX_FMT_GRAYF16BE: TODO
+    //case AV_PIX_FMT_GRAYF16LE:
+    //case AV_PIX_FMT_YAF16BE:
+    //case AV_PIX_FMT_YAF16LE:
+    case AV_PIX_FMT_GRAYF32BE:
+    case AV_PIX_FMT_GRAYF32LE:
+    case AV_PIX_FMT_YAF32BE:
+    case AV_PIX_FMT_YAF32LE:
+    case AV_PIX_FMT_YA8:
+    case AV_PIX_FMT_YA16LE:
+    case AV_PIX_FMT_YA16BE:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) {
+            .elems  = desc->nb_components,
+            .packed = desc->nb_components > 1 && !(desc->flags & AV_PIX_FMT_FLAG_PLANAR),
+        };
+        return 0;
+    }
+
+    return AVERROR(ENOTSUP);
+}
+
+static SwsPixelType get_packed_type(SwsPackOp pack)
+{
+    const int sum = pack.pattern[0] + pack.pattern[1] +
+                    pack.pattern[2] + pack.pattern[3];
+    if (sum > 16)
+        return SWS_PIXEL_U32;
+    else if (sum > 8)
+        return SWS_PIXEL_U16;
+    else
+        return SWS_PIXEL_U8;
+}
+
+#if HAVE_BIGENDIAN
+#  define NATIVE_ENDIAN_FLAG AV_PIX_FMT_FLAG_BE
+#else
+#  define NATIVE_ENDIAN_FLAG 0
+#endif
+
+int ff_sws_decode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    SwsPixelType pixel_type = fmt_pixel_type(fmt);
+    SwsPixelType raw_type = pixel_type;
+    SwsReadWriteOp rw_op;
+    SwsPackOp unpack;
+
+    RET(fmt_read_write(fmt, &rw_op, &unpack));
+    if (unpack.pattern[0])
+        raw_type = get_packed_type(unpack);
+
+    /* TODO: handle subsampled or semipacked input formats */
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_READ,
+        .type = raw_type,
+        .rw   = rw_op,
+    }));
+
+    if ((desc->flags & AV_PIX_FMT_FLAG_BE) != NATIVE_ENDIAN_FLAG) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_SWAP_BYTES,
+            .type = raw_type,
+        }));
+    }
+
+    if (unpack.pattern[0]) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_UNPACK,
+            .type = raw_type,
+            .pack = unpack,
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CONVERT,
+            .type = raw_type,
+            .convert.to = pixel_type,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op      = SWS_OP_SWIZZLE,
+        .type    = pixel_type,
+        .swizzle = swizzle_inv(fmt_swizzle(fmt)),
+    }));
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_RSHIFT,
+        .type = pixel_type,
+        .c.u  = fmt_shift(fmt),
+    }));
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_CLEAR,
+        .type = pixel_type,
+        .c    = fmt_clear(fmt),
+    }));
+
+    return 0;
+}
+
+int ff_sws_encode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    SwsPixelType pixel_type = fmt_pixel_type(fmt);
+    SwsPixelType raw_type = pixel_type;
+    SwsReadWriteOp rw_op;
+    SwsPackOp pack;
+
+    RET(fmt_read_write(fmt, &rw_op, &pack));
+    if (pack.pattern[0])
+        raw_type = get_packed_type(pack);
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_LSHIFT,
+        .type = pixel_type,
+        .c.u  = fmt_shift(fmt),
+    }));
+
+    if (rw_op.elems > desc->nb_components) {
+        /* Format writes unused alpha channel, clear it explicitly for sanity */
+        av_assert1(!(desc->flags & AV_PIX_FMT_FLAG_ALPHA));
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CLEAR,
+            .type = pixel_type,
+            .c.q4[3] = Q0,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op      = SWS_OP_SWIZZLE,
+        .type    = pixel_type,
+        .swizzle = fmt_swizzle(fmt),
+    }));
+
+    if (pack.pattern[0]) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CONVERT,
+            .type = pixel_type,
+            .convert.to = raw_type,
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_PACK,
+            .type = raw_type,
+            .pack = pack,
+        }));
+    }
+
+    if ((desc->flags & AV_PIX_FMT_FLAG_BE) != NATIVE_ENDIAN_FLAG) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_SWAP_BYTES,
+            .type = raw_type,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_WRITE,
+        .type = raw_type,
+        .rw   = rw_op,
+    }));
+    return 0;
+}
+
+static inline AVRational av_neg_q(AVRational x)
+{
+    return (AVRational) { -x.num, x.den };
+}
+
+static SwsLinearOp fmt_encode_range(const SwsFormat fmt, bool *incomplete)
+{
+    SwsLinearOp c = { .m = {
+        { Q1, Q0, Q0, Q0, Q0 },
+        { Q0, Q1, Q0, Q0, Q0 },
+        { Q0, Q0, Q1, Q0, Q0 },
+        { Q0, Q0, Q0, Q1, Q0 },
+    }};
+
+    const int depth0 = fmt.desc->comp[0].depth;
+    const int depth1 = fmt.desc->comp[1].depth;
+    const int depth2 = fmt.desc->comp[2].depth;
+    const int depth3 = fmt.desc->comp[3].depth;
+
+    if (fmt.desc->flags & AV_PIX_FMT_FLAG_FLOAT)
+        return c; /* floats are directly output as-is */
+
+    if (fmt.csp == AVCOL_SPC_RGB || (fmt.desc->flags & AV_PIX_FMT_FLAG_XYZ)) {
+        c.m[0][0] = Q((1 << depth0) - 1);
+        c.m[1][1] = Q((1 << depth1) - 1);
+        c.m[2][2] = Q((1 << depth2) - 1);
+    } else if (fmt.range == AVCOL_RANGE_JPEG) {
+        /* Full range YUV */
+        c.m[0][0] = Q((1 << depth0) - 1);
+        if (fmt.desc->nb_components >= 3) {
+            /* This follows the ITU-R convention, which is slightly different
+             * from the JFIF convention. */
+            c.m[1][1] = Q((1 << depth1) - 1);
+            c.m[2][2] = Q((1 << depth2) - 1);
+            c.m[1][4] = Q(1 << (depth1 - 1));
+            c.m[2][4] = Q(1 << (depth2 - 1));
+        }
+    } else {
+        /* Limited range YUV */
+        if (fmt.range == AVCOL_RANGE_UNSPECIFIED)
+            *incomplete = true;
+        c.m[0][0] = Q(219 << (depth0 - 8));
+        c.m[0][4] = Q( 16 << (depth0 - 8));
+        if (fmt.desc->nb_components >= 3) {
+            c.m[1][1] = Q(224 << (depth1 - 8));
+            c.m[2][2] = Q(224 << (depth2 - 8));
+            c.m[1][4] = Q(128 << (depth1 - 8));
+            c.m[2][4] = Q(128 << (depth2 - 8));
+        }
+    }
+
+    if (fmt.desc->flags & AV_PIX_FMT_FLAG_ALPHA) {
+        const bool is_ya = fmt.desc->nb_components == 2;
+        c.m[3][3] = Q((1 << (is_ya ? depth1 : depth3)) - 1);
+    }
+
+    if (fmt.format == AV_PIX_FMT_MONOWHITE) {
+        /* This format is inverted, 0 = white, 1 = black */
+        c.m[0][4] = av_add_q(c.m[0][4], c.m[0][0]);
+        c.m[0][0] = av_neg_q(c.m[0][0]);
+    }
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+static SwsLinearOp fmt_decode_range(const SwsFormat fmt, bool *incomplete)
+{
+    SwsLinearOp c = fmt_encode_range(fmt, incomplete);
+
+    /* Invert main diagonal + offset: x = s * y + k  ==>  y = (x - k) / s */
+    for (int i = 0; i < 4; i++) {
+        c.m[i][i] = av_inv_q(c.m[i][i]);
+        c.m[i][4] = av_mul_q(c.m[i][4], av_neg_q(c.m[i][i]));
+    }
+
+    /* Explicitly initialize alpha for sanity */
+    if (!(fmt.desc->flags & AV_PIX_FMT_FLAG_ALPHA))
+        c.m[3][4] = Q1;
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+static AVRational *generate_bayer_matrix(const int size_log2)
+{
+    const int size = 1 << size_log2;
+    const int num_entries = size * size;
+    AVRational *m = av_refstruct_allocz(sizeof(*m) * num_entries);
+    av_assert1(size_log2 < 16);
+    if (!m)
+        return NULL;
+
+    /* Start with a 1x1 matrix */
+    m[0] = Q0;
+
+    /* Generate three copies of the current, appropriately scaled and offset */
+    for (int sz = 1; sz < size; sz <<= 1) {
+        const int den = 4 * sz * sz;
+        for (int y = 0; y < sz; y++) {
+            for (int x = 0; x < sz; x++) {
+                const AVRational cur = m[y * size + x];
+                m[(y + sz) * size + x + sz] = av_add_q(cur, av_make_q(1, den));
+                m[(y     ) * size + x + sz] = av_add_q(cur, av_make_q(2, den));
+                m[(y + sz) * size + x     ] = av_add_q(cur, av_make_q(3, den));
+            }
+        }
+    }
+
+    /**
+     * To correctly round, we need to evenly distribute the result on [0, 1),
+     * giving an average value of 1/2.
+     *
+     * After the above construction, we have a matrix with average value:
+     *   [ 0/N + 1/N + 2/N + ... (N-1)/N ] / N = (N-1)/(2N)
+     * where N = size * size is the total number of entries.
+     *
+     * To make the average value equal to 1/2 = N/(2N), add a bias of 1/(2N).
+     */
+    for (int i = 0; i < num_entries; i++)
+        m[i] = av_add_q(m[i], av_make_q(1, 2 * num_entries));
+
+    return m;
+}
+
+static bool trc_is_hdr(enum AVColorTransferCharacteristic trc)
+{
+    switch (trc) {
+    case AVCOL_TRC_LOG:
+    case AVCOL_TRC_LOG_SQRT:
+    case AVCOL_TRC_SMPTEST2084:
+    case AVCOL_TRC_ARIB_STD_B67:
+        return true;
+    default:
+        static_assert(AVCOL_TRC_NB == 19, "Update this list when adding TRCs");
+        return false;
+    }
+}
+
+static int fmt_dither(SwsContext *ctx, SwsOpList *ops,
+                      const SwsPixelType type, const SwsFormat fmt)
+{
+    SwsDither mode = ctx->dither;
+    SwsDitherOp dither;
+
+    if (mode == SWS_DITHER_AUTO) {
+        /* Visual threshold of perception: 12 bits for SDR, 14 bits for HDR */
+        const int jnd_bits = trc_is_hdr(fmt.color.trc) ? 14 : 12;
+        const int bpc = fmt.desc->comp[0].depth;
+        mode = bpc >= jnd_bits ? SWS_DITHER_NONE : SWS_DITHER_BAYER;
+    }
+
+    switch (mode) {
+    case SWS_DITHER_NONE:
+        if (ctx->flags & SWS_ACCURATE_RND) {
+            /* Add constant 0.5 for correct rounding */
+            AVRational *bias = av_refstruct_allocz(sizeof(*bias));
+            if (!bias)
+                return AVERROR(ENOMEM);
+            *bias = (AVRational) {1, 2};
+            return ff_sws_op_list_append(ops, &(SwsOp) {
+                .op   = SWS_OP_DITHER,
+                .type = type,
+                .dither.matrix = bias,
+            });
+        } else {
+            return 0; /* No-op */
+        }
+    case SWS_DITHER_BAYER:
+        /* Hardcode 16x16 matrix for now; in theory we could adjust this
+         * based on the expected level of precision in the output, since lower
+         * bit depth outputs can suffice with smaller dither matrices; however
+         * in practice we probably want to use error diffusion for such low bit
+         * depths anyway */
+        dither.size_log2 = 4;
+        dither.matrix = generate_bayer_matrix(dither.size_log2);
+        if (!dither.matrix)
+            return AVERROR(ENOMEM);
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .op     = SWS_OP_DITHER,
+            .type   = type,
+            .dither = dither,
+        });
+    case SWS_DITHER_ED:
+    case SWS_DITHER_A_DITHER:
+    case SWS_DITHER_X_DITHER:
+        return AVERROR(ENOTSUP);
+
+    case SWS_DITHER_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid dither mode");
+    return AVERROR(EINVAL);
+}
+
+static inline SwsLinearOp
+linear_mat3(const AVRational m00, const AVRational m01, const AVRational m02,
+            const AVRational m10, const AVRational m11, const AVRational m12,
+            const AVRational m20, const AVRational m21, const AVRational m22)
+{
+    SwsLinearOp c = {{
+        { m00, m01, m02, Q0, Q0 },
+        { m10, m11, m12, Q0, Q0 },
+        { m20, m21, m22, Q0, Q0 },
+        {  Q0,  Q0,  Q0, Q1, Q0 },
+    }};
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+int ff_sws_decode_colors(SwsContext *ctx, SwsPixelType type,
+                         SwsOpList *ops, const SwsFormat fmt, bool *incomplete)
+{
+    const AVLumaCoefficients *c = av_csp_luma_coeffs_from_avcsp(fmt.csp);
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op         = SWS_OP_CONVERT,
+        .type       = fmt_pixel_type(fmt.format),
+        .convert.to = type,
+    }));
+
+    /* Decode pixel format into standardized range */
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .type = type,
+        .op   = SWS_OP_LINEAR,
+        .lin  = fmt_decode_range(fmt, incomplete),
+    }));
+
+    /* Final step, decode colorspace */
+    switch (fmt.csp) {
+    case AVCOL_SPC_RGB:
+        return 0;
+    case AVCOL_SPC_UNSPECIFIED:
+        c = av_csp_luma_coeffs_from_avcsp(AVCOL_SPC_BT470BG);
+        *incomplete = true;
+        /* fall through */
+    case AVCOL_SPC_FCC:
+    case AVCOL_SPC_BT470BG:
+    case AVCOL_SPC_SMPTE170M:
+    case AVCOL_SPC_BT709:
+    case AVCOL_SPC_SMPTE240M:
+    case AVCOL_SPC_BT2020_NCL: {
+        AVRational crg = av_sub_q(Q0, av_div_q(c->cr, c->cg));
+        AVRational cbg = av_sub_q(Q0, av_div_q(c->cb, c->cg));
+        AVRational m02 = av_mul_q(Q(2), av_sub_q(Q1, c->cr));
+        AVRational m21 = av_mul_q(Q(2), av_sub_q(Q1, c->cb));
+        AVRational m11 = av_mul_q(cbg, m21);
+        AVRational m12 = av_mul_q(crg, m02);
+
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                Q1,  Q0, m02,
+                Q1, m11, m12,
+                Q1, m21,  Q0
+            ),
+        });
+    }
+
+    case AVCOL_SPC_YCGCO:
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                Q1, Q(-1), Q( 1),
+                Q1, Q( 1), Q( 0),
+                Q1, Q(-1), Q(-1)
+            ),
+        });
+
+    case AVCOL_SPC_BT2020_CL:
+    case AVCOL_SPC_SMPTE2085:
+    case AVCOL_SPC_CHROMA_DERIVED_NCL:
+    case AVCOL_SPC_CHROMA_DERIVED_CL:
+    case AVCOL_SPC_ICTCP:
+    case AVCOL_SPC_IPT_C2:
+    case AVCOL_SPC_YCGCO_RE:
+    case AVCOL_SPC_YCGCO_RO:
+        return AVERROR(ENOTSUP);
+
+    case AVCOL_SPC_RESERVED:
+        return AVERROR(EINVAL);
+
+    case AVCOL_SPC_NB:
+        break;
+    }
+
+    av_assert0(!"Corrupt AVColorSpace value?");
+    return AVERROR(EINVAL);
+}
+
+int ff_sws_encode_colors(SwsContext *ctx, SwsPixelType type,
+                         SwsOpList *ops, const SwsFormat fmt, bool *incomplete)
+{
+    const AVLumaCoefficients *c = av_csp_luma_coeffs_from_avcsp(fmt.csp);
+
+    switch (fmt.csp) {
+    case AVCOL_SPC_RGB:
+        break;
+    case AVCOL_SPC_UNSPECIFIED:
+        c = av_csp_luma_coeffs_from_avcsp(AVCOL_SPC_BT470BG);
+        *incomplete = true;
+        /* fall through */
+    case AVCOL_SPC_FCC:
+    case AVCOL_SPC_BT470BG:
+    case AVCOL_SPC_SMPTE170M:
+    case AVCOL_SPC_BT709:
+    case AVCOL_SPC_SMPTE240M:
+    case AVCOL_SPC_BT2020_NCL: {
+        AVRational cb1 = av_sub_q(c->cb, Q1);
+        AVRational cr1 = av_sub_q(c->cr, Q1);
+        AVRational m20 = av_make_q(1,2);
+        AVRational m10 = av_mul_q(m20, av_div_q(c->cr, cb1));
+        AVRational m11 = av_mul_q(m20, av_div_q(c->cg, cb1));
+        AVRational m21 = av_mul_q(m20, av_div_q(c->cg, cr1));
+        AVRational m22 = av_mul_q(m20, av_div_q(c->cb, cr1));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                c->cr, c->cg, c->cb,
+                m10,     m11,   m20,
+                m20,     m21,   m22
+            ),
+        }));
+        break;
+    }
+
+    case AVCOL_SPC_YCGCO:
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                av_make_q( 1, 4), av_make_q(1, 2), av_make_q( 1, 4),
+                av_make_q( 1, 2), av_make_q(0, 1), av_make_q(-1, 2),
+                av_make_q(-1, 4), av_make_q(1, 2), av_make_q(-1, 4)
+            ),
+        }));
+        break;
+
+    case AVCOL_SPC_BT2020_CL:
+    case AVCOL_SPC_SMPTE2085:
+    case AVCOL_SPC_CHROMA_DERIVED_NCL:
+    case AVCOL_SPC_CHROMA_DERIVED_CL:
+    case AVCOL_SPC_ICTCP:
+    case AVCOL_SPC_IPT_C2:
+    case AVCOL_SPC_YCGCO_RE:
+    case AVCOL_SPC_YCGCO_RO:
+        return AVERROR(ENOTSUP);
+
+    case AVCOL_SPC_RESERVED:
+    case AVCOL_SPC_NB:
+        return AVERROR(EINVAL);
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .type = type,
+        .op   = SWS_OP_LINEAR,
+        .lin  = fmt_encode_range(fmt, incomplete),
+    }));
+
+    if (!(fmt.desc->flags & AV_PIX_FMT_FLAG_FLOAT)) {
+        SwsConst range = {0};
+
+        const bool is_ya = fmt.desc->nb_components == 2;
+        for (int i = 0; i < fmt.desc->nb_components; i++) {
+            /* Clamp to legal pixel range */
+            const int idx = i * (is_ya ? 3 : 1);
+            range.q4[idx] = Q((1 << fmt.desc->comp[i].depth) - 1);
+        }
+
+        RET(fmt_dither(ctx, ops, type, fmt));
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_MAX,
+            .type = type,
+            .c.q4 = { Q0, Q0, Q0, Q0 },
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_MIN,
+            .type = type,
+            .c    = range,
+        }));
+    }
+
+    return ff_sws_op_list_append(ops, &(SwsOp) {
+        .type       = type,
+        .op         = SWS_OP_CONVERT,
+        .convert.to = fmt_pixel_type(fmt.format),
+    });
+}
diff --git a/libswscale/format.h b/libswscale/format.h
index 3b6d745159..3475d31e90 100644
--- a/libswscale/format.h
+++ b/libswscale/format.h
@@ -134,4 +134,27 @@ int ff_test_fmt(const SwsFormat *fmt, int output);
 /* Returns true if the formats are incomplete, false otherwise */
 bool ff_infer_colors(SwsColor *src, SwsColor *dst);
 
+typedef struct SwsOpList SwsOpList;
+typedef enum SwsPixelType SwsPixelType;
+
+/**
+ * Append a set of operations for decoding/encoding raw pixels. This will
+ * handle input read/write, swizzling, shifting and byte swapping.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_decode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt);
+int ff_sws_encode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt);
+
+/**
+ * Append a set of operations for transforming decoded pixel values to/from
+ * normalized RGB in the specified gamut and pixel type.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_decode_colors(SwsContext *ctx, SwsPixelType type, SwsOpList *ops,
+                         const SwsFormat fmt, bool *incomplete);
+int ff_sws_encode_colors(SwsContext *ctx, SwsPixelType type, SwsOpList *ops,
+                         const SwsFormat fmt, bool *incomplete);
+
 #endif /* SWSCALE_FORMAT_H */
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 17/17] swscale/graph: allow experimental use of new format handler
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (15 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic Niklas Haas
@ 2025-04-26 17:41 ` Niklas Haas
  2025-04-26 22:22 ` [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 17:41 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

\o/
---
 libswscale/graph.c | 77 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 75 insertions(+), 2 deletions(-)

diff --git a/libswscale/graph.c b/libswscale/graph.c
index b921b7ec02..947da45aa3 100644
--- a/libswscale/graph.c
+++ b/libswscale/graph.c
@@ -34,6 +34,7 @@
 #include "lut3d.h"
 #include "swscale_internal.h"
 #include "graph.h"
+#include "ops.h"
 
 static int pass_alloc_output(SwsPass *pass)
 {
@@ -453,6 +454,78 @@ static int add_legacy_sws_pass(SwsGraph *graph, SwsFormat src, SwsFormat dst,
     return 0;
 }
 
+/*********************
+ * Format conversion *
+ *********************/
+
+static int add_convert_pass(SwsGraph *graph, SwsFormat src, SwsFormat dst,
+                            SwsPass *input, SwsPass **output)
+{
+    const SwsPixelType type = SWS_PIXEL_F32;
+
+    SwsContext *ctx = graph->ctx;
+    SwsOpList *ops = NULL;
+    int ret = AVERROR(ENOTSUP);
+
+    /* Mark the entire new ops infrastructure as experimental for now */
+    if (!(ctx->flags & SWS_EXPERIMENTAL))
+        goto fail;
+
+    /* The new format conversion layer cannot scale for now */
+    if (src.width != dst.width || src.height != dst.height ||
+        src.desc->log2_chroma_h || src.desc->log2_chroma_w ||
+        dst.desc->log2_chroma_h || dst.desc->log2_chroma_w)
+        goto fail;
+
+    ops = ff_sws_op_list_alloc();
+    if (!ops)
+        return AVERROR(ENOMEM);
+
+    ret = ff_sws_decode_pixfmt(ops, src.format);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_decode_colors(ctx, type, ops, src, &graph->incomplete);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_encode_colors(ctx, type, ops, dst, &graph->incomplete);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_encode_pixfmt(ops, dst.format);
+    if (ret < 0)
+        goto fail;
+
+    av_log(ctx, AV_LOG_VERBOSE, "Conversion pass for %s -> %s:\n",
+           av_get_pix_fmt_name(src.format), av_get_pix_fmt_name(dst.format));
+
+    av_log(ctx, AV_LOG_DEBUG, "Unoptimized operation list:\n");
+    ff_sws_op_list_print(ctx, AV_LOG_DEBUG, ops);
+    av_log(ctx, AV_LOG_DEBUG, "Optimized operation list:\n");
+
+    ff_sws_op_list_optimize(ops);
+    if (ops->num_ops == 0) {
+        av_log(ctx, AV_LOG_VERBOSE, "  optimized into memcpy\n");
+        ff_sws_op_list_free(&ops);
+        *output = input;
+        return 0;
+    }
+
+    ff_sws_op_list_print(ctx, AV_LOG_VERBOSE, ops);
+
+    ret = ff_sws_compile_pass(graph, ops, 0, dst, input, output);
+    if (ret < 0)
+        goto fail;
+
+    ret = 0;
+    /* fall through */
+
+fail:
+    ff_sws_op_list_free(&ops);
+    if (ret == AVERROR(ENOTSUP))
+        return add_legacy_sws_pass(graph, src, dst, input, output);
+    return ret;
+}
+
+
 /**************************
  * Gamut and tone mapping *
  **************************/
@@ -522,7 +595,7 @@ static int adapt_colors(SwsGraph *graph, SwsFormat src, SwsFormat dst,
     if (fmt_in != src.format) {
         SwsFormat tmp = src;
         tmp.format = fmt_in;
-        ret = add_legacy_sws_pass(graph, src, tmp, input, &input);
+        ret = add_convert_pass(graph, src, tmp, input, &input);
         if (ret < 0)
             return ret;
     }
@@ -564,7 +637,7 @@ static int init_passes(SwsGraph *graph)
     src.color  = dst.color;
 
     if (!ff_fmt_equal(&src, &dst)) {
-        ret = add_legacy_sws_pass(graph, src, dst, pass, &pass);
+        ret = add_convert_pass(graph, src, dst, pass, &pass);
         if (ret < 0)
             return ret;
     }
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (16 preceding siblings ...)
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas
@ 2025-04-26 22:22 ` Niklas Haas
  2025-05-02 17:51 ` Niklas Haas
  2025-05-16 11:09 ` Niklas Haas
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-26 22:22 UTC (permalink / raw)
  To: ffmpeg-devel

On Sat, 26 Apr 2025 19:41:04 +0200 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> Hi all,
>
> After extensive amounts of refactoring and iteration on the design and API,
> and the implementation of an x86 SIMD backend, I'm happy to present the
> revised version of my ongoing swscale rewrite. Now with 100% less reliance on
> compiler autovectorization.

Small heads up: The current version of the code can be found at

https://github.com/haasn/FFmpeg/commits/swscale6

(The version I sent here had a small bug in the dither calculation code,
I've fixed it at the above link)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 11/17] swscale/x86: add SIMD backend
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 11/17] swscale/x86: add SIMD backend Niklas Haas
@ 2025-04-29 13:00   ` Michael Niedermayer
  2025-04-30 16:24     ` Niklas Haas
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Niedermayer @ 2025-04-29 13:00 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 3304 bytes --]

On Sat, Apr 26, 2025 at 07:41:15PM +0200, Niklas Haas wrote:
> From: Niklas Haas <git@haasn.dev>
> 
> This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
> floating point operations. While this is not yet 100% coverage, it's good
> enough for the vast majority of formats out there.
> 
> Of special note is the packed shuffle solver, which can reduce any compatible
> series of operations down to a single pshufb loop. This takes care of any sort
> of packed swizzle, but also e.g. grayscale to packed RGB expansion, RGB bit
> depth conversions, endianness swapping and so on.
> ---
>  libswscale/ops.c              |   4 +
>  libswscale/x86/Makefile       |   3 +
>  libswscale/x86/ops.c          | 735 ++++++++++++++++++++++++++++
>  libswscale/x86/ops_common.asm | 208 ++++++++
>  libswscale/x86/ops_float.asm  | 376 +++++++++++++++
>  libswscale/x86/ops_int.asm    | 882 ++++++++++++++++++++++++++++++++++
>  6 files changed, 2208 insertions(+)
>  create mode 100644 libswscale/x86/ops.c
>  create mode 100644 libswscale/x86/ops_common.asm
>  create mode 100644 libswscale/x86/ops_float.asm
>  create mode 100644 libswscale/x86/ops_int.asm

breaks build:

X86ASM  libswscale/x86/ops_float.o
libswscale/x86/ops_common.asm:180: error: unknown preprocessor directive `%rmacro'
libswscale/x86/ops_common.asm:180: error: label or instruction expected at start of line
libswscale/x86/ops_common.asm:181: error: `%1': not in a macro call
libswscale/x86/ops_common.asm:181: error: expression syntax error
libswscale/x86/ops_common.asm:182: error: `%2': not in a macro call
libswscale/x86/ops_common.asm:184: error: `%endmacro': not defining a macro
libswscale/x86/ops_common.asm:187: error: unknown preprocessor directive `%rmacro'
libswscale/x86/ops_common.asm:187: error: label or instruction expected at start of line
libswscale/x86/ops_common.asm:188: error: `%1': not in a macro call
libswscale/x86/ops_common.asm:188: error: expression syntax error
libswscale/x86/ops_common.asm:189: error: `%2': not in a macro call
libswscale/x86/ops_common.asm:191: error: `%endmacro': not defining a macro
libswscale/x86/ops_float.asm:369: error: parser: instruction expected
libswscale/x86/ops_common.asm:99: ... from macro `decl_common_patterns' defined here
libswscale/x86/ops_common.asm:91: ... from macro `decl_pattern' defined here
libswscale/x86/ops_float.asm:31: ... from macro `conv8to32f' defined here
libswscale/x86/ops_float.asm:369: error: parser: instruction expected
libswscale/x86/ops_common.asm:99: ... from macro `decl_common_patterns' defined here
libswscale/x86/ops_common.asm:91: ... from macro `decl_pattern' defined here
libswscale/x86/ops_float.asm:32: ... from macro `conv8to32f' defined here
libswscale/x86/ops_float.asm:369: error: parser: instruction expected
libswscale/x86/ops_common.asm:99: ... from macro `decl_common_patterns' defined here
libswscale/x86/ops_common.asm:91: ... from macro `decl_pattern' defined here
[snipped a long list of similar looking errors]

thx

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Its not that you shouldnt use gotos but rather that you should write
readable code and code with gotos often but not always is less readable

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 11/17] swscale/x86: add SIMD backend
  2025-04-29 13:00   ` Michael Niedermayer
@ 2025-04-30 16:24     ` Niklas Haas
  0 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-04-30 16:24 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Tue, 29 Apr 2025 15:00:50 +0200 Michael Niedermayer <michael@niedermayer.cc> wrote:
> On Sat, Apr 26, 2025 at 07:41:15PM +0200, Niklas Haas wrote:
> > From: Niklas Haas <git@haasn.dev>
> >
> > This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
> > floating point operations. While this is not yet 100% coverage, it's good
> > enough for the vast majority of formats out there.
> >
> > Of special note is the packed shuffle solver, which can reduce any compatible
> > series of operations down to a single pshufb loop. This takes care of any sort
> > of packed swizzle, but also e.g. grayscale to packed RGB expansion, RGB bit
> > depth conversions, endianness swapping and so on.
> > ---
> >  libswscale/ops.c              |   4 +
> >  libswscale/x86/Makefile       |   3 +
> >  libswscale/x86/ops.c          | 735 ++++++++++++++++++++++++++++
> >  libswscale/x86/ops_common.asm | 208 ++++++++
> >  libswscale/x86/ops_float.asm  | 376 +++++++++++++++
> >  libswscale/x86/ops_int.asm    | 882 ++++++++++++++++++++++++++++++++++
> >  6 files changed, 2208 insertions(+)
> >  create mode 100644 libswscale/x86/ops.c
> >  create mode 100644 libswscale/x86/ops_common.asm
> >  create mode 100644 libswscale/x86/ops_float.asm
> >  create mode 100644 libswscale/x86/ops_int.asm
>
> breaks build:
>
> X86ASM  libswscale/x86/ops_float.o
> libswscale/x86/ops_common.asm:180: error: unknown preprocessor directive `%rmacro'
> libswscale/x86/ops_common.asm:180: error: label or instruction expected at start of line
> libswscale/x86/ops_common.asm:181: error: `%1': not in a macro call
> libswscale/x86/ops_common.asm:181: error: expression syntax error
> libswscale/x86/ops_common.asm:182: error: `%2': not in a macro call
> libswscale/x86/ops_common.asm:184: error: `%endmacro': not defining a macro
> libswscale/x86/ops_common.asm:187: error: unknown preprocessor directive `%rmacro'
> libswscale/x86/ops_common.asm:187: error: label or instruction expected at start of line
> libswscale/x86/ops_common.asm:188: error: `%1': not in a macro call
> libswscale/x86/ops_common.asm:188: error: expression syntax error
> libswscale/x86/ops_common.asm:189: error: `%2': not in a macro call
> libswscale/x86/ops_common.asm:191: error: `%endmacro': not defining a macro
> libswscale/x86/ops_float.asm:369: error: parser: instruction expected
> libswscale/x86/ops_common.asm:99: ... from macro `decl_common_patterns' defined here
> libswscale/x86/ops_common.asm:91: ... from macro `decl_pattern' defined here
> libswscale/x86/ops_float.asm:31: ... from macro `conv8to32f' defined here
> libswscale/x86/ops_float.asm:369: error: parser: instruction expected
> libswscale/x86/ops_common.asm:99: ... from macro `decl_common_patterns' defined here
> libswscale/x86/ops_common.asm:91: ... from macro `decl_pattern' defined here
> libswscale/x86/ops_float.asm:32: ... from macro `conv8to32f' defined here
> libswscale/x86/ops_float.asm:369: error: parser: instruction expected
> libswscale/x86/ops_common.asm:99: ... from macro `decl_common_patterns' defined here
> libswscale/x86/ops_common.asm:91: ... from macro `decl_pattern' defined here
> [snipped a long list of similar looking errors]

Should be fixed (on my GH branch), turns out %rmacro was not necessary here
so I just replaced it by %macro.

>
> thx
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> Its not that you shouldnt use gotos but rather that you should write
> readable code and code with gotos often but not always is less readable
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic Niklas Haas
@ 2025-05-02 14:10   ` Michael Niedermayer
  2025-05-02 14:36     ` Niklas Haas
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Niedermayer @ 2025-05-02 14:10 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 1708 bytes --]

On Sat, Apr 26, 2025 at 07:41:20PM +0200, Niklas Haas wrote:
> From: Niklas Haas <git@haasn.dev>
> 
> This patch adds format handling code for the new operations. This entails
> fully decoding a format to standardized RGB, and the inverse.
> 
> Handling it this way means we can always guarantee that a conversion path
> exists from A to B without having to explicitly cover logic for each path;
> and choosing RGB instead of YUV as the intermediate (as was done in swscale
> v1) is more flexible with regards to enabling further operations such as
> primaries conversions, linear scaling, etc.
> 
> In the case of YUV->YUV transform, the redundant matrix multiplication will
> be canceled out anyways.
> ---
>  libswscale/format.c | 925 ++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/format.h |  23 ++
>  2 files changed, 948 insertions(+)

this or rather the equivalent from your repo breaks here:

In file included from libswscale/ops.h:24,
                 from libswscale/ops_internal.h:26,
                 from libswscale/format.c:28:
libswscale/format.c: In function ‘trc_is_hdr’:
libswscale/format.c:1249:9: error: a label can only be part of a statement and a declaration is not a statement
 1249 |         static_assert(AVCOL_TRC_NB == 19, "Update this list when adding TRCs");
      |         ^~~~~~~~~~~~~
make: *** [ffbuild/common.mak:81: libswscale/format.o] Error 1
make: *** Waiting for unfinished jobs....


[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Some people wanted to paint the bikeshed green, some blue and some pink.
People argued and fought, when they finally agreed, only rust was left.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic
  2025-05-02 14:10   ` Michael Niedermayer
@ 2025-05-02 14:36     ` Niklas Haas
  0 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-05-02 14:36 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, 02 May 2025 16:10:57 +0200 Michael Niedermayer <michael@niedermayer.cc> wrote:
> On Sat, Apr 26, 2025 at 07:41:20PM +0200, Niklas Haas wrote:
> > From: Niklas Haas <git@haasn.dev>
> >
> > This patch adds format handling code for the new operations. This entails
> > fully decoding a format to standardized RGB, and the inverse.
> >
> > Handling it this way means we can always guarantee that a conversion path
> > exists from A to B without having to explicitly cover logic for each path;
> > and choosing RGB instead of YUV as the intermediate (as was done in swscale
> > v1) is more flexible with regards to enabling further operations such as
> > primaries conversions, linear scaling, etc.
> >
> > In the case of YUV->YUV transform, the redundant matrix multiplication will
> > be canceled out anyways.
> > ---
> >  libswscale/format.c | 925 ++++++++++++++++++++++++++++++++++++++++++++
> >  libswscale/format.h |  23 ++
> >  2 files changed, 948 insertions(+)
>
> this or rather the equivalent from your repo breaks here:
>
> In file included from libswscale/ops.h:24,
>                  from libswscale/ops_internal.h:26,
>                  from libswscale/format.c:28:
> libswscale/format.c: In function ‘trc_is_hdr’:
> libswscale/format.c:1249:9: error: a label can only be part of a statement and a declaration is not a statement
>  1249 |         static_assert(AVCOL_TRC_NB == 19, "Update this list when adding TRCs");
>       |         ^~~~~~~~~~~~~
> make: *** [ffbuild/common.mak:81: libswscale/format.o] Error 1
> make: *** Waiting for unfinished jobs....

Fixed (by moving the static_assert out of the switch/case).

>
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> Some people wanted to paint the bikeshed green, some blue and some pink.
> People argued and fought, when they finally agreed, only rust was left.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 10/17] swscale/ops_backend: add reference backend basend on C templates
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 10/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
@ 2025-05-02 15:06   ` Michael Niedermayer
  2025-05-08 12:24     ` Niklas Haas
  0 siblings, 1 reply; 33+ messages in thread
From: Michael Niedermayer @ 2025-05-02 15:06 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 3339 bytes --]

On Sat, Apr 26, 2025 at 07:41:14PM +0200, Niklas Haas wrote:
> From: Niklas Haas <git@haasn.dev>
> 
> This will serve as a reference for the SIMD backends to come. That said,
> with auto-vectorization enabled, the performance of this is not atrocious, and
> can often beat even the old SIMD.
> 
> In theory, we can dramatically speed it up by using GCC vectors instead of
> arrays, but the performance gains from this are too dependent on exact GCC
> versions and flags, so it practice it's not a substitute for a SIMD
> implementation.
> ---
>  libswscale/Makefile          |   6 +
>  libswscale/ops.c             |   3 +
>  libswscale/ops.h             |   2 -
>  libswscale/ops_backend.c     | 101 ++++++
>  libswscale/ops_backend.h     | 181 +++++++++++
>  libswscale/ops_tmpl_common.c | 176 ++++++++++
>  libswscale/ops_tmpl_float.c  | 255 +++++++++++++++
>  libswscale/ops_tmpl_int.c    | 609 +++++++++++++++++++++++++++++++++++
>  8 files changed, 1331 insertions(+), 2 deletions(-)
>  create mode 100644 libswscale/ops_backend.c
>  create mode 100644 libswscale/ops_backend.h
>  create mode 100644 libswscale/ops_tmpl_common.c
>  create mode 100644 libswscale/ops_tmpl_float.c
>  create mode 100644 libswscale/ops_tmpl_int.c

arm breaker

CC      libswscale/ops_backend.o
In file included from src/libswscale/ops_backend.c:21:0:
src/libswscale/ops_tmpl_int.c:492:12: error: initializer element is not constant
         fn(op_read_planar1),
            ^
src/libswscale/ops_backend.h:78:27: note: in definition of macro ‘bitfn2’
 #define bitfn2(name, ext) name ## _ ## ext
                           ^~~~
src/libswscale/ops_backend.h:82:19: note: in expansion of macro ‘bitfn’
 #define fn(name)  bitfn(name, FN_SUFFIX)
                   ^~~~~
src/libswscale/ops_tmpl_int.c:492:9: note: in expansion of macro ‘fn’
         fn(op_read_planar1),
         ^~
src/libswscale/ops_tmpl_int.c:492:12: note: (near initialization for ‘op_table_int_u8.entries[0]’)
         fn(op_read_planar1),
            ^
src/libswscale/ops_backend.h:78:27: note: in definition of macro ‘bitfn2’
 #define bitfn2(name, ext) name ## _ ## ext
                           ^~~~
src/libswscale/ops_backend.h:82:19: note: in expansion of macro ‘bitfn’
 #define fn(name)  bitfn(name, FN_SUFFIX)
                   ^~~~~
src/libswscale/ops_tmpl_int.c:492:9: note: in expansion of macro ‘fn’
         fn(op_read_planar1),
         ^~
src/libswscale/ops_tmpl_int.c:493:12: error: initializer element is not constant
         fn(op_read_planar2),
            ^
src/libswscale/ops_backend.h:78:27: note: in definition of macro ‘bitfn2’
 #define bitfn2(name, ext) name ## _ ## ext
                           ^~~~
src/libswscale/ops_backend.h:82:19: note: in expansion of macro ‘bitfn’
 #define fn(name)  bitfn(name, FN_SUFFIX)
                   ^~~~~
src/libswscale/ops_tmpl_int.c:493:9: note: in expansion of macro ‘fn’
         fn(op_read_planar2),
         ^~
src/libswscale/ops_tmpl_int.c:493:12: note: (near initialization for ‘op_table_int_u8.entries[1]’)

................

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

In a rich man's house there is no place to spit but his face.
-- Diogenes of Sinope

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (17 preceding siblings ...)
  2025-04-26 22:22 ` [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
@ 2025-05-02 17:51 ` Niklas Haas
  2025-05-16 11:09 ` Niklas Haas
  19 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-05-02 17:51 UTC (permalink / raw)
  To: ffmpeg-devel

On Sat, 26 Apr 2025 19:41:04 +0200 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> Hi all,
>
> After extensive amounts of refactoring and iteration on the design and API,
> and the implementation of an x86 SIMD backend, I'm happy to present the
> revised version of my ongoing swscale rewrite. Now with 100% less reliance on
> compiler autovectorization.
>
> As before, I recommend (re)reading the design document to understand the
> motivation, structure and implementation details of this rewrite. At this
> point, I expect the major API and internal organization decisions to remain
> stable.
>
> I will preface with some benchmark figures, on my (new) AMD Ryzen 9 9950X3D:
>
> All formats:
>   - single thread: Overall speedup=2.109x faster, min=0.018x max=40.309x
>   - multi thread:  Overall speedup=2.607x faster, min=0.112x max=254.738x
>
> "Common" formats: (referenced >100 times in FFmpeg source code)
>   - single thread: Overall speedup=2.797x faster, min=0.408x max=16.514x
>   - multi thread:  Overall speedup=2.870x faster, min=0.715x max=21.983x

Small update: I noticed that one code path was accidentally not enabled. I
also implemented asm for the remaining bit-packed formats. After those two
changes, the new numbers are:

All formats:
  - single thread: Overall speedup=4.247x faster, min=0.177x max=224.809x
  - multi thread:  Overall speedup=4.000x faster, min=0.256x max=968.725x

"Common" formats:
  - single thread: Overall speedup=3.174x faster, min=0.596x max=12.616x
  - multi thread:  Overall speedup=3.005x faster, min=0.617x max=14.739x

>
> However, the main goal of this rewrite is not to improve performance, but to
> improve the maintainability, extensibility and correctness of the code. Most of
> the slowdowns for "common" formats are due to increased correctness (e.g.
> accurate rounding and dithering), and not the result of a regression per se.
>
> All of the remaining slowdowns (notably, the 0.1x cases) are due to incomplete
> coverage of the x86 SIMD. Notably, this currently affects bit packed formats
> (e.g. rgb8, rgb4). (I also did not yet incorporate any AVX-512 code, which
> some of the existing routines take advantage of)
>
> While I will continue working on this and expanding coverage to all remaining
> operations, I felt that now is a good point in time to get some code review
> and feedback regardless. I would especially appreciate code review of the x86
> SIMD code inside libswscale/x86/ops_*.asm, as this is my first time writing
> x86 assembly code.
>
>  doc/APIchanges                |   3 +
>  doc/scaler.texi               |   3 +
>  doc/swscale-v2.txt            | 344 +++++++++++++++++++++++++++
>  libswscale/Makefile           |   9 +
>  libswscale/format.c           | 945 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>  libswscale/format.h           |  29 ++-
>  libswscale/graph.c            | 151 ++++++++----
>  libswscale/graph.h            |  37 ++-
>  libswscale/ops.c              | 850 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/ops.h              | 263 +++++++++++++++++++++
>  libswscale/ops_backend.c      | 101 ++++++++
>  libswscale/ops_backend.h      | 181 ++++++++++++++
>  libswscale/ops_chain.c        | 291 +++++++++++++++++++++++
>  libswscale/ops_chain.h        | 108 +++++++++
>  libswscale/ops_internal.h     | 103 ++++++++
>  libswscale/ops_optimizer.c    | 810 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/ops_tmpl_common.c  | 176 ++++++++++++++
>  libswscale/ops_tmpl_float.c   | 255 ++++++++++++++++++++
>  libswscale/ops_tmpl_int.c     | 609 +++++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/options.c          |   1 +
>  libswscale/swscale.h          |   7 +
>  libswscale/tests/swscale.c    |  11 +-
>  libswscale/version.h          |   2 +-
>  libswscale/x86/Makefile       |   3 +
>  libswscale/x86/ops.c          | 735 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  libswscale/x86/ops_common.asm | 208 ++++++++++++++++
>  libswscale/x86/ops_float.asm  | 376 +++++++++++++++++++++++++++++
>  libswscale/x86/ops_int.asm    | 882 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  tests/checkasm/Makefile       |   8 +-
>  tests/checkasm/checkasm.c     |   4 +-
>  tests/checkasm/checkasm.h     |  26 +-
>  tests/checkasm/sw_ops.c       | 748 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  32 files changed, 8206 insertions(+), 73 deletions(-)
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 07/17] swscale: add SWS_EXPERIMENTAL flag
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 07/17] swscale: add SWS_EXPERIMENTAL flag Niklas Haas
@ 2025-05-08 11:37   ` Niklas Haas
  0 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-05-08 11:37 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

On Sat, 26 Apr 2025 19:41:11 +0200 Niklas Haas <ffmpeg@haasn.xyz> wrote:
> From: Niklas Haas <git@haasn.dev>
>
> Give users and developers a way to opt in to the new format conversion code,
> and more code from the swscale rewrite in general.

This conflicts with the existing option "experimental" mapped to SWS_X
("experimental" scaler). So we need to find a new name for it.

I also propose that we deprecate SWS_X, perhaps alongside other obscure and
less useful options like a_dither and x_dither.

> ---
>  doc/APIchanges       | 3 +++
>  doc/scaler.texi      | 3 +++
>  libswscale/options.c | 1 +
>  libswscale/swscale.h | 7 +++++++
>  libswscale/version.h | 2 +-
>  5 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/doc/APIchanges b/doc/APIchanges
> index 22aa6fa5c7..84bc721569 100644
> --- a/doc/APIchanges
> +++ b/doc/APIchanges
> @@ -2,6 +2,9 @@ The last version increases of all libraries were on 2025-03-28
>
>  API changes, most recent first:
>
> +2025-04-xx - xxxxxxxxxx - lsws 9.1.100 - swscale.h
> +  Add SWS_EXPERIMENTAL flag.
> +
>  2025-04-16 - c818c67991 - libpostproc 59.1.100 - postprocess.h
>    Deprecate PP_CPU_CAPS_3DNOW.
>
> diff --git a/doc/scaler.texi b/doc/scaler.texi
> index eb045de6b7..519a83b5d3 100644
> --- a/doc/scaler.texi
> +++ b/doc/scaler.texi
> @@ -68,6 +68,9 @@ Select full chroma input.
>
>  @item bitexact
>  Enable bitexact output.
> +
> +@item experimental
> +Allow the use of experimental new code. For testing only.
>  @end table
>
>  @item srcw @var{(API only)}
> diff --git a/libswscale/options.c b/libswscale/options.c
> index feecae8c89..044c7c7f0b 100644
> --- a/libswscale/options.c
> +++ b/libswscale/options.c
> @@ -50,6 +50,7 @@ static const AVOption swscale_options[] = {
>          { "full_chroma_inp", "full chroma input",             0,  AV_OPT_TYPE_CONST, { .i64 = SWS_FULL_CHR_H_INP }, .flags = VE, .unit = "sws_flags" },
>          { "bitexact",        "bit-exact mode",                0,  AV_OPT_TYPE_CONST, { .i64 = SWS_BITEXACT       }, .flags = VE, .unit = "sws_flags" },
>          { "error_diffusion", "error diffusion dither",        0,  AV_OPT_TYPE_CONST, { .i64 = SWS_ERROR_DIFFUSION}, .flags = VE, .unit = "sws_flags" },
> +        { "experimental",    "allow experimental new code",   0,  AV_OPT_TYPE_CONST, { .i64 = SWS_EXPERIMENTAL   }, .flags = VE, .unit = "sws_flags" },
>
>      { "param0",          "scaler param 0", OFFSET(scaler_params[0]), AV_OPT_TYPE_DOUBLE, { .dbl = SWS_PARAM_DEFAULT  }, INT_MIN, INT_MAX, VE },
>      { "param1",          "scaler param 1", OFFSET(scaler_params[1]), AV_OPT_TYPE_DOUBLE, { .dbl = SWS_PARAM_DEFAULT  }, INT_MIN, INT_MAX, VE },
> diff --git a/libswscale/swscale.h b/libswscale/swscale.h
> index b04aa182d2..82a69e97fc 100644
> --- a/libswscale/swscale.h
> +++ b/libswscale/swscale.h
> @@ -155,6 +155,13 @@ typedef enum SwsFlags {
>      SWS_ACCURATE_RND   = 1 << 18,
>      SWS_BITEXACT       = 1 << 19,
>
> +    /**
> +     * Allow using experimental new code paths. This may be faster, slower,
> +     * or produce different output, with semantics subject to change at any
> +     * point in time. For testing and debugging purposes only.
> +     */
> +    SWS_EXPERIMENTAL   = 1 << 20,
> +
>      /**
>       * Deprecated flags.
>       */
> diff --git a/libswscale/version.h b/libswscale/version.h
> index 148efd83eb..4e54701aba 100644
> --- a/libswscale/version.h
> +++ b/libswscale/version.h
> @@ -28,7 +28,7 @@
>
>  #include "version_major.h"
>
> -#define LIBSWSCALE_VERSION_MINOR   0
> +#define LIBSWSCALE_VERSION_MINOR   1
>  #define LIBSWSCALE_VERSION_MICRO 100
>
>  #define LIBSWSCALE_VERSION_INT  AV_VERSION_INT(LIBSWSCALE_VERSION_MAJOR, \
> --
> 2.49.0
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 10/17] swscale/ops_backend: add reference backend basend on C templates
  2025-05-02 15:06   ` Michael Niedermayer
@ 2025-05-08 12:24     ` Niklas Haas
  0 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-05-08 12:24 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, 02 May 2025 17:06:27 +0200 Michael Niedermayer <michael@niedermayer.cc> wrote:
> On Sat, Apr 26, 2025 at 07:41:14PM +0200, Niklas Haas wrote:
> > From: Niklas Haas <git@haasn.dev>
> >
> > This will serve as a reference for the SIMD backends to come. That said,
> > with auto-vectorization enabled, the performance of this is not atrocious, and
> > can often beat even the old SIMD.
> >
> > In theory, we can dramatically speed it up by using GCC vectors instead of
> > arrays, but the performance gains from this are too dependent on exact GCC
> > versions and flags, so it practice it's not a substitute for a SIMD
> > implementation.
> > ---
> >  libswscale/Makefile          |   6 +
> >  libswscale/ops.c             |   3 +
> >  libswscale/ops.h             |   2 -
> >  libswscale/ops_backend.c     | 101 ++++++
> >  libswscale/ops_backend.h     | 181 +++++++++++
> >  libswscale/ops_tmpl_common.c | 176 ++++++++++
> >  libswscale/ops_tmpl_float.c  | 255 +++++++++++++++
> >  libswscale/ops_tmpl_int.c    | 609 +++++++++++++++++++++++++++++++++++
> >  8 files changed, 1331 insertions(+), 2 deletions(-)
> >  create mode 100644 libswscale/ops_backend.c
> >  create mode 100644 libswscale/ops_backend.h
> >  create mode 100644 libswscale/ops_tmpl_common.c
> >  create mode 100644 libswscale/ops_tmpl_float.c
> >  create mode 100644 libswscale/ops_tmpl_int.c
>
> arm breaker
>
> CC      libswscale/ops_backend.o
> In file included from src/libswscale/ops_backend.c:21:0:
> src/libswscale/ops_tmpl_int.c:492:12: error: initializer element is not constant
>          fn(op_read_planar1),
>             ^
> src/libswscale/ops_backend.h:78:27: note: in definition of macro ‘bitfn2’
>  #define bitfn2(name, ext) name ## _ ## ext
>                            ^~~~
> src/libswscale/ops_backend.h:82:19: note: in expansion of macro ‘bitfn’
>  #define fn(name)  bitfn(name, FN_SUFFIX)
>                    ^~~~~
> src/libswscale/ops_tmpl_int.c:492:9: note: in expansion of macro ‘fn’
>          fn(op_read_planar1),
>          ^~
> src/libswscale/ops_tmpl_int.c:492:12: note: (near initialization for ‘op_table_int_u8.entries[0]’)
>          fn(op_read_planar1),
>             ^
> src/libswscale/ops_backend.h:78:27: note: in definition of macro ‘bitfn2’
>  #define bitfn2(name, ext) name ## _ ## ext
>                            ^~~~
> src/libswscale/ops_backend.h:82:19: note: in expansion of macro ‘bitfn’
>  #define fn(name)  bitfn(name, FN_SUFFIX)
>                    ^~~~~
> src/libswscale/ops_tmpl_int.c:492:9: note: in expansion of macro ‘fn’
>          fn(op_read_planar1),
>          ^~
> src/libswscale/ops_tmpl_int.c:493:12: error: initializer element is not constant
>          fn(op_read_planar2),
>             ^
> src/libswscale/ops_backend.h:78:27: note: in definition of macro ‘bitfn2’
>  #define bitfn2(name, ext) name ## _ ## ext
>                            ^~~~
> src/libswscale/ops_backend.h:82:19: note: in expansion of macro ‘bitfn’
>  #define fn(name)  bitfn(name, FN_SUFFIX)
>                    ^~~~~
> src/libswscale/ops_tmpl_int.c:493:9: note: in expansion of macro ‘fn’
>          fn(op_read_planar2),
>          ^~
> src/libswscale/ops_tmpl_int.c:493:12: note: (near initialization for ‘op_table_int_u8.entries[1]’)

Fixed (hopefully) by making the op table entries indirect.

>
> ................
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> In a rich man's house there is no place to spit but his face.
> -- Diogenes of Sinope
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]
  2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
                   ` (18 preceding siblings ...)
  2025-05-02 17:51 ` Niklas Haas
@ 2025-05-16 11:09 ` Niklas Haas
  2025-05-16 14:32   ` Ramiro Polla
  19 siblings, 1 reply; 33+ messages in thread
From: Niklas Haas @ 2025-05-16 11:09 UTC (permalink / raw)
  To: ffmpeg-devel

I would like to merge at least the first half of this series, containing
mostly preliminary changes, if there are no further objections.

After they are merged I will send a rebased v2 with my latest changes, which
includes some subsequent refactors that I decided not to squash into my
previous commits in order to simplify our development process (and avoid
having to re-review them).
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]
  2025-05-16 11:09 ` Niklas Haas
@ 2025-05-16 14:32   ` Ramiro Polla
  2025-05-16 14:39     ` Niklas Haas
  0 siblings, 1 reply; 33+ messages in thread
From: Ramiro Polla @ 2025-05-16 14:32 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

Hi Niklas,

On Fri, May 16, 2025 at 1:09 PM Niklas Haas <ffmpeg@haasn.xyz> wrote:
> I would like to merge at least the first half of this series, containing
> mostly preliminary changes, if there are no further objections.

Can you list the patches that you would like to merge?

Ramiro
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]
  2025-05-16 14:32   ` Ramiro Polla
@ 2025-05-16 14:39     ` Niklas Haas
  2025-05-16 15:44       ` Ramiro Polla
  0 siblings, 1 reply; 33+ messages in thread
From: Niklas Haas @ 2025-05-16 14:39 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, 16 May 2025 16:32:00 +0200 Ramiro Polla <ramiro.polla@gmail.com> wrote:
> Hi Niklas,
>
> On Fri, May 16, 2025 at 1:09 PM Niklas Haas <ffmpeg@haasn.xyz> wrote:
> > I would like to merge at least the first half of this series, containing
> > mostly preliminary changes, if there are no further objections.
>
> Can you list the patches that you would like to merge?

Patches 1 through 6, and maybe patch 7 if we can come up with a good solution
for the name conflict. (Maybe you have some ideas?)

>
> Ramiro
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 04/17] swscale/graph: move vshift() and shift_img() to shared header
  2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 04/17] swscale/graph: move vshift() and shift_img() to shared header Niklas Haas
@ 2025-05-16 15:41   ` Ramiro Polla
  0 siblings, 0 replies; 33+ messages in thread
From: Ramiro Polla @ 2025-05-16 15:41 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Sat, Apr 26, 2025 at 7:57 PM Niklas Haas <ffmpeg@haasn.xyz> wrote:
>
> From: Niklas Haas <git@haasn.dev>
>
> I need to reuse these inside `ops.c`.
> ---
>  libswscale/graph.c | 29 +++++++----------------------
>  libswscale/graph.h | 13 +++++++++++++
>  2 files changed, 20 insertions(+), 22 deletions(-)
>
> diff --git a/libswscale/graph.c b/libswscale/graph.c
> index c5a46eb257..b921b7ec02 100644
> --- a/libswscale/graph.c
> +++ b/libswscale/graph.c
> @@ -94,29 +94,14 @@ static int pass_append(SwsGraph *graph, enum AVPixelFormat fmt, int w, int h,
>      return 0;
>  }
>
> -static int vshift(enum AVPixelFormat fmt, int plane)
> -{
> -    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
> -    return (plane == 1 || plane == 2) ? desc->log2_chroma_h : 0;
> -}
> -
> -/* Shift an image vertically by y lines */
> -static SwsImg shift_img(const SwsImg *img_base, int y)
> -{
> -    SwsImg img = *img_base;
> -    for (int i = 0; i < 4 && img.data[i]; i++)
> -        img.data[i] += (y >> vshift(img.fmt, i)) * img.linesize[i];
> -    return img;
> -}
> -
>  static void run_copy(const SwsImg *out_base, const SwsImg *in_base,
>                       int y, int h, const SwsPass *pass)
>  {
> -    SwsImg in  = shift_img(in_base,  y);
> -    SwsImg out = shift_img(out_base, y);
> +    SwsImg in  = ff_sws_img_shift(*in_base,  y);
> +    SwsImg out = ff_sws_img_shift(*out_base, y);
>
>      for (int i = 0; i < FF_ARRAY_ELEMS(out.data) && out.data[i]; i++) {
> -        const int lines = h >> vshift(in.fmt, i);
> +        const int lines = h >> ff_fmt_vshift(in.fmt, i);
>          av_assert1(in.data[i]);
>
>          if (in.linesize[i] == out.linesize[i]) {
[...]
> diff --git a/libswscale/graph.h b/libswscale/graph.h
> index 62b622a065..191734b794 100644
> --- a/libswscale/graph.h
> +++ b/libswscale/graph.h
> @@ -34,6 +34,19 @@ typedef struct SwsImg {
>      int linesize[4];
>  } SwsImg;
>
> +static av_always_inline av_const int ff_fmt_vshift(enum AVPixelFormat fmt, int plane)
> +{
> +    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
> +    return (plane == 1 || plane == 2) ? desc->log2_chroma_h : 0;
> +}
> +
> +static av_const inline SwsImg ff_sws_img_shift(SwsImg img, const int y)
> +{
> +    for (int i = 0; i < 4 && img.data[i]; i++)
> +        img.data[i] += (y >> ff_fmt_vshift(img.fmt, i)) * img.linesize[i];
> +    return img;
> +}

I find it weird to pass the struct itself as an argument. The previous
version took a const * and made the copy itself. The new version uses
a dereference to make a copy at the call site. The compiler probably
optimizes both down to the same binary, since the function is inline.

Either are fine by me btw, I just wanted to point it out.
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC]
  2025-05-16 14:39     ` Niklas Haas
@ 2025-05-16 15:44       ` Ramiro Polla
  0 siblings, 0 replies; 33+ messages in thread
From: Ramiro Polla @ 2025-05-16 15:44 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, May 16, 2025 at 4:39 PM Niklas Haas <ffmpeg@haasn.xyz> wrote:
> On Fri, 16 May 2025 16:32:00 +0200 Ramiro Polla <ramiro.polla@gmail.com> wrote:
> > On Fri, May 16, 2025 at 1:09 PM Niklas Haas <ffmpeg@haasn.xyz> wrote:
> > > I would like to merge at least the first half of this series, containing
> > > mostly preliminary changes, if there are no further objections.
> >
> > Can you list the patches that you would like to merge?
>
> Patches 1 through 6, and maybe patch 7 if we can come up with a good solution
> for the name conflict. (Maybe you have some ideas?)

Patches 1 through 6 look good to me (with only a nit for patch 4).

I don't have any good ideas about what to do about the flag from patch
7, but if we were to deprecate flags it would be good to do it before
we branch out 8.0.

Ramiro
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic
  2025-05-18 14:59 [FFmpeg-devel] [PATCH 01/17] swscale/format: rename legacy format conversion table Niklas Haas
@ 2025-05-18 14:59 ` Niklas Haas
  0 siblings, 0 replies; 33+ messages in thread
From: Niklas Haas @ 2025-05-18 14:59 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This patch adds format handling code for the new operations. This entails
fully decoding a format to standardized RGB, and the inverse.

Handling it this way means we can always guarantee that a conversion path
exists from A to B without having to explicitly cover logic for each path;
and choosing RGB instead of YUV as the intermediate (as was done in swscale
v1) is more flexible with regards to enabling further operations such as
primaries conversions, linear scaling, etc.

In the case of YUV->YUV transform, the redundant matrix multiplication will
be canceled out anyways.
---
 libswscale/format.c | 925 ++++++++++++++++++++++++++++++++++++++++++++
 libswscale/format.h |  23 ++
 2 files changed, 948 insertions(+)

diff --git a/libswscale/format.c b/libswscale/format.c
index b77081dd7a..c4c661ac0e 100644
--- a/libswscale/format.c
+++ b/libswscale/format.c
@@ -21,8 +21,22 @@
 #include "libavutil/avassert.h"
 #include "libavutil/hdr_dynamic_metadata.h"
 #include "libavutil/mastering_display_metadata.h"
+#include "libavutil/refstruct.h"
 
 #include "format.h"
+#include "csputils.h"
+#include "ops_internal.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+#define Q0   Q(0)
+#define Q1   Q(1)
+
+#define RET(x)                                                                 \
+    do {                                                                       \
+        int __ret = (x);                                                       \
+        if (__ret  < 0)                                                        \
+            return __ret;                                                      \
+    } while (0)
 
 typedef struct LegacyFormatEntry {
     uint8_t is_supported_in         :1;
@@ -582,3 +596,914 @@ int sws_is_noop(const AVFrame *dst, const AVFrame *src)
 
     return 1;
 }
+
+/* Returns the type suitable for a pixel after fully decoding/unpacking it */
+static SwsPixelType fmt_pixel_type(enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    const int bits = FFALIGN(desc->comp[0].depth, 8);
+    if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) {
+        switch (bits) {
+        case 32: return SWS_PIXEL_F32;
+        }
+    } else {
+        switch (bits) {
+        case  8: return SWS_PIXEL_U8;
+        case 16: return SWS_PIXEL_U16;
+        case 32: return SWS_PIXEL_U32;
+        }
+    }
+
+    return SWS_PIXEL_NONE;
+}
+
+static SwsSwizzleOp fmt_swizzle(enum AVPixelFormat fmt)
+{
+    switch (fmt) {
+    case AV_PIX_FMT_ARGB:
+    case AV_PIX_FMT_0RGB:
+    case AV_PIX_FMT_AYUV64LE:
+    case AV_PIX_FMT_AYUV64BE:
+    case AV_PIX_FMT_AYUV:
+    case AV_PIX_FMT_X2RGB10LE:
+    case AV_PIX_FMT_X2RGB10BE:
+        return (SwsSwizzleOp) {{ .x = 3, 0, 1, 2 }};
+    case AV_PIX_FMT_BGR24:
+    case AV_PIX_FMT_BGR8:
+    case AV_PIX_FMT_BGR4:
+    case AV_PIX_FMT_BGR4_BYTE:
+    case AV_PIX_FMT_BGRA:
+    case AV_PIX_FMT_BGR565BE:
+    case AV_PIX_FMT_BGR565LE:
+    case AV_PIX_FMT_BGR555BE:
+    case AV_PIX_FMT_BGR555LE:
+    case AV_PIX_FMT_BGR444BE:
+    case AV_PIX_FMT_BGR444LE:
+    case AV_PIX_FMT_BGR48BE:
+    case AV_PIX_FMT_BGR48LE:
+    case AV_PIX_FMT_BGRA64BE:
+    case AV_PIX_FMT_BGRA64LE:
+    case AV_PIX_FMT_BGR0:
+    case AV_PIX_FMT_VUYA:
+    case AV_PIX_FMT_VUYX:
+        return (SwsSwizzleOp) {{ .x = 2, 1, 0, 3 }};
+    case AV_PIX_FMT_ABGR:
+    case AV_PIX_FMT_0BGR:
+    case AV_PIX_FMT_X2BGR10LE:
+    case AV_PIX_FMT_X2BGR10BE:
+        return (SwsSwizzleOp) {{ .x = 3, 2, 1, 0 }};
+    case AV_PIX_FMT_YA8:
+    case AV_PIX_FMT_YA16BE:
+    case AV_PIX_FMT_YA16LE:
+        return (SwsSwizzleOp) {{ .x = 0, 3, 1, 2 }};
+    case AV_PIX_FMT_XV30BE:
+    case AV_PIX_FMT_XV30LE:
+        return (SwsSwizzleOp) {{ .x = 3, 2, 0, 1 }};
+    case AV_PIX_FMT_VYU444:
+    case AV_PIX_FMT_V30XBE:
+    case AV_PIX_FMT_V30XLE:
+        return (SwsSwizzleOp) {{ .x = 2, 0, 1, 3 }};
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XV48BE:
+    case AV_PIX_FMT_XV48LE:
+    case AV_PIX_FMT_UYVA:
+        return (SwsSwizzleOp) {{ .x = 1, 0, 2, 3 }};
+    case AV_PIX_FMT_GBRP:
+    case AV_PIX_FMT_GBRP9BE:
+    case AV_PIX_FMT_GBRP9LE:
+    case AV_PIX_FMT_GBRP10BE:
+    case AV_PIX_FMT_GBRP10LE:
+    case AV_PIX_FMT_GBRP12BE:
+    case AV_PIX_FMT_GBRP12LE:
+    case AV_PIX_FMT_GBRP14BE:
+    case AV_PIX_FMT_GBRP14LE:
+    case AV_PIX_FMT_GBRP16BE:
+    case AV_PIX_FMT_GBRP16LE:
+    case AV_PIX_FMT_GBRPF16BE:
+    case AV_PIX_FMT_GBRPF16LE:
+    case AV_PIX_FMT_GBRAP:
+    case AV_PIX_FMT_GBRAP10LE:
+    case AV_PIX_FMT_GBRAP10BE:
+    case AV_PIX_FMT_GBRAP12LE:
+    case AV_PIX_FMT_GBRAP12BE:
+    case AV_PIX_FMT_GBRAP14LE:
+    case AV_PIX_FMT_GBRAP14BE:
+    case AV_PIX_FMT_GBRAP16LE:
+    case AV_PIX_FMT_GBRAP16BE:
+    case AV_PIX_FMT_GBRPF32BE:
+    case AV_PIX_FMT_GBRPF32LE:
+    case AV_PIX_FMT_GBRAPF16BE:
+    case AV_PIX_FMT_GBRAPF16LE:
+    case AV_PIX_FMT_GBRAPF32BE:
+    case AV_PIX_FMT_GBRAPF32LE:
+        return (SwsSwizzleOp) {{ .x = 1, 2, 0, 3 }};
+    default:
+        return (SwsSwizzleOp) {{ .x = 0, 1, 2, 3 }};
+    }
+}
+
+static SwsSwizzleOp swizzle_inv(SwsSwizzleOp swiz) {
+    /* Input[x] =: Output[swizzle.x] */
+    unsigned out[4];
+    out[swiz.x] = 0;
+    out[swiz.y] = 1;
+    out[swiz.z] = 2;
+    out[swiz.w] = 3;
+    return (SwsSwizzleOp) {{ .x = out[0], out[1], out[2], out[3] }};
+}
+
+/* Shift factor for MSB aligned formats */
+static int fmt_shift(enum AVPixelFormat fmt)
+{
+    switch (fmt) {
+    case AV_PIX_FMT_P010BE:
+    case AV_PIX_FMT_P010LE:
+    case AV_PIX_FMT_P210BE:
+    case AV_PIX_FMT_P210LE:
+    case AV_PIX_FMT_Y210BE:
+    case AV_PIX_FMT_Y210LE:
+        return 6;
+    case AV_PIX_FMT_P012BE:
+    case AV_PIX_FMT_P012LE:
+    case AV_PIX_FMT_P212BE:
+    case AV_PIX_FMT_P212LE:
+    case AV_PIX_FMT_P412BE:
+    case AV_PIX_FMT_P412LE:
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XYZ12BE:
+    case AV_PIX_FMT_XYZ12LE:
+        return 4;
+    }
+
+    return 0;
+}
+
+/**
+ * This initializes all absent components explicitly to zero. There is no
+ * need to worry about the correct neutral value as fmt_decode() will
+ * implicitly ignore and overwrite absent components in any case. This function
+ * is just to ensure that we don't operate on undefined memory. In most cases,
+ * it will end up getting pushed towards the output or optimized away entirely
+ * by the optimization pass.
+ */
+static SwsConst fmt_clear(enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    const bool has_chroma = desc->nb_components >= 3;
+    const bool has_alpha  = desc->flags & AV_PIX_FMT_FLAG_ALPHA;
+
+    SwsConst c = {0};
+    if (!has_chroma)
+        c.q4[1] = c.q4[2] = Q0;
+    if (!has_alpha)
+        c.q4[3] = Q0;
+
+    return c;
+}
+
+static int fmt_read_write(enum AVPixelFormat fmt, SwsReadWriteOp *rw_op,
+                          SwsPackOp *pack_op)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    if (!desc)
+        return AVERROR(EINVAL);
+
+    switch (fmt) {
+    case AV_PIX_FMT_NONE:
+    case AV_PIX_FMT_NB:
+        break;
+
+    /* Packed bitstream formats */
+    case AV_PIX_FMT_MONOWHITE:
+    case AV_PIX_FMT_MONOBLACK:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) {
+            .elems = 1,
+            .frac  = 3,
+        };
+        return 0;
+    case AV_PIX_FMT_RGB4:
+    case AV_PIX_FMT_BGR4:
+        *pack_op = (SwsPackOp) {{ 1, 2, 1 }};
+        *rw_op = (SwsReadWriteOp) {
+            .elems = 1,
+            .frac  = 1,
+        };
+        return 0;
+    /* Packed 8-bit aligned formats */
+    case AV_PIX_FMT_RGB4_BYTE:
+    case AV_PIX_FMT_BGR4_BYTE:
+        *pack_op = (SwsPackOp) {{ 1, 2, 1 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_BGR8:
+        *pack_op = (SwsPackOp) {{ 2, 3, 3 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_RGB8:
+        *pack_op = (SwsPackOp) {{ 3, 3, 2 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+
+    /* Packed 16-bit aligned formats */
+    case AV_PIX_FMT_RGB565BE:
+    case AV_PIX_FMT_RGB565LE:
+    case AV_PIX_FMT_BGR565BE:
+    case AV_PIX_FMT_BGR565LE:
+        *pack_op = (SwsPackOp) {{ 5, 6, 5 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_RGB555BE:
+    case AV_PIX_FMT_RGB555LE:
+    case AV_PIX_FMT_BGR555BE:
+    case AV_PIX_FMT_BGR555LE:
+        *pack_op = (SwsPackOp) {{ 5, 5, 5 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_RGB444BE:
+    case AV_PIX_FMT_RGB444LE:
+    case AV_PIX_FMT_BGR444BE:
+    case AV_PIX_FMT_BGR444LE:
+        *pack_op = (SwsPackOp) {{ 4, 4, 4 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    /* Packed 32-bit aligned 4:4:4 formats */
+    case AV_PIX_FMT_X2RGB10BE:
+    case AV_PIX_FMT_X2RGB10LE:
+    case AV_PIX_FMT_X2BGR10BE:
+    case AV_PIX_FMT_X2BGR10LE:
+    case AV_PIX_FMT_XV30BE:
+    case AV_PIX_FMT_XV30LE:
+        *pack_op = (SwsPackOp) {{ 2, 10, 10, 10 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    case AV_PIX_FMT_V30XBE:
+    case AV_PIX_FMT_V30XLE:
+        *pack_op = (SwsPackOp) {{ 10, 10, 10, 2 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1 };
+        return 0;
+    /* 3 component formats with one channel ignored */
+    case AV_PIX_FMT_RGB0:
+    case AV_PIX_FMT_BGR0:
+    case AV_PIX_FMT_0RGB:
+    case AV_PIX_FMT_0BGR:
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XV48BE:
+    case AV_PIX_FMT_XV48LE:
+    case AV_PIX_FMT_VUYX:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) { .elems = 4, .packed = true };
+        return 0;
+    /* Unpacked byte-aligned 4:4:4 formats */
+    case AV_PIX_FMT_YUV444P:
+    case AV_PIX_FMT_YUVJ444P:
+    case AV_PIX_FMT_YUV444P9BE:
+    case AV_PIX_FMT_YUV444P9LE:
+    case AV_PIX_FMT_YUV444P10BE:
+    case AV_PIX_FMT_YUV444P10LE:
+    case AV_PIX_FMT_YUV444P12BE:
+    case AV_PIX_FMT_YUV444P12LE:
+    case AV_PIX_FMT_YUV444P14BE:
+    case AV_PIX_FMT_YUV444P14LE:
+    case AV_PIX_FMT_YUV444P16BE:
+    case AV_PIX_FMT_YUV444P16LE:
+    case AV_PIX_FMT_YUVA444P:
+    case AV_PIX_FMT_YUVA444P9BE:
+    case AV_PIX_FMT_YUVA444P9LE:
+    case AV_PIX_FMT_YUVA444P10BE:
+    case AV_PIX_FMT_YUVA444P10LE:
+    case AV_PIX_FMT_YUVA444P12BE:
+    case AV_PIX_FMT_YUVA444P12LE:
+    case AV_PIX_FMT_YUVA444P16BE:
+    case AV_PIX_FMT_YUVA444P16LE:
+    case AV_PIX_FMT_AYUV:
+    case AV_PIX_FMT_UYVA:
+    case AV_PIX_FMT_VYU444:
+    case AV_PIX_FMT_AYUV64BE:
+    case AV_PIX_FMT_AYUV64LE:
+    case AV_PIX_FMT_VUYA:
+    case AV_PIX_FMT_RGB24:
+    case AV_PIX_FMT_BGR24:
+    case AV_PIX_FMT_RGB48BE:
+    case AV_PIX_FMT_RGB48LE:
+    case AV_PIX_FMT_BGR48BE:
+    case AV_PIX_FMT_BGR48LE:
+    //case AV_PIX_FMT_RGB96BE: TODO: AVRational can't fit 2^32-1
+    //case AV_PIX_FMT_RGB96LE:
+    //case AV_PIX_FMT_RGBF16BE: TODO: no support for float16 currently
+    //case AV_PIX_FMT_RGBF16LE:
+    case AV_PIX_FMT_RGBF32BE:
+    case AV_PIX_FMT_RGBF32LE:
+    case AV_PIX_FMT_ARGB:
+    case AV_PIX_FMT_RGBA:
+    case AV_PIX_FMT_ABGR:
+    case AV_PIX_FMT_BGRA:
+    case AV_PIX_FMT_RGBA64BE:
+    case AV_PIX_FMT_RGBA64LE:
+    case AV_PIX_FMT_BGRA64BE:
+    case AV_PIX_FMT_BGRA64LE:
+    //case AV_PIX_FMT_RGBA128BE: TODO: AVRational can't fit 2^32-1
+    //case AV_PIX_FMT_RGBA128LE:
+    case AV_PIX_FMT_RGBAF32BE:
+    case AV_PIX_FMT_RGBAF32LE:
+    case AV_PIX_FMT_GBRP:
+    case AV_PIX_FMT_GBRP9BE:
+    case AV_PIX_FMT_GBRP9LE:
+    case AV_PIX_FMT_GBRP10BE:
+    case AV_PIX_FMT_GBRP10LE:
+    case AV_PIX_FMT_GBRP12BE:
+    case AV_PIX_FMT_GBRP12LE:
+    case AV_PIX_FMT_GBRP14BE:
+    case AV_PIX_FMT_GBRP14LE:
+    case AV_PIX_FMT_GBRP16BE:
+    case AV_PIX_FMT_GBRP16LE:
+    //case AV_PIX_FMT_GBRPF16BE: TODO
+    //case AV_PIX_FMT_GBRPF16LE:
+    case AV_PIX_FMT_GBRPF32BE:
+    case AV_PIX_FMT_GBRPF32LE:
+    case AV_PIX_FMT_GBRAP:
+    case AV_PIX_FMT_GBRAP10BE:
+    case AV_PIX_FMT_GBRAP10LE:
+    case AV_PIX_FMT_GBRAP12BE:
+    case AV_PIX_FMT_GBRAP12LE:
+    case AV_PIX_FMT_GBRAP14BE:
+    case AV_PIX_FMT_GBRAP14LE:
+    case AV_PIX_FMT_GBRAP16BE:
+    case AV_PIX_FMT_GBRAP16LE:
+    //case AV_PIX_FMT_GBRAPF16BE: TODO
+    //case AV_PIX_FMT_GBRAPF16LE:
+    case AV_PIX_FMT_GBRAPF32BE:
+    case AV_PIX_FMT_GBRAPF32LE:
+    case AV_PIX_FMT_GRAY8:
+    case AV_PIX_FMT_GRAY9BE:
+    case AV_PIX_FMT_GRAY9LE:
+    case AV_PIX_FMT_GRAY10BE:
+    case AV_PIX_FMT_GRAY10LE:
+    case AV_PIX_FMT_GRAY12BE:
+    case AV_PIX_FMT_GRAY12LE:
+    case AV_PIX_FMT_GRAY14BE:
+    case AV_PIX_FMT_GRAY14LE:
+    case AV_PIX_FMT_GRAY16BE:
+    case AV_PIX_FMT_GRAY16LE:
+    //case AV_PIX_FMT_GRAYF16BE: TODO
+    //case AV_PIX_FMT_GRAYF16LE:
+    //case AV_PIX_FMT_YAF16BE:
+    //case AV_PIX_FMT_YAF16LE:
+    case AV_PIX_FMT_GRAYF32BE:
+    case AV_PIX_FMT_GRAYF32LE:
+    case AV_PIX_FMT_YAF32BE:
+    case AV_PIX_FMT_YAF32LE:
+    case AV_PIX_FMT_YA8:
+    case AV_PIX_FMT_YA16LE:
+    case AV_PIX_FMT_YA16BE:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) {
+            .elems  = desc->nb_components,
+            .packed = desc->nb_components > 1 && !(desc->flags & AV_PIX_FMT_FLAG_PLANAR),
+        };
+        return 0;
+    }
+
+    return AVERROR(ENOTSUP);
+}
+
+static SwsPixelType get_packed_type(SwsPackOp pack)
+{
+    const int sum = pack.pattern[0] + pack.pattern[1] +
+                    pack.pattern[2] + pack.pattern[3];
+    if (sum > 16)
+        return SWS_PIXEL_U32;
+    else if (sum > 8)
+        return SWS_PIXEL_U16;
+    else
+        return SWS_PIXEL_U8;
+}
+
+#if HAVE_BIGENDIAN
+#  define NATIVE_ENDIAN_FLAG AV_PIX_FMT_FLAG_BE
+#else
+#  define NATIVE_ENDIAN_FLAG 0
+#endif
+
+int ff_sws_decode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    SwsPixelType pixel_type = fmt_pixel_type(fmt);
+    SwsPixelType raw_type = pixel_type;
+    SwsReadWriteOp rw_op;
+    SwsPackOp unpack;
+
+    RET(fmt_read_write(fmt, &rw_op, &unpack));
+    if (unpack.pattern[0])
+        raw_type = get_packed_type(unpack);
+
+    /* TODO: handle subsampled or semipacked input formats */
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_READ,
+        .type = raw_type,
+        .rw   = rw_op,
+    }));
+
+    if ((desc->flags & AV_PIX_FMT_FLAG_BE) != NATIVE_ENDIAN_FLAG) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_SWAP_BYTES,
+            .type = raw_type,
+        }));
+    }
+
+    if (unpack.pattern[0]) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_UNPACK,
+            .type = raw_type,
+            .pack = unpack,
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CONVERT,
+            .type = raw_type,
+            .convert.to = pixel_type,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op      = SWS_OP_SWIZZLE,
+        .type    = pixel_type,
+        .swizzle = swizzle_inv(fmt_swizzle(fmt)),
+    }));
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_RSHIFT,
+        .type = pixel_type,
+        .c.u  = fmt_shift(fmt),
+    }));
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_CLEAR,
+        .type = pixel_type,
+        .c    = fmt_clear(fmt),
+    }));
+
+    return 0;
+}
+
+int ff_sws_encode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    SwsPixelType pixel_type = fmt_pixel_type(fmt);
+    SwsPixelType raw_type = pixel_type;
+    SwsReadWriteOp rw_op;
+    SwsPackOp pack;
+
+    RET(fmt_read_write(fmt, &rw_op, &pack));
+    if (pack.pattern[0])
+        raw_type = get_packed_type(pack);
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_LSHIFT,
+        .type = pixel_type,
+        .c.u  = fmt_shift(fmt),
+    }));
+
+    if (rw_op.elems > desc->nb_components) {
+        /* Format writes unused alpha channel, clear it explicitly for sanity */
+        av_assert1(!(desc->flags & AV_PIX_FMT_FLAG_ALPHA));
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CLEAR,
+            .type = pixel_type,
+            .c.q4[3] = Q0,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op      = SWS_OP_SWIZZLE,
+        .type    = pixel_type,
+        .swizzle = fmt_swizzle(fmt),
+    }));
+
+    if (pack.pattern[0]) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CONVERT,
+            .type = pixel_type,
+            .convert.to = raw_type,
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_PACK,
+            .type = raw_type,
+            .pack = pack,
+        }));
+    }
+
+    if ((desc->flags & AV_PIX_FMT_FLAG_BE) != NATIVE_ENDIAN_FLAG) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_SWAP_BYTES,
+            .type = raw_type,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_WRITE,
+        .type = raw_type,
+        .rw   = rw_op,
+    }));
+    return 0;
+}
+
+static inline AVRational av_neg_q(AVRational x)
+{
+    return (AVRational) { -x.num, x.den };
+}
+
+static SwsLinearOp fmt_encode_range(const SwsFormat fmt, bool *incomplete)
+{
+    SwsLinearOp c = { .m = {
+        { Q1, Q0, Q0, Q0, Q0 },
+        { Q0, Q1, Q0, Q0, Q0 },
+        { Q0, Q0, Q1, Q0, Q0 },
+        { Q0, Q0, Q0, Q1, Q0 },
+    }};
+
+    const int depth0 = fmt.desc->comp[0].depth;
+    const int depth1 = fmt.desc->comp[1].depth;
+    const int depth2 = fmt.desc->comp[2].depth;
+    const int depth3 = fmt.desc->comp[3].depth;
+
+    if (fmt.desc->flags & AV_PIX_FMT_FLAG_FLOAT)
+        return c; /* floats are directly output as-is */
+
+    if (fmt.csp == AVCOL_SPC_RGB || (fmt.desc->flags & AV_PIX_FMT_FLAG_XYZ)) {
+        c.m[0][0] = Q((1 << depth0) - 1);
+        c.m[1][1] = Q((1 << depth1) - 1);
+        c.m[2][2] = Q((1 << depth2) - 1);
+    } else if (fmt.range == AVCOL_RANGE_JPEG) {
+        /* Full range YUV */
+        c.m[0][0] = Q((1 << depth0) - 1);
+        if (fmt.desc->nb_components >= 3) {
+            /* This follows the ITU-R convention, which is slightly different
+             * from the JFIF convention. */
+            c.m[1][1] = Q((1 << depth1) - 1);
+            c.m[2][2] = Q((1 << depth2) - 1);
+            c.m[1][4] = Q(1 << (depth1 - 1));
+            c.m[2][4] = Q(1 << (depth2 - 1));
+        }
+    } else {
+        /* Limited range YUV */
+        if (fmt.range == AVCOL_RANGE_UNSPECIFIED)
+            *incomplete = true;
+        c.m[0][0] = Q(219 << (depth0 - 8));
+        c.m[0][4] = Q( 16 << (depth0 - 8));
+        if (fmt.desc->nb_components >= 3) {
+            c.m[1][1] = Q(224 << (depth1 - 8));
+            c.m[2][2] = Q(224 << (depth2 - 8));
+            c.m[1][4] = Q(128 << (depth1 - 8));
+            c.m[2][4] = Q(128 << (depth2 - 8));
+        }
+    }
+
+    if (fmt.desc->flags & AV_PIX_FMT_FLAG_ALPHA) {
+        const bool is_ya = fmt.desc->nb_components == 2;
+        c.m[3][3] = Q((1 << (is_ya ? depth1 : depth3)) - 1);
+    }
+
+    if (fmt.format == AV_PIX_FMT_MONOWHITE) {
+        /* This format is inverted, 0 = white, 1 = black */
+        c.m[0][4] = av_add_q(c.m[0][4], c.m[0][0]);
+        c.m[0][0] = av_neg_q(c.m[0][0]);
+    }
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+static SwsLinearOp fmt_decode_range(const SwsFormat fmt, bool *incomplete)
+{
+    SwsLinearOp c = fmt_encode_range(fmt, incomplete);
+
+    /* Invert main diagonal + offset: x = s * y + k  ==>  y = (x - k) / s */
+    for (int i = 0; i < 4; i++) {
+        c.m[i][i] = av_inv_q(c.m[i][i]);
+        c.m[i][4] = av_mul_q(c.m[i][4], av_neg_q(c.m[i][i]));
+    }
+
+    /* Explicitly initialize alpha for sanity */
+    if (!(fmt.desc->flags & AV_PIX_FMT_FLAG_ALPHA))
+        c.m[3][4] = Q1;
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+static AVRational *generate_bayer_matrix(const int size_log2)
+{
+    const int size = 1 << size_log2;
+    const int num_entries = size * size;
+    AVRational *m = av_refstruct_allocz(sizeof(*m) * num_entries);
+    av_assert1(size_log2 < 16);
+    if (!m)
+        return NULL;
+
+    /* Start with a 1x1 matrix */
+    m[0] = Q0;
+
+    /* Generate three copies of the current, appropriately scaled and offset */
+    for (int sz = 1; sz < size; sz <<= 1) {
+        const int den = 4 * sz * sz;
+        for (int y = 0; y < sz; y++) {
+            for (int x = 0; x < sz; x++) {
+                const AVRational cur = m[y * size + x];
+                m[(y + sz) * size + x + sz] = av_add_q(cur, av_make_q(1, den));
+                m[(y     ) * size + x + sz] = av_add_q(cur, av_make_q(2, den));
+                m[(y + sz) * size + x     ] = av_add_q(cur, av_make_q(3, den));
+            }
+        }
+    }
+
+    /**
+     * To correctly round, we need to evenly distribute the result on [0, 1),
+     * giving an average value of 1/2.
+     *
+     * After the above construction, we have a matrix with average value:
+     *   [ 0/N + 1/N + 2/N + ... (N-1)/N ] / N = (N-1)/(2N)
+     * where N = size * size is the total number of entries.
+     *
+     * To make the average value equal to 1/2 = N/(2N), add a bias of 1/(2N).
+     */
+    for (int i = 0; i < num_entries; i++)
+        m[i] = av_add_q(m[i], av_make_q(1, 2 * num_entries));
+
+    return m;
+}
+
+static bool trc_is_hdr(enum AVColorTransferCharacteristic trc)
+{
+    static_assert(AVCOL_TRC_NB == 19, "Update this list when adding TRCs");
+    switch (trc) {
+    case AVCOL_TRC_LOG:
+    case AVCOL_TRC_LOG_SQRT:
+    case AVCOL_TRC_SMPTEST2084:
+    case AVCOL_TRC_ARIB_STD_B67:
+        return true;
+    default:
+        return false;
+    }
+}
+
+static int fmt_dither(SwsContext *ctx, SwsOpList *ops,
+                      const SwsPixelType type, const SwsFormat fmt)
+{
+    SwsDither mode = ctx->dither;
+    SwsDitherOp dither;
+
+    if (mode == SWS_DITHER_AUTO) {
+        /* Visual threshold of perception: 12 bits for SDR, 14 bits for HDR */
+        const int jnd_bits = trc_is_hdr(fmt.color.trc) ? 14 : 12;
+        const int bpc = fmt.desc->comp[0].depth;
+        mode = bpc >= jnd_bits ? SWS_DITHER_NONE : SWS_DITHER_BAYER;
+    }
+
+    switch (mode) {
+    case SWS_DITHER_NONE:
+        if (ctx->flags & SWS_ACCURATE_RND) {
+            /* Add constant 0.5 for correct rounding */
+            AVRational *bias = av_refstruct_allocz(sizeof(*bias));
+            if (!bias)
+                return AVERROR(ENOMEM);
+            *bias = (AVRational) {1, 2};
+            return ff_sws_op_list_append(ops, &(SwsOp) {
+                .op   = SWS_OP_DITHER,
+                .type = type,
+                .dither.matrix = bias,
+            });
+        } else {
+            return 0; /* No-op */
+        }
+    case SWS_DITHER_BAYER:
+        /* Hardcode 16x16 matrix for now; in theory we could adjust this
+         * based on the expected level of precision in the output, since lower
+         * bit depth outputs can suffice with smaller dither matrices; however
+         * in practice we probably want to use error diffusion for such low bit
+         * depths anyway */
+        dither.size_log2 = 4;
+        dither.matrix = generate_bayer_matrix(dither.size_log2);
+        if (!dither.matrix)
+            return AVERROR(ENOMEM);
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .op     = SWS_OP_DITHER,
+            .type   = type,
+            .dither = dither,
+        });
+    case SWS_DITHER_ED:
+    case SWS_DITHER_A_DITHER:
+    case SWS_DITHER_X_DITHER:
+        return AVERROR(ENOTSUP);
+
+    case SWS_DITHER_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid dither mode");
+    return AVERROR(EINVAL);
+}
+
+static inline SwsLinearOp
+linear_mat3(const AVRational m00, const AVRational m01, const AVRational m02,
+            const AVRational m10, const AVRational m11, const AVRational m12,
+            const AVRational m20, const AVRational m21, const AVRational m22)
+{
+    SwsLinearOp c = {{
+        { m00, m01, m02, Q0, Q0 },
+        { m10, m11, m12, Q0, Q0 },
+        { m20, m21, m22, Q0, Q0 },
+        {  Q0,  Q0,  Q0, Q1, Q0 },
+    }};
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+int ff_sws_decode_colors(SwsContext *ctx, SwsPixelType type,
+                         SwsOpList *ops, const SwsFormat fmt, bool *incomplete)
+{
+    const AVLumaCoefficients *c = av_csp_luma_coeffs_from_avcsp(fmt.csp);
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op         = SWS_OP_CONVERT,
+        .type       = fmt_pixel_type(fmt.format),
+        .convert.to = type,
+    }));
+
+    /* Decode pixel format into standardized range */
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .type = type,
+        .op   = SWS_OP_LINEAR,
+        .lin  = fmt_decode_range(fmt, incomplete),
+    }));
+
+    /* Final step, decode colorspace */
+    switch (fmt.csp) {
+    case AVCOL_SPC_RGB:
+        return 0;
+    case AVCOL_SPC_UNSPECIFIED:
+        c = av_csp_luma_coeffs_from_avcsp(AVCOL_SPC_BT470BG);
+        *incomplete = true;
+        /* fall through */
+    case AVCOL_SPC_FCC:
+    case AVCOL_SPC_BT470BG:
+    case AVCOL_SPC_SMPTE170M:
+    case AVCOL_SPC_BT709:
+    case AVCOL_SPC_SMPTE240M:
+    case AVCOL_SPC_BT2020_NCL: {
+        AVRational crg = av_sub_q(Q0, av_div_q(c->cr, c->cg));
+        AVRational cbg = av_sub_q(Q0, av_div_q(c->cb, c->cg));
+        AVRational m02 = av_mul_q(Q(2), av_sub_q(Q1, c->cr));
+        AVRational m21 = av_mul_q(Q(2), av_sub_q(Q1, c->cb));
+        AVRational m11 = av_mul_q(cbg, m21);
+        AVRational m12 = av_mul_q(crg, m02);
+
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                Q1,  Q0, m02,
+                Q1, m11, m12,
+                Q1, m21,  Q0
+            ),
+        });
+    }
+
+    case AVCOL_SPC_YCGCO:
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                Q1, Q(-1), Q( 1),
+                Q1, Q( 1), Q( 0),
+                Q1, Q(-1), Q(-1)
+            ),
+        });
+
+    case AVCOL_SPC_BT2020_CL:
+    case AVCOL_SPC_SMPTE2085:
+    case AVCOL_SPC_CHROMA_DERIVED_NCL:
+    case AVCOL_SPC_CHROMA_DERIVED_CL:
+    case AVCOL_SPC_ICTCP:
+    case AVCOL_SPC_IPT_C2:
+    case AVCOL_SPC_YCGCO_RE:
+    case AVCOL_SPC_YCGCO_RO:
+        return AVERROR(ENOTSUP);
+
+    case AVCOL_SPC_RESERVED:
+        return AVERROR(EINVAL);
+
+    case AVCOL_SPC_NB:
+        break;
+    }
+
+    av_assert0(!"Corrupt AVColorSpace value?");
+    return AVERROR(EINVAL);
+}
+
+int ff_sws_encode_colors(SwsContext *ctx, SwsPixelType type,
+                         SwsOpList *ops, const SwsFormat fmt, bool *incomplete)
+{
+    const AVLumaCoefficients *c = av_csp_luma_coeffs_from_avcsp(fmt.csp);
+
+    switch (fmt.csp) {
+    case AVCOL_SPC_RGB:
+        break;
+    case AVCOL_SPC_UNSPECIFIED:
+        c = av_csp_luma_coeffs_from_avcsp(AVCOL_SPC_BT470BG);
+        *incomplete = true;
+        /* fall through */
+    case AVCOL_SPC_FCC:
+    case AVCOL_SPC_BT470BG:
+    case AVCOL_SPC_SMPTE170M:
+    case AVCOL_SPC_BT709:
+    case AVCOL_SPC_SMPTE240M:
+    case AVCOL_SPC_BT2020_NCL: {
+        AVRational cb1 = av_sub_q(c->cb, Q1);
+        AVRational cr1 = av_sub_q(c->cr, Q1);
+        AVRational m20 = av_make_q(1,2);
+        AVRational m10 = av_mul_q(m20, av_div_q(c->cr, cb1));
+        AVRational m11 = av_mul_q(m20, av_div_q(c->cg, cb1));
+        AVRational m21 = av_mul_q(m20, av_div_q(c->cg, cr1));
+        AVRational m22 = av_mul_q(m20, av_div_q(c->cb, cr1));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                c->cr, c->cg, c->cb,
+                m10,     m11,   m20,
+                m20,     m21,   m22
+            ),
+        }));
+        break;
+    }
+
+    case AVCOL_SPC_YCGCO:
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                av_make_q( 1, 4), av_make_q(1, 2), av_make_q( 1, 4),
+                av_make_q( 1, 2), av_make_q(0, 1), av_make_q(-1, 2),
+                av_make_q(-1, 4), av_make_q(1, 2), av_make_q(-1, 4)
+            ),
+        }));
+        break;
+
+    case AVCOL_SPC_BT2020_CL:
+    case AVCOL_SPC_SMPTE2085:
+    case AVCOL_SPC_CHROMA_DERIVED_NCL:
+    case AVCOL_SPC_CHROMA_DERIVED_CL:
+    case AVCOL_SPC_ICTCP:
+    case AVCOL_SPC_IPT_C2:
+    case AVCOL_SPC_YCGCO_RE:
+    case AVCOL_SPC_YCGCO_RO:
+        return AVERROR(ENOTSUP);
+
+    case AVCOL_SPC_RESERVED:
+    case AVCOL_SPC_NB:
+        return AVERROR(EINVAL);
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .type = type,
+        .op   = SWS_OP_LINEAR,
+        .lin  = fmt_encode_range(fmt, incomplete),
+    }));
+
+    if (!(fmt.desc->flags & AV_PIX_FMT_FLAG_FLOAT)) {
+        SwsConst range = {0};
+
+        const bool is_ya = fmt.desc->nb_components == 2;
+        for (int i = 0; i < fmt.desc->nb_components; i++) {
+            /* Clamp to legal pixel range */
+            const int idx = i * (is_ya ? 3 : 1);
+            range.q4[idx] = Q((1 << fmt.desc->comp[i].depth) - 1);
+        }
+
+        RET(fmt_dither(ctx, ops, type, fmt));
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_MAX,
+            .type = type,
+            .c.q4 = { Q0, Q0, Q0, Q0 },
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_MIN,
+            .type = type,
+            .c    = range,
+        }));
+    }
+
+    return ff_sws_op_list_append(ops, &(SwsOp) {
+        .type       = type,
+        .op         = SWS_OP_CONVERT,
+        .convert.to = fmt_pixel_type(fmt.format),
+    });
+}
diff --git a/libswscale/format.h b/libswscale/format.h
index be92038f4f..e6a1fd7116 100644
--- a/libswscale/format.h
+++ b/libswscale/format.h
@@ -148,4 +148,27 @@ int ff_test_fmt(const SwsFormat *fmt, int output);
 /* Returns true if the formats are incomplete, false otherwise */
 bool ff_infer_colors(SwsColor *src, SwsColor *dst);
 
+typedef struct SwsOpList SwsOpList;
+typedef enum SwsPixelType SwsPixelType;
+
+/**
+ * Append a set of operations for decoding/encoding raw pixels. This will
+ * handle input read/write, swizzling, shifting and byte swapping.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_decode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt);
+int ff_sws_encode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt);
+
+/**
+ * Append a set of operations for transforming decoded pixel values to/from
+ * normalized RGB in the specified gamut and pixel type.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_decode_colors(SwsContext *ctx, SwsPixelType type, SwsOpList *ops,
+                         const SwsFormat fmt, bool *incomplete);
+int ff_sws_encode_colors(SwsContext *ctx, SwsPixelType type, SwsOpList *ops,
+                         const SwsFormat fmt, bool *incomplete);
+
 #endif /* SWSCALE_FORMAT_H */
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2025-05-18 15:01 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-04-26 17:41 [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 01/17] tests/swscale: improve colorization of speedup Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 02/17] swscale/graph: expose ff_sws_graph_add_pass Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 03/17] swscale/graph: make noop loop more robust Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 04/17] swscale/graph: move vshift() and shift_img() to shared header Niklas Haas
2025-05-16 15:41   ` Ramiro Polla
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 05/17] swscale/graph: prefer bools to ints Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 06/17] doc: add swscale rewrite design document Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 07/17] swscale: add SWS_EXPERIMENTAL flag Niklas Haas
2025-05-08 11:37   ` Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 08/17] swscale/ops: introduce new low level framework Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 09/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 10/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
2025-05-02 15:06   ` Michael Niedermayer
2025-05-08 12:24     ` Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 11/17] swscale/x86: add SIMD backend Niklas Haas
2025-04-29 13:00   ` Michael Niedermayer
2025-04-30 16:24     ` Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 12/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 13/17] tests/checkasm: add checkasm_check_float Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 14/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 15/17] swscale/format: rename legacy format conversion table Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic Niklas Haas
2025-05-02 14:10   ` Michael Niedermayer
2025-05-02 14:36     ` Niklas Haas
2025-04-26 17:41 ` [FFmpeg-devel] [PATCH 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas
2025-04-26 22:22 ` [FFmpeg-devel] [PATCH 00/17] swscale v2: new framework [RFC] Niklas Haas
2025-05-02 17:51 ` Niklas Haas
2025-05-16 11:09 ` Niklas Haas
2025-05-16 14:32   ` Ramiro Polla
2025-05-16 14:39     ` Niklas Haas
2025-05-16 15:44       ` Ramiro Polla
2025-05-18 14:59 [FFmpeg-devel] [PATCH 01/17] swscale/format: rename legacy format conversion table Niklas Haas
2025-05-18 14:59 ` [FFmpeg-devel] [PATCH 16/17] swscale/format: add new format decode/encode logic Niklas Haas

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git