Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework
@ 2025-05-21 12:43 Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 01/17] swscale/format: rename legacy format conversion table Niklas Haas
                   ` (16 more replies)
  0 siblings, 17 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel

Changes since v1:
- keep track of `packed` status even for single-element bit-packed formats
- fix memory leak of dither matrix
- fix AVRational printing of infinities
- fix value range tracking for big endian formats
- fix some overflow bugs on 32-bit
- remove unneeded internal helper
- add optimization for convert->swizzle->convert
- clean up the generated shuffle mask when clearing multiple bytes
- slightly tune the x86 asm loops
- add an `unsigned max_ulp` to checkasm_check_float()

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 01/17] swscale/format: rename legacy format conversion table
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 02/17] swscale/format: add ff_fmt_clear() Niklas Haas
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

---
 libswscale/format.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/libswscale/format.c b/libswscale/format.c
index e4c1348b90..b77081dd7a 100644
--- a/libswscale/format.c
+++ b/libswscale/format.c
@@ -24,14 +24,14 @@
 
 #include "format.h"
 
-typedef struct FormatEntry {
+typedef struct LegacyFormatEntry {
     uint8_t is_supported_in         :1;
     uint8_t is_supported_out        :1;
     uint8_t is_supported_endianness :1;
-} FormatEntry;
+} LegacyFormatEntry;
 
 /* Format support table for legacy swscale */
-static const FormatEntry format_entries[] = {
+static const LegacyFormatEntry legacy_format_entries[] = {
     [AV_PIX_FMT_YUV420P]        = { 1, 1 },
     [AV_PIX_FMT_YUYV422]        = { 1, 1 },
     [AV_PIX_FMT_RGB24]          = { 1, 1 },
@@ -262,20 +262,20 @@ static const FormatEntry format_entries[] = {
 
 int sws_isSupportedInput(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_in : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_in : 0;
 }
 
 int sws_isSupportedOutput(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_out : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_out : 0;
 }
 
 int sws_isSupportedEndiannessConversion(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_endianness : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_endianness : 0;
 }
 
 /**
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 02/17] swscale/format: add ff_fmt_clear()
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 01/17] swscale/format: rename legacy format conversion table Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 03/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Reset an SwsFormat to its fully unset/invalid state.
---
 libswscale/format.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/libswscale/format.h b/libswscale/format.h
index 3b6d745159..be92038f4f 100644
--- a/libswscale/format.h
+++ b/libswscale/format.h
@@ -85,6 +85,20 @@ typedef struct SwsFormat {
     SwsColor color;
 } SwsFormat;
 
+static inline void ff_fmt_clear(SwsFormat *fmt)
+{
+    *fmt = (SwsFormat) {
+        .format     = AV_PIX_FMT_NONE,
+        .range      = AVCOL_RANGE_UNSPECIFIED,
+        .csp        = AVCOL_SPC_UNSPECIFIED,
+        .loc        = AVCHROMA_LOC_UNSPECIFIED,
+        .color = {
+            .prim = AVCOL_PRI_UNSPECIFIED,
+            .trc  = AVCOL_TRC_UNSPECIFIED,
+        },
+    };
+}
+
 /**
  * This function also sanitizes and strips the input data, removing irrelevant
  * fields for certain formats.
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 03/17] tests/checkasm: increase number of runs in between measurements
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 01/17] swscale/format: rename legacy format conversion table Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 02/17] swscale/format: add ff_fmt_clear() Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats Niklas Haas
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Sometimes, when measuring very small functions, rdtsc is not accurate enough
to get a reliable measurement. This increases the number of runs inside the
inner loop from 4 to 32, which should help a lot. Less important when using
the more precise linux-perf API, but still useful.

There should be no user-visible change since the number of runs is adjusted
to keep the total time spent measuring the same.
---
 tests/checkasm/checkasm.c |  2 +-
 tests/checkasm/checkasm.h | 24 +++++++++++++++++++-----
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 0734cd26bf..71d1e5766c 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -628,7 +628,7 @@ static inline double avg_cycles_per_call(const CheckasmPerf *const p)
     if (p->iterations) {
         const double cycles = (double)(10 * p->cycles) / p->iterations - state.nop_time;
         if (cycles > 0.0)
-            return cycles / 4.0; /* 4 calls per iteration */
+            return cycles / 32.0; /* 32 calls per iteration */
     }
     return 0.0;
 }
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index 146bfdec35..ad7ed10613 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -342,6 +342,22 @@ typedef struct CheckasmPerf {
 #define PERF_STOP(t)  t = AV_READ_TIME() - t
 #endif
 
+#define CALL4(...)\
+    do {\
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+    } while (0)
+
+#define CALL16(...)\
+    do {\
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+    } while (0)
+
 /* Benchmark the function */
 #define bench_new(...)\
     do {\
@@ -352,14 +368,12 @@ typedef struct CheckasmPerf {
             uint64_t tsum = 0;\
             uint64_t ti, tcount = 0;\
             uint64_t t = 0; \
-            const uint64_t truns = bench_runs;\
+            const uint64_t truns = FFMAX(bench_runs >> 3, 1);\
             checkasm_set_signal_handler_state(1);\
             for (ti = 0; ti < truns; ti++) {\
                 PERF_START(t);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
+                CALL16(__VA_ARGS__);\
+                CALL16(__VA_ARGS__);\
                 PERF_STOP(t);\
                 if (t*tcount <= tsum*4 && ti > 0) {\
                     tsum += t;\
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (2 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 03/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 05/17] swscale: add SWS_UNSTABLE flag Niklas Haas
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

We split the standard macro into its body (implementation) and declaration,
and use a macro argument in place of the raw `memcmp` call, with the major
difference that we now take the number of pixels to compare instead of the
number of bytes (to match the signature of float_near_ulp_array).
---
 tests/checkasm/checkasm.c | 52 ++++++++++++++++++++++++++-------------
 tests/checkasm/checkasm.h |  7 ++++++
 2 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 71d1e5766c..f393a0cb96 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -1187,14 +1187,8 @@ static int check_err(const char *file, int line,
     return 0;
 }
 
-#define DEF_CHECKASM_CHECK_FUNC(type, fmt) \
-int checkasm_check_##type(const char *file, int line, \
-                          const type *buf1, ptrdiff_t stride1, \
-                          const type *buf2, ptrdiff_t stride2, \
-                          int w, int h, const char *name, \
-                          int align_w, int align_h, \
-                          int padding) \
-{ \
+#define DEF_CHECKASM_CHECK_BODY(compare, type, fmt) \
+do { \
     int64_t aligned_w = (w - 1LL + align_w) & ~(align_w - 1); \
     int64_t aligned_h = (h - 1LL + align_h) & ~(align_h - 1); \
     int err = 0; \
@@ -1204,7 +1198,7 @@ int checkasm_check_##type(const char *file, int line, \
     stride1 /= sizeof(*buf1); \
     stride2 /= sizeof(*buf2); \
     for (y = 0; y < h; y++) \
-        if (memcmp(&buf1[y*stride1], &buf2[y*stride2], w*sizeof(*buf1))) \
+        if (!compare(&buf1[y*stride1], &buf2[y*stride2], w)) \
             break; \
     if (y != h) { \
         if (check_err(file, line, name, w, h, &err)) \
@@ -1226,38 +1220,50 @@ int checkasm_check_##type(const char *file, int line, \
         buf2 -= h*stride2; \
     } \
     for (y = -padding; y < 0; y++) \
-        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
-                   (w + 2*padding)*sizeof(*buf1))) { \
+        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
+                     w + 2*padding)) { \
             if (check_err(file, line, name, w, h, &err)) \
                 return 1; \
             fprintf(stderr, " overwrite above\n"); \
             break; \
         } \
     for (y = aligned_h; y < aligned_h + padding; y++) \
-        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
-                   (w + 2*padding)*sizeof(*buf1))) { \
+        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
+                     w + 2*padding)) { \
             if (check_err(file, line, name, w, h, &err)) \
                 return 1; \
             fprintf(stderr, " overwrite below\n"); \
             break; \
         } \
     for (y = 0; y < h; y++) \
-        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
-                   padding*sizeof(*buf1))) { \
+        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
+                     padding)) { \
             if (check_err(file, line, name, w, h, &err)) \
                 return 1; \
             fprintf(stderr, " overwrite left\n"); \
             break; \
         } \
     for (y = 0; y < h; y++) \
-        if (memcmp(&buf1[y*stride1 + aligned_w], &buf2[y*stride2 + aligned_w], \
-                   padding*sizeof(*buf1))) { \
+        if (!compare(&buf1[y*stride1 + aligned_w], &buf2[y*stride2 + aligned_w], \
+                     padding)) { \
             if (check_err(file, line, name, w, h, &err)) \
                 return 1; \
             fprintf(stderr, " overwrite right\n"); \
             break; \
         } \
     return err; \
+} while (0)
+
+#define cmp_int(a, b, len) (!memcmp(a, b, (len) * sizeof(*(a))))
+#define DEF_CHECKASM_CHECK_FUNC(type, fmt) \
+int checkasm_check_##type(const char *file, int line, \
+                          const type *buf1, ptrdiff_t stride1, \
+                          const type *buf2, ptrdiff_t stride2, \
+                          int w, int h, const char *name, \
+                          int align_w, int align_h, \
+                          int padding) \
+{ \
+    DEF_CHECKASM_CHECK_BODY(cmp_int, type, fmt); \
 }
 
 DEF_CHECKASM_CHECK_FUNC(uint8_t,  "%02x")
@@ -1265,3 +1271,15 @@ DEF_CHECKASM_CHECK_FUNC(uint16_t, "%04x")
 DEF_CHECKASM_CHECK_FUNC(uint32_t, "%08x")
 DEF_CHECKASM_CHECK_FUNC(int16_t,  "%6d")
 DEF_CHECKASM_CHECK_FUNC(int32_t,  "%9d")
+
+int checkasm_check_float_ulp(const char *file, int line,
+                             const float *buf1, ptrdiff_t stride1,
+                             const float *buf2, ptrdiff_t stride2,
+                             int w, int h, const char *name,
+                             unsigned max_ulp, int align_w, int align_h,
+                             int padding)
+{
+    #define cmp_float(a, b, len) float_near_ulp_array(a, b, max_ulp, len)
+    DEF_CHECKASM_CHECK_BODY(cmp_float, float, "%g");
+    #undef cmp_float
+}
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index ad7ed10613..ec01bd6207 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -423,6 +423,13 @@ DECL_CHECKASM_CHECK_FUNC(uint32_t);
 DECL_CHECKASM_CHECK_FUNC(int16_t);
 DECL_CHECKASM_CHECK_FUNC(int32_t);
 
+int checkasm_check_float_ulp(const char *file, int line,
+                             const float *buf1, ptrdiff_t stride1,
+                             const float *buf2, ptrdiff_t stride2,
+                             int w, int h, const char *name,
+                             unsigned max_elp, int align_w, int align_h,
+                             int padding);
+
 #define PASTE(a,b) a ## b
 #define CONCAT(a,b) PASTE(a,b)
 
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 05/17] swscale: add SWS_UNSTABLE flag
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (3 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 06/17] swscale/ops: introduce new low level framework Niklas Haas
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Give users and developers a way to opt in to the new format conversion code,
and more code from the swscale rewrite in general, even while development is
still ongoing.
---
 doc/APIchanges       | 3 +++
 doc/scaler.texi      | 4 ++++
 libswscale/options.c | 1 +
 libswscale/swscale.h | 7 +++++++
 libswscale/version.h | 2 +-
 5 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/doc/APIchanges b/doc/APIchanges
index d0869561f3..fb202c7908 100644
--- a/doc/APIchanges
+++ b/doc/APIchanges
@@ -2,6 +2,9 @@ The last version increases of all libraries were on 2025-03-28
 
 API changes, most recent first:
 
+2025-04-xx - xxxxxxxxxx - lsws 9.1.100 - swscale.h
+  Add SWS_UNSTABLE flag.
+
 2025-02-xx - xxxxxxxxxx - lavfi 10.10.100 - avfilter.h
   Add avfilter_link_get_hw_frames_ctx().
 
diff --git a/doc/scaler.texi b/doc/scaler.texi
index eb045de6b7..42b2377761 100644
--- a/doc/scaler.texi
+++ b/doc/scaler.texi
@@ -68,6 +68,10 @@ Select full chroma input.
 
 @item bitexact
 Enable bitexact output.
+
+@item unstable
+Allow the use of experimental new code. May subtly affect the output or even
+produce wrong results. For testing only.
 @end table
 
 @item srcw @var{(API only)}
diff --git a/libswscale/options.c b/libswscale/options.c
index feecae8c89..06e51dcfe9 100644
--- a/libswscale/options.c
+++ b/libswscale/options.c
@@ -50,6 +50,7 @@ static const AVOption swscale_options[] = {
         { "full_chroma_inp", "full chroma input",             0,  AV_OPT_TYPE_CONST, { .i64 = SWS_FULL_CHR_H_INP }, .flags = VE, .unit = "sws_flags" },
         { "bitexact",        "bit-exact mode",                0,  AV_OPT_TYPE_CONST, { .i64 = SWS_BITEXACT       }, .flags = VE, .unit = "sws_flags" },
         { "error_diffusion", "error diffusion dither",        0,  AV_OPT_TYPE_CONST, { .i64 = SWS_ERROR_DIFFUSION}, .flags = VE, .unit = "sws_flags" },
+        { "unstable",        "allow experimental new code",   0,  AV_OPT_TYPE_CONST, { .i64 = SWS_UNSTABLE       }, .flags = VE, .unit = "sws_flags" },
 
     { "param0",          "scaler param 0", OFFSET(scaler_params[0]), AV_OPT_TYPE_DOUBLE, { .dbl = SWS_PARAM_DEFAULT  }, INT_MIN, INT_MAX, VE },
     { "param1",          "scaler param 1", OFFSET(scaler_params[1]), AV_OPT_TYPE_DOUBLE, { .dbl = SWS_PARAM_DEFAULT  }, INT_MIN, INT_MAX, VE },
diff --git a/libswscale/swscale.h b/libswscale/swscale.h
index b04aa182d2..4aa072009c 100644
--- a/libswscale/swscale.h
+++ b/libswscale/swscale.h
@@ -155,6 +155,13 @@ typedef enum SwsFlags {
     SWS_ACCURATE_RND   = 1 << 18,
     SWS_BITEXACT       = 1 << 19,
 
+    /**
+     * Allow using experimental new code paths. This may be faster, slower,
+     * or produce different output, with semantics subject to change at any
+     * point in time. For testing and debugging purposes only.
+     */
+    SWS_UNSTABLE = 1 << 20,
+
     /**
      * Deprecated flags.
      */
diff --git a/libswscale/version.h b/libswscale/version.h
index 148efd83eb..4e54701aba 100644
--- a/libswscale/version.h
+++ b/libswscale/version.h
@@ -28,7 +28,7 @@
 
 #include "version_major.h"
 
-#define LIBSWSCALE_VERSION_MINOR   0
+#define LIBSWSCALE_VERSION_MINOR   1
 #define LIBSWSCALE_VERSION_MICRO 100
 
 #define LIBSWSCALE_VERSION_INT  AV_VERSION_INT(LIBSWSCALE_VERSION_MAJOR, \
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 06/17] swscale/ops: introduce new low level framework
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (4 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 05/17] swscale: add SWS_UNSTABLE flag Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 07/17] swscale/optimizer: add high-level ops optimizer Niklas Haas
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

See docs/swscale-v2.txt for an in-depth introduction to the new approach.

This commit merely introduces the ops definitions and boilerplate functions.
The subsequent commits will flesh out the underlying implementation.
---
 libswscale/Makefile |   1 +
 libswscale/ops.c    | 522 ++++++++++++++++++++++++++++++++++++++++++++
 libswscale/ops.h    | 240 ++++++++++++++++++++
 3 files changed, 763 insertions(+)
 create mode 100644 libswscale/ops.c
 create mode 100644 libswscale/ops.h

diff --git a/libswscale/Makefile b/libswscale/Makefile
index d5e10d17dc..e0beef4e69 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -15,6 +15,7 @@ OBJS = alphablend.o                                     \
        graph.o                                          \
        input.o                                          \
        lut3d.o                                          \
+       ops.o                                            \
        options.o                                        \
        output.o                                         \
        rgb2rgb.o                                        \
diff --git a/libswscale/ops.c b/libswscale/ops.c
new file mode 100644
index 0000000000..004686147d
--- /dev/null
+++ b/libswscale/ops.c
@@ -0,0 +1,522 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/bswap.h"
+#include "libavutil/mem.h"
+#include "libavutil/rational.h"
+#include "libavutil/refstruct.h"
+
+#include "ops.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+
+const char *ff_sws_pixel_type_name(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:   return "u8";
+    case SWS_PIXEL_U16:  return "u16";
+    case SWS_PIXEL_U32:  return "u32";
+    case SWS_PIXEL_F32:  return "f32";
+    case SWS_PIXEL_NONE: return "none";
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return "ERR";
+}
+
+int ff_sws_pixel_type_size(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:  return sizeof(uint8_t);
+    case SWS_PIXEL_U16: return sizeof(uint16_t);
+    case SWS_PIXEL_U32: return sizeof(uint32_t);
+    case SWS_PIXEL_F32: return sizeof(float);
+    case SWS_PIXEL_NONE: break;
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return 0;
+}
+
+bool ff_sws_pixel_type_is_int(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:
+    case SWS_PIXEL_U16:
+    case SWS_PIXEL_U32:
+        return true;
+    case SWS_PIXEL_F32:
+        return false;
+    case SWS_PIXEL_NONE:
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return false;
+}
+
+SwsPixelType ff_sws_pixel_type_to_uint(SwsPixelType type)
+{
+    if (!type)
+        return type;
+
+    switch (ff_sws_pixel_type_size(type)) {
+    case 8:  return SWS_PIXEL_U8;
+    case 16: return SWS_PIXEL_U16;
+    case 32: return SWS_PIXEL_U32;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return SWS_PIXEL_NONE;
+}
+
+/* biased towards `a` */
+static AVRational av_min_q(AVRational a, AVRational b)
+{
+    return av_cmp_q(a, b) == 1 ? b : a;
+}
+
+static AVRational av_max_q(AVRational a, AVRational b)
+{
+    return av_cmp_q(a, b) == -1 ? b : a;
+}
+
+static AVRational expand_factor(SwsPixelType from, SwsPixelType to)
+{
+    const int src = ff_sws_pixel_type_size(from);
+    const int dst = ff_sws_pixel_type_size(to);
+    int scale = 0;
+    for (int i = 0; i < dst / src; i++)
+        scale = scale << src * 8 | 1;
+    return Q(scale);
+}
+
+void ff_sws_apply_op_q(const SwsOp *op, AVRational x[4])
+{
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        return;
+    case SWS_OP_UNPACK: {
+        unsigned val = x[0].num;
+        int shift = ff_sws_pixel_type_size(op->type) * 8;
+        for (int i = 0; i < 4; i++) {
+            const unsigned mask = (1 << op->pack.pattern[i]) - 1;
+            shift -= op->pack.pattern[i];
+            x[i] = Q((val >> shift) & mask);
+        }
+        return;
+    }
+    case SWS_OP_PACK: {
+        unsigned val = 0;
+        int shift = ff_sws_pixel_type_size(op->type) * 8;
+        for (int i = 0; i < 4; i++) {
+            const unsigned mask = (1 << op->pack.pattern[i]) - 1;
+            shift -= op->pack.pattern[i];
+            val |= (x[i].num & mask) << shift;
+        }
+        x[0] = Q(val);
+        return;
+    }
+    case SWS_OP_SWAP_BYTES:
+        switch (ff_sws_pixel_type_size(op->type)) {
+        case 2:
+            for (int i = 0; i < 4; i++)
+                x[i].num = av_bswap16(x[i].num);
+            break;
+        case 4:
+            for (int i = 0; i < 4; i++)
+                x[i].num = av_bswap32(x[i].num);
+            break;
+        }
+        return;
+    case SWS_OP_CLEAR:
+        for (int i = 0; i < 4; i++) {
+            if (op->c.q4[i].den)
+                x[i] = op->c.q4[i];
+        }
+        return;
+    case SWS_OP_LSHIFT: {
+        AVRational mult = Q(1 << op->c.u);
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_mul_q(x[i], mult) : x[i];
+        return;
+    }
+    case SWS_OP_RSHIFT: {
+        AVRational mult = Q(1 << op->c.u);
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_div_q(x[i], mult) : x[i];
+        return;
+    }
+    case SWS_OP_SWIZZLE: {
+        const AVRational orig[4] = { x[0], x[1], x[2], x[3] };
+        for (int i = 0; i < 4; i++)
+            x[i] = orig[op->swizzle.in[i]];
+        return;
+    }
+    case SWS_OP_CONVERT:
+        if (ff_sws_pixel_type_is_int(op->convert.to)) {
+            const AVRational scale = expand_factor(op->type, op->convert.to);
+            for (int i = 0; i < 4; i++) {
+                x[i] = x[i].den ? Q(x[i].num / x[i].den) : x[i];
+                if (op->convert.expand)
+                    x[i] = av_mul_q(x[i], scale);
+            }
+        }
+        return;
+    case SWS_OP_DITHER:
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_add_q(x[i], av_make_q(1, 2)) : x[i];
+        return;
+    case SWS_OP_MIN:
+        for (int i = 0; i < 4; i++)
+            x[i] = av_min_q(x[i], op->c.q4[i]);
+        return;
+    case SWS_OP_MAX:
+        for (int i = 0; i < 4; i++)
+            x[i] = av_max_q(x[i], op->c.q4[i]);
+        return;
+    case SWS_OP_LINEAR: {
+        const AVRational orig[4] = { x[0], x[1], x[2], x[3] };
+        for (int i = 0; i < 4; i++) {
+            AVRational sum = op->lin.m[i][4];
+            for (int j = 0; j < 4; j++)
+                sum = av_add_q(sum, av_mul_q(orig[j], op->lin.m[i][j]));
+            x[i] = sum;
+        }
+        return;
+    }
+    case SWS_OP_SCALE:
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_mul_q(x[i], op->c.q) : x[i];
+        return;
+    }
+
+    av_assert0(!"Invalid operation type!");
+}
+
+static void op_uninit(SwsOp *op)
+{
+    switch (op->op) {
+    case SWS_OP_DITHER:
+        av_refstruct_unref(&op->dither.matrix);
+        break;
+    }
+
+    *op = (SwsOp) {0};
+}
+
+SwsOpList *ff_sws_op_list_alloc(void)
+{
+    SwsOpList *ops = av_mallocz(sizeof(SwsOpList));
+    if (!ops)
+        return NULL;
+
+    ff_fmt_clear(&ops->src);
+    ff_fmt_clear(&ops->dst);
+    return ops;
+}
+
+void ff_sws_op_list_free(SwsOpList **p_ops)
+{
+    SwsOpList *ops = *p_ops;
+    if (!ops)
+        return;
+
+    for (int i = 0; i < ops->num_ops; i++)
+        op_uninit(&ops->ops[i]);
+
+    av_freep(&ops->ops);
+    av_free(ops);
+    *p_ops = NULL;
+}
+
+SwsOpList *ff_sws_op_list_duplicate(const SwsOpList *ops)
+{
+    SwsOpList *copy = av_malloc(sizeof(*copy));
+    if (!copy)
+        return NULL;
+
+    *copy = *ops;
+    copy->ops = av_memdup(ops->ops, ops->num_ops * sizeof(ops->ops[0]));
+    if (!copy->ops) {
+        av_free(copy);
+        return NULL;
+    }
+
+    for (int i = 0; i < ops->num_ops; i++) {
+        const SwsOp *op = &ops->ops[i];
+        switch (op->op) {
+        case SWS_OP_DITHER:
+            av_refstruct_ref(copy->ops[i].dither.matrix);
+            break;
+        }
+    }
+
+    return copy;
+}
+
+void ff_sws_op_list_remove_at(SwsOpList *ops, int index, int count)
+{
+    const int end = ops->num_ops - count;
+    av_assert2(index >= 0 && count >= 0 && index + count <= ops->num_ops);
+    op_uninit(&ops->ops[index]);
+    for (int i = index; i < end; i++)
+        ops->ops[i] = ops->ops[i + count];
+    ops->num_ops = end;
+}
+
+int ff_sws_op_list_insert_at(SwsOpList *ops, int index, SwsOp *op)
+{
+    void *ret;
+    ret = av_dynarray2_add((void **) &ops->ops, &ops->num_ops, sizeof(*op),
+                           (const void *) op);
+    if (!ret) {
+        op_uninit(op);
+        return AVERROR(ENOMEM);
+    }
+
+    for (int i = ops->num_ops - 1; i > index; i--)
+        ops->ops[i] = ops->ops[i - 1];
+    ops->ops[index] = *op;
+    *op = (SwsOp) {0};
+    return 0;
+}
+
+int ff_sws_op_list_append(SwsOpList *ops, SwsOp *op)
+{
+    return ff_sws_op_list_insert_at(ops, ops->num_ops, op);
+}
+
+int ff_sws_op_list_max_size(const SwsOpList *ops)
+{
+    int max_size = 0;
+    for (int i = 0; i < ops->num_ops; i++) {
+        const int size = ff_sws_pixel_type_size(ops->ops[i].type);
+        max_size = FFMAX(max_size, size);
+    }
+
+    return max_size;
+}
+
+uint32_t ff_sws_linear_mask(const SwsLinearOp c)
+{
+    uint32_t mask = 0;
+    for (int i = 0; i < 4; i++) {
+        for (int j = 0; j < 5; j++) {
+            if (av_cmp_q(c.m[i][j], Q(i == j)))
+                mask |= SWS_MASK(i, j);
+        }
+    }
+    return mask;
+}
+
+static const char *describe_lin_mask(uint32_t mask)
+{
+    /* Try to be fairly descriptive without assuming too much */
+    static const struct {
+        const char *name;
+        uint32_t mask;
+    } patterns[] = {
+        { "noop",               0 },
+        { "luma",               SWS_MASK_LUMA },
+        { "alpha",              SWS_MASK_ALPHA },
+        { "luma+alpha",         SWS_MASK_LUMA | SWS_MASK_ALPHA },
+        { "dot3",               0b111 },
+        { "dot4",               0b1111 },
+        { "row0",               SWS_MASK_ROW(0) },
+        { "row0+alpha",         SWS_MASK_ROW(0) | SWS_MASK_ALPHA },
+        { "col0",               SWS_MASK_COL(0) },
+        { "col0+off3",          SWS_MASK_COL(0) | SWS_MASK_OFF3 },
+        { "off3",               SWS_MASK_OFF3 },
+        { "off3+alpha",         SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag3",              SWS_MASK_DIAG3 },
+        { "diag4",              SWS_MASK_DIAG4 },
+        { "diag3+alpha",        SWS_MASK_DIAG3 | SWS_MASK_ALPHA },
+        { "diag3+off3",         SWS_MASK_DIAG3 | SWS_MASK_OFF3 },
+        { "diag3+off3+alpha",   SWS_MASK_DIAG3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag4+off4",         SWS_MASK_DIAG4 | SWS_MASK_OFF4 },
+        { "matrix3",            SWS_MASK_MAT3 },
+        { "matrix3+off3",       SWS_MASK_MAT3 | SWS_MASK_OFF3 },
+        { "matrix3+off3+alpha", SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "matrix4",            SWS_MASK_MAT4 },
+        { "matrix4+off4",       SWS_MASK_MAT4 | SWS_MASK_OFF4 },
+    };
+
+    for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+        if (!(mask & ~patterns[i].mask))
+            return patterns[i].name;
+    }
+
+    return "full";
+}
+
+static char describe_comp_flags(unsigned flags)
+{
+    if (flags & SWS_COMP_GARBAGE)
+        return 'X';
+    else if (flags & SWS_COMP_ZERO)
+        return '0';
+    else if (flags & SWS_COMP_EXACT)
+        return '+';
+    else
+        return '.';
+}
+
+static const char *print_q(const AVRational q, char buf[], int buf_len)
+{
+    if (!q.den) {
+        return q.num > 0 ? "inf" : q.num < 0 ? "-inf" : "nan";
+    } else if (q.den == 1) {
+        snprintf(buf, buf_len, "%d", q.num);
+        return buf;
+    } else if (abs(q.num) > 1000 || abs(q.den) > 1000) {
+        snprintf(buf, buf_len, "%f", av_q2d(q));
+        return buf;
+    } else {
+        snprintf(buf, buf_len, "%d/%d", q.num, q.den);
+        return buf;
+    }
+}
+
+#define PRINTQ(q) print_q(q, (char[32]){0}, sizeof(char[32]) - 1)
+
+void ff_sws_op_list_print(void *log, int lev, const SwsOpList *ops)
+{
+    if (!ops->num_ops) {
+        av_log(log, lev, "  (empty)\n");
+        return;
+    }
+
+    for (int i = 0; i < ops->num_ops; i++) {
+        const SwsOp *op = &ops->ops[i];
+        av_log(log, lev, "  [%3s %c%c%c%c -> %c%c%c%c] ",
+               ff_sws_pixel_type_name(op->type),
+               op->comps.unused[0] ? 'X' : '.',
+               op->comps.unused[1] ? 'X' : '.',
+               op->comps.unused[2] ? 'X' : '.',
+               op->comps.unused[3] ? 'X' : '.',
+               describe_comp_flags(op->comps.flags[0]),
+               describe_comp_flags(op->comps.flags[1]),
+               describe_comp_flags(op->comps.flags[2]),
+               describe_comp_flags(op->comps.flags[3]));
+
+        switch (op->op) {
+        case SWS_OP_INVALID:
+            av_log(log, lev, "SWS_OP_INVALID\n");
+            break;
+        case SWS_OP_READ:
+        case SWS_OP_WRITE:
+            av_log(log, lev, "%-20s: %d elem(s) %s >> %d\n",
+                   op->op == SWS_OP_READ ? "SWS_OP_READ"
+                                         : "SWS_OP_WRITE",
+                   op->rw.elems,  op->rw.packed ? "packed" : "planar",
+                   op->rw.frac);
+            break;
+        case SWS_OP_SWAP_BYTES:
+            av_log(log, lev, "SWS_OP_SWAP_BYTES\n");
+            break;
+        case SWS_OP_LSHIFT:
+            av_log(log, lev, "%-20s: << %u\n", "SWS_OP_LSHIFT", op->c.u);
+            break;
+        case SWS_OP_RSHIFT:
+            av_log(log, lev, "%-20s: >> %u\n", "SWS_OP_RSHIFT", op->c.u);
+            break;
+        case SWS_OP_PACK:
+        case SWS_OP_UNPACK:
+            av_log(log, lev, "%-20s: {%d %d %d %d}\n",
+                   op->op == SWS_OP_PACK ? "SWS_OP_PACK"
+                                         : "SWS_OP_UNPACK",
+                   op->pack.pattern[0], op->pack.pattern[1],
+                   op->pack.pattern[2], op->pack.pattern[3]);
+            break;
+        case SWS_OP_CLEAR:
+            av_log(log, lev, "%-20s: {%s %s %s %s}\n", "SWS_OP_CLEAR",
+                   op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                   op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                   op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                   op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_SWIZZLE:
+            av_log(log, lev, "%-20s: %d%d%d%d\n", "SWS_OP_SWIZZLE",
+                   op->swizzle.x, op->swizzle.y, op->swizzle.z, op->swizzle.w);
+            break;
+        case SWS_OP_CONVERT:
+            av_log(log, lev, "%-20s: %s -> %s%s\n", "SWS_OP_CONVERT",
+                   ff_sws_pixel_type_name(op->type),
+                   ff_sws_pixel_type_name(op->convert.to),
+                   op->convert.expand ? " (expand)" : "");
+            break;
+        case SWS_OP_DITHER:
+            av_log(log, lev, "%-20s: %dx%d matrix\n", "SWS_OP_DITHER",
+                    1 << op->dither.size_log2, 1 << op->dither.size_log2);
+            break;
+        case SWS_OP_MIN:
+            av_log(log, lev, "%-20s: x <= {%s %s %s %s}\n", "SWS_OP_MIN",
+                    op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                    op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                    op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                    op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_MAX:
+            av_log(log, lev, "%-20s: {%s %s %s %s} <= x\n", "SWS_OP_MAX",
+                    op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                    op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                    op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                    op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_LINEAR:
+            av_log(log, lev, "%-20s: %s [[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s]]\n",
+                   "SWS_OP_LINEAR", describe_lin_mask(op->lin.mask),
+                   PRINTQ(op->lin.m[0][0]), PRINTQ(op->lin.m[0][1]), PRINTQ(op->lin.m[0][2]), PRINTQ(op->lin.m[0][3]), PRINTQ(op->lin.m[0][4]),
+                   PRINTQ(op->lin.m[1][0]), PRINTQ(op->lin.m[1][1]), PRINTQ(op->lin.m[1][2]), PRINTQ(op->lin.m[1][3]), PRINTQ(op->lin.m[1][4]),
+                   PRINTQ(op->lin.m[2][0]), PRINTQ(op->lin.m[2][1]), PRINTQ(op->lin.m[2][2]), PRINTQ(op->lin.m[2][3]), PRINTQ(op->lin.m[2][4]),
+                   PRINTQ(op->lin.m[3][0]), PRINTQ(op->lin.m[3][1]), PRINTQ(op->lin.m[3][2]), PRINTQ(op->lin.m[3][3]), PRINTQ(op->lin.m[3][4]));
+            break;
+        case SWS_OP_SCALE:
+            av_log(log, lev, "%-20s: * %s\n", "SWS_OP_SCALE",
+                   PRINTQ(op->c.q));
+            break;
+        case SWS_OP_TYPE_NB:
+            break;
+        }
+
+        if (op->comps.min[0].den || op->comps.min[1].den ||
+            op->comps.min[2].den || op->comps.min[3].den ||
+            op->comps.max[0].den || op->comps.max[1].den ||
+            op->comps.max[2].den || op->comps.max[3].den)
+        {
+            av_log(log, AV_LOG_TRACE, "    min: {%s, %s, %s, %s}, max: {%s, %s, %s, %s}\n",
+                PRINTQ(op->comps.min[0]), PRINTQ(op->comps.min[1]),
+                PRINTQ(op->comps.min[2]), PRINTQ(op->comps.min[3]),
+                PRINTQ(op->comps.max[0]), PRINTQ(op->comps.max[1]),
+                PRINTQ(op->comps.max[2]), PRINTQ(op->comps.max[3]));
+        }
+
+    }
+
+    av_log(log, lev, "    (X = unused, + = exact, 0 = zero)\n");
+}
diff --git a/libswscale/ops.h b/libswscale/ops.h
new file mode 100644
index 0000000000..85462ae337
--- /dev/null
+++ b/libswscale/ops.h
@@ -0,0 +1,240 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_H
+#define SWSCALE_OPS_H
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdalign.h>
+
+#include "graph.h"
+
+typedef enum SwsPixelType {
+    SWS_PIXEL_NONE = 0,
+    SWS_PIXEL_U8,
+    SWS_PIXEL_U16,
+    SWS_PIXEL_U32,
+    SWS_PIXEL_F32,
+    SWS_PIXEL_TYPE_NB
+} SwsPixelType;
+
+const char *ff_sws_pixel_type_name(SwsPixelType type);
+int ff_sws_pixel_type_size(SwsPixelType type) av_const;
+bool ff_sws_pixel_type_is_int(SwsPixelType type) av_const;
+SwsPixelType ff_sws_pixel_type_to_uint(SwsPixelType type) av_const;
+
+typedef enum SwsOpType {
+    SWS_OP_INVALID = 0,
+
+    /* Input/output handling */
+    SWS_OP_READ,            /* gather raw pixels from planes */
+    SWS_OP_WRITE,           /* write raw pixels to planes */
+    SWS_OP_SWAP_BYTES,      /* swap byte order (for differing endianness) */
+    SWS_OP_UNPACK,          /* split tightly packed data into components */
+    SWS_OP_PACK,            /* compress components into tightly packed data */
+
+    /* Pixel manipulation */
+    SWS_OP_CLEAR,           /* clear pixel values */
+    SWS_OP_LSHIFT,          /* logical left shift of raw pixel values by (u8) */
+    SWS_OP_RSHIFT,          /* right shift of raw pixel values by (u8) */
+    SWS_OP_SWIZZLE,         /* rearrange channel order, or duplicate channels */
+    SWS_OP_CONVERT,         /* convert (cast) between formats */
+    SWS_OP_DITHER,          /* add dithering noise */
+
+    /* Arithmetic operations */
+    SWS_OP_LINEAR,          /* generalized linear affine transform */
+    SWS_OP_SCALE,           /* multiplication by scalar (q) */
+    SWS_OP_MIN,             /* numeric minimum (q4) */
+    SWS_OP_MAX,             /* numeric maximum (q4) */
+
+    SWS_OP_TYPE_NB,
+} SwsOpType;
+
+enum SwsCompFlags {
+    SWS_COMP_GARBAGE = 1 << 0, /* contents are undefined / garbage data */
+    SWS_COMP_EXACT   = 1 << 1, /* value is an in-range, exact, integer */
+    SWS_COMP_ZERO    = 1 << 2, /* known to be a constant zero */
+};
+
+typedef union SwsConst {
+    /* Generic constant value */
+    AVRational q;
+    AVRational q4[4];
+    unsigned u;
+} SwsConst;
+
+typedef struct SwsComps {
+    unsigned flags[4]; /* knowledge about (output) component contents */
+    bool unused[4];    /* which input components are definitely unused */
+
+    /* Keeps track of the known possible value range, or {0, 0} for undefined
+     * or (unknown range) floating point inputs */
+    AVRational min[4], max[4];
+} SwsComps;
+
+typedef struct SwsReadWriteOp {
+    int elems;   /* number of elements (of type `op.type`) to read/write */
+    bool packed; /* read multiple elements from a single plane */
+    int frac;    /* fractional pixel step factor (log2) */
+
+    /** Examples:
+     *    rgba      = 4x u8 packed
+     *    yuv444p   = 3x u8
+     *    rgb565    = 1x u16   <- use SWS_OP_UNPACK to unpack
+     *    monow     = 1x u8 (frac 3)
+     *    rgb4      = 1x u8 (frac 1)
+     */
+} SwsReadWriteOp;
+
+typedef struct SwsPackOp {
+    int pattern[4]; /* bit depth pattern, from MSB to LSB */
+} SwsPackOp;
+
+typedef struct SwsSwizzleOp {
+    /**
+     * Input component for each output component:
+     *   Out[x] := In[swizzle.in[x]]
+     */
+    union {
+        uint32_t mask;
+        uint8_t in[4];
+        struct { uint8_t x, y, z, w; };
+    };
+} SwsSwizzleOp;
+
+#define SWS_SWIZZLE(X,Y,Z,W) ((SwsSwizzleOp) { .in = {X, Y, Z, W} })
+
+typedef struct SwsConvertOp {
+    SwsPixelType to; /* type of pixel to convert to */
+    bool expand; /* if true, integers are expanded to the full range */
+} SwsConvertOp;
+
+typedef struct SwsDitherOp {
+    AVRational *matrix; /* tightly packed dither matrix (refstruct) */
+    int size_log2; /* size (in bits) of the dither matrix */
+} SwsDitherOp;
+
+typedef struct SwsLinearOp {
+    /**
+     * Generalized 5x5 affine transformation:
+     *   [ Out.x ] = [ A B C D E ]
+     *   [ Out.y ] = [ F G H I J ] * [ x y z w 1 ]
+     *   [ Out.z ] = [ K L M N O ]
+     *   [ Out.w ] = [ P Q R S T ]
+     *
+     * The mask keeps track of which components differ from an identity matrix.
+     * There may be more efficient implementations of particular subsets, for
+     * example the common subset of {A, E, G, J, M, O} can be implemented with
+     * just three fused multiply-add operations.
+     */
+    AVRational m[4][5];
+    uint32_t mask; /* m[i][j] <-> 1 << (5 * i + j) */
+} SwsLinearOp;
+
+#define SWS_MASK(I, J)  (1 << (5 * (I) + (J)))
+#define SWS_MASK_OFF(I) SWS_MASK(I, 4)
+#define SWS_MASK_ROW(I) (0b11111 << (5 * (I)))
+#define SWS_MASK_COL(J) (0b1000010000100001 << J)
+
+enum {
+    SWS_MASK_ALL   = (1 << 20) - 1,
+    SWS_MASK_LUMA  = SWS_MASK(0, 0) | SWS_MASK_OFF(0),
+    SWS_MASK_ALPHA = SWS_MASK(3, 3) | SWS_MASK_OFF(3),
+
+    SWS_MASK_DIAG3 = SWS_MASK(0, 0)  | SWS_MASK(1, 1)  | SWS_MASK(2, 2),
+    SWS_MASK_OFF3  = SWS_MASK_OFF(0) | SWS_MASK_OFF(1) | SWS_MASK_OFF(2),
+    SWS_MASK_MAT3  = SWS_MASK(0, 0)  | SWS_MASK(0, 1)  | SWS_MASK(0, 2) |
+                     SWS_MASK(1, 0)  | SWS_MASK(1, 1)  | SWS_MASK(1, 2) |
+                     SWS_MASK(2, 0)  | SWS_MASK(2, 1)  | SWS_MASK(2, 2),
+
+    SWS_MASK_DIAG4 = SWS_MASK_DIAG3  | SWS_MASK(3, 3),
+    SWS_MASK_OFF4  = SWS_MASK_OFF3   | SWS_MASK_OFF(3),
+    SWS_MASK_MAT4  = SWS_MASK_ALL & ~SWS_MASK_OFF4,
+};
+
+/* Helper function to compute the correct mask */
+uint32_t ff_sws_linear_mask(SwsLinearOp);
+
+typedef struct SwsOp {
+    SwsOpType op;      /* operation to perform */
+    SwsPixelType type; /* pixel type to operate on */
+    union {
+        SwsReadWriteOp  rw;
+        SwsPackOp       pack;
+        SwsSwizzleOp    swizzle;
+        SwsConvertOp    convert;
+        SwsDitherOp     dither;
+        SwsLinearOp     lin;
+        SwsConst        c;
+    };
+
+    /* For use internal use inside ff_sws_*() functions */
+    SwsComps comps;
+} SwsOp;
+
+/**
+ * Frees any allocations associated with an SwsOp and sets it to {0}.
+ */
+void ff_sws_op_uninit(SwsOp *op);
+
+/**
+ * Apply an operation to an AVRational. No-op for read/write operations.
+ */
+void ff_sws_apply_op_q(const SwsOp *op, AVRational x[4]);
+
+/**
+ * Helper struct for representing a list of operations.
+ */
+typedef struct SwsOpList {
+    SwsOp *ops;
+    int num_ops;
+
+    /* Purely informative metadata associated with this operation list */
+    SwsFormat src, dst;
+} SwsOpList;
+
+SwsOpList *ff_sws_op_list_alloc(void);
+void ff_sws_op_list_free(SwsOpList **ops);
+
+/**
+ * Returns a duplicate of `ops`, or NULL on OOM.
+ */
+SwsOpList *ff_sws_op_list_duplicate(const SwsOpList *ops);
+
+/**
+ * Returns the size of the largest pixel type used in `ops`.
+ */
+int ff_sws_op_list_max_size(const SwsOpList *ops);
+
+/**
+ * These will take over ownership of `op` and set it to {0}, even on failure.
+ */
+int ff_sws_op_list_append(SwsOpList *ops, SwsOp *op);
+int ff_sws_op_list_insert_at(SwsOpList *ops, int index, SwsOp *op);
+
+void ff_sws_op_list_remove_at(SwsOpList *ops, int index, int count);
+
+/**
+ * Print out the contents of an operation list.
+ */
+void ff_sws_op_list_print(void *log_ctx, int log_level, const SwsOpList *ops);
+
+#endif
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 07/17] swscale/optimizer: add high-level ops optimizer
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (5 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 06/17] swscale/ops: introduce new low level framework Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 08/17] swscale/ops_internal: add internal ops backend API Niklas Haas
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This is responsible for taking a "naive" ops list and optimizing it
as much as possible. Also includes a small analyzer that generates component
metadata for use by the optimizer.
---
 libswscale/Makefile        |   1 +
 libswscale/ops.h           |  12 +
 libswscale/ops_optimizer.c | 783 +++++++++++++++++++++++++++++++++++++
 3 files changed, 796 insertions(+)
 create mode 100644 libswscale/ops_optimizer.c

diff --git a/libswscale/Makefile b/libswscale/Makefile
index e0beef4e69..810c9dee78 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -16,6 +16,7 @@ OBJS = alphablend.o                                     \
        input.o                                          \
        lut3d.o                                          \
        ops.o                                            \
+       ops_optimizer.o                                  \
        options.o                                        \
        output.o                                         \
        rgb2rgb.o                                        \
diff --git a/libswscale/ops.h b/libswscale/ops.h
index 85462ae337..ae65d578b3 100644
--- a/libswscale/ops.h
+++ b/libswscale/ops.h
@@ -237,4 +237,16 @@ void ff_sws_op_list_remove_at(SwsOpList *ops, int index, int count);
  */
 void ff_sws_op_list_print(void *log_ctx, int log_level, const SwsOpList *ops);
 
+/**
+ * Infer + propagate known information about components. Called automatically
+ * when needed by the optimizer and compiler.
+ */
+void ff_sws_op_list_update_comps(SwsOpList *ops);
+
+/**
+ * Fuse compatible and eliminate redundant operations, as well as replacing
+ * some operations with more efficient alternatives.
+ */
+int ff_sws_op_list_optimize(SwsOpList *ops);
+
 #endif
diff --git a/libswscale/ops_optimizer.c b/libswscale/ops_optimizer.c
new file mode 100644
index 0000000000..d503bf7bf3
--- /dev/null
+++ b/libswscale/ops_optimizer.c
@@ -0,0 +1,783 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/rational.h"
+
+#include "ops.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+
+#define RET(x)                                                                 \
+    do {                                                                       \
+        if ((ret = (x)) < 0)                                                   \
+            return ret;                                                        \
+    } while (0)
+
+/* Returns true for operations that are independent per channel. These can
+ * usually be commuted freely other such operations. */
+static bool op_type_is_independent(SwsOpType op)
+{
+    switch (op) {
+    case SWS_OP_SWAP_BYTES:
+    case SWS_OP_LSHIFT:
+    case SWS_OP_RSHIFT:
+    case SWS_OP_CONVERT:
+    case SWS_OP_DITHER:
+    case SWS_OP_MIN:
+    case SWS_OP_MAX:
+    case SWS_OP_SCALE:
+        return true;
+    case SWS_OP_INVALID:
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+    case SWS_OP_SWIZZLE:
+    case SWS_OP_CLEAR:
+    case SWS_OP_LINEAR:
+    case SWS_OP_PACK:
+    case SWS_OP_UNPACK:
+        return false;
+    case SWS_OP_TYPE_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid operation type!");
+    return false;
+}
+
+static AVRational expand_factor(SwsPixelType from, SwsPixelType to)
+{
+    const int src = ff_sws_pixel_type_size(from);
+    const int dst = ff_sws_pixel_type_size(to);
+    int scale = 0;
+    for (int i = 0; i < dst / src; i++)
+        scale = scale << src * 8 | 1;
+    return Q(scale);
+}
+
+/* merge_comp_flags() forms a monoid with flags_identity as the null element */
+static const unsigned flags_identity = SWS_COMP_ZERO | SWS_COMP_EXACT;
+static unsigned merge_comp_flags(unsigned a, unsigned b)
+{
+    const unsigned flags_or  = SWS_COMP_GARBAGE;
+    const unsigned flags_and = SWS_COMP_ZERO | SWS_COMP_EXACT;
+    return ((a & b) & flags_and) | ((a | b) & flags_or);
+}
+
+/* Infer + propagate known information about components */
+void ff_sws_op_list_update_comps(SwsOpList *ops)
+{
+    SwsComps next = { .unused = {true, true, true, true} };
+    SwsComps prev = { .flags = {
+        SWS_COMP_GARBAGE, SWS_COMP_GARBAGE, SWS_COMP_GARBAGE, SWS_COMP_GARBAGE,
+    }};
+
+    /* Forwards pass, propagates knowledge about the incoming pixel values */
+    for (int n = 0; n < ops->num_ops; n++) {
+        SwsOp *op = &ops->ops[n];
+
+        /* Prefill min/max values automatically; may have to be fixed in
+         * special cases */
+        memcpy(op->comps.min, prev.min, sizeof(prev.min));
+        memcpy(op->comps.max, prev.max, sizeof(prev.max));
+
+        if (op->op != SWS_OP_SWAP_BYTES) {
+            ff_sws_apply_op_q(op, op->comps.min);
+            ff_sws_apply_op_q(op, op->comps.max);
+        }
+
+        switch (op->op) {
+        case SWS_OP_READ:
+            for (int i = 0; i < op->rw.elems; i++) {
+                if (ff_sws_pixel_type_is_int(op->type)) {
+                    int bits = 8 * ff_sws_pixel_type_size(op->type);
+                    if (!op->rw.packed && ops->src.desc) {
+                        /* Use legal value range from pixdesc if available;
+                         * we don't need to do this for packed formats because
+                         * non-byte-aligned packed formats will necessarily go
+                         * through SWS_OP_UNPACK anyway */
+                        for (int c = 0; c < 4; c++) {
+                            if (ops->src.desc->comp[c].plane == i) {
+                                bits = ops->src.desc->comp[c].depth;
+                                break;
+                            }
+                        }
+                    }
+
+                    op->comps.flags[i] = SWS_COMP_EXACT;
+                    op->comps.min[i] = Q(0);
+                    op->comps.max[i] = Q((1ULL << bits) - 1);
+                }
+            }
+            for (int i = op->rw.elems; i < 4; i++)
+                op->comps.flags[i] = prev.flags[i];
+            break;
+        case SWS_OP_WRITE:
+            for (int i = 0; i < op->rw.elems; i++)
+                av_assert1(!(prev.flags[i] & SWS_COMP_GARBAGE));
+            /* fall through */
+        case SWS_OP_SWAP_BYTES:
+        case SWS_OP_LSHIFT:
+        case SWS_OP_RSHIFT:
+        case SWS_OP_MIN:
+        case SWS_OP_MAX:
+            /* Linearly propagate flags per component */
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] = prev.flags[i];
+            break;
+        case SWS_OP_DITHER:
+            /* Strip zero flag because of the nonzero dithering offset */
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] = prev.flags[i] & ~SWS_COMP_ZERO;
+            break;
+        case SWS_OP_UNPACK:
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    op->comps.flags[i] = prev.flags[0];
+                else
+                    op->comps.flags[i] = SWS_COMP_GARBAGE;
+            }
+            break;
+        case SWS_OP_PACK: {
+            unsigned flags = flags_identity;
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    flags = merge_comp_flags(flags, prev.flags[i]);
+                if (i > 0) /* clear remaining comps for sanity */
+                    op->comps.flags[i] = SWS_COMP_GARBAGE;
+            }
+            op->comps.flags[0] = flags;
+            break;
+        }
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (op->c.q4[i].den) {
+                    if (op->c.q4[i].num == 0) {
+                        op->comps.flags[i] = SWS_COMP_ZERO | SWS_COMP_EXACT;
+                    } else if (op->c.q4[i].den == 1) {
+                        op->comps.flags[i] = SWS_COMP_EXACT;
+                    }
+                } else {
+                    op->comps.flags[i] = prev.flags[i];
+                }
+            }
+            break;
+        case SWS_OP_SWIZZLE:
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] = prev.flags[op->swizzle.in[i]];
+            break;
+        case SWS_OP_CONVERT:
+            for (int i = 0; i < 4; i++) {
+                op->comps.flags[i] = prev.flags[i];
+                if (ff_sws_pixel_type_is_int(op->convert.to))
+                    op->comps.flags[i] |= SWS_COMP_EXACT;
+            }
+            break;
+        case SWS_OP_LINEAR:
+            for (int i = 0; i < 4; i++) {
+                unsigned flags = flags_identity;
+                AVRational min = Q(0), max = Q(0);
+                for (int j = 0; j < 4; j++) {
+                    const AVRational k = op->lin.m[i][j];
+                    AVRational mink = av_mul_q(prev.min[j], k);
+                    AVRational maxk = av_mul_q(prev.max[j], k);
+                    if (k.num) {
+                        flags = merge_comp_flags(flags, prev.flags[j]);
+                        if (k.den != 1) /* fractional coefficient */
+                            flags &= ~SWS_COMP_EXACT;
+                        if (k.num < 0)
+                            FFSWAP(AVRational, mink, maxk);
+                        min = av_add_q(min, mink);
+                        max = av_add_q(max, maxk);
+                    }
+                }
+                if (op->lin.m[i][4].num) { /* nonzero offset */
+                    flags &= ~SWS_COMP_ZERO;
+                    if (op->lin.m[i][4].den != 1) /* fractional offset */
+                        flags &= ~SWS_COMP_EXACT;
+                    min = av_add_q(min, op->lin.m[i][4]);
+                    max = av_add_q(max, op->lin.m[i][4]);
+                }
+                op->comps.flags[i] = flags;
+                op->comps.min[i] = min;
+                op->comps.max[i] = max;
+            }
+            break;
+        case SWS_OP_SCALE:
+            for (int i = 0; i < 4; i++) {
+                op->comps.flags[i] = prev.flags[i];
+                if (op->c.q.den != 1) /* fractional scale */
+                    op->comps.flags[i] &= ~SWS_COMP_EXACT;
+                if (op->c.q.num < 0)
+                    FFSWAP(AVRational, op->comps.min[i], op->comps.max[i]);
+            }
+            break;
+
+        case SWS_OP_INVALID:
+        case SWS_OP_TYPE_NB:
+            av_assert0(!"Invalid operation type!");
+        }
+
+        prev = op->comps;
+    }
+
+    /* Backwards pass, solves for component dependencies */
+    for (int n = ops->num_ops - 1; n >= 0; n--) {
+        SwsOp *op = &ops->ops[n];
+
+        switch (op->op) {
+        case SWS_OP_READ:
+        case SWS_OP_WRITE:
+            for (int i = 0; i < op->rw.elems; i++)
+                op->comps.unused[i] = op->op == SWS_OP_READ;
+            for (int i = op->rw.elems; i < 4; i++)
+                op->comps.unused[i] = next.unused[i];
+            break;
+        case SWS_OP_SWAP_BYTES:
+        case SWS_OP_LSHIFT:
+        case SWS_OP_RSHIFT:
+        case SWS_OP_CONVERT:
+        case SWS_OP_DITHER:
+        case SWS_OP_MIN:
+        case SWS_OP_MAX:
+        case SWS_OP_SCALE:
+            for (int i = 0; i < 4; i++)
+                op->comps.unused[i] = next.unused[i];
+            break;
+        case SWS_OP_UNPACK: {
+            bool unused = true;
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    unused &= next.unused[i];
+                op->comps.unused[i] = i > 0;
+            }
+            op->comps.unused[0] = unused;
+            break;
+        }
+        case SWS_OP_PACK:
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    op->comps.unused[i] = next.unused[0];
+                else
+                    op->comps.unused[i] = true;
+            }
+            break;
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (op->c.q4[i].den)
+                    op->comps.unused[i] = true;
+                else
+                    op->comps.unused[i] = next.unused[i];
+            }
+            break;
+        case SWS_OP_SWIZZLE: {
+            bool unused[4] = { true, true, true, true };
+            for (int i = 0; i < 4; i++)
+                unused[op->swizzle.in[i]] &= next.unused[i];
+            for (int i = 0; i < 4; i++)
+                op->comps.unused[i] = unused[i];
+            break;
+        }
+        case SWS_OP_LINEAR:
+            for (int j = 0; j < 4; j++) {
+                bool unused = true;
+                for (int i = 0; i < 4; i++) {
+                    if (op->lin.m[i][j].num)
+                        unused &= next.unused[i];
+                }
+                op->comps.unused[j] = unused;
+            }
+            break;
+        }
+
+        next = op->comps;
+    }
+}
+
+/* returns log2(x) only if x is a power of two, or 0 otherwise */
+static int exact_log2(const int x)
+{
+    int p;
+    if (x <= 0)
+        return 0;
+    p = av_log2(x);
+    return (1 << p) == x ? p : 0;
+}
+
+static int exact_log2_q(const AVRational x)
+{
+    if (x.den == 1)
+        return exact_log2(x.num);
+    else if (x.num == 1)
+        return -exact_log2(x.den);
+    else
+        return 0;
+}
+
+/**
+ * If a linear operation can be reduced to a scalar multiplication, returns
+ * the corresponding scaling factor, or 0 otherwise.
+ */
+static bool extract_scalar(const SwsLinearOp *c, SwsComps prev, SwsComps next,
+                           SwsConst *out_scale)
+{
+    SwsConst scale = {0};
+
+    /* There are components not on the main diagonal */
+    if (c->mask & ~SWS_MASK_DIAG4)
+        return false;
+
+    for (int i = 0; i < 4; i++) {
+        const AVRational s = c->m[i][i];
+        if ((prev.flags[i] & SWS_COMP_ZERO) || next.unused[i])
+            continue;
+        if (scale.q.den && av_cmp_q(s, scale.q))
+            return false;
+        scale.q = s;
+    }
+
+    if (scale.q.den)
+        *out_scale = scale;
+    return scale.q.den;
+}
+
+/* Extracts an integer clear operation (subset) from the given linear op. */
+static bool extract_constant_rows(SwsLinearOp *c, SwsComps prev,
+                                  SwsConst *out_clear)
+{
+    SwsConst clear = {0};
+    bool ret = false;
+
+    for (int i = 0; i < 4; i++) {
+        bool const_row = c->m[i][4].den == 1; /* offset is integer */
+        for (int j = 0; j < 4; j++) {
+            const_row &= c->m[i][j].num == 0 || /* scalar is zero */
+                         (prev.flags[j] & SWS_COMP_ZERO); /* input is zero */
+        }
+        if (const_row && (c->mask & SWS_MASK_ROW(i))) {
+            clear.q4[i] = c->m[i][4];
+            for (int j = 0; j < 5; j++)
+                c->m[i][j] = Q(i == j);
+            c->mask &= ~SWS_MASK_ROW(i);
+            ret = true;
+        }
+    }
+
+    if (ret)
+        *out_clear = clear;
+    return ret;
+}
+
+/* Unswizzle a linear operation by aligning single-input rows with
+ * their corresponding diagonal */
+static bool extract_swizzle(SwsLinearOp *op, SwsComps prev, SwsSwizzleOp *out_swiz)
+{
+    SwsSwizzleOp swiz = SWS_SWIZZLE(0, 1, 2, 3);
+    SwsLinearOp c = *op;
+
+    for (int i = 0; i < 4; i++) {
+        int idx = -1;
+        for (int j = 0; j < 4; j++) {
+            if (!c.m[i][j].num || (prev.flags[j] & SWS_COMP_ZERO))
+                continue;
+            if (idx >= 0)
+                return false; /* multiple inputs */
+            idx = j;
+        }
+
+        if (idx >= 0 && idx != i) {
+            /* Move coefficient to the diagonal */
+            c.m[i][i] = c.m[i][idx];
+            c.m[i][idx] = Q(0);
+            swiz.in[i] = idx;
+        }
+    }
+
+    if (swiz.mask == SWS_SWIZZLE(0, 1, 2, 3).mask)
+        return false; /* no swizzle was identified */
+
+    c.mask = ff_sws_linear_mask(c);
+    *out_swiz = swiz;
+    *op = c;
+    return true;
+}
+
+int ff_sws_op_list_optimize(SwsOpList *ops)
+{
+    int ret;
+
+retry:
+    ff_sws_op_list_update_comps(ops);
+
+    for (int n = 0; n < ops->num_ops;) {
+        SwsOp dummy = {0};
+        SwsOp *op = &ops->ops[n];
+        SwsOp *prev = n ? &ops->ops[n - 1] : &dummy;
+        SwsOp *next = n + 1 < ops->num_ops ? &ops->ops[n + 1] : &dummy;
+
+        /* common helper variable */
+        bool noop = true;
+
+        switch (op->op) {
+        case SWS_OP_READ:
+            /* Optimized further into refcopy / memcpy */
+            if (next->op == SWS_OP_WRITE &&
+                next->rw.elems == op->rw.elems &&
+                next->rw.packed == op->rw.packed &&
+                next->rw.frac == op->rw.frac)
+            {
+                ff_sws_op_list_remove_at(ops, n, 2);
+                av_assert1(ops->num_ops == 0);
+                return 0;
+            }
+
+            /* Skip reading extra unneeded components */
+            if (!op->rw.packed) {
+                int needed = op->rw.elems;
+                while (needed > 0 && next->comps.unused[needed - 1])
+                    needed--;
+                if (op->rw.elems != needed) {
+                    op->rw.elems = needed;
+                    op->rw.packed &= op->rw.elems > 1;
+                    goto retry;
+                }
+            }
+            break;
+
+        case SWS_OP_SWAP_BYTES:
+            /* Redundant (double) swap */
+            if (next->op == SWS_OP_SWAP_BYTES) {
+                ff_sws_op_list_remove_at(ops, n, 2);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_UNPACK:
+            /* Redundant unpack+pack */
+            if (next->op == SWS_OP_PACK && next->type == op->type &&
+                next->pack.pattern[0] == op->pack.pattern[0] &&
+                next->pack.pattern[1] == op->pack.pattern[1] &&
+                next->pack.pattern[2] == op->pack.pattern[2] &&
+                next->pack.pattern[3] == op->pack.pattern[3])
+            {
+                ff_sws_op_list_remove_at(ops, n, 2);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_LSHIFT:
+        case SWS_OP_RSHIFT:
+            /* Two shifts in the same direction */
+            if (next->op == op->op) {
+                op->c.u += next->c.u;
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* No-op shift */
+            if (!op->c.u) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (!op->c.q4[i].den)
+                    continue;
+
+                if ((prev->comps.flags[i] & SWS_COMP_ZERO) &&
+                    !(prev->comps.flags[i] & SWS_COMP_GARBAGE) &&
+                    op->c.q4[i].num == 0)
+                {
+                    /* Redundant clear-to-zero of zero component */
+                    op->c.q4[i].den = 0;
+                } else if (next->comps.unused[i]) {
+                    /* Unnecessary clear of unused component */
+                    op->c.q4[i] = (AVRational) {0, 0};
+                } else if (op->c.q4[i].den) {
+                    noop = false;
+                }
+            }
+
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            /* Transitive clear */
+            if (next->op == SWS_OP_CLEAR) {
+                for (int i = 0; i < 4; i++) {
+                    if (next->c.q4[i].den)
+                        op->c.q4[i] = next->c.q4[i];
+                }
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* Prefer to clear as late as possible, to avoid doing
+                * redundant work */
+            if ((op_type_is_independent(next->op) && next->op != SWS_OP_SWAP_BYTES) ||
+                next->op == SWS_OP_SWIZZLE)
+            {
+                if (next->op == SWS_OP_CONVERT)
+                    op->type = next->convert.to;
+                ff_sws_apply_op_q(next, op->c.q4);
+                FFSWAP(SwsOp, *op, *next);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_SWIZZLE: {
+            bool seen[4] = {0};
+            bool has_duplicates = false;
+            for (int i = 0; i < 4; i++) {
+                if (next->comps.unused[i])
+                    continue;
+                if (op->swizzle.in[i] != i)
+                    noop = false;
+                has_duplicates |= seen[op->swizzle.in[i]];
+                seen[op->swizzle.in[i]] = true;
+            }
+
+            /* Identity swizzle */
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            /* Transitive swizzle */
+            if (next->op == SWS_OP_SWIZZLE) {
+                const SwsSwizzleOp orig = op->swizzle;
+                for (int i = 0; i < 4; i++)
+                    op->swizzle.in[i] = orig.in[next->swizzle.in[i]];
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* Try to push swizzles with duplicates towards the output */
+            if (has_duplicates && op_type_is_independent(next->op)) {
+                if (next->op == SWS_OP_CONVERT)
+                    op->type = next->convert.to;
+                if (next->op == SWS_OP_MIN || next->op == SWS_OP_MAX) {
+                    /* Un-swizzle the next operation */
+                    const SwsConst c = next->c;
+                    for (int i = 0; i < 4; i++) {
+                        if (!next->comps.unused[i])
+                            next->c.q4[op->swizzle.in[i]] = c.q4[i];
+                    }
+                }
+                FFSWAP(SwsOp, *op, *next);
+                goto retry;
+            }
+
+            /* Move swizzle out of the way between two converts so that
+             * they may be merged */
+            if (prev->op == SWS_OP_CONVERT && next->op == SWS_OP_CONVERT) {
+                op->type = next->convert.to;
+                FFSWAP(SwsOp, *op, *next);
+                goto retry;
+            }
+            break;
+        }
+
+        case SWS_OP_CONVERT:
+            /* No-op conversion */
+            if (op->type == op->convert.to) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            /* Transitive conversion */
+            if (next->op == SWS_OP_CONVERT &&
+                op->convert.expand == next->convert.expand)
+            {
+                av_assert1(op->convert.to == next->type);
+                op->convert.to = next->convert.to;
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* Conversion followed by integer expansion */
+            if (next->op == SWS_OP_SCALE &&
+                !av_cmp_q(next->c.q, expand_factor(op->type, op->convert.to)))
+            {
+                op->convert.expand = true;
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_MIN:
+            for (int i = 0; i < 4; i++) {
+                if (next->comps.unused[i] || !op->c.q4[i].den)
+                    continue;
+                if (av_cmp_q(op->c.q4[i], prev->comps.max[i]) < 0)
+                    noop = false;
+            }
+
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_MAX:
+            for (int i = 0; i < 4; i++) {
+                if (next->comps.unused[i] || !op->c.q4[i].den)
+                    continue;
+                if (av_cmp_q(prev->comps.min[i], op->c.q4[i]) < 0)
+                    noop = false;
+            }
+
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_DITHER:
+            for (int i = 0; i < 4; i++) {
+                noop &= (prev->comps.flags[i] & SWS_COMP_EXACT) ||
+                        next->comps.unused[i];
+            }
+
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_LINEAR: {
+            SwsSwizzleOp swizzle;
+            SwsConst c;
+
+            /* No-op (identity) linear operation */
+            if (!op->lin.mask) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            if (next->op == SWS_OP_LINEAR) {
+                /* 5x5 matrix multiplication after appending [ 0 0 0 0 1 ] */
+                const SwsLinearOp m1 = op->lin;
+                const SwsLinearOp m2 = next->lin;
+                for (int i = 0; i < 4; i++) {
+                    for (int j = 0; j < 5; j++) {
+                        AVRational sum = Q(0);
+                        for (int k = 0; k < 4; k++)
+                            sum = av_add_q(sum, av_mul_q(m2.m[i][k], m1.m[k][j]));
+                        if (j == 4) /* m1.m[4][j] == 1 */
+                            sum = av_add_q(sum, m2.m[i][4]);
+                        op->lin.m[i][j] = sum;
+                    }
+                }
+                op->lin.mask = ff_sws_linear_mask(op->lin);
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* Optimize away zero columns */
+            for (int j = 0; j < 4; j++) {
+                const uint32_t col = SWS_MASK_COL(j);
+                if (!(prev->comps.flags[j] & SWS_COMP_ZERO) || !(op->lin.mask & col))
+                    continue;
+                for (int i = 0; i < 4; i++)
+                    op->lin.m[i][j] = Q(i == j);
+                op->lin.mask &= ~col;
+                goto retry;
+            }
+
+            /* Optimize away unused rows */
+            for (int i = 0; i < 4; i++) {
+                const uint32_t row = SWS_MASK_ROW(i);
+                if (!next->comps.unused[i] || !(op->lin.mask & row))
+                    continue;
+                for (int j = 0; j < 5; j++)
+                    op->lin.m[i][j] = Q(i == j);
+                op->lin.mask &= ~row;
+                goto retry;
+            }
+
+            /* Convert constant rows to explicit clear instruction */
+            if (extract_constant_rows(&op->lin, prev->comps, &c)) {
+                RET(ff_sws_op_list_insert_at(ops, n + 1, &(SwsOp) {
+                    .op    = SWS_OP_CLEAR,
+                    .type  = op->type,
+                    .comps = op->comps,
+                    .c     = c,
+                }));
+                goto retry;
+            }
+
+            /* Multiplication by scalar constant */
+            if (extract_scalar(&op->lin, prev->comps, next->comps, &c)) {
+                op->op = SWS_OP_SCALE;
+                op->c  = c;
+                goto retry;
+            }
+
+            /* Swizzle by fixed pattern */
+            if (extract_swizzle(&op->lin, prev->comps, &swizzle)) {
+                RET(ff_sws_op_list_insert_at(ops, n, &(SwsOp) {
+                    .op      = SWS_OP_SWIZZLE,
+                    .type    = op->type,
+                    .swizzle = swizzle,
+                }));
+                goto retry;
+            }
+            break;
+        }
+
+        case SWS_OP_SCALE: {
+            const int factor2 = exact_log2_q(op->c.q);
+
+            /* No-op scaling */
+            if (op->c.q.num == 1 && op->c.q.den == 1) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            /* Scaling by integer before conversion to int */
+            if (op->c.q.den == 1 &&
+                next->op == SWS_OP_CONVERT &&
+                ff_sws_pixel_type_is_int(next->convert.to))
+            {
+                op->type = next->convert.to;
+                FFSWAP(SwsOp, *op, *next);
+                goto retry;
+            }
+
+            /* Scaling by exact power of two */
+            if (factor2 && ff_sws_pixel_type_is_int(op->type)) {
+                op->op = factor2 > 0 ? SWS_OP_LSHIFT : SWS_OP_RSHIFT;
+                op->c.u = FFABS(factor2);
+                goto retry;
+            }
+            break;
+        }
+        }
+
+        /* No optimization triggered, move on to next operation */
+        n++;
+    }
+
+    return 0;
+}
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 08/17] swscale/ops_internal: add internal ops backend API
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (6 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 07/17] swscale/optimizer: add high-level ops optimizer Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-23 16:27   ` Michael Niedermayer
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 09/17] swscale/ops: add dispatch layer Niklas Haas
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This adds an internal API for ops backends, which are responsible for
compiling op lists into executable functions.
---
 libswscale/ops.c          |  62 ++++++++++++++++++++++
 libswscale/ops_internal.h | 108 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)
 create mode 100644 libswscale/ops_internal.h

diff --git a/libswscale/ops.c b/libswscale/ops.c
index 004686147d..8491bd9cad 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -25,9 +25,22 @@
 #include "libavutil/refstruct.h"
 
 #include "ops.h"
+#include "ops_internal.h"
+
+const SwsOpBackend * const ff_sws_op_backends[] = {
+    NULL
+};
+
+const int ff_sws_num_op_backends = FF_ARRAY_ELEMS(ff_sws_op_backends) - 1;
 
 #define Q(N) ((AVRational) { N, 1 })
 
+#define RET(x)                                                                 \
+    do {                                                                       \
+        if ((ret = (x)) < 0)                                                   \
+            return ret;                                                        \
+    } while (0)
+
 const char *ff_sws_pixel_type_name(SwsPixelType type)
 {
     switch (type) {
@@ -520,3 +533,52 @@ void ff_sws_op_list_print(void *log, int lev, const SwsOpList *ops)
 
     av_log(log, lev, "    (X = unused, + = exact, 0 = zero)\n");
 }
+
+int ff_sws_ops_compile_backend(SwsContext *ctx, const SwsOpBackend *backend,
+                               const SwsOpList *ops, SwsCompiledOp *out)
+{
+    SwsOpList *copy, rest;
+    int ret = 0;
+
+    copy = ff_sws_op_list_duplicate(ops);
+    if (!copy)
+        return AVERROR(ENOMEM);
+
+    /* Ensure these are always set during compilation */
+    ff_sws_op_list_update_comps(copy);
+
+    /* Make an on-stack copy of `ops` to ensure we can still properly clean up
+     * the copy afterwards */
+    rest = *copy;
+
+    ret = backend->compile(ctx, &rest, out);
+    if (ret == AVERROR(ENOTSUP)) {
+        av_log(ctx, AV_LOG_DEBUG, "Backend '%s' does not support operations:\n", backend->name);
+        ff_sws_op_list_print(ctx, AV_LOG_DEBUG, &rest);
+    } else if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to compile operations: %s\n", av_err2str(ret));
+        ff_sws_op_list_print(ctx, AV_LOG_ERROR, &rest);
+    }
+
+    ff_sws_op_list_free(&copy);
+    return ret;
+}
+
+int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out)
+{
+    for (int n = 0; ff_sws_op_backends[n]; n++) {
+        const SwsOpBackend *backend = ff_sws_op_backends[n];
+        if (ff_sws_ops_compile_backend(ctx, backend, ops, out) < 0)
+            continue;
+
+        av_log(ctx, AV_LOG_VERBOSE, "Compiled using backend '%s': "
+               "block size = %d, over-read = %d, over-write = %d, cpu flags = 0x%x\n",
+               backend->name, out->block_size, out->over_read, out->over_write,
+               out->cpu_flags);
+        return 0;
+    }
+
+    av_log(ctx, AV_LOG_WARNING, "No backend found for operations:\n");
+    ff_sws_op_list_print(ctx, AV_LOG_WARNING, ops);
+    return AVERROR(ENOTSUP);
+}
diff --git a/libswscale/ops_internal.h b/libswscale/ops_internal.h
new file mode 100644
index 0000000000..9fd866430b
--- /dev/null
+++ b/libswscale/ops_internal.h
@@ -0,0 +1,108 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_INTERNAL_H
+#define SWSCALE_OPS_INTERNAL_H
+
+#include "libavutil/mem_internal.h"
+
+#include "ops.h"
+
+/**
+ * Global execution context for all compiled functions.
+ *
+ * Note: This struct is hard-coded in assembly, so do not change the layout
+ * without updating the corresponding assembly definitions.
+ */
+typedef struct SwsOpExec {
+    /* The data pointers point to the first pixel to process */
+    DECLARE_ALIGNED_32(const uint8_t, *in[4]);
+    DECLARE_ALIGNED_32(uint8_t, *out[4]);
+
+    /* Separation between lines in bytes */
+    DECLARE_ALIGNED_32(ptrdiff_t, in_stride[4]);
+    DECLARE_ALIGNED_32(ptrdiff_t, out_stride[4]);
+
+    /* Extra metadata, may or may not be useful */
+    int32_t width, height;      /* Overall image dimensions */
+    int32_t slice_y, slice_h;   /* Start and height of current slice */
+    int32_t pixel_bits_in;      /* Bits per input pixel */
+    int32_t pixel_bits_out;     /* Bits per output pixel */
+} SwsOpExec;
+
+static_assert(sizeof(SwsOpExec) == 16 * sizeof(void *) + 8 * sizeof(int32_t),
+              "SwsOpExec layout mismatch");
+
+/**
+ * Process a given range of pixel blocks.
+ *
+ * Note: `bx_start` and `bx_end` are in units of `SwsCompiledOp.block_size`.
+ */
+typedef void (*SwsOpFunc)(const SwsOpExec *exec, const void *priv,
+                          int bx_start, int y_start, int bx_end, int y_end);
+
+#define SWS_DECL_FUNC(NAME) \
+    void NAME(const SwsOpExec *, const void *, int, int, int, int)
+
+typedef struct SwsCompiledOp {
+    SwsOpFunc func;
+
+    int block_size; /* number of pixels processed per iteration */
+    int over_read;  /* implementation over-reads input by this many bytes */
+    int over_write; /* implementation over-writes output by this many bytes */
+    int cpu_flags;  /* active set of CPU flags (informative) */
+
+    /* Arbitrary private data */
+    void *priv;
+    void (*free)(void *priv);
+} SwsCompiledOp;
+
+typedef struct SwsOpBackend {
+    const char *name; /* Descriptive name for this backend */
+
+    /**
+     * Compile an operation list to an implementation chain. May modify `ops`
+     * freely; the original list will be freed automatically by the caller.
+     *
+     * Returns 0 or a negative error code.
+     */
+    int (*compile)(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out);
+} SwsOpBackend;
+
+/* List of all backends, terminated by NULL */
+extern const SwsOpBackend *const ff_sws_op_backends[];
+extern const int ff_sws_num_op_backends; /* excludes terminating NULL */
+
+/**
+ * Attempt to compile a list of operations using a specific backend.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_ops_compile_backend(SwsContext *ctx, const SwsOpBackend *backend,
+                               const SwsOpList *ops, SwsCompiledOp *out);
+
+/**
+ * Compile a list of operations using the best available backend.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out);
+
+#endif
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 09/17] swscale/ops: add dispatch layer
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (7 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 08/17] swscale/ops_internal: add internal ops backend API Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 10/17] swscale/optimizer: add packed shuffle solver Niklas Haas
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This handles the low-level execution of an op list, and integration into
the SwsGraph infrastructure. To handle frames with insufficient padding in
the stride (or a width smaller than one block size), we use a fallback loop
that pads the last column of pixels using `memcpy` into an appropriately
sized buffer.
---
 libswscale/ops.c | 256 +++++++++++++++++++++++++++++++++++++++++++++++
 libswscale/ops.h |  14 +++
 2 files changed, 270 insertions(+)

diff --git a/libswscale/ops.c b/libswscale/ops.c
index 8491bd9cad..d466f5e45c 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -582,3 +582,259 @@ int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out
     ff_sws_op_list_print(ctx, AV_LOG_WARNING, ops);
     return AVERROR(ENOTSUP);
 }
+
+typedef struct SwsOpPass {
+    SwsCompiledOp comp;
+    SwsOpExec exec_base;
+    int num_blocks;
+    int tail_off_in;
+    int tail_off_out;
+    int tail_size_in;
+    int tail_size_out;
+    bool memcpy_in;
+    bool memcpy_out;
+} SwsOpPass;
+
+static void op_pass_free(void *ptr)
+{
+    SwsOpPass *p = ptr;
+    if (!p)
+        return;
+
+    if (p->comp.free)
+        p->comp.free(p->comp.priv);
+
+    av_free(p);
+}
+
+static void op_pass_setup(const SwsImg *out, const SwsImg *in, const SwsPass *pass)
+{
+    const AVPixFmtDescriptor *indesc  = av_pix_fmt_desc_get(in->fmt);
+    const AVPixFmtDescriptor *outdesc = av_pix_fmt_desc_get(out->fmt);
+
+    SwsOpPass *p = pass->priv;
+    SwsOpExec *exec = &p->exec_base;
+    const SwsCompiledOp *comp = &p->comp;
+    const int block_size = comp->block_size;
+    p->num_blocks = (pass->width + block_size - 1) / block_size;
+
+    /* Set up main loop parameters */
+    const int aligned_w  = p->num_blocks * block_size;
+    const int safe_width = (p->num_blocks - 1) * block_size;
+    const int tail_size  = pass->width - safe_width;
+    p->tail_off_in   = safe_width * exec->pixel_bits_in  >> 3;
+    p->tail_off_out  = safe_width * exec->pixel_bits_out >> 3;
+    p->tail_size_in  = tail_size  * exec->pixel_bits_in  >> 3;
+    p->tail_size_out = tail_size  * exec->pixel_bits_out >> 3;
+    p->memcpy_in     = false;
+    p->memcpy_out    = false;
+
+    for (int i = 0; i < 4 && in->data[i]; i++) {
+        const int sub_x      = (i == 1 || i == 2) ? indesc->log2_chroma_w : 0;
+        const int plane_w    = (aligned_w + sub_x) >> sub_x;
+        const int plane_pad  = (comp->over_read + sub_x) >> sub_x;
+        const int plane_size = plane_w * exec->pixel_bits_in >> 3;
+        p->memcpy_in |= plane_size + plane_pad > in->linesize[i];
+        exec->in_stride[i] = in->linesize[i];
+    }
+
+    for (int i = 0; i < 4 && out->data[i]; i++) {
+        const int sub_x      = (i == 1 || i == 2) ? outdesc->log2_chroma_w : 0;
+        const int plane_w    = (aligned_w + sub_x) >> sub_x;
+        const int plane_pad  = (comp->over_write + sub_x) >> sub_x;
+        const int plane_size = plane_w * exec->pixel_bits_out >> 3;
+        p->memcpy_out |= plane_size + plane_pad > out->linesize[i];
+        exec->out_stride[i] = out->linesize[i];
+    }
+}
+
+/* Dispatch kernel over the last column of the image using memcpy */
+static av_always_inline void
+handle_tail(const SwsOpPass *p, SwsOpExec *exec,
+            const SwsImg *out_base, const bool copy_out,
+            const SwsImg *in_base, const bool copy_in,
+            int y, const int h)
+{
+    DECLARE_ALIGNED_64(uint8_t, tmp)[2][4][sizeof(uint32_t[128])];
+
+    const SwsCompiledOp *comp = &p->comp;
+    const int tail_size_in  = p->tail_size_in;
+    const int tail_size_out = p->tail_size_out;
+    const int bx = p->num_blocks - 1;
+
+    SwsImg in  = ff_sws_img_shift(in_base,  y);
+    SwsImg out = ff_sws_img_shift(out_base, y);
+    for (int i = 0; i < 4 && in.data[i]; i++) {
+        in.data[i]  += p->tail_off_in;
+        if (copy_in) {
+            exec->in[i] = (void *) tmp[0][i];
+            exec->in_stride[i] = sizeof(tmp[0][i]);
+        } else {
+            exec->in[i] = in.data[i];
+        }
+    }
+
+    for (int i = 0; i < 4 && out.data[i]; i++) {
+        out.data[i] += p->tail_off_out;
+        if (copy_out) {
+            exec->out[i] = (void *) tmp[1][i];
+            exec->out_stride[i] = sizeof(tmp[1][i]);
+        } else {
+            exec->out[i] = out.data[i];
+        }
+    }
+
+    for (int y_end = y + h; y < y_end; y++) {
+        if (copy_in) {
+            for (int i = 0; i < 4 && in.data[i]; i++) {
+                av_assert2(tmp[0][i] + tail_size_in < (uint8_t *) tmp[1]);
+                memcpy(tmp[0][i], in.data[i], tail_size_in);
+                in.data[i] += in.linesize[i];
+            }
+        }
+
+        comp->func(exec, comp->priv, bx, y, p->num_blocks, y + 1);
+
+        if (copy_out) {
+            for (int i = 0; i < 4 && out.data[i]; i++) {
+                av_assert2(tmp[1][i] + tail_size_out < (uint8_t *) tmp[2]);
+                memcpy(out.data[i], tmp[1][i], tail_size_out);
+                out.data[i] += out.linesize[i];
+            }
+        }
+
+        for (int i = 0; i < 4; i++) {
+            if (!copy_in)
+                exec->in[i] += in.linesize[i];
+            if (!copy_out)
+                exec->out[i] += out.linesize[i];
+        }
+    }
+}
+
+static void op_pass_run(const SwsImg *out_base, const SwsImg *in_base,
+                        const int y, const int h, const SwsPass *pass)
+{
+    const SwsOpPass *p = pass->priv;
+    const SwsCompiledOp *comp = &p->comp;
+
+    /* Fill exec metadata for this slice */
+    const SwsImg in  = ff_sws_img_shift(in_base,  y);
+    const SwsImg out = ff_sws_img_shift(out_base, y);
+    SwsOpExec exec = p->exec_base;
+    exec.slice_y = y;
+    exec.slice_h = h;
+    for (int i = 0; i < 4; i++) {
+        exec.in[i]  = in.data[i];
+        exec.out[i] = out.data[i];
+    }
+
+    /**
+     *  To ensure safety, we need to consider the following:
+     *
+     * 1. We can overread the input, unless this is the last line of an
+     *    unpadded buffer. All defined operations can handle arbitrary pixel
+     *    input, so overread of arbitrary data is fine.
+     *
+     * 2. We can overwrite the output, as long as we don't write more than the
+     *    amount of pixels that fit into one linesize. So we always need to
+     *    memcpy the last column on the output side if unpadded.
+     *
+     * 3. For the last row, we also need to memcpy the remainder of the input,
+     *    to avoid reading past the end of the buffer. Note that since we know
+     *    the run() function is called on stripes of the same buffer, we don't
+     *    need to worry about this for the end of a slice.
+     */
+
+    const int last_slice  = y + h == pass->height;
+    const bool memcpy_in  = last_slice && p->memcpy_in;
+    const bool memcpy_out = p->memcpy_out;
+    const int num_blocks  = p->num_blocks;
+    const int blocks_main = num_blocks - memcpy_out;
+    const int h_main      = h - memcpy_in;
+
+    /* Handle main section */
+    comp->func(&exec, comp->priv, 0, y, blocks_main, y + h_main);
+
+    if (memcpy_in) {
+        /* Safe part of last row */
+        for (int i = 0; i < 4; i++) {
+            exec.in[i]  += h_main * in.linesize[i];
+            exec.out[i] += h_main * out.linesize[i];
+        }
+        comp->func(&exec, comp->priv, 0, y + h_main, num_blocks - 1, y + h);
+    }
+
+    /* Handle last column via memcpy, takes over `exec` so call these last */
+    if (memcpy_out)
+        handle_tail(p, &exec, out_base, true, in_base, false, y, h_main);
+    if (memcpy_in)
+        handle_tail(p, &exec, out_base, memcpy_out, in_base, true, y + h_main, 1);
+}
+
+static int rw_pixel_bits(const SwsOp *op)
+{
+    const int elems = op->rw.packed ? op->rw.elems : 1;
+    const int size  = ff_sws_pixel_type_size(op->type);
+    const int bits  = 8 >> op->rw.frac;
+    av_assert1(bits >= 1);
+    return elems * size * bits;
+}
+
+int ff_sws_compile_pass(SwsGraph *graph, SwsOpList *ops, int flags, SwsFormat dst,
+                        SwsPass *input, SwsPass **output)
+{
+    SwsContext *ctx = graph->ctx;
+    SwsOpPass *p = NULL;
+    const SwsOp *read = &ops->ops[0];
+    const SwsOp *write = &ops->ops[ops->num_ops - 1];
+    SwsPass *pass;
+    int ret;
+
+    if (ops->num_ops < 2) {
+        av_log(ctx, AV_LOG_ERROR, "Need at least two operations.\n");
+        return AVERROR(EINVAL);
+    }
+
+    if (read->op != SWS_OP_READ || write->op != SWS_OP_WRITE) {
+        av_log(ctx, AV_LOG_ERROR, "First and last operations must be a read "
+               "and write, respectively.\n");
+        return AVERROR(EINVAL);
+    }
+
+    if (flags & SWS_OP_FLAG_OPTIMIZE)
+        RET(ff_sws_op_list_optimize(ops));
+    else
+        ff_sws_op_list_update_comps(ops);
+
+    p = av_mallocz(sizeof(*p));
+    if (!p)
+        return AVERROR(ENOMEM);
+
+    p->exec_base = (SwsOpExec) {
+        .width  = dst.width,
+        .height = dst.height,
+        .pixel_bits_in  = rw_pixel_bits(read),
+        .pixel_bits_out = rw_pixel_bits(write),
+    };
+
+    ret = ff_sws_ops_compile(ctx, ops, &p->comp);
+    if (ret < 0)
+        goto fail;
+
+    pass = ff_sws_graph_add_pass(graph, dst.format, dst.width, dst.height, input,
+                                 1, p, op_pass_run);
+    if (!pass) {
+        ret = AVERROR(ENOMEM);
+        goto fail;
+    }
+    pass->setup = op_pass_setup;
+    pass->free  = op_pass_free;
+
+    *output = pass;
+    return 0;
+
+fail:
+    op_pass_free(p);
+    return ret;
+}
diff --git a/libswscale/ops.h b/libswscale/ops.h
index ae65d578b3..1a992f42ec 100644
--- a/libswscale/ops.h
+++ b/libswscale/ops.h
@@ -249,4 +249,18 @@ void ff_sws_op_list_update_comps(SwsOpList *ops);
  */
 int ff_sws_op_list_optimize(SwsOpList *ops);
 
+enum SwsOpCompileFlags {
+    /* Automatically optimize the operations when compiling */
+    SWS_OP_FLAG_OPTIMIZE = 1 << 0,
+};
+
+/**
+ * Resolves an operation list to a graph pass. The first and last operations
+ * must be a read/write respectively. `flags` is a list of SwsOpCompileFlags.
+ *
+ * Note: `ops` may be modified by this function.
+ */
+int ff_sws_compile_pass(SwsGraph *graph, SwsOpList *ops, int flags, SwsFormat dst,
+                        SwsPass *input, SwsPass **output);
+
 #endif
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 10/17] swscale/optimizer: add packed shuffle solver
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (8 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 09/17] swscale/ops: add dispatch layer Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 11/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This can turn any compatible sequence of operations into a single packed
shuffle, including packed swizzling, grayscale->RGB conversion, endianness
swapping, RGB bit depth conversions, rgb24->rgb0 alpha clearing and more.
---
 libswscale/ops_internal.h  | 17 +++++++
 libswscale/ops_optimizer.c | 96 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 113 insertions(+)

diff --git a/libswscale/ops_internal.h b/libswscale/ops_internal.h
index 9fd866430b..ab957b0837 100644
--- a/libswscale/ops_internal.h
+++ b/libswscale/ops_internal.h
@@ -105,4 +105,21 @@ int ff_sws_ops_compile_backend(SwsContext *ctx, const SwsOpBackend *backend,
  */
 int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out);
 
+/**
+ * "Solve" an op list into a fixed shuffle mask, with an optional ability to
+ * also directly clear the output value (for e.g. rgb24 -> rgb0).
+ *
+ * @param ops         The operation list to decompose.
+ * @param shuffle     The output shuffle mask.
+ * @param size        The size (in bytes) of the output shuffle mask.
+ * @param clear_val   If nonzero, this index will be used to clear the output.
+ * @param read_bytes  Returns the number of bytes read per shuffle iteration.
+ * @param write_bytes Returns the number of bytes written per shuffle iteration.
+ *
+ * @return  The number of pixels processed per iteration, or a negative error
+            code; in particular AVERROR(ENOTSUP) for unsupported operations.
+ */
+int ff_sws_solve_shuffle(const SwsOpList *ops, uint8_t shuffle[], int size,
+                         uint8_t clear_val, int *read_bytes, int *write_bytes);
+
 #endif
diff --git a/libswscale/ops_optimizer.c b/libswscale/ops_optimizer.c
index d503bf7bf3..9cde60ed58 100644
--- a/libswscale/ops_optimizer.c
+++ b/libswscale/ops_optimizer.c
@@ -19,9 +19,11 @@
  */
 
 #include "libavutil/avassert.h"
+#include <libavutil/bswap.h>
 #include "libavutil/rational.h"
 
 #include "ops.h"
+#include "ops_internal.h"
 
 #define Q(N) ((AVRational) { N, 1 })
 
@@ -781,3 +783,97 @@ retry:
 
     return 0;
 }
+
+int ff_sws_solve_shuffle(const SwsOpList *const ops, uint8_t shuffle[],
+                         int shuffle_size, uint8_t clear_val,
+                         int *out_read_bytes, int *out_write_bytes)
+{
+    const SwsOp read = ops->ops[0];
+    const int read_size = ff_sws_pixel_type_size(read.type);
+    uint32_t mask[4] = {0};
+
+    if (!ops->num_ops || read.op != SWS_OP_READ)
+        return AVERROR(EINVAL);
+    if (read.rw.frac || (!read.rw.packed && read.rw.elems > 1))
+        return AVERROR(ENOTSUP);
+
+    for (int i = 0; i < read.rw.elems; i++)
+        mask[i] = 0x01010101 * i * read_size + 0x03020100;
+
+    for (int opidx = 1; opidx < ops->num_ops; opidx++) {
+        const SwsOp *op = &ops->ops[opidx];
+        switch (op->op) {
+        case SWS_OP_SWIZZLE: {
+            uint32_t orig[4] = { mask[0], mask[1], mask[2], mask[3] };
+            for (int i = 0; i < 4; i++)
+                mask[i] = orig[op->swizzle.in[i]];
+            break;
+        }
+
+        case SWS_OP_SWAP_BYTES:
+            for (int i = 0; i < 4; i++) {
+                switch (ff_sws_pixel_type_size(op->type)) {
+                case 2: mask[i] = av_bswap16(mask[i]); break;
+                case 4: mask[i] = av_bswap32(mask[i]); break;
+                }
+            }
+            break;
+
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (!op->c.q4[i].den)
+                    continue;
+                if (op->c.q4[i].num != 0 || !clear_val)
+                    return AVERROR(ENOTSUP);
+                mask[i] = 0x1010101ul * clear_val;
+            }
+            break;
+
+        case SWS_OP_CONVERT: {
+            if (!op->convert.expand)
+                return AVERROR(ENOTSUP);
+            for (int i = 0; i < 4; i++) {
+                switch (ff_sws_pixel_type_size(op->type)) {
+                case 1: mask[i] = 0x01010101 * (mask[i] & 0xFF);   break;
+                case 2: mask[i] = 0x00010001 * (mask[i] & 0xFFFF); break;
+                }
+            }
+            break;
+        }
+
+        case SWS_OP_WRITE: {
+            if (op->rw.frac || !op->rw.packed)
+                return AVERROR(ENOTSUP);
+
+            /* Initialize to no-op */
+            memset(shuffle, clear_val, shuffle_size);
+
+            const int write_size  = ff_sws_pixel_type_size(op->type);
+            const int read_chunk  = read.rw.elems * read_size;
+            const int write_chunk = op->rw.elems * write_size;
+            const int num_groups  = shuffle_size / FFMAX(read_chunk, write_chunk);
+            for (int n = 0; n < num_groups; n++) {
+                const int base_in  = n * read_chunk;
+                const int base_out = n * write_chunk;
+                for (int i = 0; i < op->rw.elems; i++) {
+                    const int offset = base_out + i * write_size;
+                    for (int b = 0; b < write_size; b++) {
+                        const uint8_t idx = mask[i] >> (b * 8);
+                        if (idx != clear_val)
+                            shuffle[offset + b] = base_in + idx;
+                    }
+                }
+            }
+
+            *out_read_bytes  = num_groups * read_chunk;
+            *out_write_bytes = num_groups * write_chunk;
+            return num_groups;
+        }
+
+        default:
+            return AVERROR(ENOTSUP);
+        }
+    }
+
+    return AVERROR(EINVAL);
+}
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 11/17] swscale/ops_chain: add internal abstraction for kernel linking
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (9 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 10/17] swscale/optimizer: add packed shuffle solver Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 12/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

See doc/swscale-v2.txt for design details.
---
 libswscale/Makefile    |   1 +
 libswscale/ops_chain.c | 293 +++++++++++++++++++++++++++++++++++++++++
 libswscale/ops_chain.h | 109 +++++++++++++++
 3 files changed, 403 insertions(+)
 create mode 100644 libswscale/ops_chain.c
 create mode 100644 libswscale/ops_chain.h

diff --git a/libswscale/Makefile b/libswscale/Makefile
index 810c9dee78..c9dfa78c89 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -16,6 +16,7 @@ OBJS = alphablend.o                                     \
        input.o                                          \
        lut3d.o                                          \
        ops.o                                            \
+       ops_chain.o                                      \
        ops_optimizer.o                                  \
        options.o                                        \
        output.o                                         \
diff --git a/libswscale/ops_chain.c b/libswscale/ops_chain.c
new file mode 100644
index 0000000000..cba825ee41
--- /dev/null
+++ b/libswscale/ops_chain.c
@@ -0,0 +1,293 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/mem.h"
+#include "libavutil/rational.h"
+
+#include "ops_chain.h"
+
+SwsOpChain *ff_sws_op_chain_alloc(void)
+{
+    return av_mallocz(sizeof(SwsOpChain));
+}
+
+void ff_sws_op_chain_free(SwsOpChain *chain)
+{
+    if (!chain)
+        return;
+
+    for (int i = 0; i < chain->num_impl + 1; i++) {
+        if (chain->free[i])
+            chain->free[i](chain->impl[i].priv.ptr);
+    }
+
+    av_free(chain);
+}
+
+int ff_sws_op_chain_append(SwsOpChain *chain, SwsFuncPtr func,
+                           void (*free)(void *), SwsOpPriv priv)
+{
+    const int idx = chain->num_impl;
+    if (idx == SWS_MAX_OPS)
+        return AVERROR(EINVAL);
+
+    av_assert1(func);
+    chain->impl[idx].cont = func;
+    chain->impl[idx + 1].priv = priv;
+    chain->free[idx + 1] = free;
+    chain->num_impl++;
+    return 0;
+}
+
+/**
+ * Match an operation against a reference operation. Returns a score for how
+ * well the reference matches the operation, or 0 if there is no match.
+ *
+ * If `ref->comps` has any flags set, they must be set in `op` as well.
+ * Likewise, if `ref->comps` has any components marked as unused, they must be
+ * marked as as unused in `ops` as well.
+ *
+ * For SWS_OP_LINEAR, `ref->linear.mask` must be a strict superset of
+ * `op->linear.mask`, but may not contain any columns explicitly ignored by
+ * `op->comps.unused`.
+ *
+ * For SWS_OP_READ, SWS_OP_WRITE, SWS_OP_SWAP_BYTES and SWS_OP_SWIZZLE, the
+ * exact type is not checked, just the size.
+ *
+ * Components set in `next.unused` are ignored when matching. If `flexible`
+ * is true, the op body is ignored - only the operation, pixel type, and
+ * component masks are checked.
+ */
+static int op_match(const SwsOp *op, const SwsOpEntry *entry, const SwsComps next)
+{
+    const SwsOp *ref = &entry->op;
+    int score = 10;
+    if (op->op != ref->op)
+        return 0;
+
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+    case SWS_OP_SWAP_BYTES:
+    case SWS_OP_SWIZZLE:
+        /* Only the size matters for these operations */
+        if (ff_sws_pixel_type_size(op->type) != ff_sws_pixel_type_size(ref->type))
+            return 0;
+        break;
+    default:
+        if (op->type != ref->type)
+            return 0;
+        break;
+    }
+
+    for (int i = 0; i < 4; i++) {
+        if (ref->comps.unused[i]) {
+            if (op->comps.unused[i])
+                score += 1; /* Operating on fewer components is better .. */
+            else
+                return false; /* .. but not too few! */
+        }
+
+        if (ref->comps.flags[i]) {
+            if (ref->comps.flags[i] & ~op->comps.flags[i]) {
+                return false; /* Missing required output assumptions */
+            } else {
+                /* Implementation is more specialized */
+                score += av_popcount(ref->comps.flags[i]);
+            }
+        }
+    }
+
+    /* Flexible variants always match, but lower the score to prioritize more
+     * specific implementations if they exist */
+    if (entry->flexible)
+        return score - 5;
+
+    switch (op->op) {
+    case SWS_OP_INVALID:
+        return 0;
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        if (op->rw.elems   != ref->rw.elems ||
+            op->rw.frac    != ref->rw.frac  ||
+            (op->rw.elems > 1 && op->rw.packed != ref->rw.packed))
+            return 0;
+        return score;
+    case SWS_OP_SWAP_BYTES:
+        return score;
+    case SWS_OP_PACK:
+    case SWS_OP_UNPACK:
+        for (int i = 0; i < 4 && op->pack.pattern[i]; i++) {
+            if (op->pack.pattern[i] != ref->pack.pattern[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_CLEAR:
+        for (int i = 0; i < 4; i++) {
+            if (!op->c.q4[i].den)
+                continue;
+            if (av_cmp_q(op->c.q4[i], ref->c.q4[i]) && !next.unused[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_LSHIFT:
+    case SWS_OP_RSHIFT:
+        return op->c.u == ref->c.u ? score : 0;
+    case SWS_OP_SWIZZLE:
+        for (int i = 0; i < 4; i++) {
+            if (op->swizzle.in[i] != ref->swizzle.in[i] && !next.unused[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_CONVERT:
+        if (op->convert.to     != ref->convert.to ||
+            op->convert.expand != ref->convert.expand)
+            return 0;
+        return score;
+    case SWS_OP_DITHER:
+        return op->dither.size_log2 == ref->dither.size_log2 ? score : 0;
+    case SWS_OP_MIN:
+    case SWS_OP_MAX:
+        for (int i = 0; i < 4; i++) {
+            if (av_cmp_q(op->c.q4[i], ref->c.q4[i]) && !next.unused[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_LINEAR:
+        /* All required elements must be present */
+        if (op->lin.mask & ~ref->lin.mask)
+            return 0;
+        /* To avoid operating on possibly undefined memory, filter out
+         * implementations that operate on more input components */
+        for (int i = 0; i < 4; i++) {
+            if ((ref->lin.mask & SWS_MASK_COL(i)) && op->comps.unused[i])
+                return 0;
+        }
+        /* Prioritize smaller implementations */
+        score += av_popcount(SWS_MASK_ALL ^ ref->lin.mask);
+        return score;
+    case SWS_OP_SCALE:
+        return score;
+    case SWS_OP_TYPE_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid operation type!");
+    return 0;
+}
+
+int ff_sws_op_compile_tables(const SwsOpTable *const tables[], int num_tables,
+                             SwsOpList *ops, const int block_size,
+                             SwsOpChain *chain)
+{
+    static const SwsOp dummy = { .comps.unused = { true, true, true, true }};
+    const SwsOp *next = ops->num_ops > 1 ? &ops->ops[1] : &dummy;
+    const unsigned cpu_flags = av_get_cpu_flags();
+    const SwsOpEntry *best = NULL;
+    const SwsOp *op = &ops->ops[0];
+    int ret, best_score = 0, best_cpu_flags;
+    SwsOpPriv priv = {0};
+
+    for (int n = 0; n < num_tables; n++) {
+        const SwsOpTable *table = tables[n];
+        if (table->block_size && table->block_size != block_size ||
+            table->cpu_flags & ~cpu_flags)
+            continue;
+
+        for (int i = 0; table->entries[i]; i++) {
+            const SwsOpEntry *entry = table->entries[i];
+            int score = op_match(op, entry, next->comps);
+            if (score > best_score) {
+                best_score = score;
+                best_cpu_flags = table->cpu_flags;
+                best = entry;
+            }
+        }
+    }
+
+    if (!best)
+        return AVERROR(ENOTSUP);
+
+    if (best->setup) {
+        ret = best->setup(op, &priv);
+        if (ret < 0)
+            return ret;
+    }
+
+    chain->cpu_flags |= best_cpu_flags;
+    ret = ff_sws_op_chain_append(chain, best->func, best->free, priv);
+    if (ret < 0) {
+        if (best->free)
+            best->free(&priv);
+        return ret;
+    }
+
+    ops->ops++;
+    ops->num_ops--;
+    return ops->num_ops ? AVERROR(EAGAIN) : 0;
+}
+
+#define q2pixel(type, q) ((q).den ? (type) (q).num / (q).den : 0)
+
+int ff_sws_setup_u8(const SwsOp *op, SwsOpPriv *out)
+{
+    out->u8[0] = op->c.u;
+    return 0;
+}
+
+int ff_sws_setup_u(const SwsOp *op, SwsOpPriv *out)
+{
+    switch (op->type) {
+    case SWS_PIXEL_U8:  out->u8[0]  = op->c.u; return 0;
+    case SWS_PIXEL_U16: out->u16[0] = op->c.u; return 0;
+    case SWS_PIXEL_U32: out->u32[0] = op->c.u; return 0;
+    case SWS_PIXEL_F32: out->f32[0] = op->c.u; return 0;
+    default: return AVERROR(EINVAL);
+    }
+}
+
+int ff_sws_setup_q(const SwsOp *op, SwsOpPriv *out)
+{
+    switch (op->type) {
+    case SWS_PIXEL_U8:  out->u8[0]  = q2pixel(uint8_t,  op->c.q); return 0;
+    case SWS_PIXEL_U16: out->u16[0] = q2pixel(uint16_t, op->c.q); return 0;
+    case SWS_PIXEL_U32: out->u32[0] = q2pixel(uint32_t, op->c.q); return 0;
+    case SWS_PIXEL_F32: out->f32[0] = q2pixel(float,    op->c.q); return 0;
+    default: return AVERROR(EINVAL);
+    }
+
+    return 0;
+}
+
+int ff_sws_setup_q4(const SwsOp *op, SwsOpPriv *out)
+{
+    for (int i = 0; i < 4; i++) {
+        switch (op->type) {
+        case SWS_PIXEL_U8:  out->u8[i]  = q2pixel(uint8_t,  op->c.q4[i]); break;
+        case SWS_PIXEL_U16: out->u16[i] = q2pixel(uint16_t, op->c.q4[i]); break;
+        case SWS_PIXEL_U32: out->u32[i] = q2pixel(uint32_t, op->c.q4[i]); break;
+        case SWS_PIXEL_F32: out->f32[i] = q2pixel(float,    op->c.q4[i]); break;
+        default: return AVERROR(EINVAL);
+        }
+    }
+
+    return 0;
+}
diff --git a/libswscale/ops_chain.h b/libswscale/ops_chain.h
new file mode 100644
index 0000000000..6cbc3adabb
--- /dev/null
+++ b/libswscale/ops_chain.h
@@ -0,0 +1,109 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_CHAIN_H
+#define SWSCALE_OPS_CHAIN_H
+
+#include "libavutil/cpu.h"
+
+#include "ops_internal.h"
+
+/**
+ * Helpers for SIMD implementations based on chained kernels, using a
+ * continuation passing style to link them together.
+ */
+
+/**
+ * Private data for each kernel.
+ */
+typedef union SwsOpPriv {
+    DECLARE_ALIGNED_16(char, data)[16];
+
+    /* Common types */
+    void *ptr;
+    uint8_t   u8[16];
+    uint16_t u16[8];
+    uint32_t u32[4];
+    float    f32[4];
+} SwsOpPriv;
+
+static_assert(sizeof(SwsOpPriv) == 16, "SwsOpPriv size mismatch");
+
+/* Setup helpers */
+int ff_sws_setup_u(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_u8(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_q(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_q4(const SwsOp *op, SwsOpPriv *out);
+
+/**
+ * Per-kernel execution context.
+ *
+ * Note: This struct is hard-coded in assembly, so do not change the layout.
+ */
+typedef void (*SwsFuncPtr)(void);
+typedef struct SwsOpImpl {
+    SwsFuncPtr cont; /* [offset =  0] Continuation for this operation. */
+    SwsOpPriv  priv; /* [offset = 16] Private data for this operation. */
+} SwsOpImpl;
+
+static_assert(sizeof(SwsOpImpl) == 32,         "SwsOpImpl layout mismatch");
+static_assert(offsetof(SwsOpImpl, priv) == 16, "SwsOpImpl layout mismatch");
+
+/* Compiled chain of operations, which can be dispatched efficiently */
+typedef struct SwsOpChain {
+#define SWS_MAX_OPS 16
+    SwsOpImpl impl[SWS_MAX_OPS + 1]; /* reserve extra space for the entrypoint */
+    void (*free[SWS_MAX_OPS + 1])(void *);
+    int num_impl;
+    int cpu_flags; /* set of all used CPU flags */
+} SwsOpChain;
+
+SwsOpChain *ff_sws_op_chain_alloc(void);
+void ff_sws_op_chain_free(SwsOpChain *chain);
+
+/* Returns 0 on success, or a negative error code. */
+int ff_sws_op_chain_append(SwsOpChain *chain, SwsFuncPtr func,
+                           void (*free)(void *), SwsOpPriv priv);
+
+typedef struct SwsOpEntry {
+    SwsOp op;
+    SwsFuncPtr func;
+    bool flexible; /* if true, only the type and op are matched */
+    int (*setup)(const SwsOp *op, SwsOpPriv *out); /* optional */
+    void (*free)(void *priv);
+} SwsOpEntry;
+
+typedef struct SwsOpTable {
+    unsigned cpu_flags;   /* required CPU flags for this table */
+    int block_size;       /* fixed block size of this table */
+    const SwsOpEntry *entries[]; /* terminated by NULL */
+} SwsOpTable;
+
+/**
+ * "Compile" a single op by looking it up in a list of fixed size op tables.
+ * See `op_match` in `ops.c` for details on how the matching works.
+ *
+ * Returns 0, AVERROR(EAGAIN), or a negative error code.
+ */
+int ff_sws_op_compile_tables(const SwsOpTable *const tables[], int num_tables,
+                             SwsOpList *ops, const int block_size,
+                             SwsOpChain *chain);
+
+#endif
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 12/17] swscale/ops_backend: add reference backend basend on C templates
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (10 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 11/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 13/17] swscale/ops_memcpy: add 'memcpy' backend for plane->plane copies Niklas Haas
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This will serve as a reference for the SIMD backends to come. That said,
with auto-vectorization enabled, the performance of this is not atrocious.
It easily beats the old C code and sometimes even the old SIMD.

In theory, we can dramatically speed it up by using GCC vectors instead of
arrays, but the performance gains from this are too dependent on exact GCC
versions and flags, so it practice it's not a substitute for a SIMD
implementation.
---
 libswscale/Makefile          |   6 +
 libswscale/ops.c             |   3 +
 libswscale/ops_backend.c     | 105 ++++++
 libswscale/ops_backend.h     | 167 ++++++++++
 libswscale/ops_tmpl_common.c | 176 ++++++++++
 libswscale/ops_tmpl_float.c  | 257 +++++++++++++++
 libswscale/ops_tmpl_int.c    | 608 +++++++++++++++++++++++++++++++++++
 7 files changed, 1322 insertions(+)
 create mode 100644 libswscale/ops_backend.c
 create mode 100644 libswscale/ops_backend.h
 create mode 100644 libswscale/ops_tmpl_common.c
 create mode 100644 libswscale/ops_tmpl_float.c
 create mode 100644 libswscale/ops_tmpl_int.c

diff --git a/libswscale/Makefile b/libswscale/Makefile
index c9dfa78c89..6e5696c5a6 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -16,6 +16,7 @@ OBJS = alphablend.o                                     \
        input.o                                          \
        lut3d.o                                          \
        ops.o                                            \
+       ops_backend.o                                    \
        ops_chain.o                                      \
        ops_optimizer.o                                  \
        options.o                                        \
@@ -29,6 +30,11 @@ OBJS = alphablend.o                                     \
        yuv2rgb.o                                        \
        vscale.o                                         \
 
+OPS-CFLAGS = -Wno-uninitialized \
+             -ffinite-math-only
+
+$(SUBDIR)ops_backend.o: CFLAGS += $(OPS-CFLAGS)
+
 # Objects duplicated from other libraries for shared builds
 SHLIBOBJS                    += log2_tab.o half2float.o
 
diff --git a/libswscale/ops.c b/libswscale/ops.c
index d466f5e45c..3b9c2844f8 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -27,7 +27,10 @@
 #include "ops.h"
 #include "ops_internal.h"
 
+extern SwsOpBackend backend_c;
+
 const SwsOpBackend * const ff_sws_op_backends[] = {
+    &backend_c,
     NULL
 };
 
diff --git a/libswscale/ops_backend.c b/libswscale/ops_backend.c
new file mode 100644
index 0000000000..47ce992bb3
--- /dev/null
+++ b/libswscale/ops_backend.c
@@ -0,0 +1,105 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "ops_backend.h"
+
+/* Array-based reference implementation */
+
+#ifndef SWS_BLOCK_SIZE
+#  define SWS_BLOCK_SIZE 32
+#endif
+
+typedef  uint8_t  u8block_t[SWS_BLOCK_SIZE];
+typedef uint16_t u16block_t[SWS_BLOCK_SIZE];
+typedef uint32_t u32block_t[SWS_BLOCK_SIZE];
+typedef    float f32block_t[SWS_BLOCK_SIZE];
+
+#define BIT_DEPTH 8
+# include "ops_tmpl_int.c"
+#undef BIT_DEPTH
+
+#define BIT_DEPTH 16
+# include "ops_tmpl_int.c"
+#undef BIT_DEPTH
+
+#define BIT_DEPTH 32
+# include "ops_tmpl_int.c"
+# include "ops_tmpl_float.c"
+#undef BIT_DEPTH
+
+static void process(const SwsOpExec *exec, const void *priv,
+                    const int bx_start, const int y_start, int bx_end, int y_end)
+{
+    const SwsOpChain *chain = priv;
+    const SwsOpImpl *impl = chain->impl;
+    SwsOpIter iter;
+
+    for (iter.y = y_start; iter.y < y_end; iter.y++) {
+        for (int i = 0; i < 4; i++) {
+            iter.in[i]  = exec->in[i]  + (iter.y - y_start) * exec->in_stride[i];
+            iter.out[i] = exec->out[i] + (iter.y - y_start) * exec->out_stride[i];
+        }
+
+        for (int block = bx_start; block < bx_end; block++) {
+            iter.x = block * SWS_BLOCK_SIZE;
+            ((void (*)(SwsOpIter *, const SwsOpImpl *)) impl->cont)
+                (&iter, &impl[1]);
+        }
+    }
+}
+
+static int compile(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out)
+{
+    int ret;
+
+    SwsOpChain *chain = ff_sws_op_chain_alloc();
+    if (!chain)
+        return AVERROR(ENOMEM);
+
+    static const SwsOpTable *const tables[] = {
+        &bitfn(op_table_int,    u8),
+        &bitfn(op_table_int,   u16),
+        &bitfn(op_table_int,   u32),
+        &bitfn(op_table_float, f32),
+    };
+
+    do {
+        ret = ff_sws_op_compile_tables(tables, FF_ARRAY_ELEMS(tables), ops,
+                                       SWS_BLOCK_SIZE, chain);
+    } while (ret == AVERROR(EAGAIN));
+    if (ret < 0) {
+        ff_sws_op_chain_free(chain);
+        return ret;
+    }
+
+    *out = (SwsCompiledOp) {
+        .func       = process,
+        .block_size = SWS_BLOCK_SIZE,
+        .cpu_flags  = chain->cpu_flags,
+        .priv       = chain,
+        .free       = (void (*)(void *)) ff_sws_op_chain_free,
+    };
+    return 0;
+}
+
+SwsOpBackend backend_c = {
+    .name       = "c",
+    .compile    = compile,
+};
diff --git a/libswscale/ops_backend.h b/libswscale/ops_backend.h
new file mode 100644
index 0000000000..690302b264
--- /dev/null
+++ b/libswscale/ops_backend.h
@@ -0,0 +1,167 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_BACKEND_H
+#define SWSCALE_OPS_BACKEND_H
+
+/**
+ * Helper macros for the C-based backend.
+ *
+ * To use these macros, the following types must be defined:
+ *  - PIXEL_TYPE should be one of SWS_PIXEL_*
+ *  - pixel_t should be the type of pixels
+ *  - block_t should be the type of blocks (groups of pixels)
+ */
+
+#include <assert.h>
+#include <float.h>
+#include <stdint.h>
+
+#include "libavutil/attributes.h"
+#include "libavutil/mem.h"
+
+#include "ops_chain.h"
+
+/**
+ * Internal context holding per-iter execution data. The data pointers will be
+ * directly incremented by the corresponding read/write functions.
+ */
+typedef struct SwsOpIter {
+    const uint8_t *in[4];
+    uint8_t *out[4];
+    int x, y;
+} SwsOpIter;
+
+#ifdef __clang__
+#  define SWS_FUNC
+#  define SWS_LOOP AV_PRAGMA(clang loop vectorize(assume_safety))
+#elif defined(__GNUC__)
+#  define SWS_FUNC __attribute__((optimize("tree-vectorize")))
+#  define SWS_LOOP AV_PRAGMA(GCC ivdep)
+#else
+#  define SWS_FUNC
+#  define SWS_LOOP
+#endif
+
+/* Miscellaneous helpers */
+#define bitfn2(name, ext) name ## _ ## ext
+#define bitfn(name, ext)  bitfn2(name, ext)
+
+#define FN_SUFFIX AV_JOIN(FMT_CHAR, BIT_DEPTH)
+#define fn(name)  bitfn(name, FN_SUFFIX)
+
+#define av_q2pixel(q) ((q).den ? (pixel_t) (q).num / (q).den : 0)
+
+/* Helper macros to make writing common function signatures less painful */
+#define DECL_FUNC(NAME, ...)                                                    \
+    static av_always_inline void fn(NAME)(SwsOpIter *restrict iter,             \
+                                          const SwsOpImpl *restrict impl,       \
+                                          block_t x, block_t y,                 \
+                                          block_t z, block_t w,                 \
+                                          __VA_ARGS__)
+
+#define DECL_READ(NAME, ...)                                                    \
+    static av_always_inline void fn(NAME)(SwsOpIter *restrict iter,             \
+                                          const SwsOpImpl *restrict impl,       \
+                                          const pixel_t *restrict in0,          \
+                                          const pixel_t *restrict in1,          \
+                                          const pixel_t *restrict in2,          \
+                                          const pixel_t *restrict in3,          \
+                                          __VA_ARGS__)
+
+#define DECL_WRITE(NAME, ...)                                                   \
+    DECL_FUNC(NAME, pixel_t *restrict out0, pixel_t *restrict out1,             \
+                    pixel_t *restrict out2, pixel_t *restrict out3,             \
+                    __VA_ARGS__)
+
+/* Helper macros to call into functions declared with DECL_FUNC_* */
+#define CALL(FUNC, ...) \
+    fn(FUNC)(iter, impl, x, y, z, w, __VA_ARGS__)
+
+#define CALL_READ(FUNC, ...)                                                    \
+    fn(FUNC)(iter, impl, (const pixel_t *) iter->in[0],                         \
+                         (const pixel_t *) iter->in[1],                         \
+                         (const pixel_t *) iter->in[2],                         \
+                         (const pixel_t *) iter->in[3], __VA_ARGS__)
+
+#define CALL_WRITE(FUNC, ...)                                                   \
+    CALL(FUNC, (pixel_t *) iter->out[0], (pixel_t *) iter->out[1],              \
+               (pixel_t *) iter->out[2], (pixel_t *) iter->out[3], __VA_ARGS__)
+
+/* Helper macros to declare continuation functions */
+#define DECL_IMPL(NAME)                                                         \
+    static SWS_FUNC void fn(NAME)(SwsOpIter *restrict iter,                     \
+                                  const SwsOpImpl *restrict impl,               \
+                                  block_t x, block_t y,                         \
+                                  block_t z, block_t w)                         \
+
+/* Helper macro to call into the next continuation with a given type */
+#define CONTINUE(TYPE, ...)                                                     \
+    ((void (*)(SwsOpIter *, const SwsOpImpl *,                                  \
+               TYPE x, TYPE y, TYPE z, TYPE w)) impl->cont)                     \
+        (iter, &impl[1], __VA_ARGS__)
+
+/* Helper macros for common op setup code */
+#define DECL_SETUP(NAME)                                                        \
+    static int fn(NAME)(const SwsOp *op, SwsOpPriv *out)
+
+#define SETUP_MEMDUP(c) ff_setup_memdup(&(c), sizeof(c), out)
+static inline int ff_setup_memdup(const void *c, size_t size, SwsOpPriv *out)
+{
+    out->ptr = av_memdup(c, size);
+    return out->ptr ? 0 : AVERROR(ENOMEM);
+}
+
+/* Helper macro for declaring op table entries */
+#define DECL_ENTRY(NAME, ...)                                                   \
+    static const SwsOpEntry fn(op_##NAME) = {                                   \
+        .func = (SwsFuncPtr) fn(NAME),                                          \
+        .op.type = PIXEL_TYPE,                                                  \
+        __VA_ARGS__                                                             \
+    }
+
+/* Helpers to define functions for common subsets of components */
+#define DECL_PATTERN(NAME) \
+    DECL_FUNC(NAME, const bool X, const bool Y, const bool Z, const bool W)
+
+#define WRAP_PATTERN(FUNC, X, Y, Z, W, ...)                                     \
+    DECL_IMPL(FUNC##_##X##Y##Z##W)                                              \
+    {                                                                           \
+        CALL(FUNC, X, Y, Z, W);                                                 \
+    }                                                                           \
+                                                                                \
+    DECL_ENTRY(FUNC##_##X##Y##Z##W,                                             \
+        .op.comps.unused = { !X, !Y, !Z, !W },                                  \
+        __VA_ARGS__                                                             \
+    )
+
+#define WRAP_COMMON_PATTERNS(FUNC, ...)                                         \
+    WRAP_PATTERN(FUNC, 1, 0, 0, 0, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 0, 0, 1, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 1, 1, 0, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 1, 1, 1, __VA_ARGS__)
+
+#define REF_COMMON_PATTERNS(NAME)                                               \
+    &fn(op_##NAME##_1000),                                                      \
+    &fn(op_##NAME##_1001),                                                      \
+    &fn(op_##NAME##_1110),                                                      \
+    &fn(op_##NAME##_1111)
+
+#endif
diff --git a/libswscale/ops_tmpl_common.c b/libswscale/ops_tmpl_common.c
new file mode 100644
index 0000000000..a9410a8a61
--- /dev/null
+++ b/libswscale/ops_tmpl_common.c
@@ -0,0 +1,176 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  error Should only be included from ops_tmpl_*.c!
+#endif
+
+#define WRAP_CONVERT_UINT(N)                                                    \
+DECL_PATTERN(convert_uint##N)                                                   \
+{                                                                               \
+    u##N##block_t xu, yu, zu, wu;                                               \
+                                                                                \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        if (X)                                                                  \
+            xu[i] = x[i];                                                       \
+        if (Y)                                                                  \
+            yu[i] = y[i];                                                       \
+        if (Z)                                                                  \
+            zu[i] = z[i];                                                       \
+        if (W)                                                                  \
+            wu[i] = w[i];                                                       \
+    }                                                                           \
+                                                                                \
+    CONTINUE(u##N##block_t, xu, yu, zu, wu);                                    \
+}                                                                               \
+                                                                                \
+WRAP_COMMON_PATTERNS(convert_uint##N,                                           \
+    .op.op = SWS_OP_CONVERT,                                                    \
+    .op.convert.to = SWS_PIXEL_U##N,                                            \
+);
+
+#if BIT_DEPTH != 8
+WRAP_CONVERT_UINT(8)
+#endif
+
+#if BIT_DEPTH != 16
+WRAP_CONVERT_UINT(16)
+#endif
+
+#if BIT_DEPTH != 32 || defined(IS_FLOAT)
+WRAP_CONVERT_UINT(32)
+#endif
+
+DECL_FUNC(clear, const bool X, const bool Y, const bool Z, const bool W)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (!X)
+            x[i] = impl->priv.px[0];
+        if (!Y)
+            y[i] = impl->priv.px[1];
+        if (!Z)
+            z[i] = impl->priv.px[2];
+        if (!W)
+            w[i] = impl->priv.px[3];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_CLEAR(X, Y, Z, W)                                                  \
+DECL_IMPL(clear##_##X##Y##Z##W)                                                 \
+{                                                                               \
+    CALL(clear, X, Y, Z, W);                                                    \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(clear##_##X##Y##Z##W,                                                \
+    .setup = ff_sws_setup_q4,                                                   \
+    .flexible = true,                                                           \
+    .op.op = SWS_OP_CLEAR,                                                      \
+    .op.comps.unused = { !X, !Y, !Z, !W },                                      \
+);
+
+WRAP_CLEAR(1, 1, 1, 0) /* rgba alpha */
+WRAP_CLEAR(0, 1, 1, 1) /* argb alpha */
+
+WRAP_CLEAR(0, 0, 1, 1) /* vuya chroma */
+WRAP_CLEAR(1, 0, 0, 1) /* yuva chroma */
+WRAP_CLEAR(1, 1, 0, 0) /* ayuv chroma */
+WRAP_CLEAR(0, 1, 0, 1) /* uyva chroma */
+WRAP_CLEAR(1, 0, 1, 0) /* xvyu chroma */
+
+WRAP_CLEAR(1, 0, 0, 0) /* gray -> yuva */
+WRAP_CLEAR(0, 1, 0, 0) /* gray -> ayuv */
+WRAP_CLEAR(0, 0, 1, 0) /* gray -> vuya */
+
+DECL_PATTERN(min)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = FFMIN(x[i], impl->priv.px[0]);
+        if (Y)
+            y[i] = FFMIN(y[i], impl->priv.px[1]);
+        if (Z)
+            z[i] = FFMIN(z[i], impl->priv.px[2]);
+        if (W)
+            w[i] = FFMIN(w[i], impl->priv.px[3]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_PATTERN(max)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = FFMAX(x[i], impl->priv.px[0]);
+        if (Y)
+            y[i] = FFMAX(y[i], impl->priv.px[1]);
+        if (Z)
+            z[i] = FFMAX(z[i], impl->priv.px[2]);
+        if (W)
+            w[i] = FFMAX(w[i], impl->priv.px[3]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(min,
+    .op.op = SWS_OP_MIN,
+    .setup = ff_sws_setup_q4,
+    .flexible = true,
+);
+
+WRAP_COMMON_PATTERNS(max,
+    .op.op = SWS_OP_MAX,
+    .setup = ff_sws_setup_q4,
+    .flexible = true,
+);
+
+DECL_PATTERN(scale)
+{
+    const pixel_t scale = impl->priv.px[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] *= scale;
+        if (Y)
+            y[i] *= scale;
+        if (Z)
+            z[i] *= scale;
+        if (W)
+            w[i] *= scale;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(scale,
+    .op.op = SWS_OP_SCALE,
+    .setup = ff_sws_setup_q,
+    .flexible = true,
+);
diff --git a/libswscale/ops_tmpl_float.c b/libswscale/ops_tmpl_float.c
new file mode 100644
index 0000000000..050a8a77f8
--- /dev/null
+++ b/libswscale/ops_tmpl_float.c
@@ -0,0 +1,257 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  define BIT_DEPTH 32
+#endif
+
+#if BIT_DEPTH == 32
+#  define PIXEL_TYPE SWS_PIXEL_F32
+#  define PIXEL_MAX  FLT_MAX
+#  define PIXEL_MIN  FLT_MIN
+#  define pixel_t    float
+#  define block_t    f32block_t
+#  define px         f32
+#else
+#  error Invalid BIT_DEPTH
+#endif
+
+#define IS_FLOAT 1
+#define FMT_CHAR f
+#include "ops_tmpl_common.c"
+
+DECL_SETUP(setup_dither)
+{
+    const int size = 1 << op->dither.size_log2;
+    if (!size) {
+        /* We special case this value */
+        av_assert1(!av_cmp_q(op->dither.matrix[0], av_make_q(1, 2)));
+        out->ptr = NULL;
+        return 0;
+    }
+
+    const int width = FFMAX(size, SWS_BLOCK_SIZE);
+    pixel_t *matrix = out->ptr = av_malloc(sizeof(pixel_t) * size * width);
+    if (!matrix)
+        return AVERROR(ENOMEM);
+
+    for (int y = 0; y < size; y++) {
+        for (int x = 0; x < size; x++)
+            matrix[y * width + x] = av_q2pixel(op->dither.matrix[y * size + x]);
+        for (int x = size; x < width; x++) /* pad to block size */
+            matrix[y * width + x] = matrix[y * width + (x % size)];
+    }
+
+    return 0;
+}
+
+DECL_FUNC(dither, const int size_log2)
+{
+    const pixel_t *restrict matrix = impl->priv.ptr;
+    const int mask = (1 << size_log2) - 1;
+    const int y_line = iter->y;
+    const int row0 = (y_line +  0) & mask;
+    const int row1 = (y_line +  3) & mask;
+    const int row2 = (y_line +  2) & mask;
+    const int row3 = (y_line +  5) & mask;
+    const int size = 1 << size_log2;
+    const int width = FFMAX(size, SWS_BLOCK_SIZE);
+    const int base = iter->x & ~(SWS_BLOCK_SIZE - 1) & (size - 1);
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] += size_log2 ? matrix[row0 * width + base + i] : (pixel_t) 0.5;
+        y[i] += size_log2 ? matrix[row1 * width + base + i] : (pixel_t) 0.5;
+        z[i] += size_log2 ? matrix[row2 * width + base + i] : (pixel_t) 0.5;
+        w[i] += size_log2 ? matrix[row3 * width + base + i] : (pixel_t) 0.5;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_DITHER(N)                                                          \
+DECL_IMPL(dither##N)                                                            \
+{                                                                               \
+    CALL(dither, N);                                                            \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(dither##N,                                                           \
+    .op.op = SWS_OP_DITHER,                                                     \
+    .op.dither.size_log2 = N,                                                   \
+    .setup = fn(setup_dither),                                                  \
+    .free = av_free,                                                            \
+);
+
+WRAP_DITHER(0)
+WRAP_DITHER(1)
+WRAP_DITHER(2)
+WRAP_DITHER(3)
+WRAP_DITHER(4)
+WRAP_DITHER(5)
+WRAP_DITHER(6)
+WRAP_DITHER(7)
+WRAP_DITHER(8)
+
+typedef struct {
+    /* Stored in split form for convenience */
+    pixel_t m[4][4];
+    pixel_t k[4];
+} fn(LinCoeffs);
+
+DECL_SETUP(setup_linear)
+{
+    fn(LinCoeffs) c;
+
+    for (int i = 0; i < 4; i++) {
+        for (int j = 0; j < 4; j++)
+            c.m[i][j] = av_q2pixel(op->lin.m[i][j]);
+        c.k[i] = av_q2pixel(op->lin.m[i][4]);
+    }
+
+    return SETUP_MEMDUP(c);
+}
+
+/**
+ * Fully general case for a 5x5 linear affine transformation. Should never be
+ * called without constant `mask`. This function will compile down to the
+ * appropriately optimized version for the required subset of operations when
+ * called with a constant mask.
+ */
+DECL_FUNC(linear_mask, const uint32_t mask)
+{
+    const fn(LinCoeffs) c = *(const fn(LinCoeffs) *) impl->priv.ptr;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        const pixel_t xx = x[i];
+        const pixel_t yy = y[i];
+        const pixel_t zz = z[i];
+        const pixel_t ww = w[i];
+
+        x[i]  = (mask & SWS_MASK_OFF(0)) ? c.k[0] : 0;
+        x[i] += (mask & SWS_MASK(0, 0))  ? c.m[0][0] * xx : xx;
+        x[i] += (mask & SWS_MASK(0, 1))  ? c.m[0][1] * yy : 0;
+        x[i] += (mask & SWS_MASK(0, 2))  ? c.m[0][2] * zz : 0;
+        x[i] += (mask & SWS_MASK(0, 3))  ? c.m[0][3] * ww : 0;
+
+        y[i]  = (mask & SWS_MASK_OFF(1)) ? c.k[1] : 0;
+        y[i] += (mask & SWS_MASK(1, 0))  ? c.m[1][0] * xx : 0;
+        y[i] += (mask & SWS_MASK(1, 1))  ? c.m[1][1] * yy : yy;
+        y[i] += (mask & SWS_MASK(1, 2))  ? c.m[1][2] * zz : 0;
+        y[i] += (mask & SWS_MASK(1, 3))  ? c.m[1][3] * ww : 0;
+
+        z[i]  = (mask & SWS_MASK_OFF(2)) ? c.k[2] : 0;
+        z[i] += (mask & SWS_MASK(2, 0))  ? c.m[2][0] * xx : 0;
+        z[i] += (mask & SWS_MASK(2, 1))  ? c.m[2][1] * yy : 0;
+        z[i] += (mask & SWS_MASK(2, 2))  ? c.m[2][2] * zz : zz;
+        z[i] += (mask & SWS_MASK(2, 3))  ? c.m[2][3] * ww : 0;
+
+        w[i]  = (mask & SWS_MASK_OFF(3)) ? c.k[3] : 0;
+        w[i] += (mask & SWS_MASK(3, 0))  ? c.m[3][0] * xx : 0;
+        w[i] += (mask & SWS_MASK(3, 1))  ? c.m[3][1] * yy : 0;
+        w[i] += (mask & SWS_MASK(3, 2))  ? c.m[3][2] * zz : 0;
+        w[i] += (mask & SWS_MASK(3, 3))  ? c.m[3][3] * ww : ww;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_LINEAR(NAME, MASK)                                                 \
+DECL_IMPL(linear_##NAME)                                                        \
+{                                                                               \
+    CALL(linear_mask, MASK);                                                    \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(linear_##NAME,                                                       \
+    .setup = fn(setup_linear),                                                  \
+    .free = av_free,                                                            \
+    .op.op = SWS_OP_LINEAR,                                                     \
+    .op.lin.mask = (MASK),                                                      \
+);
+
+WRAP_LINEAR(luma,      SWS_MASK_LUMA)
+WRAP_LINEAR(alpha,     SWS_MASK_ALPHA)
+WRAP_LINEAR(lumalpha,  SWS_MASK_LUMA | SWS_MASK_ALPHA)
+WRAP_LINEAR(dot3,      0b111)
+WRAP_LINEAR(row0,      SWS_MASK_ROW(0))
+WRAP_LINEAR(row0a,     SWS_MASK_ROW(0) | SWS_MASK_ALPHA)
+WRAP_LINEAR(diag3,     SWS_MASK_DIAG3)
+WRAP_LINEAR(diag4,     SWS_MASK_DIAG4)
+WRAP_LINEAR(diagoff3,  SWS_MASK_DIAG3 | SWS_MASK_OFF3)
+WRAP_LINEAR(matrix3,   SWS_MASK_MAT3)
+WRAP_LINEAR(affine3,   SWS_MASK_MAT3 | SWS_MASK_OFF3)
+WRAP_LINEAR(affine3a,  SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA)
+WRAP_LINEAR(matrix4,   SWS_MASK_MAT4)
+WRAP_LINEAR(affine4,   SWS_MASK_MAT4 | SWS_MASK_OFF4)
+
+static const SwsOpTable fn(op_table_float) = {
+    .block_size = SWS_BLOCK_SIZE,
+    .entries = {
+        REF_COMMON_PATTERNS(convert_uint8),
+        REF_COMMON_PATTERNS(convert_uint16),
+        REF_COMMON_PATTERNS(convert_uint32),
+
+        &fn(op_clear_1110),
+        REF_COMMON_PATTERNS(min),
+        REF_COMMON_PATTERNS(max),
+        REF_COMMON_PATTERNS(scale),
+
+        &fn(op_dither0),
+        &fn(op_dither1),
+        &fn(op_dither2),
+        &fn(op_dither3),
+        &fn(op_dither4),
+        &fn(op_dither5),
+        &fn(op_dither6),
+        &fn(op_dither7),
+        &fn(op_dither8),
+
+        &fn(op_linear_luma),
+        &fn(op_linear_alpha),
+        &fn(op_linear_lumalpha),
+        &fn(op_linear_dot3),
+        &fn(op_linear_row0),
+        &fn(op_linear_row0a),
+        &fn(op_linear_diag3),
+        &fn(op_linear_diag4),
+        &fn(op_linear_diagoff3),
+        &fn(op_linear_matrix3),
+        &fn(op_linear_affine3),
+        &fn(op_linear_affine3a),
+        &fn(op_linear_matrix4),
+        &fn(op_linear_affine4),
+
+        NULL
+    },
+};
+
+#undef PIXEL_TYPE
+#undef PIXEL_MAX
+#undef PIXEL_MIN
+#undef pixel_t
+#undef block_t
+#undef px
+
+#undef FMT_CHAR
+#undef IS_FLOAT
diff --git a/libswscale/ops_tmpl_int.c b/libswscale/ops_tmpl_int.c
new file mode 100644
index 0000000000..ee9e43fddb
--- /dev/null
+++ b/libswscale/ops_tmpl_int.c
@@ -0,0 +1,608 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/bswap.h"
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  define BIT_DEPTH 8
+#endif
+
+#if BIT_DEPTH == 32
+#  define PIXEL_TYPE SWS_PIXEL_U32
+#  define PIXEL_MAX  0xFFFFFFFFu
+#  define SWAP_BYTES av_bswap32
+#  define pixel_t    uint32_t
+#  define block_t    u32block_t
+#  define px         u32
+#elif BIT_DEPTH == 16
+#  define PIXEL_TYPE SWS_PIXEL_U16
+#  define PIXEL_MAX  0xFFFFu
+#  define SWAP_BYTES av_bswap16
+#  define pixel_t    uint16_t
+#  define block_t    u16block_t
+#  define px         u16
+#elif BIT_DEPTH == 8
+#  define PIXEL_TYPE SWS_PIXEL_U8
+#  define PIXEL_MAX  0xFFu
+#  define pixel_t    uint8_t
+#  define block_t    u8block_t
+#  define px         u8
+#else
+#  error Invalid BIT_DEPTH
+#endif
+
+#define IS_FLOAT  0
+#define FMT_CHAR  u
+#define PIXEL_MIN 0
+#include "ops_tmpl_common.c"
+
+DECL_READ(read_planar, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] = in0[i];
+        if (elems > 1)
+            y[i] = in1[i];
+        if (elems > 2)
+            z[i] = in2[i];
+        if (elems > 3)
+            w[i] = in3[i];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_READ(read_packed, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] = in0[elems * i + 0];
+        if (elems > 1)
+            y[i] = in0[elems * i + 1];
+        if (elems > 2)
+            z[i] = in0[elems * i + 2];
+        if (elems > 3)
+            w[i] = in0[elems * i + 3];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_WRITE(write_planar, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        out0[i] = x[i];
+        if (elems > 1)
+            out1[i] = y[i];
+        if (elems > 2)
+            out2[i] = z[i];
+        if (elems > 3)
+            out3[i] = w[i];
+    }
+}
+
+DECL_WRITE(write_packed, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        out0[elems * i + 0] = x[i];
+        if (elems > 1)
+            out0[elems * i + 1] = y[i];
+        if (elems > 2)
+            out0[elems * i + 2] = z[i];
+        if (elems > 3)
+            out0[elems * i + 3] = w[i];
+    }
+}
+
+#define WRAP_READ(FUNC, ELEMS, FRAC, PACKED)                                    \
+DECL_IMPL(FUNC##ELEMS)                                                          \
+{                                                                               \
+    CALL_READ(FUNC, ELEMS);                                                     \
+    for (int i = 0; i < (PACKED ? 1 : ELEMS); i++)                              \
+        iter->in[i] += sizeof(block_t) * (PACKED ? ELEMS : 1) >> FRAC;          \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(FUNC##ELEMS,                                                         \
+    .op.op = SWS_OP_READ,                                                       \
+    .op.rw = {                                                                  \
+        .elems  = ELEMS,                                                        \
+        .packed = PACKED,                                                       \
+        .frac   = FRAC,                                                         \
+    },                                                                          \
+);
+
+WRAP_READ(read_planar, 1, 0, false)
+WRAP_READ(read_planar, 2, 0, false)
+WRAP_READ(read_planar, 3, 0, false)
+WRAP_READ(read_planar, 4, 0, false)
+WRAP_READ(read_packed, 2, 0, true)
+WRAP_READ(read_packed, 3, 0, true)
+WRAP_READ(read_packed, 4, 0, true)
+
+#define WRAP_WRITE(FUNC, ELEMS, FRAC, PACKED)                                   \
+DECL_IMPL(FUNC##ELEMS)                                                          \
+{                                                                               \
+    CALL_WRITE(FUNC, ELEMS);                                                    \
+    for (int i = 0; i < (PACKED ? 1 : ELEMS); i++)                              \
+        iter->out[i] += sizeof(block_t) * (PACKED ? ELEMS : 1) >> FRAC;         \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(FUNC##ELEMS,                                                         \
+    .op.op = SWS_OP_WRITE,                                                      \
+    .op.rw = {                                                                  \
+        .elems  = ELEMS,                                                        \
+        .packed = PACKED,                                                       \
+        .frac   = FRAC,                                                         \
+    },                                                                          \
+);
+
+WRAP_WRITE(write_planar, 1, 0, false)
+WRAP_WRITE(write_planar, 2, 0, false)
+WRAP_WRITE(write_planar, 3, 0, false)
+WRAP_WRITE(write_planar, 4, 0, false)
+WRAP_WRITE(write_packed, 2, 0, true)
+WRAP_WRITE(write_packed, 3, 0, true)
+WRAP_WRITE(write_packed, 4, 0, true)
+
+#if BIT_DEPTH == 8
+DECL_READ(read_nibbles, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 2) {
+        const pixel_t val = ((const pixel_t *) in0)[i >> 1];
+        x[i + 0] = val >> 4;  /* high nibble */
+        x[i + 1] = val & 0xF; /* low nibble */
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_READ(read_bits, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 8) {
+        const pixel_t val = ((const pixel_t *) in0)[i >> 3];
+        x[i + 0] = (val >> 7) & 1;
+        x[i + 1] = (val >> 6) & 1;
+        x[i + 2] = (val >> 5) & 1;
+        x[i + 3] = (val >> 4) & 1;
+        x[i + 4] = (val >> 3) & 1;
+        x[i + 5] = (val >> 2) & 1;
+        x[i + 6] = (val >> 1) & 1;
+        x[i + 7] = (val >> 0) & 1;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_READ(read_nibbles, 1, 1, false)
+WRAP_READ(read_bits,    1, 3, false)
+
+DECL_WRITE(write_nibbles, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 2)
+        out0[i >> 1] = x[i] << 4 | x[i + 1];
+}
+
+DECL_WRITE(write_bits, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 8) {
+        out0[i >> 3] = x[i + 0] << 7 |
+                       x[i + 1] << 6 |
+                       x[i + 2] << 5 |
+                       x[i + 3] << 4 |
+                       x[i + 4] << 3 |
+                       x[i + 5] << 2 |
+                       x[i + 6] << 1 |
+                       x[i + 7];
+    }
+}
+
+WRAP_WRITE(write_nibbles, 1, 1, false)
+WRAP_WRITE(write_bits,    1, 3, false)
+#endif /* BIT_DEPTH == 8 */
+
+#ifdef SWAP_BYTES
+DECL_PATTERN(swap_bytes)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = SWAP_BYTES(x[i]);
+        if (Y)
+            y[i] = SWAP_BYTES(y[i]);
+        if (Z)
+            z[i] = SWAP_BYTES(z[i]);
+        if (W)
+            w[i] = SWAP_BYTES(w[i]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(swap_bytes, .op.op = SWS_OP_SWAP_BYTES);
+#endif /* SWAP_BYTES */
+
+#if BIT_DEPTH == 8
+DECL_PATTERN(expand16)
+{
+    u16block_t x16, y16, z16, w16;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x16[i] = x[i] << 8 | x[i];
+        if (Y)
+            y16[i] = y[i] << 8 | y[i];
+        if (Z)
+            z16[i] = z[i] << 8 | z[i];
+        if (W)
+            w16[i] = w[i] << 8 | w[i];
+    }
+
+    CONTINUE(u16block_t, x16, y16, z16, w16);
+}
+
+WRAP_COMMON_PATTERNS(expand16,
+    .op.op = SWS_OP_CONVERT,
+    .op.convert.to = SWS_PIXEL_U16,
+    .op.convert.expand = true,
+);
+
+DECL_PATTERN(expand32)
+{
+    u32block_t x32, y32, z32, w32;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x32[i] = x[i] << 24 | x[i] << 16 | x[i] << 8 | x[i];
+        y32[i] = y[i] << 24 | y[i] << 16 | y[i] << 8 | y[i];
+        z32[i] = z[i] << 24 | z[i] << 16 | z[i] << 8 | z[i];
+        w32[i] = w[i] << 24 | w[i] << 16 | w[i] << 8 | w[i];
+    }
+
+    CONTINUE(u32block_t, x32, y32, z32, w32);
+}
+
+WRAP_COMMON_PATTERNS(expand32,
+    .op.op = SWS_OP_CONVERT,
+    .op.convert.to = SWS_PIXEL_U32,
+    .op.convert.expand = true,
+);
+#endif
+
+#define WRAP_PACK_UNPACK(X, Y, Z, W)                                            \
+inline DECL_IMPL(pack_##X##Y##Z##W)                                             \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        x[i] = x[i] << (Y+Z+W);                                                 \
+        if (Y)                                                                  \
+            x[i] |= y[i] << (Z+W);                                              \
+        if (Z)                                                                  \
+            x[i] |= z[i] << W;                                                  \
+        if (W)                                                                  \
+            x[i] |= w[i];                                                       \
+    }                                                                           \
+                                                                                \
+    CONTINUE(block_t, x, y, z, w);                                              \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(pack_##X##Y##Z##W,                                                   \
+    .op.op = SWS_OP_PACK,                                                       \
+    .op.pack.pattern = { X, Y, Z, W },                                          \
+    .op.comps.unused = { !X, !Y, !Z, !W },                                      \
+);                                                                              \
+                                                                                \
+inline DECL_IMPL(unpack_##X##Y##Z##W)                                           \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        const pixel_t val = x[i];                                               \
+        x[i] = val >> (Y+Z+W);                                                  \
+        if (Y)                                                                  \
+            y[i] = (val >> (Z+W)) & ((1 << Y) - 1);                             \
+        if (Z)                                                                  \
+            z[i] = (val >> W) & ((1 << Z) - 1);                                 \
+        if (W)                                                                  \
+            w[i] = val & ((1 << W) - 1);                                        \
+    }                                                                           \
+                                                                                \
+    CONTINUE(block_t, x, y, z, w);                                              \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(unpack_##X##Y##Z##W,                                                 \
+    .op.op = SWS_OP_UNPACK,                                                     \
+    .op.pack.pattern = { X, Y, Z, W },                                          \
+    .op.comps.flags = {                                                         \
+        X ? 0 : SWS_COMP_GARBAGE, Y ? 0 : SWS_COMP_GARBAGE,                     \
+        Z ? 0 : SWS_COMP_GARBAGE, W ? 0 : SWS_COMP_GARBAGE,                     \
+    },                                                                          \
+);
+
+WRAP_PACK_UNPACK( 3,  3,  2,  0)
+WRAP_PACK_UNPACK( 2,  3,  3,  0)
+WRAP_PACK_UNPACK( 1,  2,  1,  0)
+WRAP_PACK_UNPACK( 5,  6,  5,  0)
+WRAP_PACK_UNPACK( 5,  5,  5,  0)
+WRAP_PACK_UNPACK( 4,  4,  4,  0)
+WRAP_PACK_UNPACK( 2, 10, 10, 10)
+WRAP_PACK_UNPACK(10, 10, 10,  2)
+
+#if BIT_DEPTH != 8
+DECL_PATTERN(lshift)
+{
+    const uint8_t amount = impl->priv.u8[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] <<= amount;
+        y[i] <<= amount;
+        z[i] <<= amount;
+        w[i] <<= amount;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_PATTERN(rshift)
+{
+    const uint8_t amount = impl->priv.u8[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] >>= amount;
+        y[i] >>= amount;
+        z[i] >>= amount;
+        w[i] >>= amount;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(lshift,
+    .op.op    = SWS_OP_LSHIFT,
+    .setup    = ff_sws_setup_u8,
+    .flexible = true,
+);
+
+WRAP_COMMON_PATTERNS(rshift,
+    .op.op    = SWS_OP_RSHIFT,
+    .setup    = ff_sws_setup_u8,
+    .flexible = true,
+);
+#endif /* BIT_DEPTH != 8 */
+
+DECL_PATTERN(convert_float)
+{
+    f32block_t xf, yf, zf, wf;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        xf[i] = x[i];
+        yf[i] = y[i];
+        zf[i] = z[i];
+        wf[i] = w[i];
+    }
+
+    CONTINUE(f32block_t, xf, yf, zf, wf);
+}
+
+WRAP_COMMON_PATTERNS(convert_float,
+    .op.op = SWS_OP_CONVERT,
+    .op.convert.to = SWS_PIXEL_F32,
+);
+
+/**
+ * Swizzle by directly swapping the order of arguments to the continuation.
+ * Note that this is only safe to do if no arguments are duplicated.
+ */
+#define DECL_SWIZZLE(X, Y, Z, W)                                                \
+static SWS_FUNC void                                                            \
+fn(swizzle_##X##Y##Z##W)(SwsOpIter *restrict iter,                              \
+                         const SwsOpImpl *restrict impl,                        \
+                         block_t c0, block_t c1, block_t c2, block_t c3)        \
+{                                                                               \
+    CONTINUE(block_t, c##X, c##Y, c##Z, c##W);                                  \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(swizzle_##X##Y##Z##W,                                                \
+    .op.op = SWS_OP_SWIZZLE,                                                    \
+    .op.swizzle = SWS_SWIZZLE(X, Y, Z, W),                                      \
+);
+
+DECL_SWIZZLE(3, 0, 1, 2)
+DECL_SWIZZLE(3, 0, 2, 1)
+DECL_SWIZZLE(2, 1, 0, 3)
+DECL_SWIZZLE(3, 2, 1, 0)
+DECL_SWIZZLE(3, 1, 0, 2)
+DECL_SWIZZLE(3, 2, 0, 1)
+DECL_SWIZZLE(1, 2, 0, 3)
+DECL_SWIZZLE(1, 0, 2, 3)
+DECL_SWIZZLE(2, 0, 1, 3)
+DECL_SWIZZLE(2, 3, 1, 0)
+DECL_SWIZZLE(2, 1, 3, 0)
+DECL_SWIZZLE(1, 2, 3, 0)
+DECL_SWIZZLE(1, 3, 2, 0)
+DECL_SWIZZLE(0, 2, 1, 3)
+DECL_SWIZZLE(0, 2, 3, 1)
+DECL_SWIZZLE(0, 3, 1, 2)
+DECL_SWIZZLE(3, 1, 2, 0)
+DECL_SWIZZLE(0, 3, 2, 1)
+
+/* Broadcast luma -> rgb (only used for y(a) -> rgb(a)) */
+#define DECL_EXPAND_LUMA(X, W, T0, T1)                                          \
+static SWS_FUNC void                                                            \
+fn(expand_luma_##X##W)(SwsOpIter *restrict iter,                                \
+                       const SwsOpImpl *restrict impl,                          \
+                       block_t c0, block_t c1,  block_t c2, block_t c3)         \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++)                                    \
+        T0[i] = T1[i] = c0[i];                                                  \
+                                                                                \
+    CONTINUE(block_t, c##X, T0, T1, c##W);                                      \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(expand_luma_##X##W,                                                  \
+    .op.op = SWS_OP_SWIZZLE,                                                    \
+    .op.swizzle = SWS_SWIZZLE(X, 0, 0, W),                                      \
+);
+
+DECL_EXPAND_LUMA(0, 3, c1, c2)
+DECL_EXPAND_LUMA(3, 0, c1, c2)
+DECL_EXPAND_LUMA(1, 0, c2, c3)
+DECL_EXPAND_LUMA(0, 1, c2, c3)
+
+static const SwsOpTable fn(op_table_int) = {
+    .block_size = SWS_BLOCK_SIZE,
+    .entries = {
+        &fn(op_read_planar1),
+        &fn(op_read_planar2),
+        &fn(op_read_planar3),
+        &fn(op_read_planar4),
+        &fn(op_read_packed2),
+        &fn(op_read_packed3),
+        &fn(op_read_packed4),
+
+        &fn(op_write_planar1),
+        &fn(op_write_planar2),
+        &fn(op_write_planar3),
+        &fn(op_write_planar4),
+        &fn(op_write_packed2),
+        &fn(op_write_packed3),
+        &fn(op_write_packed4),
+
+#if BIT_DEPTH == 8
+        &fn(op_read_bits1),
+        &fn(op_read_nibbles1),
+        &fn(op_write_bits1),
+        &fn(op_write_nibbles1),
+
+        &fn(op_pack_1210),
+        &fn(op_pack_2330),
+        &fn(op_pack_3320),
+
+        &fn(op_unpack_1210),
+        &fn(op_unpack_2330),
+        &fn(op_unpack_3320),
+
+        REF_COMMON_PATTERNS(expand16),
+        REF_COMMON_PATTERNS(expand32),
+#elif BIT_DEPTH == 16
+        &fn(op_pack_4440),
+        &fn(op_pack_5550),
+        &fn(op_pack_5650),
+        &fn(op_unpack_4440),
+        &fn(op_unpack_5550),
+        &fn(op_unpack_5650),
+#elif BIT_DEPTH == 32
+        &fn(op_pack_2101010),
+        &fn(op_pack_1010102),
+        &fn(op_unpack_2101010),
+        &fn(op_unpack_1010102),
+#endif
+
+#ifdef SWAP_BYTES
+        REF_COMMON_PATTERNS(swap_bytes),
+#endif
+
+        REF_COMMON_PATTERNS(min),
+        REF_COMMON_PATTERNS(max),
+        REF_COMMON_PATTERNS(scale),
+        REF_COMMON_PATTERNS(convert_float),
+
+        &fn(op_clear_1110),
+        &fn(op_clear_0111),
+        &fn(op_clear_0011),
+        &fn(op_clear_1001),
+        &fn(op_clear_1100),
+        &fn(op_clear_0101),
+        &fn(op_clear_1010),
+        &fn(op_clear_1000),
+        &fn(op_clear_0100),
+        &fn(op_clear_0010),
+
+        &fn(op_swizzle_3012),
+        &fn(op_swizzle_3021),
+        &fn(op_swizzle_2103),
+        &fn(op_swizzle_3210),
+        &fn(op_swizzle_3102),
+        &fn(op_swizzle_3201),
+        &fn(op_swizzle_1203),
+        &fn(op_swizzle_1023),
+        &fn(op_swizzle_2013),
+        &fn(op_swizzle_2310),
+        &fn(op_swizzle_2130),
+        &fn(op_swizzle_1230),
+        &fn(op_swizzle_1320),
+        &fn(op_swizzle_0213),
+        &fn(op_swizzle_0231),
+        &fn(op_swizzle_0312),
+        &fn(op_swizzle_3120),
+        &fn(op_swizzle_0321),
+
+        &fn(op_expand_luma_03),
+        &fn(op_expand_luma_30),
+        &fn(op_expand_luma_10),
+        &fn(op_expand_luma_01),
+
+#if BIT_DEPTH != 8
+        REF_COMMON_PATTERNS(lshift),
+        REF_COMMON_PATTERNS(rshift),
+        REF_COMMON_PATTERNS(convert_uint8),
+#endif /* BIT_DEPTH != 8 */
+
+#if BIT_DEPTH != 16
+        REF_COMMON_PATTERNS(convert_uint16),
+#endif
+#if BIT_DEPTH != 32
+        REF_COMMON_PATTERNS(convert_uint32),
+#endif
+
+        NULL
+    },
+};
+
+#undef PIXEL_TYPE
+#undef PIXEL_MAX
+#undef PIXEL_MIN
+#undef SWAP_BYTES
+#undef pixel_t
+#undef block_t
+#undef px
+
+#undef FMT_CHAR
+#undef IS_FLOAT
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 13/17] swscale/ops_memcpy: add 'memcpy' backend for plane->plane copies
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (11 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 12/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
@ 2025-05-21 12:43 ` Niklas Haas
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 14/17] swscale/x86: add SIMD backend Niklas Haas
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:43 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Provides a generic fast path for any operation list that can be decomposed
into a series of memcpy and memset operations.

25% faster than the x86 backend for yuv444p -> yuva444p
33% faster than the x86 backend for gray -> yuvj444p
---
 libswscale/Makefile     |   1 +
 libswscale/ops.c        |   2 +
 libswscale/ops_memcpy.c | 132 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 135 insertions(+)
 create mode 100644 libswscale/ops_memcpy.c

diff --git a/libswscale/Makefile b/libswscale/Makefile
index 6e5696c5a6..136d33f6bc 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -18,6 +18,7 @@ OBJS = alphablend.o                                     \
        ops.o                                            \
        ops_backend.o                                    \
        ops_chain.o                                      \
+       ops_memcpy.o                                     \
        ops_optimizer.o                                  \
        options.o                                        \
        output.o                                         \
diff --git a/libswscale/ops.c b/libswscale/ops.c
index 3b9c2844f8..6403eff324 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -28,8 +28,10 @@
 #include "ops_internal.h"
 
 extern SwsOpBackend backend_c;
+extern SwsOpBackend backend_murder;
 
 const SwsOpBackend * const ff_sws_op_backends[] = {
+    &backend_murder,
     &backend_c,
     NULL
 };
diff --git a/libswscale/ops_memcpy.c b/libswscale/ops_memcpy.c
new file mode 100644
index 0000000000..1fcb58d452
--- /dev/null
+++ b/libswscale/ops_memcpy.c
@@ -0,0 +1,132 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+
+#include "ops_backend.h"
+
+typedef struct MemcpyPriv {
+    int num_planes;
+    int index[4]; /* or -1 to clear plane */
+    uint8_t clear_value[4];
+} MemcpyPriv;
+
+/* Memcpy backend for trivial cases */
+
+static void process(const SwsOpExec *exec, const void *priv,
+                    int x_start, int y_start, int x_end, int y_end)
+{
+    const MemcpyPriv *p = priv;
+    const int lines = y_end - y_start;
+    av_assert1(x_start == 0 && x_end == exec->width);
+
+    for (int i = 0; i < p->num_planes; i++) {
+        uint8_t *out = exec->out[i];
+        const int idx = p->index[i];
+        if (idx < 0) {
+            memset(out, p->clear_value[i], exec->out_stride[i] * lines);
+        } else if (exec->out_stride[i] == exec->in_stride[idx]) {
+            memcpy(out, exec->in[idx], exec->out_stride[i] * lines);
+        } else {
+            const int bytes = x_end * exec->pixel_bits_out >> 3;
+            const uint8_t *in = exec->in[idx];
+            for (int y = y_start; y < y_end; y++) {
+                memcpy(out, in, bytes);
+                out += exec->out_stride[i];
+                in  += exec->in_stride[idx];
+            }
+        }
+    }
+}
+
+static int compile(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out)
+{
+    MemcpyPriv p = {0};
+
+    for (int n = 0; n < ops->num_ops; n++) {
+        const SwsOp *op = &ops->ops[n];
+        switch (op->op) {
+        case SWS_OP_READ:
+            if ((op->rw.packed && op->rw.elems != 1) || op->rw.frac)
+                return AVERROR(ENOTSUP);
+            for (int i = 0; i < op->rw.elems; i++)
+                p.index[i] = i;
+            break;
+
+        case SWS_OP_SWIZZLE: {
+            const MemcpyPriv orig = p;
+            for (int i = 0; i < 4; i++) {
+                /* Explicitly exclude swizzle masks that contain duplicates,
+                 * because these are wasteful to implement as a memcpy */
+                for (int j = 0; j < i; j++) {
+                    if (op->swizzle.in[i] == op->swizzle.in[j])
+                        return AVERROR(ENOTSUP);
+                }
+                p.index[i] = orig.index[op->swizzle.in[i]];
+            }
+            break;
+        }
+
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (!op->c.q4[i].den)
+                    continue;
+                if (op->c.q4[i].den != 1)
+                    return AVERROR(ENOTSUP);
+
+                /* Ensure all bytes to be cleared are the same, because we
+                 * can't memset on multi-byte sequences */
+                uint8_t val = op->c.q4[i].num & 0xFF;
+                uint32_t ref = val;
+                switch (ff_sws_pixel_type_size(op->type)) {
+                case 2: ref *= 0x101; break;
+                case 4: ref *= 0x1010101; break;
+                }
+                if (ref != op->c.q4[i].num)
+                    return AVERROR(ENOTSUP);
+                p.clear_value[i] = val;
+                p.index[i] = -1;
+            }
+            break;
+
+        case SWS_OP_WRITE:
+            if ((op->rw.packed && op->rw.elems != 1) || op->rw.frac)
+                return AVERROR(ENOTSUP);
+            p.num_planes = op->rw.elems;
+            break;
+
+        default:
+            return AVERROR(ENOTSUP);
+        }
+    }
+
+    *out = (SwsCompiledOp) {
+        .block_size = 1,
+        .func = process,
+        .priv = av_memdup(&p, sizeof(p)),
+        .free = av_free,
+    };
+    return out->priv ? 0 : AVERROR(ENOMEM);
+}
+
+SwsOpBackend backend_murder = {
+    .name    = "memcpy",
+    .compile = compile,
+};
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 14/17] swscale/x86: add SIMD backend
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (12 preceding siblings ...)
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 13/17] swscale/ops_memcpy: add 'memcpy' backend for plane->plane copies Niklas Haas
@ 2025-05-21 12:44 ` Niklas Haas
  2025-05-21 14:11   ` Kieran Kunhya via ffmpeg-devel
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 15/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:44 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
floating point operations. While this is not yet 100% coverage, it's good
enough for the vast majority of formats out there.

Of special note is the packed shuffle fast path, which uses pshufb at vector
sizes up to AVX512.
---
 libswscale/ops.c              |    4 +
 libswscale/x86/Makefile       |    3 +
 libswscale/x86/ops.c          |  706 ++++++++++++++++++++++
 libswscale/x86/ops_common.asm |  187 ++++++
 libswscale/x86/ops_float.asm  |  386 ++++++++++++
 libswscale/x86/ops_int.asm    | 1050 +++++++++++++++++++++++++++++++++
 6 files changed, 2336 insertions(+)
 create mode 100644 libswscale/x86/ops.c
 create mode 100644 libswscale/x86/ops_common.asm
 create mode 100644 libswscale/x86/ops_float.asm
 create mode 100644 libswscale/x86/ops_int.asm

diff --git a/libswscale/ops.c b/libswscale/ops.c
index 6403eff324..8a27e70ef9 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -29,9 +29,13 @@
 
 extern SwsOpBackend backend_c;
 extern SwsOpBackend backend_murder;
+extern SwsOpBackend backend_x86;
 
 const SwsOpBackend * const ff_sws_op_backends[] = {
     &backend_murder,
+#if ARCH_X86
+    &backend_x86,
+#endif
     &backend_c,
     NULL
 };
diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index f00154941d..a04bc8336f 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -10,6 +10,9 @@ OBJS-$(CONFIG_XMM_CLOBBER_TEST) += x86/w64xmmtest.o
 
 X86ASM-OBJS                     += x86/input.o                          \
                                    x86/output.o                         \
+                                   x86/ops_int.o                        \
+                                   x86/ops_float.o                      \
+                                   x86/ops.o                            \
                                    x86/scale.o                          \
                                    x86/scale_avx2.o                          \
                                    x86/range_convert.o                  \
diff --git a/libswscale/x86/ops.c b/libswscale/x86/ops.c
new file mode 100644
index 0000000000..d5fd046d64
--- /dev/null
+++ b/libswscale/x86/ops.c
@@ -0,0 +1,706 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <float.h>
+
+#include <libavutil/avassert.h>
+#include <libavutil/mem.h>
+
+#include "../ops_chain.h"
+
+#define DECL_ENTRY(TYPE, NAME, ...)                                             \
+    static const SwsOpEntry op_##NAME = {                                       \
+        .op.type = SWS_PIXEL_##TYPE,                                            \
+        __VA_ARGS__                                                             \
+    }
+
+#define DECL_ASM(TYPE, NAME, ...)                                               \
+    void ff_##NAME(void);                                                       \
+    DECL_ENTRY(TYPE, NAME,                                                      \
+        .func = ff_##NAME,                                                      \
+        __VA_ARGS__)
+
+#define DECL_PATTERN(TYPE, NAME, X, Y, Z, W, ...)                               \
+    DECL_ASM(TYPE, p##X##Y##Z##W##_##NAME,                                      \
+        .op.comps.unused = { !X, !Y, !Z, !W },                                  \
+        __VA_ARGS__                                                             \
+    )
+
+#define REF_PATTERN(NAME, X, Y, Z, W)                                           \
+    &op_p##X##Y##Z##W##_##NAME
+
+#define DECL_COMMON_PATTERNS(TYPE, NAME, ...)                                   \
+    DECL_PATTERN(TYPE, NAME, 1, 0, 0, 0, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 0, 0, 1, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 1, 1, 0, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 1, 1, 1, __VA_ARGS__)                           \
+
+#define REF_COMMON_PATTERNS(NAME)                                               \
+    REF_PATTERN(NAME, 1, 0, 0, 0),                                              \
+    REF_PATTERN(NAME, 1, 0, 0, 1),                                              \
+    REF_PATTERN(NAME, 1, 1, 1, 0),                                              \
+    REF_PATTERN(NAME, 1, 1, 1, 1)
+
+#define DECL_RW(EXT, TYPE, NAME, OP, ELEMS, PACKED, FRAC)                       \
+    DECL_ASM(TYPE, NAME##ELEMS##EXT,                                            \
+        .op.op = SWS_OP_##OP,                                                   \
+        .op.rw = { .elems = ELEMS, .packed = PACKED, .frac = FRAC },            \
+    );
+
+#define DECL_PACKED_RW(EXT, DEPTH)                                              \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  2, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  3, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  4, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 2, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 3, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 4, true,  0)           \
+
+#define DECL_PACK_UNPACK(EXT, TYPE, X, Y, Z, W)                                 \
+    DECL_ASM(TYPE, pack_##X##Y##Z##W##EXT,                                      \
+        .op.op = SWS_OP_PACK,                                                   \
+        .op.pack.pattern = {X, Y, Z, W},                                        \
+    );                                                                          \
+                                                                                \
+    DECL_ASM(TYPE, unpack_##X##Y##Z##W##EXT,                                    \
+        .op.op = SWS_OP_UNPACK,                                                 \
+        .op.pack.pattern = {X, Y, Z, W},                                        \
+    );                                                                          \
+
+static int setup_swap_bytes(const SwsOp *op, SwsOpPriv *out)
+{
+    const int mask = ff_sws_pixel_type_size(op->type) - 1;
+    for (int i = 0; i < 16; i++)
+        out->u8[i] = (i & ~mask) | (mask - (i & mask));
+    return 0;
+}
+
+#define DECL_SWAP_BYTES(EXT, TYPE, X, Y, Z, W)                                  \
+    DECL_PATTERN(TYPE, swap_bytes_##TYPE##EXT, X, Y, Z, W,                      \
+        .func = ff_p##X##Y##Z##W##_shuffle##EXT,                                \
+        .op.op = SWS_OP_SWAP_BYTES,                                             \
+        .setup = setup_swap_bytes,                                              \
+    );
+
+#define DECL_CLEAR_ALPHA(EXT, IDX)                                              \
+    DECL_ASM(U8, clear_alpha##IDX##EXT,                                         \
+        .op.op = SWS_OP_CLEAR,                                                  \
+        .op.c.q4[IDX] = { .num = -1, .den = 1 },                                \
+        .op.comps.unused[IDX] = true,                                           \
+    );                                                                          \
+
+#define DECL_CLEAR_ZERO(EXT, IDX)                                               \
+    DECL_ASM(U8, clear_zero##IDX##EXT,                                          \
+        .op.op = SWS_OP_CLEAR,                                                  \
+        .op.c.q4[IDX] = { .num = 0, .den = 1 },                                 \
+        .op.comps.unused[IDX] = true,                                           \
+    );
+
+static int setup_clear(const SwsOp *op, SwsOpPriv *out)
+{
+    for (int i = 0; i < 4; i++)
+        out->u32[i] = (uint32_t) op->c.q4[i].num;
+    return 0;
+}
+
+#define DECL_CLEAR(EXT, X, Y, Z, W)                                             \
+    DECL_PATTERN(U8, clear##EXT, X, Y, Z, W,                                    \
+        .op.op = SWS_OP_CLEAR,                                                  \
+        .setup = setup_clear,                                                   \
+        .flexible = true,                                                       \
+    );
+
+#define DECL_SWIZZLE(EXT, X, Y, Z, W)                                           \
+    DECL_ASM(U8, swizzle_##X##Y##Z##W##EXT,                                     \
+        .op.op = SWS_OP_SWIZZLE,                                                \
+        .op.swizzle = SWS_SWIZZLE( X, Y, Z, W ),                                \
+    );
+
+#define DECL_CONVERT(EXT, FROM, TO)                                             \
+    DECL_COMMON_PATTERNS(FROM, convert_##FROM##_##TO##EXT,                      \
+        .op.op = SWS_OP_CONVERT,                                                \
+        .op.convert.to = SWS_PIXEL_##TO,                                        \
+    );
+
+#define DECL_EXPAND(EXT, FROM, TO)                                              \
+    DECL_COMMON_PATTERNS(FROM, expand_##FROM##_##TO##EXT,                       \
+        .op.op = SWS_OP_CONVERT,                                                \
+        .op.convert.to = SWS_PIXEL_##TO,                                        \
+        .op.convert.expand = true,                                              \
+    );
+
+static int setup_shift(const SwsOp *op, SwsOpPriv *out)
+{
+    out->u16[0] = op->c.u;
+    return 0;
+}
+
+#define DECL_SHIFT16(EXT)                                                       \
+    DECL_COMMON_PATTERNS(U16, lshift16##EXT,                                    \
+        .op.op = SWS_OP_LSHIFT,                                                 \
+        .setup = setup_shift,                                                   \
+        .flexible = true,                                                       \
+    );                                                                          \
+                                                                                \
+    DECL_COMMON_PATTERNS(U16, rshift16##EXT,                                    \
+        .op.op = SWS_OP_RSHIFT,                                                 \
+        .setup = setup_shift,                                                   \
+        .flexible = true,                                                       \
+    );
+
+#define DECL_MIN_MAX(EXT)                                                       \
+    DECL_COMMON_PATTERNS(F32, min##EXT,                                         \
+        .op.op = SWS_OP_MIN,                                                    \
+        .setup = ff_sws_setup_q4,                                               \
+        .flexible = true,                                                       \
+    );                                                                          \
+                                                                                \
+    DECL_COMMON_PATTERNS(F32, max##EXT,                                         \
+        .op.op = SWS_OP_MAX,                                                    \
+        .setup = ff_sws_setup_q4,                                               \
+        .flexible = true,                                                       \
+    );
+
+#define DECL_SCALE(EXT)                                                         \
+    DECL_COMMON_PATTERNS(F32, scale##EXT,                                       \
+        .op.op = SWS_OP_SCALE,                                                  \
+        .setup = ff_sws_setup_q,                                                \
+    );
+
+/* 2x2 matrix fits inside SwsOpPriv directly, save an indirect in this case */
+static_assert(sizeof(SwsOpPriv) >= sizeof(float[2][2]), "2x2 dither matrix too large");
+static int setup_dither(const SwsOp *op, SwsOpPriv *out)
+{
+    const int size = 1 << op->dither.size_log2;
+    float *matrix = out->f32;
+    if (size > 2) {
+        matrix = out->ptr = av_mallocz(size * size * sizeof(*matrix));
+        if (!matrix)
+            return AVERROR(ENOMEM);
+    }
+
+    for (int i = 0; i < size * size; i++)
+        matrix[i] = (float) op->dither.matrix[i].num / op->dither.matrix[i].den;
+
+    return 0;
+}
+
+#define DECL_DITHER(EXT, SIZE)                                                  \
+    DECL_COMMON_PATTERNS(F32, dither##SIZE##EXT,                                \
+        .op.op = SWS_OP_DITHER,                                                 \
+        .op.dither.size_log2 = SIZE,                                            \
+        .setup = setup_dither,                                                  \
+        .free  = SIZE > 2 ? av_free : NULL,                                     \
+    );
+
+static int setup_linear(const SwsOp *op, SwsOpPriv *out)
+{
+    float *matrix = out->ptr = av_mallocz(sizeof(float[4][5]));
+    if (!matrix)
+        return AVERROR(ENOMEM);
+
+    for (int y = 0; y < 4; y++) {
+        for (int x = 0; x < 5; x++)
+            matrix[y * 5 + x] = (float) op->lin.m[y][x].num / op->lin.m[y][x].den;
+    }
+
+    return 0;
+}
+
+#define DECL_LINEAR(EXT, NAME, MASK)                                            \
+    DECL_ASM(F32, NAME##EXT,                                                    \
+        .op.op = SWS_OP_LINEAR,                                                 \
+        .op.lin.mask = (MASK),                                                  \
+        .setup = setup_linear,                                                  \
+        .free  = av_free,                                                       \
+    );
+
+#define DECL_FUNCS_8(SIZE, EXT, FLAG)                                           \
+    DECL_RW(EXT, U8, read_planar,   READ,  1, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  2, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  3, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  4, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 1, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 2, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 3, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 4, false, 0)                         \
+    DECL_RW(EXT, U8, read_nibbles,  READ,  1, false, 1)                         \
+    DECL_RW(EXT, U8, read_bits,     READ,  1, false, 3)                         \
+    DECL_RW(EXT, U8, write_bits,    WRITE, 1, false, 3)                         \
+    DECL_PACKED_RW(EXT, 8)                                                      \
+    DECL_PACK_UNPACK(EXT, U8, 1, 2, 1, 0)                                       \
+    DECL_PACK_UNPACK(EXT, U8, 3, 3, 2, 0)                                       \
+    DECL_PACK_UNPACK(EXT, U8, 2, 3, 3, 0)                                       \
+    void ff_p1000_shuffle##EXT(void);                                           \
+    void ff_p1001_shuffle##EXT(void);                                           \
+    void ff_p1110_shuffle##EXT(void);                                           \
+    void ff_p1111_shuffle##EXT(void);                                           \
+    DECL_SWIZZLE(EXT, 3, 0, 1, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 0, 2, 1)                                               \
+    DECL_SWIZZLE(EXT, 2, 1, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 3, 2, 1, 0)                                               \
+    DECL_SWIZZLE(EXT, 3, 1, 0, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 2, 0, 1)                                               \
+    DECL_SWIZZLE(EXT, 1, 2, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 1, 0, 2, 3)                                               \
+    DECL_SWIZZLE(EXT, 2, 0, 1, 3)                                               \
+    DECL_SWIZZLE(EXT, 2, 3, 1, 0)                                               \
+    DECL_SWIZZLE(EXT, 2, 1, 3, 0)                                               \
+    DECL_SWIZZLE(EXT, 1, 2, 3, 0)                                               \
+    DECL_SWIZZLE(EXT, 1, 3, 2, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 2, 1, 3)                                               \
+    DECL_SWIZZLE(EXT, 0, 2, 3, 1)                                               \
+    DECL_SWIZZLE(EXT, 0, 3, 1, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 1, 2, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 3, 2, 1)                                               \
+    DECL_SWIZZLE(EXT, 0, 0, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 3, 0, 0, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 0, 0, 1)                                               \
+    DECL_SWIZZLE(EXT, 1, 0, 0, 0)                                               \
+    DECL_CLEAR_ALPHA(EXT, 0)                                                    \
+    DECL_CLEAR_ALPHA(EXT, 1)                                                    \
+    DECL_CLEAR_ALPHA(EXT, 3)                                                    \
+    DECL_CLEAR_ZERO(EXT, 0)                                                     \
+    DECL_CLEAR_ZERO(EXT, 1)                                                     \
+    DECL_CLEAR_ZERO(EXT, 3)                                                     \
+    DECL_CLEAR(EXT, 1, 1, 1, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 1, 1)                                                 \
+    DECL_CLEAR(EXT, 0, 0, 1, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 0, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 1, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 0, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 1, 0)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 0, 1, 0)                                                 \
+                                                                                \
+static const SwsOpTable ops8##EXT = {                                           \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        &op_read_planar1##EXT,                                                  \
+        &op_read_planar2##EXT,                                                  \
+        &op_read_planar3##EXT,                                                  \
+        &op_read_planar4##EXT,                                                  \
+        &op_write_planar1##EXT,                                                 \
+        &op_write_planar2##EXT,                                                 \
+        &op_write_planar3##EXT,                                                 \
+        &op_write_planar4##EXT,                                                 \
+        &op_read8_packed2##EXT,                                                 \
+        &op_read8_packed3##EXT,                                                 \
+        &op_read8_packed4##EXT,                                                 \
+        &op_write8_packed2##EXT,                                                \
+        &op_write8_packed3##EXT,                                                \
+        &op_write8_packed4##EXT,                                                \
+        &op_read_nibbles1##EXT,                                                 \
+        &op_read_bits1##EXT,                                                    \
+        &op_write_bits1##EXT,                                                   \
+        &op_pack_1210##EXT,                                                     \
+        &op_pack_3320##EXT,                                                     \
+        &op_pack_2330##EXT,                                                     \
+        &op_unpack_1210##EXT,                                                   \
+        &op_unpack_3320##EXT,                                                   \
+        &op_unpack_2330##EXT,                                                   \
+        &op_swizzle_3012##EXT,                                                  \
+        &op_swizzle_3021##EXT,                                                  \
+        &op_swizzle_2103##EXT,                                                  \
+        &op_swizzle_3210##EXT,                                                  \
+        &op_swizzle_3102##EXT,                                                  \
+        &op_swizzle_3201##EXT,                                                  \
+        &op_swizzle_1203##EXT,                                                  \
+        &op_swizzle_1023##EXT,                                                  \
+        &op_swizzle_2013##EXT,                                                  \
+        &op_swizzle_2310##EXT,                                                  \
+        &op_swizzle_2130##EXT,                                                  \
+        &op_swizzle_1230##EXT,                                                  \
+        &op_swizzle_1320##EXT,                                                  \
+        &op_swizzle_0213##EXT,                                                  \
+        &op_swizzle_0231##EXT,                                                  \
+        &op_swizzle_0312##EXT,                                                  \
+        &op_swizzle_3120##EXT,                                                  \
+        &op_swizzle_0321##EXT,                                                  \
+        &op_swizzle_0003##EXT,                                                  \
+        &op_swizzle_0001##EXT,                                                  \
+        &op_swizzle_3000##EXT,                                                  \
+        &op_swizzle_1000##EXT,                                                  \
+        &op_clear_alpha0##EXT,                                                  \
+        &op_clear_alpha1##EXT,                                                  \
+        &op_clear_alpha3##EXT,                                                  \
+        &op_clear_zero0##EXT,                                                   \
+        &op_clear_zero1##EXT,                                                   \
+        &op_clear_zero3##EXT,                                                   \
+        REF_PATTERN(clear##EXT, 1, 1, 1, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 1, 1),                                    \
+        REF_PATTERN(clear##EXT, 0, 0, 1, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 0, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 1, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 0, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 1, 0),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 0, 1, 0),                                    \
+        NULL                                                                    \
+    },                                                                          \
+};
+
+#define DECL_FUNCS_16(SIZE, EXT, FLAG)                                          \
+    DECL_PACKED_RW(EXT, 16)                                                     \
+    DECL_PACK_UNPACK(EXT, U16, 4, 4, 4, 0)                                      \
+    DECL_PACK_UNPACK(EXT, U16, 5, 5, 5, 0)                                      \
+    DECL_PACK_UNPACK(EXT, U16, 5, 6, 5, 0)                                      \
+    DECL_SWAP_BYTES(EXT, U16, 1, 0, 0, 0)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 0, 0, 1)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 1, 1, 0)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 1, 1, 1)                                       \
+    DECL_SHIFT16(EXT)                                                           \
+    DECL_CONVERT(EXT,  U8, U16)                                                 \
+    DECL_CONVERT(EXT, U16,  U8)                                                 \
+    DECL_EXPAND(EXT,   U8, U16)                                                 \
+                                                                                \
+static const SwsOpTable ops16##EXT = {                                          \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        &op_read16_packed2##EXT,                                                \
+        &op_read16_packed3##EXT,                                                \
+        &op_read16_packed4##EXT,                                                \
+        &op_write16_packed2##EXT,                                               \
+        &op_write16_packed3##EXT,                                               \
+        &op_write16_packed4##EXT,                                               \
+        &op_pack_4440##EXT,                                                     \
+        &op_pack_5550##EXT,                                                     \
+        &op_pack_5650##EXT,                                                     \
+        &op_unpack_4440##EXT,                                                   \
+        &op_unpack_5550##EXT,                                                   \
+        &op_unpack_5650##EXT,                                                   \
+        REF_COMMON_PATTERNS(swap_bytes_U16##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U8_U16##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_U8##EXT),                               \
+        REF_COMMON_PATTERNS(expand_U8_U16##EXT),                                \
+        REF_COMMON_PATTERNS(lshift16##EXT),                                     \
+        REF_COMMON_PATTERNS(rshift16##EXT),                                     \
+        NULL                                                                    \
+    },                                                                          \
+};
+
+#define DECL_FUNCS_32(SIZE, EXT, FLAG)                                          \
+    DECL_PACKED_RW(_m2##EXT, 32)                                                \
+    DECL_PACK_UNPACK(_m2##EXT, U32, 10, 10, 10, 2)                              \
+    DECL_PACK_UNPACK(_m2##EXT, U32, 2, 10, 10, 10)                              \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 0, 0, 0)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 0, 0, 1)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 1, 1, 0)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 1, 1, 1)                                  \
+    DECL_CONVERT(EXT,  U8, U32)                                                 \
+    DECL_CONVERT(EXT, U32,  U8)                                                 \
+    DECL_CONVERT(EXT, U16, U32)                                                 \
+    DECL_CONVERT(EXT, U32, U16)                                                 \
+    DECL_CONVERT(EXT,  U8, F32)                                                 \
+    DECL_CONVERT(EXT, F32,  U8)                                                 \
+    DECL_CONVERT(EXT, U16, F32)                                                 \
+    DECL_CONVERT(EXT, F32, U16)                                                 \
+    DECL_EXPAND(EXT,   U8, U32)                                                 \
+    DECL_MIN_MAX(EXT)                                                           \
+    DECL_SCALE(EXT)                                                             \
+    DECL_DITHER(EXT, 0)                                                         \
+    DECL_DITHER(EXT, 1)                                                         \
+    DECL_DITHER(EXT, 2)                                                         \
+    DECL_DITHER(EXT, 3)                                                         \
+    DECL_DITHER(EXT, 4)                                                         \
+    DECL_DITHER(EXT, 5)                                                         \
+    DECL_DITHER(EXT, 6)                                                         \
+    DECL_DITHER(EXT, 7)                                                         \
+    DECL_DITHER(EXT, 8)                                                         \
+    DECL_LINEAR(EXT, luma,      SWS_MASK_LUMA)                                  \
+    DECL_LINEAR(EXT, alpha,     SWS_MASK_ALPHA)                                 \
+    DECL_LINEAR(EXT, lumalpha,  SWS_MASK_LUMA | SWS_MASK_ALPHA)                 \
+    DECL_LINEAR(EXT, dot3,      0b111)                                          \
+    DECL_LINEAR(EXT, row0,      SWS_MASK_ROW(0))                                \
+    DECL_LINEAR(EXT, row0a,     SWS_MASK_ROW(0) | SWS_MASK_ALPHA)               \
+    DECL_LINEAR(EXT, diag3,     SWS_MASK_DIAG3)                                 \
+    DECL_LINEAR(EXT, diag4,     SWS_MASK_DIAG4)                                 \
+    DECL_LINEAR(EXT, diagoff3,  SWS_MASK_DIAG3 | SWS_MASK_OFF3)                 \
+    DECL_LINEAR(EXT, matrix3,   SWS_MASK_MAT3)                                  \
+    DECL_LINEAR(EXT, affine3,   SWS_MASK_MAT3 | SWS_MASK_OFF3)                  \
+    DECL_LINEAR(EXT, affine3a,  SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA) \
+    DECL_LINEAR(EXT, matrix4,   SWS_MASK_MAT4)                                  \
+    DECL_LINEAR(EXT, affine4,   SWS_MASK_MAT4 | SWS_MASK_OFF4)                  \
+                                                                                \
+static const SwsOpTable ops32##EXT = {                                          \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        &op_read32_packed2_m2##EXT,                                             \
+        &op_read32_packed3_m2##EXT,                                             \
+        &op_read32_packed4_m2##EXT,                                             \
+        &op_write32_packed2_m2##EXT,                                            \
+        &op_write32_packed3_m2##EXT,                                            \
+        &op_write32_packed4_m2##EXT,                                            \
+        &op_pack_1010102_m2##EXT,                                               \
+        &op_pack_2101010_m2##EXT,                                               \
+        &op_unpack_1010102_m2##EXT,                                             \
+        &op_unpack_2101010_m2##EXT,                                             \
+        REF_COMMON_PATTERNS(swap_bytes_U32_m2##EXT),                            \
+        REF_COMMON_PATTERNS(convert_U8_U32##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U32_U8##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_U32##EXT),                              \
+        REF_COMMON_PATTERNS(convert_U32_U16##EXT),                              \
+        REF_COMMON_PATTERNS(convert_U8_F32##EXT),                               \
+        REF_COMMON_PATTERNS(convert_F32_U8##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_F32##EXT),                              \
+        REF_COMMON_PATTERNS(convert_F32_U16##EXT),                              \
+        REF_COMMON_PATTERNS(expand_U8_U32##EXT),                                \
+        REF_COMMON_PATTERNS(min##EXT),                                          \
+        REF_COMMON_PATTERNS(max##EXT),                                          \
+        REF_COMMON_PATTERNS(scale##EXT),                                        \
+        REF_COMMON_PATTERNS(dither0##EXT),                                      \
+        REF_COMMON_PATTERNS(dither1##EXT),                                      \
+        REF_COMMON_PATTERNS(dither2##EXT),                                      \
+        REF_COMMON_PATTERNS(dither3##EXT),                                      \
+        REF_COMMON_PATTERNS(dither4##EXT),                                      \
+        REF_COMMON_PATTERNS(dither5##EXT),                                      \
+        REF_COMMON_PATTERNS(dither6##EXT),                                      \
+        REF_COMMON_PATTERNS(dither7##EXT),                                      \
+        REF_COMMON_PATTERNS(dither8##EXT),                                      \
+        &op_luma##EXT,                                                          \
+        &op_alpha##EXT,                                                         \
+        &op_lumalpha##EXT,                                                      \
+        &op_dot3##EXT,                                                          \
+        &op_row0##EXT,                                                          \
+        &op_row0a##EXT,                                                         \
+        &op_diag3##EXT,                                                         \
+        &op_diag4##EXT,                                                         \
+        &op_diagoff3##EXT,                                                      \
+        &op_matrix3##EXT,                                                       \
+        &op_affine3##EXT,                                                       \
+        &op_affine3a##EXT,                                                      \
+        &op_matrix4##EXT,                                                       \
+        &op_affine4##EXT,                                                       \
+        NULL                                                                    \
+    },                                                                          \
+};
+
+DECL_FUNCS_8(16, _m1_sse4, SSE4)
+DECL_FUNCS_8(32, _m1_avx2, AVX2)
+DECL_FUNCS_8(32, _m2_sse4, SSE4)
+DECL_FUNCS_8(64, _m2_avx2, AVX2)
+
+DECL_FUNCS_16(16, _m1_avx2, AVX2)
+DECL_FUNCS_16(32, _m2_avx2, AVX2)
+
+DECL_FUNCS_32(16, _avx2, AVX2)
+
+static av_const int get_mmsize(const int cpu_flags)
+{
+    if (cpu_flags & AV_CPU_FLAG_AVX512)
+        return 64;
+    else if (cpu_flags & AV_CPU_FLAG_AVX2)
+        return 32;
+    else if (cpu_flags & AV_CPU_FLAG_SSE4)
+        return 16;
+    else
+        return AVERROR(ENOTSUP);
+}
+
+/**
+ * Returns true if the operation's implementation only depends on the block
+ * size, and not the underlying pixel type
+ */
+static bool op_is_type_invariant(const SwsOp *op)
+{
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        return !op->rw.packed && !op->rw.frac;
+    case SWS_OP_SWIZZLE:
+    case SWS_OP_CLEAR:
+        return true;
+    }
+
+    return false;
+}
+
+static int solve_shuffle(const SwsOpList *ops, int mmsize, SwsCompiledOp *out)
+{
+    uint8_t shuffle[16];
+    int read_bytes, write_bytes;
+    int pixels;
+
+    pixels = ff_sws_solve_shuffle(ops, shuffle, 16, 0x80, &read_bytes, &write_bytes);
+    if (pixels < 0)
+        return pixels;
+
+    if (read_bytes < 16 || write_bytes < 16)
+        mmsize = 16; /* avoid cross-lane shuffle */
+
+    const int num_lanes = mmsize / 16;
+    const int in_total  = num_lanes * read_bytes;
+    const int out_total = num_lanes * write_bytes;
+    const int read_size = in_total <= 4 ? 4 : in_total <= 8 ? 8 : mmsize;
+    *out = (SwsCompiledOp) {
+        .priv       = av_memdup(shuffle, sizeof(shuffle)),
+        .free       = av_free,
+        .block_size = pixels * num_lanes,
+        .over_read  = read_size - in_total,
+        .over_write = mmsize - out_total,
+        .cpu_flags  = mmsize > 32 ? AV_CPU_FLAG_AVX512 :
+                      mmsize > 16 ? AV_CPU_FLAG_AVX2 :
+                                    AV_CPU_FLAG_SSE4,
+    };
+
+    if (!out->priv)
+        return AVERROR(ENOMEM);
+
+#define ASSIGN_SHUFFLE_FUNC(IN, OUT, EXT)                                       \
+do {                                                                            \
+    SWS_DECL_FUNC(ff_packed_shuffle##IN##_##OUT##_##EXT);                       \
+    if (in_total == IN && out_total == OUT)                                     \
+        out->func = ff_packed_shuffle##IN##_##OUT##_##EXT;                      \
+} while (0)
+
+    ASSIGN_SHUFFLE_FUNC( 5, 15, sse4);
+    ASSIGN_SHUFFLE_FUNC( 4, 16, sse4);
+    ASSIGN_SHUFFLE_FUNC( 2, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(10, 15, sse4);
+    ASSIGN_SHUFFLE_FUNC( 8, 16, sse4);
+    ASSIGN_SHUFFLE_FUNC( 4, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(15, 15, sse4);
+    ASSIGN_SHUFFLE_FUNC(12, 16, sse4);
+    ASSIGN_SHUFFLE_FUNC( 6, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(16, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(16, 16, sse4);
+    ASSIGN_SHUFFLE_FUNC( 8, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(12, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(32, 32, avx2);
+    ASSIGN_SHUFFLE_FUNC(64, 64, avx512);
+    av_assert1(out->func);
+    return 0;
+}
+
+/* Normalize clear values into 32-bit integer constants */
+static void normalize_clear(SwsOp *op)
+{
+    static_assert(sizeof(uint32_t) == sizeof(int), "int size mismatch");
+    SwsOpPriv priv;
+    union {
+        uint32_t u32;
+        int i;
+    } c;
+
+    ff_sws_setup_q4(op, &priv);
+    for (int i = 0; i < 4; i++) {
+        if (!op->c.q4[i].den)
+            continue;
+        switch (ff_sws_pixel_type_size(op->type)) {
+        case 1: c.u32 = 0x1010101 * priv.u8[i]; break;
+        case 2: c.u32 = priv.u16[i] << 16 | priv.u16[i]; break;
+        case 4: c.u32 = priv.u32[i]; break;
+        }
+
+        op->c.q4[i].num = c.i;
+        op->c.q4[i].den = 1;
+    }
+}
+
+static int compile(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out)
+{
+    const int cpu_flags = av_get_cpu_flags();
+    const int mmsize = get_mmsize(cpu_flags);
+    if (mmsize < 0)
+        return mmsize;
+
+    av_assert1(ops->num_ops > 0);
+    const SwsOp read = ops->ops[0];
+    const SwsOp write = ops->ops[ops->num_ops - 1];
+    int ret;
+
+    /* Special fast path for in-place packed shuffle */
+    ret = solve_shuffle(ops, mmsize, out);
+    if (ret != AVERROR(ENOTSUP))
+        return ret;
+
+    SwsOpChain *chain = ff_sws_op_chain_alloc();
+    if (!chain)
+        return AVERROR(ENOMEM);
+
+    *out = (SwsCompiledOp) {
+        .priv = chain,
+        .free = (void (*)(void *)) ff_sws_op_chain_free,
+
+        /* Use at most two full YMM regs during the widest precision section */
+        .block_size = 2 * FFMIN(mmsize, 32) / ff_sws_op_list_max_size(ops),
+    };
+
+    /* 3-component reads/writes process one extra garbage word */
+    if (read.rw.packed && read.rw.elems == 3)
+        out->over_read = sizeof(uint32_t);
+    if (write.rw.packed && write.rw.elems == 3)
+        out->over_write = sizeof(uint32_t);
+
+    static const SwsOpTable *const tables[] = {
+        &ops8_m1_sse4,
+        &ops8_m1_avx2,
+        &ops8_m2_sse4,
+        &ops8_m2_avx2,
+        &ops16_m1_avx2,
+        &ops16_m2_avx2,
+        &ops32_avx2,
+    };
+
+    do {
+        int op_block_size = out->block_size;
+        SwsOp *op = &ops->ops[0];
+
+        if (op_is_type_invariant(op)) {
+            if (op->op == SWS_OP_CLEAR)
+                normalize_clear(op);
+            op_block_size *= ff_sws_pixel_type_size(op->type);
+            op->type = SWS_PIXEL_U8;
+        }
+
+        ret = ff_sws_op_compile_tables(tables, FF_ARRAY_ELEMS(tables), ops,
+                                       op_block_size, chain);
+    } while (ret == AVERROR(EAGAIN));
+    if (ret < 0) {
+        ff_sws_op_chain_free(chain);
+        return ret;
+    }
+
+    SWS_DECL_FUNC(ff_sws_process1_sse4);
+    SWS_DECL_FUNC(ff_sws_process2_sse4);
+    SWS_DECL_FUNC(ff_sws_process3_sse4);
+    SWS_DECL_FUNC(ff_sws_process4_sse4);
+
+    const int read_planes  = read.rw.packed  ? 1 : read.rw.elems;
+    const int write_planes = write.rw.packed ? 1 : write.rw.elems;
+    switch (FFMAX(read_planes, write_planes)) {
+    case 1: out->func = ff_sws_process1_sse4; break;
+    case 2: out->func = ff_sws_process2_sse4; break;
+    case 3: out->func = ff_sws_process3_sse4; break;
+    case 4: out->func = ff_sws_process4_sse4; break;
+    }
+
+    out->cpu_flags = chain->cpu_flags;
+    return ret;
+}
+
+SwsOpBackend backend_x86 = {
+    .name       = "x86",
+    .compile    = compile,
+};
diff --git a/libswscale/x86/ops_common.asm b/libswscale/x86/ops_common.asm
new file mode 100644
index 0000000000..400bdfd3bf
--- /dev/null
+++ b/libswscale/x86/ops_common.asm
@@ -0,0 +1,187 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "libavutil/x86/x86util.asm"
+
+struc SwsOpExec
+    .in0 resq 1
+    .in1 resq 1
+    .in2 resq 1
+    .in3 resq 1
+    .out0 resq 1
+    .out1 resq 1
+    .out2 resq 1
+    .out3 resq 1
+    .in_stride0 resq 1
+    .in_stride1 resq 1
+    .in_stride2 resq 1
+    .in_stride3 resq 1
+    .out_stride0 resq 1
+    .out_stride1 resq 1
+    .out_stride2 resq 1
+    .out_stride3 resq 1
+    .width resd 1
+    .height resd 1
+    .slice_y resd 1
+    .slice_h resd 1
+    .pixel_bits_in resd 1
+    .pixel_bits_out resd 1
+endstruc
+
+struc SwsOpImpl
+    .cont resb 16
+    .priv resb 16
+    .next resb 0
+endstruc
+
+; common macros for declaring operations
+%macro op 1 ; name
+    %ifdef X
+        %define ADD_PAT(name) p %+ X %+ Y %+ Z %+ W %+ _ %+ name
+    %else
+        %define ADD_PAT(name) name
+    %endif
+
+    %ifdef V2
+        %if V2
+            %define ADD_MUL(name) name %+ _m2
+        %else
+            %define ADD_MUL(name) name %+ _m1
+        %endif
+    %else
+        %define ADD_MUL(name) name
+    %endif
+
+    cglobal ADD_PAT(ADD_MUL(%1)), 0, 0, 0 ; already allocated by entry point
+
+    %undef ADD_PAT
+    %undef ADD_MUL
+%endmacro
+
+%macro decl_v2 2+ ; v2, func
+    %xdefine V2 %1
+    %2
+    %undef V2
+%endmacro
+
+%macro decl_pattern 5+ ; X, Y, Z, W, func
+    %xdefine X %1
+    %xdefine Y %2
+    %xdefine Z %3
+    %xdefine W %4
+    %5
+    %undef X
+    %undef Y
+    %undef Z
+    %undef W
+%endmacro
+
+%macro decl_common_patterns 1+ ; func
+    decl_pattern 1, 0, 0, 0, %1 ; y
+    decl_pattern 1, 0, 0, 1, %1 ; ya
+    decl_pattern 1, 1, 1, 0, %1 ; yuv
+    decl_pattern 1, 1, 1, 1, %1 ; yuva
+%endmacro
+
+; common names for the internal calling convention
+%define mx      m0
+%define my      m1
+%define mz      m2
+%define mw      m3
+
+%define xmx     xm0
+%define xmy     xm1
+%define xmz     xm2
+%define xmw     xm3
+
+%define ymx     ym0
+%define ymy     ym1
+%define ymz     ym2
+%define ymw     ym3
+
+%define mx2     m4
+%define my2     m5
+%define mz2     m6
+%define mw2     m7
+
+%define xmx2    xm4
+%define xmy2    xm5
+%define xmz2    xm6
+%define xmw2    xm7
+
+%define ymx2    ym4
+%define ymy2    ym5
+%define ymz2    ym6
+%define ymw2    ym7
+
+; from entry point signature
+%define execq   r0q
+%define implq   r1q
+%define bxd     r2d
+%define yd      r3d
+%define bxendd  r4d
+
+; extra registers for free use by kernels, not saved between ops
+%define tmp0q   r5q
+%define tmp1q   r6q
+
+%define tmp0d   r5d
+%define tmp1d   r6d
+
+; pinned static registers for plane pointers, incremented by read/write ops
+%define  in0q   r7q
+%define out0q   r8q
+%define  in1q   r9q
+%define out1q   r10q
+%define  in2q   r11q
+%define out2q   r12q
+%define  in3q   r13q
+%define out3q   r14q
+
+; load the next operation kernel
+%macro LOAD_CONT 1 ; reg
+    mov %1, [implq + SwsOpImpl.cont]
+%endmacro
+
+; tail call into the next operation kernel
+%macro CONTINUE 1 ; reg
+    add implq, SwsOpImpl.next
+    jmp %1
+    annotate_function_size
+%endmacro
+
+%macro CONTINUE 0
+    LOAD_CONT tmp0q
+    CONTINUE tmp0q
+%endmacro
+
+; helper for inline conditionals
+%macro IF 2+ ; cond, body
+    %if %1
+        %2
+    %endif
+%endmacro
+
+; alternate name for nested usage to work around some NASM bugs
+%macro IF1 2+
+    %if %1
+        %2
+    %endif
+%endmacro
diff --git a/libswscale/x86/ops_float.asm b/libswscale/x86/ops_float.asm
new file mode 100644
index 0000000000..9077b266be
--- /dev/null
+++ b/libswscale/x86/ops_float.asm
@@ -0,0 +1,386 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "ops_common.asm"
+
+SECTION .text
+
+;---------------------------------------------------------
+; Pixel type conversions
+
+%macro conv8to32f 0
+op convert_U8_F32
+        LOAD_CONT tmp0q
+IF X,   vpsrldq xmx2, xmx, 8
+IF Y,   vpsrldq xmy2, xmy, 8
+IF Z,   vpsrldq xmz2, xmz, 8
+IF W,   vpsrldq xmw2, xmw, 8
+IF X,   pmovzxbd mx, xmx
+IF Y,   pmovzxbd my, xmy
+IF Z,   pmovzxbd mz, xmz
+IF W,   pmovzxbd mw, xmw
+IF X,   pmovzxbd mx2, xmx2
+IF Y,   pmovzxbd my2, xmy2
+IF Z,   pmovzxbd mz2, xmz2
+IF W,   pmovzxbd mw2, xmw2
+IF X,   vcvtdq2ps mx, mx
+IF Y,   vcvtdq2ps my, my
+IF Z,   vcvtdq2ps mz, mz
+IF W,   vcvtdq2ps mw, mw
+IF X,   vcvtdq2ps mx2, mx2
+IF Y,   vcvtdq2ps my2, my2
+IF Z,   vcvtdq2ps mz2, mz2
+IF W,   vcvtdq2ps mw2, mw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv16to32f 0
+op convert_U16_F32
+        LOAD_CONT tmp0q
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   pmovzxwd mx, xmx
+IF Y,   pmovzxwd my, xmy
+IF Z,   pmovzxwd mz, xmz
+IF W,   pmovzxwd mw, xmw
+IF X,   pmovzxwd mx2, xmx2
+IF Y,   pmovzxwd my2, xmy2
+IF Z,   pmovzxwd mz2, xmz2
+IF W,   pmovzxwd mw2, xmw2
+IF X,   vcvtdq2ps mx, mx
+IF Y,   vcvtdq2ps my, my
+IF Z,   vcvtdq2ps mz, mz
+IF W,   vcvtdq2ps mw, mw
+IF X,   vcvtdq2ps mx2, mx2
+IF Y,   vcvtdq2ps my2, my2
+IF Z,   vcvtdq2ps mz2, mz2
+IF W,   vcvtdq2ps mw2, mw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32fto8 0
+op convert_F32_U8
+        LOAD_CONT tmp0q
+IF X,   cvttps2dq mx, mx
+IF Y,   cvttps2dq my, my
+IF Z,   cvttps2dq mz, mz
+IF W,   cvttps2dq mw, mw
+IF X,   cvttps2dq mx2, mx2
+IF Y,   cvttps2dq my2, my2
+IF Z,   cvttps2dq mz2, mz2
+IF W,   cvttps2dq mw2, mw2
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   packuswb xmx, xmx2
+IF Y,   packuswb xmy, xmy2
+IF Z,   packuswb xmz, xmz2
+IF W,   packuswb xmw, xmw2
+IF X,   vpshufd xmx, xmx, q3120
+IF Y,   vpshufd xmy, xmy, q3120
+IF Z,   vpshufd xmz, xmz, q3120
+IF W,   vpshufd xmw, xmw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32fto16 0
+op convert_F32_U16
+        LOAD_CONT tmp0q
+IF X,   cvttps2dq mx, mx
+IF Y,   cvttps2dq my, my
+IF Z,   cvttps2dq mz, mz
+IF W,   cvttps2dq mw, mw
+IF X,   cvttps2dq mx2, mx2
+IF Y,   cvttps2dq my2, my2
+IF Z,   cvttps2dq mz2, mz2
+IF W,   cvttps2dq mw2, mw2
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro min_max 0
+op min
+IF X,   vbroadcastss m8,  [implq + SwsOpImpl.priv + 0]
+IF Y,   vbroadcastss m9,  [implq + SwsOpImpl.priv + 4]
+IF Z,   vbroadcastss m10, [implq + SwsOpImpl.priv + 8]
+IF W,   vbroadcastss m11, [implq + SwsOpImpl.priv + 12]
+        LOAD_CONT tmp0q
+IF X,   minps mx, mx, m8
+IF Y,   minps my, my, m9
+IF Z,   minps mz, mz, m10
+IF W,   minps mw, mw, m11
+IF X,   minps mx2, m8
+IF Y,   minps my2, m9
+IF Z,   minps mz2, m10
+IF W,   minps mw2, m11
+        CONTINUE tmp0q
+
+op max
+IF X,   vbroadcastss m8,  [implq + SwsOpImpl.priv + 0]
+IF Y,   vbroadcastss m9,  [implq + SwsOpImpl.priv + 4]
+IF Z,   vbroadcastss m10, [implq + SwsOpImpl.priv + 8]
+IF W,   vbroadcastss m11, [implq + SwsOpImpl.priv + 12]
+        LOAD_CONT tmp0q
+IF X,   maxps mx, m8
+IF Y,   maxps my, m9
+IF Z,   maxps mz, m10
+IF W,   maxps mw, m11
+IF X,   maxps mx2, m8
+IF Y,   maxps my2, m9
+IF Z,   maxps mz2, m10
+IF W,   maxps mw2, m11
+        CONTINUE tmp0q
+%endmacro
+
+%macro scale 0
+op scale
+        vbroadcastss m8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   mulps mx, m8
+IF Y,   mulps my, m8
+IF Z,   mulps mz, m8
+IF W,   mulps mw, m8
+IF X,   mulps mx2, m8
+IF Y,   mulps my2, m8
+IF Z,   mulps mz2, m8
+IF W,   mulps mw2, m8
+        CONTINUE tmp0q
+%endmacro
+
+%macro load_dither_row 5 ; size_log2, y, addr, out, out2
+        lea tmp0q, %2
+        and tmp0q, (1 << %1) - 1
+        shl tmp0q, %1+2
+%if %1 == 2
+        VBROADCASTI128 %4, [%3 + tmp0q]
+%else
+        mova %4, [%3 + tmp0q]
+    %if (4 << %1) > mmsize
+        mova %5, [%3 + tmp0q + mmsize]
+    %endif
+%endif
+%endmacro
+
+%macro dither 1 ; size_log2
+op dither%1
+        %define DX  m8
+        %define DY  m9
+        %define DZ  m10
+        %define DW  m11
+        %define DX2 DX
+        %define DY2 DY
+        %define DZ2 DZ
+        %define DW2 DW
+%if %1 == 0
+        ; constant offest for all channels
+        vbroadcastss DX, [implq + SwsOpImpl.priv]
+        %define DY DX
+        %define DZ DX
+        %define DW DX
+%elif %1 == 1
+        ; 2x2 matrix, only sign of y matters
+        mov tmp0d, yd
+        and tmp0d, 1
+        shl tmp0d, 3
+    %if X || Z
+        vbroadcastsd DX, [implq + SwsOpImpl.priv + tmp0q]
+    %endif
+    %if Y || W
+        xor tmp0d, 8
+        vbroadcastsd DY, [implq + SwsOpImpl.priv + tmp0q]
+    %endif
+        %define DZ DX
+        %define DW DY
+%else
+        ; matrix is at least 4x4, load all four channels with custom offset
+    %if (4 << %1) > mmsize
+        %define DX2 m12
+        %define DY2 m13
+        %define DZ2 m14
+        %define DW2 m15
+    %endif
+        mov tmp1q, [implq + SwsOpImpl.priv]
+    %if (4 << %1) > 2 * mmsize
+        ; need to add in x offset
+        mov tmp0d, bxd
+        shl tmp0d, 6 ; sizeof(float[16])
+        and tmp0d, (4 << %1) - 1
+        add tmp1q, tmp0q
+    %endif
+IF X,   load_dither_row %1, [yd + 0], tmp1q, DX, DX2
+IF Y,   load_dither_row %1, [yd + 3], tmp1q, DY, DY2
+IF Z,   load_dither_row %1, [yd + 2], tmp1q, DZ, DZ2
+IF W,   load_dither_row %1, [yd + 5], tmp1q, DW, DW2
+%endif
+        LOAD_CONT tmp0q
+IF X,   addps mx, DX
+IF Y,   addps my, DY
+IF Z,   addps mz, DZ
+IF W,   addps mw, DW
+IF X,   addps mx2, DX2
+IF Y,   addps my2, DY2
+IF Z,   addps mz2, DZ2
+IF W,   addps mw2, DW2
+        CONTINUE tmp0q
+%endmacro
+
+%macro dither_fns 0
+        dither 0
+        dither 1
+        dither 2
+        dither 3
+        dither 4
+        dither 5
+        dither 6
+        dither 7
+        dither 8
+%endmacro
+
+%xdefine MASK(I, J)  (1 << (5 * (I) + (J)))
+%xdefine MASK_OFF(I) MASK(I, 4)
+%xdefine MASK_ROW(I) (0b11111 << (5 * (I)))
+%xdefine MASK_COL(J) (0b1000010000100001 << J)
+%xdefine MASK_ALL    (1 << 20) - 1
+%xdefine MASK_LUMA   MASK(0, 0) | MASK_OFF(0)
+%xdefine MASK_ALPHA  MASK(3, 3) | MASK_OFF(3)
+%xdefine MASK_DIAG3  MASK(0, 0) | MASK(1, 1) | MASK(2, 2)
+%xdefine MASK_OFF3   MASK_OFF(0) | MASK_OFF(1) | MASK_OFF(2)
+%xdefine MASK_MAT3   MASK(0, 0) | MASK(0, 1) | MASK(0, 2) |\
+                     MASK(1, 0) | MASK(1, 1) | MASK(1, 2) |\
+                     MASK(2, 0) | MASK(2, 1) | MASK(2, 2)
+%xdefine MASK_DIAG4  MASK_DIAG3 | MASK(3, 3)
+%xdefine MASK_OFF4   MASK_OFF3 | MASK_OFF(3)
+%xdefine MASK_MAT4   MASK_ALL & ~MASK_OFF4
+
+%macro linear_row 7 ; res, x, y, z, w, row, mask
+%define COL(J) ((%7) & MASK(%6, J)) ; true if mask contains component J
+%define NOP(J) (J == %6 && !COL(J)) ; true if J is untouched input component
+
+    ; load weights
+    IF COL(0),  vbroadcastss m12,  [tmp0q + %6 * 20 + 0]
+    IF COL(1),  vbroadcastss m13,  [tmp0q + %6 * 20 + 4]
+    IF COL(2),  vbroadcastss m14,  [tmp0q + %6 * 20 + 8]
+    IF COL(3),  vbroadcastss m15,  [tmp0q + %6 * 20 + 12]
+
+    ; initialize result vector as appropriate
+    %if COL(4) ; offset
+        vbroadcastss %1, [tmp0q + %6 * 20 + 16]
+    %elif NOP(0)
+        ; directly reuse first component vector if possible
+        mova %1, %2
+    %else
+        xorps %1, %1
+    %endif
+
+    IF COL(0),  mulps m12, %2
+    IF COL(1),  mulps m13, %3
+    IF COL(2),  mulps m14, %4
+    IF COL(3),  mulps m15, %5
+    IF COL(0),  addps %1, m12
+    IF NOP(0) && COL(4), addps %1, %3 ; first vector was not reused
+    IF COL(1),  addps %1, m13
+    IF NOP(1),  addps %1, %3
+    IF COL(2),  addps %1, m14
+    IF NOP(2),  addps %1, %4
+    IF COL(3),  addps %1, m15
+    IF NOP(3),  addps %1, %5
+%endmacro
+
+%macro linear_inner 5 ; x, y, z, w, mask
+    %define ROW(I) ((%5) & MASK_ROW(I))
+    IF1 ROW(0), linear_row m8,  %1, %2, %3, %4, 0, %5
+    IF1 ROW(1), linear_row m9,  %1, %2, %3, %4, 1, %5
+    IF1 ROW(2), linear_row m10, %1, %2, %3, %4, 2, %5
+    IF1 ROW(3), linear_row m11, %1, %2, %3, %4, 3, %5
+    IF ROW(0),  mova %1, m8
+    IF ROW(1),  mova %2, m9
+    IF ROW(2),  mova %3, m10
+    IF ROW(3),  mova %4, m11
+%endmacro
+
+%macro linear_mask 2 ; name, mask
+op %1
+        mov tmp0q, [implq + SwsOpImpl.priv] ; address of matrix
+        linear_inner mx,  my,  mz,  mw,  %2
+        linear_inner mx2, my2, mz2, mw2, %2
+        CONTINUE
+%endmacro
+
+; specialized functions for very simple cases
+%macro linear_dot3 0
+op dot3
+        mov tmp0q, [implq + SwsOpImpl.priv]
+        vbroadcastss m12,  [tmp0q + 0]
+        vbroadcastss m13,  [tmp0q + 4]
+        vbroadcastss m14,  [tmp0q + 8]
+        LOAD_CONT tmp0q
+        mulps mx, m12
+        mulps m8, my, m13
+        mulps m9, mz, m14
+        addps mx, m8
+        addps mx, m9
+        mulps mx2, m12
+        mulps m10, my2, m13
+        mulps m11, mz2, m14
+        addps mx2, m10
+        addps mx2, m11
+        CONTINUE tmp0q
+%endmacro
+
+%macro linear_fns 0
+        linear_dot3
+        linear_mask luma,       MASK_LUMA
+        linear_mask alpha,      MASK_ALPHA
+        linear_mask lumalpha,   MASK_LUMA | MASK_ALPHA
+        linear_mask row0,       MASK_ROW(0)
+        linear_mask row0a,      MASK_ROW(0) | MASK_ALPHA
+        linear_mask diag3,      MASK_DIAG3
+        linear_mask diag4,      MASK_DIAG4
+        linear_mask diagoff3,   MASK_DIAG3 | MASK_OFF3
+        linear_mask matrix3,    MASK_MAT3
+        linear_mask affine3,    MASK_MAT3 | MASK_OFF3
+        linear_mask affine3a,   MASK_MAT3 | MASK_OFF3 | MASK_ALPHA
+        linear_mask matrix4,    MASK_MAT4
+        linear_mask affine4,    MASK_MAT4 | MASK_OFF4
+%endmacro
+
+INIT_YMM avx2
+decl_common_patterns conv8to32f
+decl_common_patterns conv16to32f
+decl_common_patterns conv32fto8
+decl_common_patterns conv32fto16
+decl_common_patterns min_max
+decl_common_patterns scale
+decl_common_patterns dither_fns
+linear_fns
diff --git a/libswscale/x86/ops_int.asm b/libswscale/x86/ops_int.asm
new file mode 100644
index 0000000000..ca5a483a2c
--- /dev/null
+++ b/libswscale/x86/ops_int.asm
@@ -0,0 +1,1050 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "ops_common.asm"
+
+SECTION_RODATA
+
+expand16_shuf:  db   0,  0,  2,  2,  4,  4,  6,  6,  8,  8, 10, 10, 12, 12, 14, 14
+expand32_shuf:  db   0,  0,  0,  0,  4,  4,  4,  4,  8,  8,  8,  8, 12, 12, 12, 12
+
+read8_unpack2:  db   0,  2,  4,  6,  8, 10, 12, 14,  1,  3,  5,  7,  9, 11, 13, 15
+read8_unpack3:  db   0,  3,  6,  9,  1,  4,  7, 10,  2,  5,  8, 11, -1, -1, -1, -1
+read8_unpack4:  db   0,  4,  8, 12,  1,  5,  9, 13,  2,  6, 10, 14,  3,  7, 11, 15
+read16_unpack2: db   0,  1,  4,  5,  8,  9, 12, 13,  2,  3,  6,  7, 10, 11, 14, 15
+read16_unpack3: db   0,  1,  6,  7,  2,  3,  8,  9,  4,  5, 10, 11, -1, -1, -1, -1
+read16_unpack4: db   0,  1,  8,  9,  2,  3, 10, 11,  4,  5, 12, 13,  6,  7, 14, 15
+write8_pack2:   db   0,  8,  1,  9,  2, 10,  3, 11,  4, 12,  5, 13,  6, 14,  7, 15
+write8_pack3:   db   0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11, -1, -1, -1, -1
+write16_pack3:  db   0,  1,  4,  5,  8,  9,  2,  3,  6,  7, 10, 11, -1, -1, -1, -1
+
+%define write8_pack4  read8_unpack4
+%define write16_pack4 read16_unpack2
+%define write16_pack2 read16_unpack4
+
+align 32
+bits_shuf:      db   0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1, \
+                     2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,  3
+bits_mask:      db 128, 64, 32, 16,  8,  4,  2,  1,128, 64, 32, 16,  8,  4,  2,  1
+bits_reverse:   db   7,  6,  5,  4,  3,  2,  1,  0, 15, 14, 13, 12, 11, 10,  9,  8,
+
+mask1: times 32 db 0x01
+mask2: times 32 db 0x03
+mask3: times 32 db 0x07
+mask4: times 32 db 0x0F
+
+SECTION .text
+
+;---------------------------------------------------------
+; Global entry point
+
+%macro prep_addr 3 ; num_planes, dstp, srcp
+    %if %1 = 1
+        mov tmp0q, [%3]
+        mov [%2], tmp0q
+    %elif %1 == 2
+        mova xm0, [%3]
+        mova [%2], xm0
+    %else
+        mova xm0, [%3]
+        mova xm1, [%3 + 16]
+        mova [%2], xm0
+        mova [%2 + 16], xm1
+    %endif
+%endmacro
+
+%macro incr_addr 3 ; num_planes, addrp, stridep
+    %if %1 = 1
+        mov tmp0q, [%2]
+        add tmp0q, [%3]
+        mov [%2], tmp0q
+    %elif %1 == 2
+        mova xm0, [%2]
+        paddq xm0, [%3]
+        mova [%2], xm0
+    %else
+        mova xm0, [%2]
+        mova xm1, [%2 + 16]
+        paddq xm0, [%3]
+        paddq xm1, [%3]
+        mova [%2], xm0
+        mova [%2 + 16], xm1
+    %endif
+%endmacro
+
+%macro process_fn 1 ; num_planes
+cglobal sws_process%1, 6, 7 + 2 * %1, 16
+            ; Args:
+            ;   execq, implq, bxd, yd, bxendd as defined in ops_common.int
+            ;   tmp0d initially holds y_end, will be pushed to stack
+            ; Stack layout:
+            ;   [rsp +  0] = [qword] in0
+            ;   [rsp +  8] = [qword] in1
+            ;   [rsp + 16] = [qword] in2
+            ;   [rsp + 24] = [qword] in3
+            ;   [rsp + 32] = [qword] out0
+            ;   [rsp + 40] = [qword] out1
+            ;   [rsp + 48] = [qword] out2
+            ;   [rsp + 56] = [qword] out3
+            ;   [rsp + 64] = [qword] saved impl
+            ;   [rsp + 72] = [dword] saved bx start
+            ;   [rsp + 76] = [dword] saved y end
+            ;   [rsp + 80] = [qword] saved rsp
+            mov tmp1q, rsp
+            sub rsp, 88
+            and rsp, -32
+            mov [rsp + 64], implq
+            mov [rsp + 72], bxd
+            mov [rsp + 76], tmp0d
+            mov [rsp + 80], tmp1q
+            prep_addr %1, rsp,      execq + SwsOpExec.in0
+            prep_addr %1, rsp + 32, execq + SwsOpExec.out0
+.outer:
+            ; set up static registers
+            mov in0q,  [rsp +  0]
+IF %1 > 1,  mov in1q,  [rsp +  8]
+IF %1 > 2,  mov in2q,  [rsp + 16]
+IF %1 > 3,  mov in3q,  [rsp + 24]
+            mov out0q, [rsp + 32]
+IF %1 > 1,  mov out1q, [rsp + 40]
+IF %1 > 2,  mov out2q, [rsp + 48]
+IF %1 > 3,  mov out3q, [rsp + 56]
+.inner:
+            mov tmp0q, [implq + SwsOpImpl.cont]
+            add implq, SwsOpImpl.next
+            call tmp0q
+            mov implq, [rsp + 64]
+            inc bxd
+            cmp bxd, bxendd
+            jne .inner
+            inc yd
+            cmp yd, [rsp + 76]
+            je .end
+            incr_addr %1, rsp,      execq + SwsOpExec.in_stride0
+            incr_addr %1, rsp + 32, execq + SwsOpExec.out_stride0
+            mov bxd, [rsp + 72]
+            jmp .outer
+
+.end:
+            ; clean up
+            mov rsp, [rsp + 80]
+            RET
+%endmacro
+
+;---------------------------------------------------------
+; Planar reads / writes
+
+%macro read_planar 1 ; elems
+op read_planar%1
+            movu mx, [in0q]
+IF %1 > 1,  movu my, [in1q]
+IF %1 > 2,  movu mz, [in2q]
+IF %1 > 3,  movu mw, [in3q]
+%if V2
+            movu mx2, [in0q + mmsize]
+IF %1 > 1,  movu my2, [in1q + mmsize]
+IF %1 > 2,  movu mz2, [in2q + mmsize]
+IF %1 > 3,  movu mw2, [in3q + mmsize]
+%endif
+            LOAD_CONT tmp0q
+            add in0q, mmsize * (1 + V2)
+IF %1 > 1,  add in1q, mmsize * (1 + V2)
+IF %1 > 2,  add in2q, mmsize * (1 + V2)
+IF %1 > 3,  add in3q, mmsize * (1 + V2)
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_planar 1 ; elems
+op write_planar%1
+            movu [out0q], mx
+IF %1 > 1,  movu [out1q], my
+IF %1 > 2,  movu [out2q], mz
+IF %1 > 3,  movu [out3q], mw
+%if V2
+            movu [out0q + mmsize], mx2
+IF %1 > 1,  movu [out1q + mmsize], my2
+IF %1 > 2,  movu [out2q + mmsize], mz2
+IF %1 > 3,  movu [out3q + mmsize], mw2
+%endif
+            add out0q, mmsize * (1 + V2)
+IF %1 > 1,  add out1q, mmsize * (1 + V2)
+IF %1 > 2,  add out2q, mmsize * (1 + V2)
+IF %1 > 3,  add out3q, mmsize * (1 + V2)
+            RET
+%endmacro
+
+%macro read_packed2 1 ; depth
+op read%1_packed2
+            movu m8,  [in0q + 0*mmsize]
+            movu m9,  [in0q + 1*mmsize]
+    IF V2,  movu m10, [in0q + 2*mmsize]
+    IF V2,  movu m11, [in0q + 3*mmsize]
+IF %1 < 32, VBROADCASTI128 m12, [read%1_unpack2]
+            LOAD_CONT tmp0q
+            add in0q, mmsize * (2 + V2 * 2)
+%if %1 == 32
+            shufps m8, m8, q3120
+            shufps m9, m9, q3120
+    IF V2,  shufps m10, m10, q3120
+    IF V2,  shufps m11, m11, q3120
+%else
+            pshufb m8, m12              ; { X0 Y0 | X1 Y1 }
+            pshufb m9, m12              ; { X2 Y2 | X3 Y3 }
+    IF V2,  pshufb m10, m12
+    IF V2,  pshufb m11, m12
+%endif
+            unpcklpd mx, m8, m9         ; { X0 X2 | X1 X3 }
+            unpckhpd my, m8, m9         ; { Y0 Y2 | Y1 Y3 }
+    IF V2,  unpcklpd mx2, m10, m11
+    IF V2,  unpckhpd my2, m10, m11
+%if avx_enabled
+            vpermq mx, mx, q3120       ; { X0 X1 | X2 X3 }
+            vpermq my, my, q3120       ; { Y0 Y1 | Y2 Y3 }
+    IF V2,  vpermq mx2, mx2, q3120
+    IF V2,  vpermq my2, my2, q3120
+%endif
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_packed2 1 ; depth
+op write%1_packed2
+IF %1 < 32, VBROADCASTI128 m12, [write%1_pack2]
+            LOAD_CONT tmp0q
+%if avx_enabled
+            vpermq mx, mx, q3120       ; { X0 X2 | X1 X3 }
+            vpermq my, my, q3120       ; { Y0 Y2 | Y1 Y3 }
+    IF V2,  vpermq mx2, mx2, q3120
+    IF V2,  vpermq my2, my2, q3120
+%endif
+            unpcklpd m8, mx, my        ; { X0 Y0 | X1 Y1 }
+            unpckhpd m9, mx, my        ; { X2 Y2 | X3 Y3 }
+    IF V2,  unpcklpd m10, mx2, my2
+    IF V2,  unpckhpd m11, mx2, my2
+%if %1 == 32
+            shufps m8, m8, q3120
+            shufps m9, m9, q3120
+    IF V2,  shufps m10, m10, q3120
+    IF V2,  shufps m11, m11, q3120
+%else
+            pshufb m8, m12
+            pshufb m9, m12
+    IF V2,  pshufb m10, m12
+    IF V2,  pshufb m11, m12
+%endif
+            movu [out0q + 0*mmsize], m8
+            movu [out0q + 1*mmsize], m9
+IF V2,      movu [out0q + 2*mmsize], m10
+IF V2,      movu [out0q + 3*mmsize], m11
+            add out0q, mmsize * (2 + V2 * 2)
+            RET
+%endmacro
+
+%macro read_packed_inner 7 ; x, y, z, w, addr, num, depth
+            movu xm8,  [%5 + 0  * %6]
+            movu xm9,  [%5 + 4  * %6]
+            movu xm10, [%5 + 8  * %6]
+            movu xm11, [%5 + 12 * %6]
+    %if avx_enabled
+            vinserti128 m8,  m8,  [%5 + 16 * %6], 1
+            vinserti128 m9,  m9,  [%5 + 20 * %6], 1
+            vinserti128 m10, m10, [%5 + 24 * %6], 1
+            vinserti128 m11, m11, [%5 + 28 * %6], 1
+    %endif
+    %if %7 == 32
+            mova %1, m8
+            mova %2, m9
+            mova %3, m10
+            mova %4, m11
+    %else
+            pshufb %1, m8,  m12         ; { X0 Y0 Z0 W0 | X4 Y4 Z4 W4 }
+            pshufb %2, m9,  m12         ; { X1 Y1 Z1 W1 | X5 Y5 Z5 W5 }
+            pshufb %3, m10, m12         ; { X2 Y2 Z2 W2 | X6 Y6 Z6 W6 }
+            pshufb %4, m11, m12         ; { X3 Y3 Z3 W3 | X7 Y7 Z7 W7 }
+    %endif
+            punpckldq m8,  %1, %2       ; { X0 X1 Y0 Y1 | X4 X5 Y4 Y5 }
+            punpckldq m9,  %3, %4       ; { X2 X3 Y2 Y3 | X6 X7 Y6 Y7 }
+            punpckhdq m10, %1, %2       ; { Z0 Z1 W0 W1 | Z4 Z5 W4 W5 }
+            punpckhdq m11, %3, %4       ; { Z2 Z3 W2 W3 | Z6 Z7 W6 W7 }
+            punpcklqdq %1, m8, m9       ; { X0 X1 X2 X3 | X4 X5 X6 X7 }
+            punpckhqdq %2, m8, m9       ; { Y0 Y1 Y2 Y3 | Y4 Y5 Y6 Y7 }
+            punpcklqdq %3, m10, m11     ; { Z0 Z1 Z2 Z3 | Z4 Z5 Z6 Z7 }
+IF %6 > 3,  punpckhqdq %4, m10, m11     ; { W0 W1 W2 W3 | W4 W5 W6 W7 }
+%endmacro
+
+%macro read_packed 2 ; num, depth
+op read%2_packed%1
+IF %2 < 32, VBROADCASTI128 m12, [read%2_unpack%1]
+            LOAD_CONT tmp0q
+            read_packed_inner mx, my, mz, mw, in0q, %1, %2
+IF1 V2,     read_packed_inner mx2, my2, mz2, mw2, in0q + %1 * mmsize, %1, %2
+            add in0q, %1 * mmsize * (1 + V2)
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_packed_inner 7 ; x, y, z, w, addr, num, depth
+        punpckldq m8,  %1, %2       ; { X0 Y0 X1 Y1 | X4 Y4 X5 Y5 }
+        punpckldq m9,  %3, %4       ; { Z0 W0 Z1 W1 | Z4 W4 Z5 W5 }
+        punpckhdq m10, %1, %2       ; { X2 Y2 X3 Y3 | X6 Y6 X7 Y7 }
+        punpckhdq m11, %3, %4       ; { Z2 W2 Z3 W3 | Z6 W6 Z7 W7 }
+        punpcklqdq %1, m8, m9       ; { X0 Y0 Z0 W0 | X4 Y4 Z4 W4 }
+        punpckhqdq %2, m8, m9       ; { X1 Y1 Z1 W1 | X5 Y5 Z5 W5 }
+        punpcklqdq %3, m10, m11     ; { X2 Y2 Z2 W2 | X6 Y6 Z6 W6 }
+        punpckhqdq %4, m10, m11     ; { X3 Y3 Z3 W3 | X7 Y7 Z7 W7 }
+    %if %7 == 32
+        mova m8,  %1
+        mova m9,  %2
+        mova m10, %3
+        mova m11, %4
+    %else
+        pshufb m8,  %1, m12
+        pshufb m9,  %2, m12
+        pshufb m10, %3, m12
+        pshufb m11, %4, m12
+    %endif
+        movu [%5 +  0*%6], xm8
+        movu [%5 +  4*%6], xm9
+        movu [%5 +  8*%6], xm10
+        movu [%5 + 12*%6], xm11
+    %if avx_enabled
+        vextracti128 [%5 + 16*%6], m8, 1
+        vextracti128 [%5 + 20*%6], m9, 1
+        vextracti128 [%5 + 24*%6], m10, 1
+        vextracti128 [%5 + 28*%6], m11, 1
+    %endif
+%endmacro
+
+%macro write_packed 2 ; num, depth
+op write%2_packed%1
+IF %2 < 32, VBROADCASTI128 m12, [write%2_pack%1]
+            write_packed_inner mx, my, mz, mw, out0q, %1, %2
+IF1 V2,     write_packed_inner mx2, my2, mz2, mw2, out0q + %1 * mmsize, %1, %2
+            add out0q, %1 * mmsize * (1 + V2)
+            RET
+%endmacro
+
+%macro rw_packed 1 ; depth
+        read_packed2 %1
+        read_packed 3, %1
+        read_packed 4, %1
+        write_packed2 %1
+        write_packed 3, %1
+        write_packed 4, %1
+%endmacro
+
+%macro read_nibbles 0
+op read_nibbles1
+%if avx_enabled
+        movu xmx,  [in0q]
+IF V2,  movu xmx2, [in0q + 16]
+%else
+        movq xmx,  [in0q]
+IF V2,  movq xmx2, [in0q + 8]
+%endif
+        VBROADCASTI128 m8, [mask4]
+        LOAD_CONT tmp0q
+        add in0q, (mmsize >> 1) * (1 + V2)
+        pmovzxbw mx, xmx
+IF V2,  pmovzxbw mx2, xmx2
+        psllw my, mx, 8
+IF V2,  psllw my2, mx2, 8
+        psrlw mx, 4
+IF V2,  psrlw mx2, 4
+        pand my, m8
+IF V2,  pand my2, m8
+        por mx, my
+IF V2,  por mx2, my2
+        CONTINUE tmp0q
+%endmacro
+
+%macro read_bits 0
+op read_bits1
+%if avx_enabled
+        vpbroadcastd mx,  [in0q]
+IF V2,  vpbroadcastd mx2, [in0q + 4]
+%else
+        movd mx, [in0q]
+IF V2,  movd mx2, [in0q + 2]
+%endif
+        mova m8, [bits_shuf]
+        VBROADCASTI128 m9,  [bits_mask]
+        VBROADCASTI128 m10, [mask1]
+        LOAD_CONT tmp0q
+        add in0q, (mmsize >> 3) * (1 + V2)
+        pshufb mx,  m8
+IF V2,  pshufb mx2, m8
+        pand mx,  m9
+IF V2,  pand mx2, m9
+        pcmpeqb mx,  m9
+IF V2,  pcmpeqb mx2, m9
+        pand mx,  m10
+IF V2,  pand mx2, m10
+        CONTINUE tmp0q
+%endmacro
+
+%macro write_bits 0
+op write_bits1
+        VBROADCASTI128 m8, [bits_reverse]
+        psllw mx,  7
+IF V2,  psllw mx2, 7
+        pshufb mx,  m8
+IF V2,  pshufb mx2, m8
+        pmovmskb tmp0d, mx
+IF V2,  pmovmskb tmp1d, mx2
+%if avx_enabled
+        mov [out0q],     tmp0d
+IF V2,  mov [out0q + 4], tmp1d
+%else
+        mov [out0q],     tmp0d
+IF V2,  mov [out0q + 2], tmp1d
+%endif
+        add out0q, (mmsize >> 3) * (1 + V2)
+        RET
+%endmacro
+
+;--------------------------
+; Pixel packing / unpacking
+
+%macro pack_generic 3-4 0 ; x, y, z, w
+op pack_%1%2%3%4
+        ; pslld works for all sizes because the input should not overflow
+IF %2,  pslld mx, %4+%3+%2
+IF %3,  pslld my, %4+%3
+IF %4,  pslld mz, %4
+IF %2,  por mx, my
+IF %3,  por mx, mz
+IF %4,  por mx, mw
+    %if V2
+IF %2,  pslld mx2, %4+%3+%2
+IF %3,  pslld my2, %4+%3
+IF %4,  pslld mz2, %4
+IF %2,  por mx2, my2
+IF %3,  por mx2, mz2
+IF %4,  por mx2, mw2
+    %endif
+        CONTINUE
+%endmacro
+
+%macro unpack 5-6 0 ; type, bits, x, y, z, w
+op unpack_%3%4%5%6
+        ; clear high bits by shifting left
+IF %6,  vpsll%1 mw, mx, %2 - (%6)
+IF %5,  vpsll%1 mz, mx, %2 - (%6+%5)
+IF %4,  vpsll%1 my, mx, %2 - (%6+%5+%4)
+        psrl%1 mx, %4+%5+%6
+IF %4,  psrl%1 my, %2 - %4
+IF %5,  psrl%1 mz, %2 - %5
+IF %6,  psrl%1 mw, %2 - %6
+    %if V2
+IF %6,  vpsll%1 mw2, mx2, %2 - (%6)
+IF %5,  vpsll%1 mz2, mx2, %2 - (%6+%5)
+IF %4,  vpsll%1 my2, mx2, %2 - (%6+%5+%4)
+        psrl%1 mx2, %4+%5+%6
+IF %4,  psrl%1 my2, %2 - %4
+IF %5,  psrl%1 mz2, %2 - %5
+IF %6,  psrl%1 mw2, %2 - %6
+    %endif
+        CONTINUE
+%endmacro
+
+%macro unpack8 3 ; x, y, z
+op unpack_%1%2%3 %+ 0
+        pand mz, mx, [mask%3]
+        psrld my, mx, %3
+        psrld mx, %3+%2
+        pand my, [mask%2]
+        pand mx, [mask%1]
+    %if V2
+        pand mz2, mx2, [mask%3]
+        psrld my2, mx2, %3
+        psrld mx2, %3+%2
+        pand my2, [mask%2]
+        pand mx2, [mask%1]
+    %endif
+        CONTINUE
+%endmacro
+
+;---------------------------------------------------------
+; Generic byte order shuffle (packed swizzle, endian, etc)
+
+%macro shuffle 0
+op shuffle
+        VBROADCASTI128 m8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   pshufb mx, m8
+IF Y,   pshufb my, m8
+IF Z,   pshufb mz, m8
+IF W,   pshufb mw, m8
+%if V2
+IF X,   pshufb mx2, m8
+IF Y,   pshufb my2, m8
+IF Z,   pshufb mz2, m8
+IF W,   pshufb mw2, m8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Clearing
+
+%macro clear_alpha 3 ; idx, vreg, vreg2
+op clear_alpha%1
+        LOAD_CONT tmp0q
+        pcmpeqb %2, %2
+IF V2,  mova %3, %2
+        CONTINUE tmp0q
+%endmacro
+
+%macro clear_zero 3 ; idx, vreg, vreg2
+op clear_zero%1
+        LOAD_CONT tmp0q
+        pxor %2, %2
+IF V2,  mova %3, %2
+        CONTINUE tmp0q
+%endmacro
+
+%macro clear_generic 0
+op clear
+            LOAD_CONT tmp0q
+%if avx_enabled
+    IF !X,  vpbroadcastd mx, [implq + SwsOpImpl.priv + 0]
+    IF !Y,  vpbroadcastd my, [implq + SwsOpImpl.priv + 4]
+    IF !Z,  vpbroadcastd mz, [implq + SwsOpImpl.priv + 8]
+    IF !W,  vpbroadcastd mw, [implq + SwsOpImpl.priv + 12]
+%else ; !avx_enabled
+    IF !X,  movd mx, [implq + SwsOpImpl.priv + 0]
+    IF !Y,  movd my, [implq + SwsOpImpl.priv + 4]
+    IF !Z,  movd mz, [implq + SwsOpImpl.priv + 8]
+    IF !W,  movd mw, [implq + SwsOpImpl.priv + 12]
+    IF !X,  pshufd mx, mx, 0
+    IF !Y,  pshufd my, my, 0
+    IF !Z,  pshufd mz, mz, 0
+    IF !W,  pshufd mw, mw, 0
+%endif
+%if V2
+    IF !X,  mova mx2, mx
+    IF !Y,  mova my2, my
+    IF !Z,  mova mz2, mz
+    IF !W,  mova mw2, mw
+%endif
+            CONTINUE tmp0q
+%endmacro
+
+%macro clear_funcs 0
+        decl_pattern 1, 1, 1, 0, clear_generic
+        decl_pattern 0, 1, 1, 1, clear_generic
+        decl_pattern 0, 0, 1, 1, clear_generic
+        decl_pattern 1, 0, 0, 1, clear_generic
+        decl_pattern 1, 1, 0, 0, clear_generic
+        decl_pattern 0, 1, 0, 1, clear_generic
+        decl_pattern 1, 0, 1, 0, clear_generic
+        decl_pattern 1, 0, 0, 0, clear_generic
+        decl_pattern 0, 1, 0, 0, clear_generic
+        decl_pattern 0, 0, 1, 0, clear_generic
+%endmacro
+
+;---------------------------------------------------------
+; Swizzling and duplicating
+
+; mA := mB, mB := mC, ... mX := mA
+%macro vrotate 2-* ; A, B, C, ...
+    %rep %0
+        %assign rot_a %1 + 4
+        %assign rot_b %2 + 4
+        mova m%1, m%2
+        IF V2, mova m%[rot_a], m%[rot_b]
+    %rotate 1
+    %endrep
+    %undef rot_a
+    %undef rot_b
+%endmacro
+
+%macro swizzle_funcs 0
+op swizzle_3012
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 2, 1
+    CONTINUE tmp0q
+
+op swizzle_3021
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 1
+    CONTINUE tmp0q
+
+op swizzle_2103
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2
+    CONTINUE tmp0q
+
+op swizzle_3210
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3
+    vrotate 8, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_3102
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 2
+    CONTINUE tmp0q
+
+op swizzle_3201
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_1203
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_1023
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1
+    CONTINUE tmp0q
+
+op swizzle_2013
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 1
+    CONTINUE tmp0q
+
+op swizzle_2310
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_2130
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_1230
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_1320
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_0213
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_0231
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_0312
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 3, 2
+    CONTINUE tmp0q
+
+op swizzle_3120
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3
+    CONTINUE tmp0q
+
+op swizzle_0321
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_0003
+    LOAD_CONT tmp0q
+    mova my, mx
+    mova mz, mx
+%if V2
+    mova my2, mx2
+    mova mz2, mx2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_0001
+    LOAD_CONT tmp0q
+    mova mw, my
+    mova mz, mx
+    mova my, mx
+%if V2
+    mova mw2, my2
+    mova mz2, mx2
+    mova my2, mx2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_3000
+    LOAD_CONT tmp0q
+    mova my, mx
+    mova mz, mx
+    mova mx, mw
+    mova mw, my
+%if V2
+    mova my2, mx2
+    mova mz2, mx2
+    mova mx2, mw2
+    mova mw2, my2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_1000
+    LOAD_CONT tmp0q
+    mova mz, mx
+    mova mw, mx
+    mova mx, my
+    mova my, mz
+%if V2
+    mova mz2, mx2
+    mova mw2, mx2
+    mova mx2, my2
+    mova my2, mz2
+%endif
+    CONTINUE tmp0q
+%endmacro
+
+%macro packed_shuffle 2 ; size_in, size_out
+cglobal packed_shuffle%1_%2, 6, 10, 2, \
+    exec, shuffle, bx, y, bxend, yend, src, dst, src_stride, dst_stride
+            mov srcq, [execq + SwsOpExec.in0]
+            mov dstq, [execq + SwsOpExec.out0]
+            mov src_strideq, [execq + SwsOpExec.in_stride0]
+            mov dst_strideq, [execq + SwsOpExec.out_stride0]
+            VBROADCASTI128 m1, [shuffleq]
+            sub bxendd, bxd
+            sub yendd, yd
+            ; reuse regs
+    %define srcidxq execq
+            imul srcidxq, bxendq, -%1
+%if %1 = %2
+    %define dstidxq srcidxq
+%else
+    %define dstidxq shuffleq ; no longer needed reg
+            imul dstidxq, bxendq, -%2
+%endif
+            sub srcq, srcidxq
+            sub dstq, dstidxq
+.loop:
+    %if %1 <= 4
+            movd m0, [srcq + srcidxq]
+    %elif %1 <= 8
+            movq m0, [srcq + srcidxq]
+    %else
+            movu m0, [srcq + srcidxq]
+    %endif
+            pshufb m0, m1
+            movu [dstq + dstidxq], m0
+            add srcidxq, %1
+IF %1 != %2,add dstidxq, %2
+            jnz .loop
+            add srcq, src_strideq
+            add dstq, dst_strideq
+            imul srcidxq, bxendq, -%1
+IF %1 != %2,imul dstidxq, bxendq, -%2
+            dec yendd
+            jnz .loop
+            RET
+%endmacro
+
+;---------------------------------------------------------
+; Pixel type conversions
+
+%macro conv8to16 1 ; type
+op %1_U8_U16
+            LOAD_CONT tmp0q
+%if V2
+    %if avx_enabled
+    IF X,   vextracti128 xmx2, mx, 1
+    IF Y,   vextracti128 xmy2, my, 1
+    IF Z,   vextracti128 xmz2, mz, 1
+    IF W,   vextracti128 xmw2, mw, 1
+    %else
+    IF X,   psrldq xmx2, mx, 8
+    IF Y,   psrldq xmy2, my, 8
+    IF Z,   psrldq xmz2, mz, 8
+    IF W,   psrldq xmw2, mw, 8
+    %endif
+    IF X,   pmovzxbw mx2, xmx2
+    IF Y,   pmovzxbw my2, xmy2
+    IF Z,   pmovzxbw mz2, xmz2
+    IF W,   pmovzxbw mw2, xmw2
+%endif ; V2
+    IF X,   pmovzxbw mx, xmx
+    IF Y,   pmovzxbw my, xmy
+    IF Z,   pmovzxbw mz, xmz
+    IF W,   pmovzxbw mw, xmw
+
+%ifidn %1, expand
+            VBROADCASTI128 m8, [expand16_shuf]
+    %if V2
+    IF X,   pshufb mx2, m8
+    IF Y,   pshufb my2, m8
+    IF Z,   pshufb mz2, m8
+    IF W,   pshufb mw2, m8
+    %endif
+    IF X,   pshufb mx, m8
+    IF Y,   pshufb my, m8
+    IF Z,   pshufb mz, m8
+    IF W,   pshufb mw, m8
+%endif ; expand
+            CONTINUE tmp0q
+%endmacro
+
+%macro conv16to8 0
+op convert_U16_U8
+        LOAD_CONT tmp0q
+%if V2
+        ; this code technically works for the !V2 case as well, but slower
+IF X,   packuswb mx, mx2
+IF Y,   packuswb my, my2
+IF Z,   packuswb mz, mz2
+IF W,   packuswb mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+%else
+IF X,   vextracti128  xm8, mx, 1
+IF Y,   vextracti128  xm9, my, 1
+IF Z,   vextracti128 xm10, mz, 1
+IF W,   vextracti128 xm11, mw, 1
+IF X,   packuswb xmx, xm8
+IF Y,   packuswb xmy, xm9
+IF Z,   packuswb xmz, xm10
+IF W,   packuswb xmw, xm11
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv8to32 1 ; type
+op %1_U8_U32
+        LOAD_CONT tmp0q
+IF X,   psrldq xmx2, xmx, 8
+IF Y,   psrldq xmy2, xmy, 8
+IF Z,   psrldq xmz2, xmz, 8
+IF W,   psrldq xmw2, xmw, 8
+IF X,   pmovzxbd mx, xmx
+IF Y,   pmovzxbd my, xmy
+IF Z,   pmovzxbd mz, xmz
+IF W,   pmovzxbd mw, xmw
+IF X,   pmovzxbd mx2, xmx2
+IF Y,   pmovzxbd my2, xmy2
+IF Z,   pmovzxbd mz2, xmz2
+IF W,   pmovzxbd mw2, xmw2
+%ifidn %1, expand
+        VBROADCASTI128 m8, [expand32_shuf]
+IF X,   pshufb mx, m8
+IF Y,   pshufb my, m8
+IF Z,   pshufb mz, m8
+IF W,   pshufb mw, m8
+IF X,   pshufb mx2, m8
+IF Y,   pshufb my2, m8
+IF Z,   pshufb mz2, m8
+IF W,   pshufb mw2, m8
+%endif ; expand
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32to8 0
+op convert_U32_U8
+        LOAD_CONT tmp0q
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   packuswb xmx, xmx2
+IF Y,   packuswb xmy, xmy2
+IF Z,   packuswb xmz, xmz2
+IF W,   packuswb xmw, xmw2
+IF X,   vpshufd xmx, xmx, q3120
+IF Y,   vpshufd xmy, xmy, q3120
+IF Z,   vpshufd xmz, xmz, q3120
+IF W,   vpshufd xmw, xmw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv16to32 0
+op convert_U16_U32
+        LOAD_CONT tmp0q
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   pmovzxwd mx, xmx
+IF Y,   pmovzxwd my, xmy
+IF Z,   pmovzxwd mz, xmz
+IF W,   pmovzxwd mw, xmw
+IF X,   pmovzxwd mx2, xmx2
+IF Y,   pmovzxwd my2, xmy2
+IF Z,   pmovzxwd mz2, xmz2
+IF W,   pmovzxwd mw2, xmw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32to16 0
+op convert_U32_U16
+        LOAD_CONT tmp0q
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Shifting
+
+%macro lshift16 0
+op lshift16
+        vmovq xm8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   psllw mx, xm8
+IF Y,   psllw my, xm8
+IF Z,   psllw mz, xm8
+IF W,   psllw mw, xm8
+%if V2
+IF X,   psllw mx2, xm8
+IF Y,   psllw my2, xm8
+IF Z,   psllw mz2, xm8
+IF W,   psllw mw2, xm8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+%macro rshift16 0
+op rshift16
+        vmovq xm8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   psrlw mx, xm8
+IF Y,   psrlw my, xm8
+IF Z,   psrlw mz, xm8
+IF W,   psrlw mw, xm8
+%if V2
+IF X,   psrlw mx2, xm8
+IF Y,   psrlw my2, xm8
+IF Z,   psrlw mz2, xm8
+IF W,   psrlw mw2, xm8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Function instantiations
+
+%macro funcs_u8 0
+    read_planar 1
+    read_planar 2
+    read_planar 3
+    read_planar 4
+    write_planar 1
+    write_planar 2
+    write_planar 3
+    write_planar 4
+
+    rw_packed 8
+    read_nibbles
+    read_bits
+    write_bits
+
+    pack_generic 1, 2, 1
+    pack_generic 3, 3, 2
+    pack_generic 2, 3, 3
+    unpack8 1, 2, 1
+    unpack8 3, 3, 2
+    unpack8 2, 3, 3
+
+    clear_alpha 0, mx, mx2
+    clear_alpha 1, my, my2
+    clear_alpha 3, mw, mw2
+    clear_zero  0, mx, mx2
+    clear_zero  1, my, my2
+    clear_zero  3, mw, mw2
+    clear_funcs
+    swizzle_funcs
+
+    decl_common_patterns shuffle
+%endmacro
+
+%macro funcs_u16 0
+    rw_packed 16
+    pack_generic  4, 4, 4
+    pack_generic  5, 5, 5
+    pack_generic  5, 6, 5
+    unpack w, 16, 4, 4, 4
+    unpack w, 16, 5, 5, 5
+    unpack w, 16, 5, 6, 5
+    decl_common_patterns conv8to16 convert
+    decl_common_patterns conv8to16 expand
+    decl_common_patterns conv16to8
+    decl_common_patterns lshift16
+    decl_common_patterns rshift16
+%endmacro
+
+INIT_XMM sse4
+decl_v2 0, funcs_u8
+decl_v2 1, funcs_u8
+
+process_fn 1
+process_fn 2
+process_fn 3
+process_fn 4
+
+packed_shuffle  5, 15 ;  8 -> 24
+packed_shuffle  4, 16 ;  8 -> 32, 16 -> 64
+packed_shuffle  2, 12 ;  8 -> 48
+packed_shuffle 10, 15 ; 16 -> 24
+packed_shuffle  8, 16 ; 16 -> 32, 32 -> 64
+packed_shuffle  4, 12 ; 16 -> 48
+packed_shuffle 15, 15 ; 24 -> 24
+packed_shuffle 12, 16 ; 24 -> 32
+packed_shuffle  6, 12 ; 24 -> 48
+packed_shuffle 16, 12 ; 32 -> 24, 64 -> 48
+packed_shuffle 16, 16 ; 32 -> 32, 64 -> 64
+packed_shuffle  8, 12 ; 32 -> 48
+packed_shuffle 12, 12 ; 48 -> 48
+
+INIT_YMM avx2
+decl_v2 0, funcs_u8
+decl_v2 1, funcs_u8
+decl_v2 0, funcs_u16
+decl_v2 1, funcs_u16
+
+packed_shuffle 32, 32
+
+INIT_YMM avx2
+decl_v2 1, rw_packed 32
+decl_v2 1, pack_generic  10, 10, 10,  2
+decl_v2 1, pack_generic   2, 10, 10, 10
+decl_v2 1, unpack d, 32, 10, 10, 10,  2
+decl_v2 1, unpack d, 32,  2, 10, 10, 10
+decl_common_patterns conv8to32 convert
+decl_common_patterns conv8to32 expand
+decl_common_patterns conv32to8
+decl_common_patterns conv16to32
+decl_common_patterns conv32to16
+
+INIT_ZMM avx512
+packed_shuffle 64, 64
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 15/17] tests/checkasm: add checkasm tests for swscale ops
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (13 preceding siblings ...)
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 14/17] swscale/x86: add SIMD backend Niklas Haas
@ 2025-05-21 12:44 ` Niklas Haas
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 16/17] swscale/format: add new format decode/encode logic Niklas Haas
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:44 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Because of the lack of an external ABI on low-level kernels, we cannot
directly test internal functions. Instead, we construct a minimal op chain
consisting of a read, the op to be tested, and a write.

The bigger complication arises from the fact that the backend may generate
arbitrary internal state that needs to be passed back to the implementation,
which means we cannot directly call `func_ref` on the generated chain. To get
around this, always compile the op chain twice - once using the backend to be
tested, and once using the reference C backend.

The actual entry point may also just be a shared wrapper, so we need to
be very careful to run checkasm_check_func() on a pseudo-pointer that will
actually be unique for each combination of backend and active CPU flags.
---
 tests/checkasm/Makefile   |   8 +-
 tests/checkasm/checkasm.c |   1 +
 tests/checkasm/checkasm.h |   1 +
 tests/checkasm/sw_ops.c   | 770 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 779 insertions(+), 1 deletion(-)
 create mode 100644 tests/checkasm/sw_ops.c

diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index fabbf595b4..d38ec371df 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -66,7 +66,13 @@ AVFILTEROBJS-$(CONFIG_SOBEL_FILTER)      += vf_convolution.o
 CHECKASMOBJS-$(CONFIG_AVFILTER) += $(AVFILTEROBJS-yes)
 
 # swscale tests
-SWSCALEOBJS                             += sw_gbrp.o sw_range_convert.o sw_rgb.o sw_scale.o sw_yuv2rgb.o sw_yuv2yuv.o
+SWSCALEOBJS                             += sw_gbrp.o            \
+                                           sw_ops.o             \
+                                           sw_range_convert.o   \
+                                           sw_rgb.o             \
+                                           sw_scale.o           \
+                                           sw_yuv2rgb.o         \
+                                           sw_yuv2yuv.o
 
 CHECKASMOBJS-$(CONFIG_SWSCALE)  += $(SWSCALEOBJS)
 
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index f393a0cb96..11bd5668cf 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -298,6 +298,7 @@ static const struct {
     { "sw_scale", checkasm_check_sw_scale },
     { "sw_yuv2rgb", checkasm_check_sw_yuv2rgb },
     { "sw_yuv2yuv", checkasm_check_sw_yuv2yuv },
+    { "sw_ops", checkasm_check_sw_ops },
 #endif
 #if CONFIG_AVUTIL
         { "aes",       checkasm_check_aes },
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index ec01bd6207..d69f4cb835 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -132,6 +132,7 @@ void checkasm_check_sw_rgb(void);
 void checkasm_check_sw_scale(void);
 void checkasm_check_sw_yuv2rgb(void);
 void checkasm_check_sw_yuv2yuv(void);
+void checkasm_check_sw_ops(void);
 void checkasm_check_takdsp(void);
 void checkasm_check_utvideodsp(void);
 void checkasm_check_v210dec(void);
diff --git a/tests/checkasm/sw_ops.c b/tests/checkasm/sw_ops.c
new file mode 100644
index 0000000000..e82f028fd9
--- /dev/null
+++ b/tests/checkasm/sw_ops.c
@@ -0,0 +1,770 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <string.h>
+
+#include "libavutil/avassert.h"
+#include "libavutil/mem_internal.h"
+#include "libavutil/refstruct.h"
+
+#include "libswscale/ops.h"
+#include "libswscale/ops_internal.h"
+
+#include "checkasm.h"
+
+enum {
+    LINES  = 2,
+    PLANES = 4,
+    PIXELS = 64,
+};
+
+enum {
+    U8  = SWS_PIXEL_U8,
+    U16 = SWS_PIXEL_U16,
+    U32 = SWS_PIXEL_U32,
+    F32 = SWS_PIXEL_F32,
+};
+
+#define FMT(fmt, ...) tprintf((char[256]) {0}, 256, fmt, __VA_ARGS__)
+static const char *tprintf(char buf[], size_t size, const char *fmt, ...)
+{
+    va_list ap;
+    va_start(ap, fmt);
+    vsnprintf(buf, size, fmt, ap);
+    va_end(ap);
+    return buf;
+}
+
+static int rw_pixel_bits(const SwsOp *op)
+{
+    const int elems = op->rw.packed ? op->rw.elems : 1;
+    const int size  = ff_sws_pixel_type_size(op->type);
+    const int bits  = 8 >> op->rw.frac;
+    av_assert1(bits >= 1);
+    return elems * size * bits;
+}
+
+static float rndf(void)
+{
+    union { uint32_t u; float f; } x;
+    do {
+        x.u = rnd();
+    } while (!isnormal(x.f));
+    return x.f;
+}
+
+static void fill32f(float *line, int num, unsigned range)
+{
+    const float scale = (float) range / UINT32_MAX;
+    for (int i = 0; i < num; i++)
+        line[i] = range ? scale * rnd() : rndf();
+}
+
+static void fill32(uint32_t *line, int num, unsigned range)
+{
+    for (int i = 0; i < num; i++)
+        line[i] = range ? rnd() % (range + 1) : rnd();
+}
+
+static void fill16(uint16_t *line, int num, unsigned range)
+{
+    if (!range) {
+        fill32((uint32_t *) line, AV_CEIL_RSHIFT(num, 1), 0);
+    } else {
+        for (int i = 0; i < num; i++)
+            line[i] = rnd() % (range + 1);
+    }
+}
+
+static void fill8(uint8_t *line, int num, unsigned range)
+{
+    if (!range) {
+        fill32((uint32_t *) line, AV_CEIL_RSHIFT(num, 2), 0);
+    } else {
+        for (int i = 0; i < num; i++)
+            line[i] = rnd() % (range + 1);
+    }
+}
+
+static void check_ops(const char *report, const unsigned ranges[PLANES],
+                      const SwsOp *ops)
+{
+    SwsContext *ctx = sws_alloc_context();
+    SwsCompiledOp comp_ref = {0}, comp_new = {0};
+    const SwsOpBackend *backend_new = NULL;
+    SwsOpList oplist = { .ops = (SwsOp *) ops };
+    const SwsOp *read_op, *write_op;
+    static const unsigned def_ranges[4] = {0};
+    if (!ranges)
+        ranges = def_ranges;
+
+    declare_func(void, const SwsOpExec *, const void *, int bx, int y, int bx_end, int y_end);
+
+    DECLARE_ALIGNED_64(char, src0)[PLANES][LINES][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, src1)[PLANES][LINES][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, dst0)[PLANES][LINES][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, dst1)[PLANES][LINES][PIXELS * sizeof(uint32_t[4])];
+
+    if (!ctx)
+        return;
+    ctx->flags = SWS_BITEXACT;
+
+    read_op = &ops[0];
+    for (oplist.num_ops = 0; ops[oplist.num_ops].op; oplist.num_ops++)
+        write_op = &ops[oplist.num_ops];
+
+    for (int p = 0; p < PLANES; p++) {
+        void *plane = src0[p];
+        switch (read_op->type) {
+        case U8:    fill8(plane, sizeof(src0[p]) /  sizeof(uint8_t), ranges[p]); break;
+        case U16:  fill16(plane, sizeof(src0[p]) / sizeof(uint16_t), ranges[p]); break;
+        case U32:  fill32(plane, sizeof(src0[p]) / sizeof(uint32_t), ranges[p]); break;
+        case F32: fill32f(plane, sizeof(src0[p]) / sizeof(uint32_t), ranges[p]); break;
+        }
+    }
+
+    memcpy(src1, src0, sizeof(src0));
+    memset(dst0, 0, sizeof(dst0));
+    memset(dst1, 0, sizeof(dst1));
+
+    /* Compile `ops` using both the asm and c backends */
+    for (int n = 0; ff_sws_op_backends[n]; n++) {
+        const SwsOpBackend *backend = ff_sws_op_backends[n];
+        const bool is_ref = !strcmp(backend->name, "c");
+        if (is_ref || !comp_new.func) {
+            SwsCompiledOp comp;
+            int ret = ff_sws_ops_compile_backend(ctx, backend, &oplist, &comp);
+            if (ret == AVERROR(ENOTSUP))
+                continue;
+            else if (ret < 0)
+                fail();
+            else if (PIXELS % comp.block_size != 0)
+                fail();
+
+            if (is_ref)
+                comp_ref = comp;
+            if (!comp_new.func) {
+                comp_new = comp;
+                backend_new = backend;
+            }
+        }
+    }
+
+    av_assert0(comp_ref.func && comp_new.func);
+
+    SwsOpExec exec = {0};
+    exec.pixel_bits_in  = rw_pixel_bits(read_op);
+    exec.pixel_bits_out = rw_pixel_bits(write_op);
+    exec.width = PIXELS;
+    exec.height = exec.slice_h = 1;
+    for (int i = 0; i < PLANES; i++) {
+        exec.in_stride[i]  = sizeof(src0[i][0]);
+        exec.out_stride[i] = sizeof(dst0[i][0]);
+    }
+
+    /**
+     * Don't use check_func() because the actual function pointer may be a
+     * wrapper shared by multiple implementations. Instead, take a hash of both
+     * the backend pointer and the active CPU flags.
+     */
+    uintptr_t id = (uintptr_t) backend_new;
+    id ^= (id << 6) + (id >> 2) + 0x9e3779b97f4a7c15 + comp_new.cpu_flags;
+
+    checkasm_save_context();
+    if (checkasm_check_func((void *) id, "%s", report)) {
+        func_new = comp_new.func;
+        func_ref = comp_ref.func;
+
+        for (int i = 0; i < PLANES; i++) {
+            exec.in[i]  = (void *) src0[i];
+            exec.out[i] = (void *) dst0[i];
+        }
+        call_ref(&exec, comp_ref.priv, 0, 0, PIXELS / comp_ref.block_size, LINES);
+
+        for (int i = 0; i < PLANES; i++) {
+            exec.in[i]  = (void *) src1[i];
+            exec.out[i] = (void *) dst1[i];
+        }
+        call_new(&exec, comp_new.priv, 0, 0, PIXELS / comp_new.block_size, LINES);
+
+        for (int i = 0; i < PLANES; i++) {
+            const char *name = FMT("%s[%d]", report, i);
+            const int size   = PIXELS * exec.pixel_bits_out >> 3;
+            const int stride = sizeof(dst0[i][0]);
+
+            switch (write_op->type) {
+            case U8:
+                checkasm_check(uint8_t, (void *) dst0[i], stride,
+                                        (void *) dst1[i], stride,
+                                        size, LINES, name);
+                break;
+            case U16:
+                checkasm_check(uint16_t, (void *) dst0[i], stride,
+                                         (void *) dst1[i], stride,
+                                         size >> 1, LINES, name);
+                break;
+            case U32:
+                checkasm_check(uint32_t, (void *) dst0[i], stride,
+                                         (void *) dst1[i], stride,
+                                         size >> 2, LINES, name);
+                break;
+            case F32:
+                checkasm_check(float_ulp, (void *) dst0[i], stride,
+                                          (void *) dst1[i], stride,
+                                          size >> 2, LINES, name, 0);
+                break;
+            }
+
+            if (write_op->rw.packed)
+                break;
+        }
+
+        bench_new(&exec, comp_new.priv, 0, 0, PIXELS / comp_new.block_size, LINES);
+    }
+
+    if (comp_new.func != comp_ref.func && comp_new.free)
+        comp_new.free(comp_new.priv);
+    if (comp_ref.free)
+        comp_ref.free(comp_ref.priv);
+    sws_free_context(&ctx);
+}
+
+#define CHECK_RANGES(NAME, RANGES, N_IN, N_OUT, IN, OUT, ...)                   \
+  do {                                                                          \
+      check_ops(NAME, RANGES, (SwsOp[]) {                                       \
+        {                                                                       \
+            .op = SWS_OP_READ,                                                  \
+            .type = IN,                                                         \
+            .rw.elems = N_IN,                                                   \
+        },                                                                      \
+        __VA_ARGS__,                                                            \
+        {                                                                       \
+            .op = SWS_OP_WRITE,                                                 \
+            .type = OUT,                                                        \
+            .rw.elems = N_OUT,                                                  \
+        }, {0}                                                                  \
+    });                                                                         \
+  } while (0)
+
+#define MK_RANGES(R) ((const unsigned[]) { R, R, R, R })
+#define CHECK_RANGE(NAME, RANGE, N_IN, N_OUT, IN, OUT, ...)                     \
+    CHECK_RANGES(NAME, MK_RANGES(RANGE), N_IN, N_OUT, IN, OUT, __VA_ARGS__)
+
+#define CHECK_COMMON_RANGE(NAME, RANGE, IN, OUT, ...)                           \
+    CHECK_RANGE(FMT("%s_p1000", NAME), RANGE, 1, 1, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1110", NAME), RANGE, 3, 3, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1111", NAME), RANGE, 4, 4, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1001", NAME), RANGE, 4, 2, IN, OUT, __VA_ARGS__, {     \
+        .op = SWS_OP_SWIZZLE,                                                   \
+        .type = OUT,                                                            \
+        .swizzle = SWS_SWIZZLE(0, 3, 1, 2),                                     \
+    })
+
+#define CHECK(NAME, N_IN, N_OUT, IN, OUT, ...) \
+    CHECK_RANGE(NAME, 0, N_IN, N_OUT, IN, OUT, __VA_ARGS__)
+
+#define CHECK_COMMON(NAME, IN, OUT, ...) \
+    CHECK_COMMON_RANGE(NAME, 0, IN, OUT, __VA_ARGS__)
+
+static void check_read_write(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        for (int i = 1; i <= 4; i++) {
+            /* Test N->N planar read/write */
+            for (int o = 1; o <= i; o++) {
+                check_ops(FMT("rw_%d_%d_%s", i, o, type), NULL, (SwsOp[]) {
+                    {
+                        .op = SWS_OP_READ,
+                        .type = t,
+                        .rw.elems = i,
+                    }, {
+                        .op = SWS_OP_WRITE,
+                        .type = t,
+                        .rw.elems = o,
+                    }, {0}
+                });
+            }
+
+            /* Test packed read/write */
+            if (i == 1)
+                continue;
+
+            check_ops(FMT("read_packed%d_%s", i, type), NULL, (SwsOp[]) {
+                {
+                    .op = SWS_OP_READ,
+                    .type = t,
+                    .rw.elems = i,
+                    .rw.packed = true,
+                }, {
+                    .op = SWS_OP_WRITE,
+                    .type = t,
+                    .rw.elems = i,
+                }, {0}
+            });
+
+            check_ops(FMT("write_packed%d_%s", i, type), NULL, (SwsOp[]) {
+                {
+                    .op = SWS_OP_READ,
+                    .type = t,
+                    .rw.elems = i,
+                }, {
+                    .op = SWS_OP_WRITE,
+                    .type = t,
+                    .rw.elems = i,
+                    .rw.packed = true,
+                }, {0}
+            });
+        }
+    }
+
+    /* Test fractional reads/writes */
+    for (int frac = 1; frac <= 3; frac++) {
+        const int bits = 8 >> frac;
+        const int range = (1 << bits) - 1;
+        if (bits == 2)
+            continue; /* no 2 bit packed formats currently exist */
+
+        check_ops(FMT("read_frac%d", frac), NULL, (SwsOp[]) {
+            {
+                .op = SWS_OP_READ,
+                .type = U8,
+                .rw.elems = 1,
+                .rw.frac  = frac,
+            }, {
+                .op = SWS_OP_WRITE,
+                .type = U8,
+                .rw.elems = 1,
+            }, {0}
+        });
+
+        check_ops(FMT("write_frac%d", frac), MK_RANGES(range), (SwsOp[]) {
+            {
+                .op = SWS_OP_READ,
+                .type = U8,
+                .rw.elems = 1,
+            }, {
+                .op = SWS_OP_WRITE,
+                .type = U8,
+                .rw.elems = 1,
+                .rw.frac  = frac,
+            }, {0}
+        });
+    }
+}
+
+static void check_swap_bytes(void)
+{
+    CHECK_COMMON("swap_bytes_16", U16, U16, {
+        .op   = SWS_OP_SWAP_BYTES,
+        .type = U16,
+    });
+
+    CHECK_COMMON("swap_bytes_32", U32, U32, {
+        .op   = SWS_OP_SWAP_BYTES,
+        .type = U32,
+    });
+}
+
+static void check_pack_unpack(void)
+{
+    const struct {
+        SwsPixelType type;
+        SwsPackOp op;
+    } patterns[] = {
+        { U8, {{ 3,  3,  2 }}},
+        { U8, {{ 2,  3,  3 }}},
+        { U8, {{ 1,  2,  1 }}},
+        {U16, {{ 5,  6,  5 }}},
+        {U16, {{ 5,  5,  5 }}},
+        {U16, {{ 4,  4,  4 }}},
+        {U32, {{ 2, 10, 10, 10 }}},
+        {U32, {{10, 10, 10,  2 }}},
+    };
+
+    for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+        const SwsPixelType type = patterns[i].type;
+        const SwsPackOp pack = patterns[i].op;
+        const int num = pack.pattern[3] ? 4 : 3;
+        const char *pat = FMT("%d%d%d%d", pack.pattern[0], pack.pattern[1],
+                                          pack.pattern[2], pack.pattern[3]);
+        const int total = pack.pattern[0] + pack.pattern[1] +
+                          pack.pattern[2] + pack.pattern[3];
+        const unsigned ranges[4] = {
+            (1 << pack.pattern[0]) - 1,
+            (1 << pack.pattern[1]) - 1,
+            (1 << pack.pattern[2]) - 1,
+            (1 << pack.pattern[3]) - 1,
+        };
+
+        CHECK_RANGES(FMT("pack_%s", pat), ranges, num, 1, type, type, {
+            .op   = SWS_OP_PACK,
+            .type = type,
+            .pack = pack,
+        });
+
+        CHECK_RANGE(FMT("unpack_%s", pat), (1 << total) - 1, 1, num, type, type, {
+            .op   = SWS_OP_UNPACK,
+            .type = type,
+            .pack = pack,
+        });
+    }
+}
+
+static AVRational rndq(SwsPixelType t)
+{
+    const unsigned num = rnd();
+    if (ff_sws_pixel_type_is_int(t)) {
+        const unsigned mask = (1 << (ff_sws_pixel_type_size(t) * 8)) - 1;
+        return (AVRational) { num & mask, 1 };
+    } else {
+        const unsigned den = rnd();
+        return (AVRational) { num, den ? den : 1 };
+    }
+}
+
+static void check_clear(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        const int bits = ff_sws_pixel_type_size(t) * 8;
+
+        /* TODO: AVRational can't fit 32 bit constants */
+        if (bits < 32) {
+            const AVRational chroma = (AVRational) { 1 << (bits - 1), 1};
+            const AVRational alpha  = (AVRational) { (1 << bits) - 1, 1};
+            const AVRational zero   = (AVRational) { 0, 1};
+            const AVRational none = {0};
+
+            const SwsConst patterns[] = {
+                /* Zero only */
+                {.q4 = {   none,   none,   none,   zero }},
+                {.q4 = {   zero,   none,   none,   none }},
+                /* Alpha only */
+                {.q4 = {   none,   none,   none,  alpha }},
+                {.q4 = {  alpha,   none,   none,   none }},
+                /* Chroma only */
+                {.q4 = { chroma, chroma,   none,   none }},
+                {.q4 = {   none, chroma, chroma,   none }},
+                {.q4 = {   none,   none, chroma, chroma }},
+                {.q4 = { chroma,   none, chroma,   none }},
+                {.q4 = {   none, chroma,   none, chroma }},
+                /* Alpha+chroma */
+                {.q4 = { chroma, chroma,   none,  alpha }},
+                {.q4 = {   none, chroma, chroma,  alpha }},
+                {.q4 = {  alpha,   none, chroma, chroma }},
+                {.q4 = { chroma,   none, chroma,  alpha }},
+                {.q4 = {  alpha, chroma,   none, chroma }},
+                /* Random values */
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+            };
+
+            for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+                CHECK(FMT("clear_pattern_%s[%d]", type, i), 4, 4, t, t, {
+                    .op   = SWS_OP_CLEAR,
+                    .type = t,
+                    .c    = patterns[i],
+                });
+            }
+        } else if (!ff_sws_pixel_type_is_int(t)) {
+            /* Floating point YUV doesn't exist, only alpha needs to be cleared */
+            CHECK(FMT("clear_alpha_%s", type), 4, 4, t, t, {
+                .op      = SWS_OP_CLEAR,
+                .type    = t,
+                .c.q4[3] = { 0, 1 },
+            });
+        }
+    }
+}
+
+static void check_shift(void)
+{
+    for (SwsPixelType t = U16; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (!ff_sws_pixel_type_is_int(t))
+            continue;
+
+        for (int shift = 1; shift <= 8; shift++) {
+            CHECK_COMMON(FMT("lshift%d_%s", shift, type), t, t, {
+                .op   = SWS_OP_LSHIFT,
+                .type = t,
+                .c.u  = shift,
+            });
+
+            CHECK_COMMON(FMT("rshift%d_%s", shift, type), t, t, {
+                .op   = SWS_OP_RSHIFT,
+                .type = t,
+                .c.u  = shift,
+            });
+        }
+    }
+}
+
+static void check_swizzle(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        static const int patterns[][4] = {
+            /* Pure swizzle */
+            {3, 0, 1, 2},
+            {3, 0, 2, 1},
+            {2, 1, 0, 3},
+            {3, 2, 1, 0},
+            {3, 1, 0, 2},
+            {3, 2, 0, 1},
+            {1, 2, 0, 3},
+            {1, 0, 2, 3},
+            {2, 0, 1, 3},
+            {2, 3, 1, 0},
+            {2, 1, 3, 0},
+            {1, 2, 3, 0},
+            {1, 3, 2, 0},
+            {0, 2, 1, 3},
+            {0, 2, 3, 1},
+            {0, 3, 1, 2},
+            {3, 1, 2, 0},
+            {0, 3, 2, 1},
+            /* Luma expansion */
+            {0, 0, 0, 3},
+            {3, 0, 0, 0},
+            {0, 0, 0, 1},
+            {1, 0, 0, 0},
+        };
+
+        for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+            const int x = patterns[i][0], y = patterns[i][1],
+                      z = patterns[i][2], w = patterns[i][3];
+            CHECK(FMT("swizzle_%d%d%d%d_%s", x, y, z, w, type), 4, 4, t, t, {
+                .op = SWS_OP_SWIZZLE,
+                .type = t,
+                .swizzle = SWS_SWIZZLE(x, y, z, w),
+            });
+        }
+    }
+}
+
+static void check_convert(void)
+{
+    for (SwsPixelType i = U8; i < SWS_PIXEL_TYPE_NB; i++) {
+        const char *itype = ff_sws_pixel_type_name(i);
+        const int isize = ff_sws_pixel_type_size(i);
+        for (SwsPixelType o = U8; o < SWS_PIXEL_TYPE_NB; o++) {
+            const char *otype = ff_sws_pixel_type_name(o);
+            const int osize = ff_sws_pixel_type_size(o);
+            const char *name = FMT("convert_%s_%s", itype, otype);
+            if (i == o)
+                continue;
+
+            if (isize < osize || !ff_sws_pixel_type_is_int(o)) {
+                CHECK_COMMON(name, i, o, {
+                    .op = SWS_OP_CONVERT,
+                    .type = i,
+                    .convert.to = o,
+                });
+            } else if (isize > osize || !ff_sws_pixel_type_is_int(i)) {
+                uint32_t range = (1 << osize * 8) - 1;
+                CHECK_COMMON_RANGE(name, range, i, o, {
+                    .op = SWS_OP_CONVERT,
+                    .type = i,
+                    .convert.to = o,
+                });
+            }
+        }
+    }
+
+    /* Check expanding conversions */
+    CHECK_COMMON("expand16", U8, U16, {
+        .op = SWS_OP_CONVERT,
+        .type = U8,
+        .convert.to = U16,
+        .convert.expand = true,
+    });
+
+    CHECK_COMMON("expand32", U8, U32, {
+        .op = SWS_OP_CONVERT,
+        .type = U8,
+        .convert.to = U32,
+        .convert.expand = true,
+    });
+}
+
+static void check_dither(void)
+{
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (ff_sws_pixel_type_is_int(t))
+            continue;
+
+        /* Test all sizes up to 256x256 */
+        for (int size_log2 = 0; size_log2 <= 8; size_log2++) {
+            const int size = 1 << size_log2;
+            AVRational *matrix = av_refstruct_allocz(size * size * sizeof(*matrix));
+            if (!matrix) {
+                fail();
+                return;
+            }
+
+            if (size == 1) {
+                matrix[0] = (AVRational) { 1, 2 };
+            } else {
+                for (int i = 0; i < size * size; i++)
+                    matrix[i] = rndq(t);
+            }
+
+            CHECK_COMMON(FMT("dither_%dx%d_%s", size, size, type), t, t, {
+                .op = SWS_OP_DITHER,
+                .type = t,
+                .dither.size_log2 = size_log2,
+                .dither.matrix = matrix,
+            });
+
+            av_refstruct_unref(&matrix);
+        }
+    }
+}
+
+static void check_min_max(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        CHECK_COMMON(FMT("min_%s", type), t, t, {
+            .op = SWS_OP_MIN,
+            .type = t,
+            .c.q4 = { rndq(t), rndq(t), rndq(t), rndq(t) },
+        });
+
+        CHECK_COMMON(FMT("max_%s", type), t, t, {
+            .op = SWS_OP_MAX,
+            .type = t,
+            .c.q4 = { rndq(t), rndq(t), rndq(t), rndq(t) },
+        });
+    }
+}
+
+static void check_linear(void)
+{
+    static const struct {
+        const char *name;
+        uint32_t mask;
+    } patterns[] = {
+        { "noop",               0 },
+        { "luma",               SWS_MASK_LUMA },
+        { "alpha",              SWS_MASK_ALPHA },
+        { "luma+alpha",         SWS_MASK_LUMA | SWS_MASK_ALPHA },
+        { "dot3",               0b111 },
+        { "dot4",               0b1111 },
+        { "row0",               SWS_MASK_ROW(0) },
+        { "row0+alpha",         SWS_MASK_ROW(0) | SWS_MASK_ALPHA },
+        { "off3",               SWS_MASK_OFF3 },
+        { "off3+alpha",         SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag3",              SWS_MASK_DIAG3 },
+        { "diag4",              SWS_MASK_DIAG4 },
+        { "diag3+alpha",        SWS_MASK_DIAG3 | SWS_MASK_ALPHA },
+        { "diag3+off3",         SWS_MASK_DIAG3 | SWS_MASK_OFF3 },
+        { "diag3+off3+alpha",   SWS_MASK_DIAG3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag4+off4",         SWS_MASK_DIAG4 | SWS_MASK_OFF4 },
+        { "matrix3",            SWS_MASK_MAT3 },
+        { "matrix3+off3",       SWS_MASK_MAT3 | SWS_MASK_OFF3 },
+        { "matrix3+off3+alpha", SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "matrix4",            SWS_MASK_MAT4 },
+        { "matrix4+off4",       SWS_MASK_MAT4 | SWS_MASK_OFF4 },
+    };
+
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (ff_sws_pixel_type_is_int(t))
+            continue;
+
+        for (int p = 0; p < FF_ARRAY_ELEMS(patterns); p++) {
+            const uint32_t mask = patterns[p].mask;
+            SwsLinearOp lin = { .mask = mask };
+
+            for (int i = 0; i < 4; i++) {
+                for (int j = 0; j < 5; j++) {
+                    if (mask & SWS_MASK(i, j)) {
+                        lin.m[i][j] = rndq(t);
+                    } else {
+                        lin.m[i][j] = (AVRational) { i == j, 1 };
+                    }
+                }
+            }
+
+            CHECK(FMT("linear_%s_%s", patterns[p].name, type), 4, 4, t, t, {
+                .op = SWS_OP_LINEAR,
+                .type = t,
+                .lin = lin,
+            });
+        }
+    }
+}
+
+static void check_scale(void)
+{
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        const int bits = ff_sws_pixel_type_size(t) * 8;
+        if (ff_sws_pixel_type_is_int(t)) {
+            /* Ensure the result won't exceed the value range */
+            const unsigned max = (1 << bits) - 1;
+            const unsigned scale = rnd() & max;
+            const unsigned range = max / (scale ? scale : 1);
+            CHECK_COMMON_RANGE(FMT("scale_%s", type), range, t, t, {
+                .op   = SWS_OP_SCALE,
+                .type = t,
+                .c.q  = { scale, 1 },
+            });
+        } else {
+            CHECK_COMMON(FMT("scale_%s", type), t, t, {
+                .op   = SWS_OP_SCALE,
+                .type = t,
+                .c.q  = rndq(t),
+            });
+        }
+    }
+}
+
+void checkasm_check_sw_ops(void)
+{
+    check_read_write();
+    report("read_write");
+    check_swap_bytes();
+    report("swap_bytes");
+    check_pack_unpack();
+    report("pack_unpack");
+    check_clear();
+    report("clear");
+    check_shift();
+    report("shift");
+    check_swizzle();
+    report("swizzle");
+    check_convert();
+    report("convert");
+    check_dither();
+    report("dither");
+    check_min_max();
+    report("min_max");
+    check_linear();
+    report("linear");
+    check_scale();
+    report("scale");
+}
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 16/17] swscale/format: add new format decode/encode logic
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (14 preceding siblings ...)
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 15/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
@ 2025-05-21 12:44 ` Niklas Haas
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:44 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This patch adds format handling code for the new operations. This entails
fully decoding a format to standardized RGB, and the inverse.

Handling it this way means we can always guarantee that a conversion path
exists from A to B without having to explicitly cover logic for each path;
and choosing RGB instead of YUV as the intermediate (as was done in swscale
v1) is more flexible with regards to enabling further operations such as
primaries conversions, linear scaling, etc.

In the case of YUV->YUV transform, the redundant matrix multiplication will
be canceled out anyways.
---
 libswscale/format.c | 926 ++++++++++++++++++++++++++++++++++++++++++++
 libswscale/format.h |  23 ++
 2 files changed, 949 insertions(+)

diff --git a/libswscale/format.c b/libswscale/format.c
index b77081dd7a..7cbc5b37db 100644
--- a/libswscale/format.c
+++ b/libswscale/format.c
@@ -21,8 +21,22 @@
 #include "libavutil/avassert.h"
 #include "libavutil/hdr_dynamic_metadata.h"
 #include "libavutil/mastering_display_metadata.h"
+#include "libavutil/refstruct.h"
 
 #include "format.h"
+#include "csputils.h"
+#include "ops_internal.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+#define Q0   Q(0)
+#define Q1   Q(1)
+
+#define RET(x)                                                                 \
+    do {                                                                       \
+        int __ret = (x);                                                       \
+        if (__ret  < 0)                                                        \
+            return __ret;                                                      \
+    } while (0)
 
 typedef struct LegacyFormatEntry {
     uint8_t is_supported_in         :1;
@@ -582,3 +596,915 @@ int sws_is_noop(const AVFrame *dst, const AVFrame *src)
 
     return 1;
 }
+
+/* Returns the type suitable for a pixel after fully decoding/unpacking it */
+static SwsPixelType fmt_pixel_type(enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    const int bits = FFALIGN(desc->comp[0].depth, 8);
+    if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) {
+        switch (bits) {
+        case 32: return SWS_PIXEL_F32;
+        }
+    } else {
+        switch (bits) {
+        case  8: return SWS_PIXEL_U8;
+        case 16: return SWS_PIXEL_U16;
+        case 32: return SWS_PIXEL_U32;
+        }
+    }
+
+    return SWS_PIXEL_NONE;
+}
+
+static SwsSwizzleOp fmt_swizzle(enum AVPixelFormat fmt)
+{
+    switch (fmt) {
+    case AV_PIX_FMT_ARGB:
+    case AV_PIX_FMT_0RGB:
+    case AV_PIX_FMT_AYUV64LE:
+    case AV_PIX_FMT_AYUV64BE:
+    case AV_PIX_FMT_AYUV:
+    case AV_PIX_FMT_X2RGB10LE:
+    case AV_PIX_FMT_X2RGB10BE:
+        return (SwsSwizzleOp) {{ .x = 3, 0, 1, 2 }};
+    case AV_PIX_FMT_BGR24:
+    case AV_PIX_FMT_BGR8:
+    case AV_PIX_FMT_BGR4:
+    case AV_PIX_FMT_BGR4_BYTE:
+    case AV_PIX_FMT_BGRA:
+    case AV_PIX_FMT_BGR565BE:
+    case AV_PIX_FMT_BGR565LE:
+    case AV_PIX_FMT_BGR555BE:
+    case AV_PIX_FMT_BGR555LE:
+    case AV_PIX_FMT_BGR444BE:
+    case AV_PIX_FMT_BGR444LE:
+    case AV_PIX_FMT_BGR48BE:
+    case AV_PIX_FMT_BGR48LE:
+    case AV_PIX_FMT_BGRA64BE:
+    case AV_PIX_FMT_BGRA64LE:
+    case AV_PIX_FMT_BGR0:
+    case AV_PIX_FMT_VUYA:
+    case AV_PIX_FMT_VUYX:
+        return (SwsSwizzleOp) {{ .x = 2, 1, 0, 3 }};
+    case AV_PIX_FMT_ABGR:
+    case AV_PIX_FMT_0BGR:
+    case AV_PIX_FMT_X2BGR10LE:
+    case AV_PIX_FMT_X2BGR10BE:
+        return (SwsSwizzleOp) {{ .x = 3, 2, 1, 0 }};
+    case AV_PIX_FMT_YA8:
+    case AV_PIX_FMT_YA16BE:
+    case AV_PIX_FMT_YA16LE:
+        return (SwsSwizzleOp) {{ .x = 0, 3, 1, 2 }};
+    case AV_PIX_FMT_XV30BE:
+    case AV_PIX_FMT_XV30LE:
+        return (SwsSwizzleOp) {{ .x = 3, 2, 0, 1 }};
+    case AV_PIX_FMT_VYU444:
+    case AV_PIX_FMT_V30XBE:
+    case AV_PIX_FMT_V30XLE:
+        return (SwsSwizzleOp) {{ .x = 2, 0, 1, 3 }};
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XV48BE:
+    case AV_PIX_FMT_XV48LE:
+    case AV_PIX_FMT_UYVA:
+        return (SwsSwizzleOp) {{ .x = 1, 0, 2, 3 }};
+    case AV_PIX_FMT_GBRP:
+    case AV_PIX_FMT_GBRP9BE:
+    case AV_PIX_FMT_GBRP9LE:
+    case AV_PIX_FMT_GBRP10BE:
+    case AV_PIX_FMT_GBRP10LE:
+    case AV_PIX_FMT_GBRP12BE:
+    case AV_PIX_FMT_GBRP12LE:
+    case AV_PIX_FMT_GBRP14BE:
+    case AV_PIX_FMT_GBRP14LE:
+    case AV_PIX_FMT_GBRP16BE:
+    case AV_PIX_FMT_GBRP16LE:
+    case AV_PIX_FMT_GBRPF16BE:
+    case AV_PIX_FMT_GBRPF16LE:
+    case AV_PIX_FMT_GBRAP:
+    case AV_PIX_FMT_GBRAP10LE:
+    case AV_PIX_FMT_GBRAP10BE:
+    case AV_PIX_FMT_GBRAP12LE:
+    case AV_PIX_FMT_GBRAP12BE:
+    case AV_PIX_FMT_GBRAP14LE:
+    case AV_PIX_FMT_GBRAP14BE:
+    case AV_PIX_FMT_GBRAP16LE:
+    case AV_PIX_FMT_GBRAP16BE:
+    case AV_PIX_FMT_GBRPF32BE:
+    case AV_PIX_FMT_GBRPF32LE:
+    case AV_PIX_FMT_GBRAPF16BE:
+    case AV_PIX_FMT_GBRAPF16LE:
+    case AV_PIX_FMT_GBRAPF32BE:
+    case AV_PIX_FMT_GBRAPF32LE:
+        return (SwsSwizzleOp) {{ .x = 1, 2, 0, 3 }};
+    default:
+        return (SwsSwizzleOp) {{ .x = 0, 1, 2, 3 }};
+    }
+}
+
+static SwsSwizzleOp swizzle_inv(SwsSwizzleOp swiz) {
+    /* Input[x] =: Output[swizzle.x] */
+    unsigned out[4];
+    out[swiz.x] = 0;
+    out[swiz.y] = 1;
+    out[swiz.z] = 2;
+    out[swiz.w] = 3;
+    return (SwsSwizzleOp) {{ .x = out[0], out[1], out[2], out[3] }};
+}
+
+/* Shift factor for MSB aligned formats */
+static int fmt_shift(enum AVPixelFormat fmt)
+{
+    switch (fmt) {
+    case AV_PIX_FMT_P010BE:
+    case AV_PIX_FMT_P010LE:
+    case AV_PIX_FMT_P210BE:
+    case AV_PIX_FMT_P210LE:
+    case AV_PIX_FMT_Y210BE:
+    case AV_PIX_FMT_Y210LE:
+        return 6;
+    case AV_PIX_FMT_P012BE:
+    case AV_PIX_FMT_P012LE:
+    case AV_PIX_FMT_P212BE:
+    case AV_PIX_FMT_P212LE:
+    case AV_PIX_FMT_P412BE:
+    case AV_PIX_FMT_P412LE:
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XYZ12BE:
+    case AV_PIX_FMT_XYZ12LE:
+        return 4;
+    }
+
+    return 0;
+}
+
+/**
+ * This initializes all absent components explicitly to zero. There is no
+ * need to worry about the correct neutral value as fmt_decode() will
+ * implicitly ignore and overwrite absent components in any case. This function
+ * is just to ensure that we don't operate on undefined memory. In most cases,
+ * it will end up getting pushed towards the output or optimized away entirely
+ * by the optimization pass.
+ */
+static SwsConst fmt_clear(enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    const bool has_chroma = desc->nb_components >= 3;
+    const bool has_alpha  = desc->flags & AV_PIX_FMT_FLAG_ALPHA;
+
+    SwsConst c = {0};
+    if (!has_chroma)
+        c.q4[1] = c.q4[2] = Q0;
+    if (!has_alpha)
+        c.q4[3] = Q0;
+
+    return c;
+}
+
+static int fmt_read_write(enum AVPixelFormat fmt, SwsReadWriteOp *rw_op,
+                          SwsPackOp *pack_op)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    if (!desc)
+        return AVERROR(EINVAL);
+
+    switch (fmt) {
+    case AV_PIX_FMT_NONE:
+    case AV_PIX_FMT_NB:
+        break;
+
+    /* Packed bitstream formats */
+    case AV_PIX_FMT_MONOWHITE:
+    case AV_PIX_FMT_MONOBLACK:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) {
+            .elems = 1,
+            .frac  = 3,
+        };
+        return 0;
+    case AV_PIX_FMT_RGB4:
+    case AV_PIX_FMT_BGR4:
+        *pack_op = (SwsPackOp) {{ 1, 2, 1 }};
+        *rw_op = (SwsReadWriteOp) {
+            .elems = 1,
+            .packed = true,
+            .frac  = 1,
+        };
+        return 0;
+    /* Packed 8-bit aligned formats */
+    case AV_PIX_FMT_RGB4_BYTE:
+    case AV_PIX_FMT_BGR4_BYTE:
+        *pack_op = (SwsPackOp) {{ 1, 2, 1 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_BGR8:
+        *pack_op = (SwsPackOp) {{ 2, 3, 3 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_RGB8:
+        *pack_op = (SwsPackOp) {{ 3, 3, 2 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+
+    /* Packed 16-bit aligned formats */
+    case AV_PIX_FMT_RGB565BE:
+    case AV_PIX_FMT_RGB565LE:
+    case AV_PIX_FMT_BGR565BE:
+    case AV_PIX_FMT_BGR565LE:
+        *pack_op = (SwsPackOp) {{ 5, 6, 5 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_RGB555BE:
+    case AV_PIX_FMT_RGB555LE:
+    case AV_PIX_FMT_BGR555BE:
+    case AV_PIX_FMT_BGR555LE:
+        *pack_op = (SwsPackOp) {{ 5, 5, 5 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_RGB444BE:
+    case AV_PIX_FMT_RGB444LE:
+    case AV_PIX_FMT_BGR444BE:
+    case AV_PIX_FMT_BGR444LE:
+        *pack_op = (SwsPackOp) {{ 4, 4, 4 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    /* Packed 32-bit aligned 4:4:4 formats */
+    case AV_PIX_FMT_X2RGB10BE:
+    case AV_PIX_FMT_X2RGB10LE:
+    case AV_PIX_FMT_X2BGR10BE:
+    case AV_PIX_FMT_X2BGR10LE:
+    case AV_PIX_FMT_XV30BE:
+    case AV_PIX_FMT_XV30LE:
+        *pack_op = (SwsPackOp) {{ 2, 10, 10, 10 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_V30XBE:
+    case AV_PIX_FMT_V30XLE:
+        *pack_op = (SwsPackOp) {{ 10, 10, 10, 2 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    /* 3 component formats with one channel ignored */
+    case AV_PIX_FMT_RGB0:
+    case AV_PIX_FMT_BGR0:
+    case AV_PIX_FMT_0RGB:
+    case AV_PIX_FMT_0BGR:
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XV48BE:
+    case AV_PIX_FMT_XV48LE:
+    case AV_PIX_FMT_VUYX:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) { .elems = 4, .packed = true };
+        return 0;
+    /* Unpacked byte-aligned 4:4:4 formats */
+    case AV_PIX_FMT_YUV444P:
+    case AV_PIX_FMT_YUVJ444P:
+    case AV_PIX_FMT_YUV444P9BE:
+    case AV_PIX_FMT_YUV444P9LE:
+    case AV_PIX_FMT_YUV444P10BE:
+    case AV_PIX_FMT_YUV444P10LE:
+    case AV_PIX_FMT_YUV444P12BE:
+    case AV_PIX_FMT_YUV444P12LE:
+    case AV_PIX_FMT_YUV444P14BE:
+    case AV_PIX_FMT_YUV444P14LE:
+    case AV_PIX_FMT_YUV444P16BE:
+    case AV_PIX_FMT_YUV444P16LE:
+    case AV_PIX_FMT_YUVA444P:
+    case AV_PIX_FMT_YUVA444P9BE:
+    case AV_PIX_FMT_YUVA444P9LE:
+    case AV_PIX_FMT_YUVA444P10BE:
+    case AV_PIX_FMT_YUVA444P10LE:
+    case AV_PIX_FMT_YUVA444P12BE:
+    case AV_PIX_FMT_YUVA444P12LE:
+    case AV_PIX_FMT_YUVA444P16BE:
+    case AV_PIX_FMT_YUVA444P16LE:
+    case AV_PIX_FMT_AYUV:
+    case AV_PIX_FMT_UYVA:
+    case AV_PIX_FMT_VYU444:
+    case AV_PIX_FMT_AYUV64BE:
+    case AV_PIX_FMT_AYUV64LE:
+    case AV_PIX_FMT_VUYA:
+    case AV_PIX_FMT_RGB24:
+    case AV_PIX_FMT_BGR24:
+    case AV_PIX_FMT_RGB48BE:
+    case AV_PIX_FMT_RGB48LE:
+    case AV_PIX_FMT_BGR48BE:
+    case AV_PIX_FMT_BGR48LE:
+    //case AV_PIX_FMT_RGB96BE: TODO: AVRational can't fit 2^32-1
+    //case AV_PIX_FMT_RGB96LE:
+    //case AV_PIX_FMT_RGBF16BE: TODO: no support for float16 currently
+    //case AV_PIX_FMT_RGBF16LE:
+    case AV_PIX_FMT_RGBF32BE:
+    case AV_PIX_FMT_RGBF32LE:
+    case AV_PIX_FMT_ARGB:
+    case AV_PIX_FMT_RGBA:
+    case AV_PIX_FMT_ABGR:
+    case AV_PIX_FMT_BGRA:
+    case AV_PIX_FMT_RGBA64BE:
+    case AV_PIX_FMT_RGBA64LE:
+    case AV_PIX_FMT_BGRA64BE:
+    case AV_PIX_FMT_BGRA64LE:
+    //case AV_PIX_FMT_RGBA128BE: TODO: AVRational can't fit 2^32-1
+    //case AV_PIX_FMT_RGBA128LE:
+    case AV_PIX_FMT_RGBAF32BE:
+    case AV_PIX_FMT_RGBAF32LE:
+    case AV_PIX_FMT_GBRP:
+    case AV_PIX_FMT_GBRP9BE:
+    case AV_PIX_FMT_GBRP9LE:
+    case AV_PIX_FMT_GBRP10BE:
+    case AV_PIX_FMT_GBRP10LE:
+    case AV_PIX_FMT_GBRP12BE:
+    case AV_PIX_FMT_GBRP12LE:
+    case AV_PIX_FMT_GBRP14BE:
+    case AV_PIX_FMT_GBRP14LE:
+    case AV_PIX_FMT_GBRP16BE:
+    case AV_PIX_FMT_GBRP16LE:
+    //case AV_PIX_FMT_GBRPF16BE: TODO
+    //case AV_PIX_FMT_GBRPF16LE:
+    case AV_PIX_FMT_GBRPF32BE:
+    case AV_PIX_FMT_GBRPF32LE:
+    case AV_PIX_FMT_GBRAP:
+    case AV_PIX_FMT_GBRAP10BE:
+    case AV_PIX_FMT_GBRAP10LE:
+    case AV_PIX_FMT_GBRAP12BE:
+    case AV_PIX_FMT_GBRAP12LE:
+    case AV_PIX_FMT_GBRAP14BE:
+    case AV_PIX_FMT_GBRAP14LE:
+    case AV_PIX_FMT_GBRAP16BE:
+    case AV_PIX_FMT_GBRAP16LE:
+    //case AV_PIX_FMT_GBRAPF16BE: TODO
+    //case AV_PIX_FMT_GBRAPF16LE:
+    case AV_PIX_FMT_GBRAPF32BE:
+    case AV_PIX_FMT_GBRAPF32LE:
+    case AV_PIX_FMT_GRAY8:
+    case AV_PIX_FMT_GRAY9BE:
+    case AV_PIX_FMT_GRAY9LE:
+    case AV_PIX_FMT_GRAY10BE:
+    case AV_PIX_FMT_GRAY10LE:
+    case AV_PIX_FMT_GRAY12BE:
+    case AV_PIX_FMT_GRAY12LE:
+    case AV_PIX_FMT_GRAY14BE:
+    case AV_PIX_FMT_GRAY14LE:
+    case AV_PIX_FMT_GRAY16BE:
+    case AV_PIX_FMT_GRAY16LE:
+    //case AV_PIX_FMT_GRAYF16BE: TODO
+    //case AV_PIX_FMT_GRAYF16LE:
+    //case AV_PIX_FMT_YAF16BE:
+    //case AV_PIX_FMT_YAF16LE:
+    case AV_PIX_FMT_GRAYF32BE:
+    case AV_PIX_FMT_GRAYF32LE:
+    case AV_PIX_FMT_YAF32BE:
+    case AV_PIX_FMT_YAF32LE:
+    case AV_PIX_FMT_YA8:
+    case AV_PIX_FMT_YA16LE:
+    case AV_PIX_FMT_YA16BE:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) {
+            .elems  = desc->nb_components,
+            .packed = desc->nb_components > 1 && !(desc->flags & AV_PIX_FMT_FLAG_PLANAR),
+        };
+        return 0;
+    }
+
+    return AVERROR(ENOTSUP);
+}
+
+static SwsPixelType get_packed_type(SwsPackOp pack)
+{
+    const int sum = pack.pattern[0] + pack.pattern[1] +
+                    pack.pattern[2] + pack.pattern[3];
+    if (sum > 16)
+        return SWS_PIXEL_U32;
+    else if (sum > 8)
+        return SWS_PIXEL_U16;
+    else
+        return SWS_PIXEL_U8;
+}
+
+#if HAVE_BIGENDIAN
+#  define NATIVE_ENDIAN_FLAG AV_PIX_FMT_FLAG_BE
+#else
+#  define NATIVE_ENDIAN_FLAG 0
+#endif
+
+int ff_sws_decode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    SwsPixelType pixel_type = fmt_pixel_type(fmt);
+    SwsPixelType raw_type = pixel_type;
+    SwsReadWriteOp rw_op;
+    SwsPackOp unpack;
+
+    RET(fmt_read_write(fmt, &rw_op, &unpack));
+    if (unpack.pattern[0])
+        raw_type = get_packed_type(unpack);
+
+    /* TODO: handle subsampled or semipacked input formats */
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_READ,
+        .type = raw_type,
+        .rw   = rw_op,
+    }));
+
+    if ((desc->flags & AV_PIX_FMT_FLAG_BE) != NATIVE_ENDIAN_FLAG) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_SWAP_BYTES,
+            .type = raw_type,
+        }));
+    }
+
+    if (unpack.pattern[0]) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_UNPACK,
+            .type = raw_type,
+            .pack = unpack,
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CONVERT,
+            .type = raw_type,
+            .convert.to = pixel_type,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op      = SWS_OP_SWIZZLE,
+        .type    = pixel_type,
+        .swizzle = swizzle_inv(fmt_swizzle(fmt)),
+    }));
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_RSHIFT,
+        .type = pixel_type,
+        .c.u  = fmt_shift(fmt),
+    }));
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_CLEAR,
+        .type = pixel_type,
+        .c    = fmt_clear(fmt),
+    }));
+
+    return 0;
+}
+
+int ff_sws_encode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    SwsPixelType pixel_type = fmt_pixel_type(fmt);
+    SwsPixelType raw_type = pixel_type;
+    SwsReadWriteOp rw_op;
+    SwsPackOp pack;
+
+    RET(fmt_read_write(fmt, &rw_op, &pack));
+    if (pack.pattern[0])
+        raw_type = get_packed_type(pack);
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_LSHIFT,
+        .type = pixel_type,
+        .c.u  = fmt_shift(fmt),
+    }));
+
+    if (rw_op.elems > desc->nb_components) {
+        /* Format writes unused alpha channel, clear it explicitly for sanity */
+        av_assert1(!(desc->flags & AV_PIX_FMT_FLAG_ALPHA));
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CLEAR,
+            .type = pixel_type,
+            .c.q4[3] = Q0,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op      = SWS_OP_SWIZZLE,
+        .type    = pixel_type,
+        .swizzle = fmt_swizzle(fmt),
+    }));
+
+    if (pack.pattern[0]) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CONVERT,
+            .type = pixel_type,
+            .convert.to = raw_type,
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_PACK,
+            .type = raw_type,
+            .pack = pack,
+        }));
+    }
+
+    if ((desc->flags & AV_PIX_FMT_FLAG_BE) != NATIVE_ENDIAN_FLAG) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_SWAP_BYTES,
+            .type = raw_type,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_WRITE,
+        .type = raw_type,
+        .rw   = rw_op,
+    }));
+    return 0;
+}
+
+static inline AVRational av_neg_q(AVRational x)
+{
+    return (AVRational) { -x.num, x.den };
+}
+
+static SwsLinearOp fmt_encode_range(const SwsFormat fmt, bool *incomplete)
+{
+    SwsLinearOp c = { .m = {
+        { Q1, Q0, Q0, Q0, Q0 },
+        { Q0, Q1, Q0, Q0, Q0 },
+        { Q0, Q0, Q1, Q0, Q0 },
+        { Q0, Q0, Q0, Q1, Q0 },
+    }};
+
+    const int depth0 = fmt.desc->comp[0].depth;
+    const int depth1 = fmt.desc->comp[1].depth;
+    const int depth2 = fmt.desc->comp[2].depth;
+    const int depth3 = fmt.desc->comp[3].depth;
+
+    if (fmt.desc->flags & AV_PIX_FMT_FLAG_FLOAT)
+        return c; /* floats are directly output as-is */
+
+    if (fmt.csp == AVCOL_SPC_RGB || (fmt.desc->flags & AV_PIX_FMT_FLAG_XYZ)) {
+        c.m[0][0] = Q((1 << depth0) - 1);
+        c.m[1][1] = Q((1 << depth1) - 1);
+        c.m[2][2] = Q((1 << depth2) - 1);
+    } else if (fmt.range == AVCOL_RANGE_JPEG) {
+        /* Full range YUV */
+        c.m[0][0] = Q((1 << depth0) - 1);
+        if (fmt.desc->nb_components >= 3) {
+            /* This follows the ITU-R convention, which is slightly different
+             * from the JFIF convention. */
+            c.m[1][1] = Q((1 << depth1) - 1);
+            c.m[2][2] = Q((1 << depth2) - 1);
+            c.m[1][4] = Q(1 << (depth1 - 1));
+            c.m[2][4] = Q(1 << (depth2 - 1));
+        }
+    } else {
+        /* Limited range YUV */
+        if (fmt.range == AVCOL_RANGE_UNSPECIFIED)
+            *incomplete = true;
+        c.m[0][0] = Q(219 << (depth0 - 8));
+        c.m[0][4] = Q( 16 << (depth0 - 8));
+        if (fmt.desc->nb_components >= 3) {
+            c.m[1][1] = Q(224 << (depth1 - 8));
+            c.m[2][2] = Q(224 << (depth2 - 8));
+            c.m[1][4] = Q(128 << (depth1 - 8));
+            c.m[2][4] = Q(128 << (depth2 - 8));
+        }
+    }
+
+    if (fmt.desc->flags & AV_PIX_FMT_FLAG_ALPHA) {
+        const bool is_ya = fmt.desc->nb_components == 2;
+        c.m[3][3] = Q((1 << (is_ya ? depth1 : depth3)) - 1);
+    }
+
+    if (fmt.format == AV_PIX_FMT_MONOWHITE) {
+        /* This format is inverted, 0 = white, 1 = black */
+        c.m[0][4] = av_add_q(c.m[0][4], c.m[0][0]);
+        c.m[0][0] = av_neg_q(c.m[0][0]);
+    }
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+static SwsLinearOp fmt_decode_range(const SwsFormat fmt, bool *incomplete)
+{
+    SwsLinearOp c = fmt_encode_range(fmt, incomplete);
+
+    /* Invert main diagonal + offset: x = s * y + k  ==>  y = (x - k) / s */
+    for (int i = 0; i < 4; i++) {
+        c.m[i][i] = av_inv_q(c.m[i][i]);
+        c.m[i][4] = av_mul_q(c.m[i][4], av_neg_q(c.m[i][i]));
+    }
+
+    /* Explicitly initialize alpha for sanity */
+    if (!(fmt.desc->flags & AV_PIX_FMT_FLAG_ALPHA))
+        c.m[3][4] = Q1;
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+static AVRational *generate_bayer_matrix(const int size_log2)
+{
+    const int size = 1 << size_log2;
+    const int num_entries = size * size;
+    AVRational *m = av_refstruct_allocz(sizeof(*m) * num_entries);
+    av_assert1(size_log2 < 16);
+    if (!m)
+        return NULL;
+
+    /* Start with a 1x1 matrix */
+    m[0] = Q0;
+
+    /* Generate three copies of the current, appropriately scaled and offset */
+    for (int sz = 1; sz < size; sz <<= 1) {
+        const int den = 4 * sz * sz;
+        for (int y = 0; y < sz; y++) {
+            for (int x = 0; x < sz; x++) {
+                const AVRational cur = m[y * size + x];
+                m[(y + sz) * size + x + sz] = av_add_q(cur, av_make_q(1, den));
+                m[(y     ) * size + x + sz] = av_add_q(cur, av_make_q(2, den));
+                m[(y + sz) * size + x     ] = av_add_q(cur, av_make_q(3, den));
+            }
+        }
+    }
+
+    /**
+     * To correctly round, we need to evenly distribute the result on [0, 1),
+     * giving an average value of 1/2.
+     *
+     * After the above construction, we have a matrix with average value:
+     *   [ 0/N + 1/N + 2/N + ... (N-1)/N ] / N = (N-1)/(2N)
+     * where N = size * size is the total number of entries.
+     *
+     * To make the average value equal to 1/2 = N/(2N), add a bias of 1/(2N).
+     */
+    for (int i = 0; i < num_entries; i++)
+        m[i] = av_add_q(m[i], av_make_q(1, 2 * num_entries));
+
+    return m;
+}
+
+static bool trc_is_hdr(enum AVColorTransferCharacteristic trc)
+{
+    static_assert(AVCOL_TRC_NB == 19, "Update this list when adding TRCs");
+    switch (trc) {
+    case AVCOL_TRC_LOG:
+    case AVCOL_TRC_LOG_SQRT:
+    case AVCOL_TRC_SMPTEST2084:
+    case AVCOL_TRC_ARIB_STD_B67:
+        return true;
+    default:
+        return false;
+    }
+}
+
+static int fmt_dither(SwsContext *ctx, SwsOpList *ops,
+                      const SwsPixelType type, const SwsFormat fmt)
+{
+    SwsDither mode = ctx->dither;
+    SwsDitherOp dither;
+
+    if (mode == SWS_DITHER_AUTO) {
+        /* Visual threshold of perception: 12 bits for SDR, 14 bits for HDR */
+        const int jnd_bits = trc_is_hdr(fmt.color.trc) ? 14 : 12;
+        const int bpc = fmt.desc->comp[0].depth;
+        mode = bpc >= jnd_bits ? SWS_DITHER_NONE : SWS_DITHER_BAYER;
+    }
+
+    switch (mode) {
+    case SWS_DITHER_NONE:
+        if (ctx->flags & SWS_ACCURATE_RND) {
+            /* Add constant 0.5 for correct rounding */
+            AVRational *bias = av_refstruct_allocz(sizeof(*bias));
+            if (!bias)
+                return AVERROR(ENOMEM);
+            *bias = (AVRational) {1, 2};
+            return ff_sws_op_list_append(ops, &(SwsOp) {
+                .op   = SWS_OP_DITHER,
+                .type = type,
+                .dither.matrix = bias,
+            });
+        } else {
+            return 0; /* No-op */
+        }
+    case SWS_DITHER_BAYER:
+        /* Hardcode 16x16 matrix for now; in theory we could adjust this
+         * based on the expected level of precision in the output, since lower
+         * bit depth outputs can suffice with smaller dither matrices; however
+         * in practice we probably want to use error diffusion for such low bit
+         * depths anyway */
+        dither.size_log2 = 4;
+        dither.matrix = generate_bayer_matrix(dither.size_log2);
+        if (!dither.matrix)
+            return AVERROR(ENOMEM);
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .op     = SWS_OP_DITHER,
+            .type   = type,
+            .dither = dither,
+        });
+    case SWS_DITHER_ED:
+    case SWS_DITHER_A_DITHER:
+    case SWS_DITHER_X_DITHER:
+        return AVERROR(ENOTSUP);
+
+    case SWS_DITHER_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid dither mode");
+    return AVERROR(EINVAL);
+}
+
+static inline SwsLinearOp
+linear_mat3(const AVRational m00, const AVRational m01, const AVRational m02,
+            const AVRational m10, const AVRational m11, const AVRational m12,
+            const AVRational m20, const AVRational m21, const AVRational m22)
+{
+    SwsLinearOp c = {{
+        { m00, m01, m02, Q0, Q0 },
+        { m10, m11, m12, Q0, Q0 },
+        { m20, m21, m22, Q0, Q0 },
+        {  Q0,  Q0,  Q0, Q1, Q0 },
+    }};
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+int ff_sws_decode_colors(SwsContext *ctx, SwsPixelType type,
+                         SwsOpList *ops, const SwsFormat fmt, bool *incomplete)
+{
+    const AVLumaCoefficients *c = av_csp_luma_coeffs_from_avcsp(fmt.csp);
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op         = SWS_OP_CONVERT,
+        .type       = fmt_pixel_type(fmt.format),
+        .convert.to = type,
+    }));
+
+    /* Decode pixel format into standardized range */
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .type = type,
+        .op   = SWS_OP_LINEAR,
+        .lin  = fmt_decode_range(fmt, incomplete),
+    }));
+
+    /* Final step, decode colorspace */
+    switch (fmt.csp) {
+    case AVCOL_SPC_RGB:
+        return 0;
+    case AVCOL_SPC_UNSPECIFIED:
+        c = av_csp_luma_coeffs_from_avcsp(AVCOL_SPC_BT470BG);
+        *incomplete = true;
+        /* fall through */
+    case AVCOL_SPC_FCC:
+    case AVCOL_SPC_BT470BG:
+    case AVCOL_SPC_SMPTE170M:
+    case AVCOL_SPC_BT709:
+    case AVCOL_SPC_SMPTE240M:
+    case AVCOL_SPC_BT2020_NCL: {
+        AVRational crg = av_sub_q(Q0, av_div_q(c->cr, c->cg));
+        AVRational cbg = av_sub_q(Q0, av_div_q(c->cb, c->cg));
+        AVRational m02 = av_mul_q(Q(2), av_sub_q(Q1, c->cr));
+        AVRational m21 = av_mul_q(Q(2), av_sub_q(Q1, c->cb));
+        AVRational m11 = av_mul_q(cbg, m21);
+        AVRational m12 = av_mul_q(crg, m02);
+
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                Q1,  Q0, m02,
+                Q1, m11, m12,
+                Q1, m21,  Q0
+            ),
+        });
+    }
+
+    case AVCOL_SPC_YCGCO:
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                Q1, Q(-1), Q( 1),
+                Q1, Q( 1), Q( 0),
+                Q1, Q(-1), Q(-1)
+            ),
+        });
+
+    case AVCOL_SPC_BT2020_CL:
+    case AVCOL_SPC_SMPTE2085:
+    case AVCOL_SPC_CHROMA_DERIVED_NCL:
+    case AVCOL_SPC_CHROMA_DERIVED_CL:
+    case AVCOL_SPC_ICTCP:
+    case AVCOL_SPC_IPT_C2:
+    case AVCOL_SPC_YCGCO_RE:
+    case AVCOL_SPC_YCGCO_RO:
+        return AVERROR(ENOTSUP);
+
+    case AVCOL_SPC_RESERVED:
+        return AVERROR(EINVAL);
+
+    case AVCOL_SPC_NB:
+        break;
+    }
+
+    av_assert0(!"Corrupt AVColorSpace value?");
+    return AVERROR(EINVAL);
+}
+
+int ff_sws_encode_colors(SwsContext *ctx, SwsPixelType type,
+                         SwsOpList *ops, const SwsFormat fmt, bool *incomplete)
+{
+    const AVLumaCoefficients *c = av_csp_luma_coeffs_from_avcsp(fmt.csp);
+
+    switch (fmt.csp) {
+    case AVCOL_SPC_RGB:
+        break;
+    case AVCOL_SPC_UNSPECIFIED:
+        c = av_csp_luma_coeffs_from_avcsp(AVCOL_SPC_BT470BG);
+        *incomplete = true;
+        /* fall through */
+    case AVCOL_SPC_FCC:
+    case AVCOL_SPC_BT470BG:
+    case AVCOL_SPC_SMPTE170M:
+    case AVCOL_SPC_BT709:
+    case AVCOL_SPC_SMPTE240M:
+    case AVCOL_SPC_BT2020_NCL: {
+        AVRational cb1 = av_sub_q(c->cb, Q1);
+        AVRational cr1 = av_sub_q(c->cr, Q1);
+        AVRational m20 = av_make_q(1,2);
+        AVRational m10 = av_mul_q(m20, av_div_q(c->cr, cb1));
+        AVRational m11 = av_mul_q(m20, av_div_q(c->cg, cb1));
+        AVRational m21 = av_mul_q(m20, av_div_q(c->cg, cr1));
+        AVRational m22 = av_mul_q(m20, av_div_q(c->cb, cr1));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                c->cr, c->cg, c->cb,
+                m10,     m11,   m20,
+                m20,     m21,   m22
+            ),
+        }));
+        break;
+    }
+
+    case AVCOL_SPC_YCGCO:
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                av_make_q( 1, 4), av_make_q(1, 2), av_make_q( 1, 4),
+                av_make_q( 1, 2), av_make_q(0, 1), av_make_q(-1, 2),
+                av_make_q(-1, 4), av_make_q(1, 2), av_make_q(-1, 4)
+            ),
+        }));
+        break;
+
+    case AVCOL_SPC_BT2020_CL:
+    case AVCOL_SPC_SMPTE2085:
+    case AVCOL_SPC_CHROMA_DERIVED_NCL:
+    case AVCOL_SPC_CHROMA_DERIVED_CL:
+    case AVCOL_SPC_ICTCP:
+    case AVCOL_SPC_IPT_C2:
+    case AVCOL_SPC_YCGCO_RE:
+    case AVCOL_SPC_YCGCO_RO:
+        return AVERROR(ENOTSUP);
+
+    case AVCOL_SPC_RESERVED:
+    case AVCOL_SPC_NB:
+        return AVERROR(EINVAL);
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .type = type,
+        .op   = SWS_OP_LINEAR,
+        .lin  = fmt_encode_range(fmt, incomplete),
+    }));
+
+    if (!(fmt.desc->flags & AV_PIX_FMT_FLAG_FLOAT)) {
+        SwsConst range = {0};
+
+        const bool is_ya = fmt.desc->nb_components == 2;
+        for (int i = 0; i < fmt.desc->nb_components; i++) {
+            /* Clamp to legal pixel range */
+            const int idx = i * (is_ya ? 3 : 1);
+            range.q4[idx] = Q((1 << fmt.desc->comp[i].depth) - 1);
+        }
+
+        RET(fmt_dither(ctx, ops, type, fmt));
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_MAX,
+            .type = type,
+            .c.q4 = { Q0, Q0, Q0, Q0 },
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_MIN,
+            .type = type,
+            .c    = range,
+        }));
+    }
+
+    return ff_sws_op_list_append(ops, &(SwsOp) {
+        .type       = type,
+        .op         = SWS_OP_CONVERT,
+        .convert.to = fmt_pixel_type(fmt.format),
+    });
+}
diff --git a/libswscale/format.h b/libswscale/format.h
index be92038f4f..e6a1fd7116 100644
--- a/libswscale/format.h
+++ b/libswscale/format.h
@@ -148,4 +148,27 @@ int ff_test_fmt(const SwsFormat *fmt, int output);
 /* Returns true if the formats are incomplete, false otherwise */
 bool ff_infer_colors(SwsColor *src, SwsColor *dst);
 
+typedef struct SwsOpList SwsOpList;
+typedef enum SwsPixelType SwsPixelType;
+
+/**
+ * Append a set of operations for decoding/encoding raw pixels. This will
+ * handle input read/write, swizzling, shifting and byte swapping.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_decode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt);
+int ff_sws_encode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt);
+
+/**
+ * Append a set of operations for transforming decoded pixel values to/from
+ * normalized RGB in the specified gamut and pixel type.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_decode_colors(SwsContext *ctx, SwsPixelType type, SwsOpList *ops,
+                         const SwsFormat fmt, bool *incomplete);
+int ff_sws_encode_colors(SwsContext *ctx, SwsPixelType type, SwsOpList *ops,
+                         const SwsFormat fmt, bool *incomplete);
+
 #endif /* SWSCALE_FORMAT_H */
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* [FFmpeg-devel] [PATCH v2 17/17] swscale/graph: allow experimental use of new format handler
  2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
                   ` (15 preceding siblings ...)
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 16/17] swscale/format: add new format decode/encode logic Niklas Haas
@ 2025-05-21 12:44 ` Niklas Haas
  16 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-21 12:44 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

\o/
---
 libswscale/graph.c | 84 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/libswscale/graph.c b/libswscale/graph.c
index dc7784aa49..24930e7627 100644
--- a/libswscale/graph.c
+++ b/libswscale/graph.c
@@ -34,6 +34,7 @@
 #include "lut3d.h"
 #include "swscale_internal.h"
 #include "graph.h"
+#include "ops.h"
 
 static int pass_alloc_output(SwsPass *pass)
 {
@@ -453,6 +454,85 @@ static int add_legacy_sws_pass(SwsGraph *graph, SwsFormat src, SwsFormat dst,
     return 0;
 }
 
+/*********************
+ * Format conversion *
+ *********************/
+
+static int add_convert_pass(SwsGraph *graph, SwsFormat src, SwsFormat dst,
+                            SwsPass *input, SwsPass **output)
+{
+    const SwsPixelType type = SWS_PIXEL_F32;
+
+    SwsContext *ctx = graph->ctx;
+    SwsOpList *ops = NULL;
+    int ret = AVERROR(ENOTSUP);
+
+    /* Mark the entire new ops infrastructure as experimental for now */
+    if (!(ctx->flags & SWS_UNSTABLE))
+        goto fail;
+
+    /* The new format conversion layer cannot scale for now */
+    if (src.width != dst.width || src.height != dst.height ||
+        src.desc->log2_chroma_h || src.desc->log2_chroma_w ||
+        dst.desc->log2_chroma_h || dst.desc->log2_chroma_w)
+        goto fail;
+
+    /* The new code does not yet support alpha blending */
+    if (src.desc->flags & AV_PIX_FMT_FLAG_ALPHA &&
+        ctx->alpha_blend != SWS_ALPHA_BLEND_NONE)
+        goto fail;
+
+    ops = ff_sws_op_list_alloc();
+    if (!ops)
+        return AVERROR(ENOMEM);
+    ops->src = src;
+    ops->dst = dst;
+
+    ret = ff_sws_decode_pixfmt(ops, src.format);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_decode_colors(ctx, type, ops, src, &graph->incomplete);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_encode_colors(ctx, type, ops, dst, &graph->incomplete);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_encode_pixfmt(ops, dst.format);
+    if (ret < 0)
+        goto fail;
+
+    av_log(ctx, AV_LOG_VERBOSE, "Conversion pass for %s -> %s:\n",
+           av_get_pix_fmt_name(src.format), av_get_pix_fmt_name(dst.format));
+
+    av_log(ctx, AV_LOG_DEBUG, "Unoptimized operation list:\n");
+    ff_sws_op_list_print(ctx, AV_LOG_DEBUG, ops);
+    av_log(ctx, AV_LOG_DEBUG, "Optimized operation list:\n");
+
+    ff_sws_op_list_optimize(ops);
+    if (ops->num_ops == 0) {
+        av_log(ctx, AV_LOG_VERBOSE, "  optimized into memcpy\n");
+        ff_sws_op_list_free(&ops);
+        *output = input;
+        return 0;
+    }
+
+    ff_sws_op_list_print(ctx, AV_LOG_VERBOSE, ops);
+
+    ret = ff_sws_compile_pass(graph, ops, 0, dst, input, output);
+    if (ret < 0)
+        goto fail;
+
+    ret = 0;
+    /* fall through */
+
+fail:
+    ff_sws_op_list_free(&ops);
+    if (ret == AVERROR(ENOTSUP))
+        return add_legacy_sws_pass(graph, src, dst, input, output);
+    return ret;
+}
+
+
 /**************************
  * Gamut and tone mapping *
  **************************/
@@ -522,7 +602,7 @@ static int adapt_colors(SwsGraph *graph, SwsFormat src, SwsFormat dst,
     if (fmt_in != src.format) {
         SwsFormat tmp = src;
         tmp.format = fmt_in;
-        ret = add_legacy_sws_pass(graph, src, tmp, input, &input);
+        ret = add_convert_pass(graph, src, tmp, input, &input);
         if (ret < 0)
             return ret;
     }
@@ -564,7 +644,7 @@ static int init_passes(SwsGraph *graph)
     src.color  = dst.color;
 
     if (!ff_fmt_equal(&src, &dst)) {
-        ret = add_legacy_sws_pass(graph, src, dst, pass, &pass);
+        ret = add_convert_pass(graph, src, dst, pass, &pass);
         if (ret < 0)
             return ret;
     }
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 14/17] swscale/x86: add SIMD backend
  2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 14/17] swscale/x86: add SIMD backend Niklas Haas
@ 2025-05-21 14:11   ` Kieran Kunhya via ffmpeg-devel
  0 siblings, 0 replies; 21+ messages in thread
From: Kieran Kunhya via ffmpeg-devel @ 2025-05-21 14:11 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Kieran Kunhya, Niklas Haas

On Wed, May 21, 2025 at 2:00 PM Niklas Haas <ffmpeg@haasn.xyz> wrote:
>
> From: Niklas Haas <git@haasn.dev>
>
> This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
> floating point operations. While this is not yet 100% coverage, it's good
> enough for the vast majority of formats out there.
>
> Of special note is the packed shuffle fast path, which uses pshufb at vector
> sizes up to AVX512.

Can I ask if this has some kind of design documentation? Because it's
not exactly simple to understand what's going on here.
I would not like to repeat the mistakes of swscale.

Kieran
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 08/17] swscale/ops_internal: add internal ops backend API
  2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 08/17] swscale/ops_internal: add internal ops backend API Niklas Haas
@ 2025-05-23 16:27   ` Michael Niedermayer
  2025-05-23 16:52     ` Niklas Haas
  0 siblings, 1 reply; 21+ messages in thread
From: Michael Niedermayer @ 2025-05-23 16:27 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 1374 bytes --]

On Wed, May 21, 2025 at 02:43:54PM +0200, Niklas Haas wrote:
> From: Niklas Haas <git@haasn.dev>
> 
> This adds an internal API for ops backends, which are responsible for
> compiling op lists into executable functions.
> ---
>  libswscale/ops.c          |  62 ++++++++++++++++++++++
>  libswscale/ops_internal.h | 108 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 170 insertions(+)
>  create mode 100644 libswscale/ops_internal.h

ubuntu x86-32

In file included from src/libavutil/internal.h:39:0,
                 from src/libavutil/common.h:50,
                 from src/libavutil/avutil.h:300,
                 from src/libswscale/swscale.h:33,
                 from src/libswscale/graph.h:27,
                 from src/libswscale/ops.h:28,
                 from src/libswscale/ops.c:27:
src/libswscale/ops_internal.h:50:1: error: static assertion failed: "SwsOpExec layout mismatch"
 static_assert(sizeof(SwsOpExec) == 16 * sizeof(void *) + 8 * sizeof(int32_t),
 ^
make: *** [/home/michael/ffmpeg-git/ffmpeg/ffbuild/common.mak:81: libswscale/ops.o] Error 1
make: *** Waiting for unfinished jobs....
AR	libavcodec/libavcodec.a

[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

While the State exists there can be no freedom; when there is freedom there
will be no State. -- Vladimir Lenin

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [FFmpeg-devel] [PATCH v2 08/17] swscale/ops_internal: add internal ops backend API
  2025-05-23 16:27   ` Michael Niedermayer
@ 2025-05-23 16:52     ` Niklas Haas
  0 siblings, 0 replies; 21+ messages in thread
From: Niklas Haas @ 2025-05-23 16:52 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, 23 May 2025 18:27:38 +0200 Michael Niedermayer <michael@niedermayer.cc> wrote:
> On Wed, May 21, 2025 at 02:43:54PM +0200, Niklas Haas wrote:
> > From: Niklas Haas <git@haasn.dev>
> >
> > This adds an internal API for ops backends, which are responsible for
> > compiling op lists into executable functions.
> > ---
> >  libswscale/ops.c          |  62 ++++++++++++++++++++++
> >  libswscale/ops_internal.h | 108 ++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 170 insertions(+)
> >  create mode 100644 libswscale/ops_internal.h
>
> ubuntu x86-32
>
> In file included from src/libavutil/internal.h:39:0,
>                  from src/libavutil/common.h:50,
>                  from src/libavutil/avutil.h:300,
>                  from src/libswscale/swscale.h:33,
>                  from src/libswscale/graph.h:27,
>                  from src/libswscale/ops.h:28,
>                  from src/libswscale/ops.c:27:
> src/libswscale/ops_internal.h:50:1: error: static assertion failed: "SwsOpExec layout mismatch"
>  static_assert(sizeof(SwsOpExec) == 16 * sizeof(void *) + 8 * sizeof(int32_t),
>  ^
> make: *** [/home/michael/ffmpeg-git/ffmpeg/ffbuild/common.mak:81: libswscale/ops.o] Error 1
> make: *** Waiting for unfinished jobs....

Fixed; I've opted to instead move the alignment to the usage site.

> AR	libavcodec/libavcodec.a
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> While the State exists there can be no freedom; when there is freedom there
> will be no State. -- Vladimir Lenin
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-05-23 16:52 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-21 12:43 [FFmpeg-devel] [PATCH v2 00/17] swscale: new ops framework Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 01/17] swscale/format: rename legacy format conversion table Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 02/17] swscale/format: add ff_fmt_clear() Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 03/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 05/17] swscale: add SWS_UNSTABLE flag Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 06/17] swscale/ops: introduce new low level framework Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 07/17] swscale/optimizer: add high-level ops optimizer Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 08/17] swscale/ops_internal: add internal ops backend API Niklas Haas
2025-05-23 16:27   ` Michael Niedermayer
2025-05-23 16:52     ` Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 09/17] swscale/ops: add dispatch layer Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 10/17] swscale/optimizer: add packed shuffle solver Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 11/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 12/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
2025-05-21 12:43 ` [FFmpeg-devel] [PATCH v2 13/17] swscale/ops_memcpy: add 'memcpy' backend for plane->plane copies Niklas Haas
2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 14/17] swscale/x86: add SIMD backend Niklas Haas
2025-05-21 14:11   ` Kieran Kunhya via ffmpeg-devel
2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 15/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 16/17] swscale/format: add new format decode/encode logic Niklas Haas
2025-05-21 12:44 ` [FFmpeg-devel] [PATCH v2 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git