Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] (no subject)
@ 2025-05-27  7:55 Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 01/17] swscale/format: rename legacy format conversion table Niklas Haas
                   ` (17 more replies)
  0 siblings, 18 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel

Changes since v2:

- refactored x86 loop to reduce per-line overhead and simplify the code;
  eliminate SSE instructions from the process function entirely and also
  reduce the number of allocated registers by one
- remove alignment macro from SwsOpExec and align usage instead
- chenge pixel_bits_in/out to block_size_in/out in SwsOpExec, and add
  precomputed pointer bump fields
- fix SwsOpExec size assertion
- reduce storage size of several SwsOp types
- simplify SwsOpEntry and massively reduce its storage size, from ~300
  bytes to around 50 bytes per entry
- add lots of comments and documentation, especially for the x86 backend and
  the shuffle solver
- eliminate initializer field override warning from x86 backend
- switch from call/ret to jmp/jmp inside x86 op chain; massively speeds up
  some chains on hardware with dedicated loop buffers
- add more vzeroupper calls to break dependencies throughout the code

This branch is ~18% faster across the board, as a result of:
- adding vzeroupper: ~12%
- eliminating call/ret: ~4%
- simplifying the x86 loop: ~1%

The amount of rodata used has been reduced by ~80%.

The latest branch can be found here:
  https://github.com/haasn/FFmpeg/tree/swscale6_clean

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 01/17] swscale/format: rename legacy format conversion table
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 02/17] swscale/format: add ff_fmt_clear() Niklas Haas
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

---
 libswscale/format.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/libswscale/format.c b/libswscale/format.c
index e4c1348b90..b77081dd7a 100644
--- a/libswscale/format.c
+++ b/libswscale/format.c
@@ -24,14 +24,14 @@
 
 #include "format.h"
 
-typedef struct FormatEntry {
+typedef struct LegacyFormatEntry {
     uint8_t is_supported_in         :1;
     uint8_t is_supported_out        :1;
     uint8_t is_supported_endianness :1;
-} FormatEntry;
+} LegacyFormatEntry;
 
 /* Format support table for legacy swscale */
-static const FormatEntry format_entries[] = {
+static const LegacyFormatEntry legacy_format_entries[] = {
     [AV_PIX_FMT_YUV420P]        = { 1, 1 },
     [AV_PIX_FMT_YUYV422]        = { 1, 1 },
     [AV_PIX_FMT_RGB24]          = { 1, 1 },
@@ -262,20 +262,20 @@ static const FormatEntry format_entries[] = {
 
 int sws_isSupportedInput(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_in : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_in : 0;
 }
 
 int sws_isSupportedOutput(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_out : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_out : 0;
 }
 
 int sws_isSupportedEndiannessConversion(enum AVPixelFormat pix_fmt)
 {
-    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(format_entries) ?
-           format_entries[pix_fmt].is_supported_endianness : 0;
+    return (unsigned)pix_fmt < FF_ARRAY_ELEMS(legacy_format_entries) ?
+    legacy_format_entries[pix_fmt].is_supported_endianness : 0;
 }
 
 /**
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 02/17] swscale/format: add ff_fmt_clear()
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 01/17] swscale/format: rename legacy format conversion table Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 03/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Reset an SwsFormat to its fully unset/invalid state.
---
 libswscale/format.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/libswscale/format.h b/libswscale/format.h
index 3b6d745159..be92038f4f 100644
--- a/libswscale/format.h
+++ b/libswscale/format.h
@@ -85,6 +85,20 @@ typedef struct SwsFormat {
     SwsColor color;
 } SwsFormat;
 
+static inline void ff_fmt_clear(SwsFormat *fmt)
+{
+    *fmt = (SwsFormat) {
+        .format     = AV_PIX_FMT_NONE,
+        .range      = AVCOL_RANGE_UNSPECIFIED,
+        .csp        = AVCOL_SPC_UNSPECIFIED,
+        .loc        = AVCHROMA_LOC_UNSPECIFIED,
+        .color = {
+            .prim = AVCOL_PRI_UNSPECIFIED,
+            .trc  = AVCOL_TRC_UNSPECIFIED,
+        },
+    };
+}
+
 /**
  * This function also sanitizes and strips the input data, removing irrelevant
  * fields for certain formats.
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 03/17] tests/checkasm: increase number of runs in between measurements
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 01/17] swscale/format: rename legacy format conversion table Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 02/17] swscale/format: add ff_fmt_clear() Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats Niklas Haas
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Sometimes, when measuring very small functions, rdtsc is not accurate enough
to get a reliable measurement. This increases the number of runs inside the
inner loop from 4 to 32, which should help a lot. Less important when using
the more precise linux-perf API, but still useful.

There should be no user-visible change since the number of runs is adjusted
to keep the total time spent measuring the same.
---
 tests/checkasm/checkasm.c |  2 +-
 tests/checkasm/checkasm.h | 24 +++++++++++++++++++-----
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 0734cd26bf..71d1e5766c 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -628,7 +628,7 @@ static inline double avg_cycles_per_call(const CheckasmPerf *const p)
     if (p->iterations) {
         const double cycles = (double)(10 * p->cycles) / p->iterations - state.nop_time;
         if (cycles > 0.0)
-            return cycles / 4.0; /* 4 calls per iteration */
+            return cycles / 32.0; /* 32 calls per iteration */
     }
     return 0.0;
 }
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index 146bfdec35..ad7ed10613 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -342,6 +342,22 @@ typedef struct CheckasmPerf {
 #define PERF_STOP(t)  t = AV_READ_TIME() - t
 #endif
 
+#define CALL4(...)\
+    do {\
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+        tfunc(__VA_ARGS__); \
+    } while (0)
+
+#define CALL16(...)\
+    do {\
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+        CALL4(__VA_ARGS__); \
+    } while (0)
+
 /* Benchmark the function */
 #define bench_new(...)\
     do {\
@@ -352,14 +368,12 @@ typedef struct CheckasmPerf {
             uint64_t tsum = 0;\
             uint64_t ti, tcount = 0;\
             uint64_t t = 0; \
-            const uint64_t truns = bench_runs;\
+            const uint64_t truns = FFMAX(bench_runs >> 3, 1);\
             checkasm_set_signal_handler_state(1);\
             for (ti = 0; ti < truns; ti++) {\
                 PERF_START(t);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
-                tfunc(__VA_ARGS__);\
+                CALL16(__VA_ARGS__);\
+                CALL16(__VA_ARGS__);\
                 PERF_STOP(t);\
                 if (t*tcount <= tsum*4 && ti > 0) {\
                     tsum += t;\
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (2 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 03/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  8:24   ` Martin Storsjö
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 05/17] swscale: add SWS_UNSTABLE flag Niklas Haas
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

We split the standard macro into its body (implementation) and declaration,
and use a macro argument in place of the raw `memcmp` call, with the major
difference that we now take the number of pixels to compare instead of the
number of bytes (to match the signature of float_near_ulp_array).
---
 tests/checkasm/checkasm.c | 52 ++++++++++++++++++++++++++-------------
 tests/checkasm/checkasm.h |  7 ++++++
 2 files changed, 42 insertions(+), 17 deletions(-)

diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index 71d1e5766c..f393a0cb96 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -1187,14 +1187,8 @@ static int check_err(const char *file, int line,
     return 0;
 }
 
-#define DEF_CHECKASM_CHECK_FUNC(type, fmt) \
-int checkasm_check_##type(const char *file, int line, \
-                          const type *buf1, ptrdiff_t stride1, \
-                          const type *buf2, ptrdiff_t stride2, \
-                          int w, int h, const char *name, \
-                          int align_w, int align_h, \
-                          int padding) \
-{ \
+#define DEF_CHECKASM_CHECK_BODY(compare, type, fmt) \
+do { \
     int64_t aligned_w = (w - 1LL + align_w) & ~(align_w - 1); \
     int64_t aligned_h = (h - 1LL + align_h) & ~(align_h - 1); \
     int err = 0; \
@@ -1204,7 +1198,7 @@ int checkasm_check_##type(const char *file, int line, \
     stride1 /= sizeof(*buf1); \
     stride2 /= sizeof(*buf2); \
     for (y = 0; y < h; y++) \
-        if (memcmp(&buf1[y*stride1], &buf2[y*stride2], w*sizeof(*buf1))) \
+        if (!compare(&buf1[y*stride1], &buf2[y*stride2], w)) \
             break; \
     if (y != h) { \
         if (check_err(file, line, name, w, h, &err)) \
@@ -1226,38 +1220,50 @@ int checkasm_check_##type(const char *file, int line, \
         buf2 -= h*stride2; \
     } \
     for (y = -padding; y < 0; y++) \
-        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
-                   (w + 2*padding)*sizeof(*buf1))) { \
+        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
+                     w + 2*padding)) { \
             if (check_err(file, line, name, w, h, &err)) \
                 return 1; \
             fprintf(stderr, " overwrite above\n"); \
             break; \
         } \
     for (y = aligned_h; y < aligned_h + padding; y++) \
-        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
-                   (w + 2*padding)*sizeof(*buf1))) { \
+        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
+                     w + 2*padding)) { \
             if (check_err(file, line, name, w, h, &err)) \
                 return 1; \
             fprintf(stderr, " overwrite below\n"); \
             break; \
         } \
     for (y = 0; y < h; y++) \
-        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
-                   padding*sizeof(*buf1))) { \
+        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
+                     padding)) { \
             if (check_err(file, line, name, w, h, &err)) \
                 return 1; \
             fprintf(stderr, " overwrite left\n"); \
             break; \
         } \
     for (y = 0; y < h; y++) \
-        if (memcmp(&buf1[y*stride1 + aligned_w], &buf2[y*stride2 + aligned_w], \
-                   padding*sizeof(*buf1))) { \
+        if (!compare(&buf1[y*stride1 + aligned_w], &buf2[y*stride2 + aligned_w], \
+                     padding)) { \
             if (check_err(file, line, name, w, h, &err)) \
                 return 1; \
             fprintf(stderr, " overwrite right\n"); \
             break; \
         } \
     return err; \
+} while (0)
+
+#define cmp_int(a, b, len) (!memcmp(a, b, (len) * sizeof(*(a))))
+#define DEF_CHECKASM_CHECK_FUNC(type, fmt) \
+int checkasm_check_##type(const char *file, int line, \
+                          const type *buf1, ptrdiff_t stride1, \
+                          const type *buf2, ptrdiff_t stride2, \
+                          int w, int h, const char *name, \
+                          int align_w, int align_h, \
+                          int padding) \
+{ \
+    DEF_CHECKASM_CHECK_BODY(cmp_int, type, fmt); \
 }
 
 DEF_CHECKASM_CHECK_FUNC(uint8_t,  "%02x")
@@ -1265,3 +1271,15 @@ DEF_CHECKASM_CHECK_FUNC(uint16_t, "%04x")
 DEF_CHECKASM_CHECK_FUNC(uint32_t, "%08x")
 DEF_CHECKASM_CHECK_FUNC(int16_t,  "%6d")
 DEF_CHECKASM_CHECK_FUNC(int32_t,  "%9d")
+
+int checkasm_check_float_ulp(const char *file, int line,
+                             const float *buf1, ptrdiff_t stride1,
+                             const float *buf2, ptrdiff_t stride2,
+                             int w, int h, const char *name,
+                             unsigned max_ulp, int align_w, int align_h,
+                             int padding)
+{
+    #define cmp_float(a, b, len) float_near_ulp_array(a, b, max_ulp, len)
+    DEF_CHECKASM_CHECK_BODY(cmp_float, float, "%g");
+    #undef cmp_float
+}
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index ad7ed10613..ec01bd6207 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -423,6 +423,13 @@ DECL_CHECKASM_CHECK_FUNC(uint32_t);
 DECL_CHECKASM_CHECK_FUNC(int16_t);
 DECL_CHECKASM_CHECK_FUNC(int32_t);
 
+int checkasm_check_float_ulp(const char *file, int line,
+                             const float *buf1, ptrdiff_t stride1,
+                             const float *buf2, ptrdiff_t stride2,
+                             int w, int h, const char *name,
+                             unsigned max_elp, int align_w, int align_h,
+                             int padding);
+
 #define PASTE(a,b) a ## b
 #define CONCAT(a,b) PASTE(a,b)
 
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 05/17] swscale: add SWS_UNSTABLE flag
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (3 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 06/17] swscale/ops: introduce new low level framework Niklas Haas
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Give users and developers a way to opt in to the new format conversion code,
and more code from the swscale rewrite in general, even while development is
still ongoing.
---
 doc/APIchanges       | 3 +++
 doc/scaler.texi      | 4 ++++
 libswscale/options.c | 1 +
 libswscale/swscale.h | 7 +++++++
 libswscale/version.h | 2 +-
 5 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/doc/APIchanges b/doc/APIchanges
index 91710bb27d..ae5e4b366b 100644
--- a/doc/APIchanges
+++ b/doc/APIchanges
@@ -2,6 +2,9 @@ The last version increases of all libraries were on 2025-03-28
 
 API changes, most recent first:
 
+2025-04-xx - xxxxxxxxxx - lsws 9.1.100 - swscale.h
+  Add SWS_UNSTABLE flag.
+
 2025-05-21 - xxxxxxxxxx - lavu 60.3.100 - avassert.h
   Add av_unreachable() and av_assume() macros.
 
diff --git a/doc/scaler.texi b/doc/scaler.texi
index eb045de6b7..42b2377761 100644
--- a/doc/scaler.texi
+++ b/doc/scaler.texi
@@ -68,6 +68,10 @@ Select full chroma input.
 
 @item bitexact
 Enable bitexact output.
+
+@item unstable
+Allow the use of experimental new code. May subtly affect the output or even
+produce wrong results. For testing only.
 @end table
 
 @item srcw @var{(API only)}
diff --git a/libswscale/options.c b/libswscale/options.c
index feecae8c89..06e51dcfe9 100644
--- a/libswscale/options.c
+++ b/libswscale/options.c
@@ -50,6 +50,7 @@ static const AVOption swscale_options[] = {
         { "full_chroma_inp", "full chroma input",             0,  AV_OPT_TYPE_CONST, { .i64 = SWS_FULL_CHR_H_INP }, .flags = VE, .unit = "sws_flags" },
         { "bitexact",        "bit-exact mode",                0,  AV_OPT_TYPE_CONST, { .i64 = SWS_BITEXACT       }, .flags = VE, .unit = "sws_flags" },
         { "error_diffusion", "error diffusion dither",        0,  AV_OPT_TYPE_CONST, { .i64 = SWS_ERROR_DIFFUSION}, .flags = VE, .unit = "sws_flags" },
+        { "unstable",        "allow experimental new code",   0,  AV_OPT_TYPE_CONST, { .i64 = SWS_UNSTABLE       }, .flags = VE, .unit = "sws_flags" },
 
     { "param0",          "scaler param 0", OFFSET(scaler_params[0]), AV_OPT_TYPE_DOUBLE, { .dbl = SWS_PARAM_DEFAULT  }, INT_MIN, INT_MAX, VE },
     { "param1",          "scaler param 1", OFFSET(scaler_params[1]), AV_OPT_TYPE_DOUBLE, { .dbl = SWS_PARAM_DEFAULT  }, INT_MIN, INT_MAX, VE },
diff --git a/libswscale/swscale.h b/libswscale/swscale.h
index b04aa182d2..4aa072009c 100644
--- a/libswscale/swscale.h
+++ b/libswscale/swscale.h
@@ -155,6 +155,13 @@ typedef enum SwsFlags {
     SWS_ACCURATE_RND   = 1 << 18,
     SWS_BITEXACT       = 1 << 19,
 
+    /**
+     * Allow using experimental new code paths. This may be faster, slower,
+     * or produce different output, with semantics subject to change at any
+     * point in time. For testing and debugging purposes only.
+     */
+    SWS_UNSTABLE = 1 << 20,
+
     /**
      * Deprecated flags.
      */
diff --git a/libswscale/version.h b/libswscale/version.h
index 148efd83eb..4e54701aba 100644
--- a/libswscale/version.h
+++ b/libswscale/version.h
@@ -28,7 +28,7 @@
 
 #include "version_major.h"
 
-#define LIBSWSCALE_VERSION_MINOR   0
+#define LIBSWSCALE_VERSION_MINOR   1
 #define LIBSWSCALE_VERSION_MICRO 100
 
 #define LIBSWSCALE_VERSION_INT  AV_VERSION_INT(LIBSWSCALE_VERSION_MAJOR, \
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 06/17] swscale/ops: introduce new low level framework
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (4 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 05/17] swscale: add SWS_UNSTABLE flag Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 07/17] swscale/optimizer: add high-level ops optimizer Niklas Haas
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

See docs/swscale-v2.txt for an in-depth introduction to the new approach.

This commit merely introduces the ops definitions and boilerplate functions.
The subsequent commits will flesh out the underlying implementation.
---
 libswscale/Makefile |   1 +
 libswscale/ops.c    | 522 ++++++++++++++++++++++++++++++++++++++++++++
 libswscale/ops.h    | 240 ++++++++++++++++++++
 3 files changed, 763 insertions(+)
 create mode 100644 libswscale/ops.c
 create mode 100644 libswscale/ops.h

diff --git a/libswscale/Makefile b/libswscale/Makefile
index d5e10d17dc..e0beef4e69 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -15,6 +15,7 @@ OBJS = alphablend.o                                     \
        graph.o                                          \
        input.o                                          \
        lut3d.o                                          \
+       ops.o                                            \
        options.o                                        \
        output.o                                         \
        rgb2rgb.o                                        \
diff --git a/libswscale/ops.c b/libswscale/ops.c
new file mode 100644
index 0000000000..004686147d
--- /dev/null
+++ b/libswscale/ops.c
@@ -0,0 +1,522 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/bswap.h"
+#include "libavutil/mem.h"
+#include "libavutil/rational.h"
+#include "libavutil/refstruct.h"
+
+#include "ops.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+
+const char *ff_sws_pixel_type_name(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:   return "u8";
+    case SWS_PIXEL_U16:  return "u16";
+    case SWS_PIXEL_U32:  return "u32";
+    case SWS_PIXEL_F32:  return "f32";
+    case SWS_PIXEL_NONE: return "none";
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return "ERR";
+}
+
+int ff_sws_pixel_type_size(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:  return sizeof(uint8_t);
+    case SWS_PIXEL_U16: return sizeof(uint16_t);
+    case SWS_PIXEL_U32: return sizeof(uint32_t);
+    case SWS_PIXEL_F32: return sizeof(float);
+    case SWS_PIXEL_NONE: break;
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return 0;
+}
+
+bool ff_sws_pixel_type_is_int(SwsPixelType type)
+{
+    switch (type) {
+    case SWS_PIXEL_U8:
+    case SWS_PIXEL_U16:
+    case SWS_PIXEL_U32:
+        return true;
+    case SWS_PIXEL_F32:
+        return false;
+    case SWS_PIXEL_NONE:
+    case SWS_PIXEL_TYPE_NB: break;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return false;
+}
+
+SwsPixelType ff_sws_pixel_type_to_uint(SwsPixelType type)
+{
+    if (!type)
+        return type;
+
+    switch (ff_sws_pixel_type_size(type)) {
+    case 8:  return SWS_PIXEL_U8;
+    case 16: return SWS_PIXEL_U16;
+    case 32: return SWS_PIXEL_U32;
+    }
+
+    av_assert0(!"Invalid pixel type!");
+    return SWS_PIXEL_NONE;
+}
+
+/* biased towards `a` */
+static AVRational av_min_q(AVRational a, AVRational b)
+{
+    return av_cmp_q(a, b) == 1 ? b : a;
+}
+
+static AVRational av_max_q(AVRational a, AVRational b)
+{
+    return av_cmp_q(a, b) == -1 ? b : a;
+}
+
+static AVRational expand_factor(SwsPixelType from, SwsPixelType to)
+{
+    const int src = ff_sws_pixel_type_size(from);
+    const int dst = ff_sws_pixel_type_size(to);
+    int scale = 0;
+    for (int i = 0; i < dst / src; i++)
+        scale = scale << src * 8 | 1;
+    return Q(scale);
+}
+
+void ff_sws_apply_op_q(const SwsOp *op, AVRational x[4])
+{
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        return;
+    case SWS_OP_UNPACK: {
+        unsigned val = x[0].num;
+        int shift = ff_sws_pixel_type_size(op->type) * 8;
+        for (int i = 0; i < 4; i++) {
+            const unsigned mask = (1 << op->pack.pattern[i]) - 1;
+            shift -= op->pack.pattern[i];
+            x[i] = Q((val >> shift) & mask);
+        }
+        return;
+    }
+    case SWS_OP_PACK: {
+        unsigned val = 0;
+        int shift = ff_sws_pixel_type_size(op->type) * 8;
+        for (int i = 0; i < 4; i++) {
+            const unsigned mask = (1 << op->pack.pattern[i]) - 1;
+            shift -= op->pack.pattern[i];
+            val |= (x[i].num & mask) << shift;
+        }
+        x[0] = Q(val);
+        return;
+    }
+    case SWS_OP_SWAP_BYTES:
+        switch (ff_sws_pixel_type_size(op->type)) {
+        case 2:
+            for (int i = 0; i < 4; i++)
+                x[i].num = av_bswap16(x[i].num);
+            break;
+        case 4:
+            for (int i = 0; i < 4; i++)
+                x[i].num = av_bswap32(x[i].num);
+            break;
+        }
+        return;
+    case SWS_OP_CLEAR:
+        for (int i = 0; i < 4; i++) {
+            if (op->c.q4[i].den)
+                x[i] = op->c.q4[i];
+        }
+        return;
+    case SWS_OP_LSHIFT: {
+        AVRational mult = Q(1 << op->c.u);
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_mul_q(x[i], mult) : x[i];
+        return;
+    }
+    case SWS_OP_RSHIFT: {
+        AVRational mult = Q(1 << op->c.u);
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_div_q(x[i], mult) : x[i];
+        return;
+    }
+    case SWS_OP_SWIZZLE: {
+        const AVRational orig[4] = { x[0], x[1], x[2], x[3] };
+        for (int i = 0; i < 4; i++)
+            x[i] = orig[op->swizzle.in[i]];
+        return;
+    }
+    case SWS_OP_CONVERT:
+        if (ff_sws_pixel_type_is_int(op->convert.to)) {
+            const AVRational scale = expand_factor(op->type, op->convert.to);
+            for (int i = 0; i < 4; i++) {
+                x[i] = x[i].den ? Q(x[i].num / x[i].den) : x[i];
+                if (op->convert.expand)
+                    x[i] = av_mul_q(x[i], scale);
+            }
+        }
+        return;
+    case SWS_OP_DITHER:
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_add_q(x[i], av_make_q(1, 2)) : x[i];
+        return;
+    case SWS_OP_MIN:
+        for (int i = 0; i < 4; i++)
+            x[i] = av_min_q(x[i], op->c.q4[i]);
+        return;
+    case SWS_OP_MAX:
+        for (int i = 0; i < 4; i++)
+            x[i] = av_max_q(x[i], op->c.q4[i]);
+        return;
+    case SWS_OP_LINEAR: {
+        const AVRational orig[4] = { x[0], x[1], x[2], x[3] };
+        for (int i = 0; i < 4; i++) {
+            AVRational sum = op->lin.m[i][4];
+            for (int j = 0; j < 4; j++)
+                sum = av_add_q(sum, av_mul_q(orig[j], op->lin.m[i][j]));
+            x[i] = sum;
+        }
+        return;
+    }
+    case SWS_OP_SCALE:
+        for (int i = 0; i < 4; i++)
+            x[i] = x[i].den ? av_mul_q(x[i], op->c.q) : x[i];
+        return;
+    }
+
+    av_assert0(!"Invalid operation type!");
+}
+
+static void op_uninit(SwsOp *op)
+{
+    switch (op->op) {
+    case SWS_OP_DITHER:
+        av_refstruct_unref(&op->dither.matrix);
+        break;
+    }
+
+    *op = (SwsOp) {0};
+}
+
+SwsOpList *ff_sws_op_list_alloc(void)
+{
+    SwsOpList *ops = av_mallocz(sizeof(SwsOpList));
+    if (!ops)
+        return NULL;
+
+    ff_fmt_clear(&ops->src);
+    ff_fmt_clear(&ops->dst);
+    return ops;
+}
+
+void ff_sws_op_list_free(SwsOpList **p_ops)
+{
+    SwsOpList *ops = *p_ops;
+    if (!ops)
+        return;
+
+    for (int i = 0; i < ops->num_ops; i++)
+        op_uninit(&ops->ops[i]);
+
+    av_freep(&ops->ops);
+    av_free(ops);
+    *p_ops = NULL;
+}
+
+SwsOpList *ff_sws_op_list_duplicate(const SwsOpList *ops)
+{
+    SwsOpList *copy = av_malloc(sizeof(*copy));
+    if (!copy)
+        return NULL;
+
+    *copy = *ops;
+    copy->ops = av_memdup(ops->ops, ops->num_ops * sizeof(ops->ops[0]));
+    if (!copy->ops) {
+        av_free(copy);
+        return NULL;
+    }
+
+    for (int i = 0; i < ops->num_ops; i++) {
+        const SwsOp *op = &ops->ops[i];
+        switch (op->op) {
+        case SWS_OP_DITHER:
+            av_refstruct_ref(copy->ops[i].dither.matrix);
+            break;
+        }
+    }
+
+    return copy;
+}
+
+void ff_sws_op_list_remove_at(SwsOpList *ops, int index, int count)
+{
+    const int end = ops->num_ops - count;
+    av_assert2(index >= 0 && count >= 0 && index + count <= ops->num_ops);
+    op_uninit(&ops->ops[index]);
+    for (int i = index; i < end; i++)
+        ops->ops[i] = ops->ops[i + count];
+    ops->num_ops = end;
+}
+
+int ff_sws_op_list_insert_at(SwsOpList *ops, int index, SwsOp *op)
+{
+    void *ret;
+    ret = av_dynarray2_add((void **) &ops->ops, &ops->num_ops, sizeof(*op),
+                           (const void *) op);
+    if (!ret) {
+        op_uninit(op);
+        return AVERROR(ENOMEM);
+    }
+
+    for (int i = ops->num_ops - 1; i > index; i--)
+        ops->ops[i] = ops->ops[i - 1];
+    ops->ops[index] = *op;
+    *op = (SwsOp) {0};
+    return 0;
+}
+
+int ff_sws_op_list_append(SwsOpList *ops, SwsOp *op)
+{
+    return ff_sws_op_list_insert_at(ops, ops->num_ops, op);
+}
+
+int ff_sws_op_list_max_size(const SwsOpList *ops)
+{
+    int max_size = 0;
+    for (int i = 0; i < ops->num_ops; i++) {
+        const int size = ff_sws_pixel_type_size(ops->ops[i].type);
+        max_size = FFMAX(max_size, size);
+    }
+
+    return max_size;
+}
+
+uint32_t ff_sws_linear_mask(const SwsLinearOp c)
+{
+    uint32_t mask = 0;
+    for (int i = 0; i < 4; i++) {
+        for (int j = 0; j < 5; j++) {
+            if (av_cmp_q(c.m[i][j], Q(i == j)))
+                mask |= SWS_MASK(i, j);
+        }
+    }
+    return mask;
+}
+
+static const char *describe_lin_mask(uint32_t mask)
+{
+    /* Try to be fairly descriptive without assuming too much */
+    static const struct {
+        const char *name;
+        uint32_t mask;
+    } patterns[] = {
+        { "noop",               0 },
+        { "luma",               SWS_MASK_LUMA },
+        { "alpha",              SWS_MASK_ALPHA },
+        { "luma+alpha",         SWS_MASK_LUMA | SWS_MASK_ALPHA },
+        { "dot3",               0b111 },
+        { "dot4",               0b1111 },
+        { "row0",               SWS_MASK_ROW(0) },
+        { "row0+alpha",         SWS_MASK_ROW(0) | SWS_MASK_ALPHA },
+        { "col0",               SWS_MASK_COL(0) },
+        { "col0+off3",          SWS_MASK_COL(0) | SWS_MASK_OFF3 },
+        { "off3",               SWS_MASK_OFF3 },
+        { "off3+alpha",         SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag3",              SWS_MASK_DIAG3 },
+        { "diag4",              SWS_MASK_DIAG4 },
+        { "diag3+alpha",        SWS_MASK_DIAG3 | SWS_MASK_ALPHA },
+        { "diag3+off3",         SWS_MASK_DIAG3 | SWS_MASK_OFF3 },
+        { "diag3+off3+alpha",   SWS_MASK_DIAG3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag4+off4",         SWS_MASK_DIAG4 | SWS_MASK_OFF4 },
+        { "matrix3",            SWS_MASK_MAT3 },
+        { "matrix3+off3",       SWS_MASK_MAT3 | SWS_MASK_OFF3 },
+        { "matrix3+off3+alpha", SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "matrix4",            SWS_MASK_MAT4 },
+        { "matrix4+off4",       SWS_MASK_MAT4 | SWS_MASK_OFF4 },
+    };
+
+    for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+        if (!(mask & ~patterns[i].mask))
+            return patterns[i].name;
+    }
+
+    return "full";
+}
+
+static char describe_comp_flags(unsigned flags)
+{
+    if (flags & SWS_COMP_GARBAGE)
+        return 'X';
+    else if (flags & SWS_COMP_ZERO)
+        return '0';
+    else if (flags & SWS_COMP_EXACT)
+        return '+';
+    else
+        return '.';
+}
+
+static const char *print_q(const AVRational q, char buf[], int buf_len)
+{
+    if (!q.den) {
+        return q.num > 0 ? "inf" : q.num < 0 ? "-inf" : "nan";
+    } else if (q.den == 1) {
+        snprintf(buf, buf_len, "%d", q.num);
+        return buf;
+    } else if (abs(q.num) > 1000 || abs(q.den) > 1000) {
+        snprintf(buf, buf_len, "%f", av_q2d(q));
+        return buf;
+    } else {
+        snprintf(buf, buf_len, "%d/%d", q.num, q.den);
+        return buf;
+    }
+}
+
+#define PRINTQ(q) print_q(q, (char[32]){0}, sizeof(char[32]) - 1)
+
+void ff_sws_op_list_print(void *log, int lev, const SwsOpList *ops)
+{
+    if (!ops->num_ops) {
+        av_log(log, lev, "  (empty)\n");
+        return;
+    }
+
+    for (int i = 0; i < ops->num_ops; i++) {
+        const SwsOp *op = &ops->ops[i];
+        av_log(log, lev, "  [%3s %c%c%c%c -> %c%c%c%c] ",
+               ff_sws_pixel_type_name(op->type),
+               op->comps.unused[0] ? 'X' : '.',
+               op->comps.unused[1] ? 'X' : '.',
+               op->comps.unused[2] ? 'X' : '.',
+               op->comps.unused[3] ? 'X' : '.',
+               describe_comp_flags(op->comps.flags[0]),
+               describe_comp_flags(op->comps.flags[1]),
+               describe_comp_flags(op->comps.flags[2]),
+               describe_comp_flags(op->comps.flags[3]));
+
+        switch (op->op) {
+        case SWS_OP_INVALID:
+            av_log(log, lev, "SWS_OP_INVALID\n");
+            break;
+        case SWS_OP_READ:
+        case SWS_OP_WRITE:
+            av_log(log, lev, "%-20s: %d elem(s) %s >> %d\n",
+                   op->op == SWS_OP_READ ? "SWS_OP_READ"
+                                         : "SWS_OP_WRITE",
+                   op->rw.elems,  op->rw.packed ? "packed" : "planar",
+                   op->rw.frac);
+            break;
+        case SWS_OP_SWAP_BYTES:
+            av_log(log, lev, "SWS_OP_SWAP_BYTES\n");
+            break;
+        case SWS_OP_LSHIFT:
+            av_log(log, lev, "%-20s: << %u\n", "SWS_OP_LSHIFT", op->c.u);
+            break;
+        case SWS_OP_RSHIFT:
+            av_log(log, lev, "%-20s: >> %u\n", "SWS_OP_RSHIFT", op->c.u);
+            break;
+        case SWS_OP_PACK:
+        case SWS_OP_UNPACK:
+            av_log(log, lev, "%-20s: {%d %d %d %d}\n",
+                   op->op == SWS_OP_PACK ? "SWS_OP_PACK"
+                                         : "SWS_OP_UNPACK",
+                   op->pack.pattern[0], op->pack.pattern[1],
+                   op->pack.pattern[2], op->pack.pattern[3]);
+            break;
+        case SWS_OP_CLEAR:
+            av_log(log, lev, "%-20s: {%s %s %s %s}\n", "SWS_OP_CLEAR",
+                   op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                   op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                   op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                   op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_SWIZZLE:
+            av_log(log, lev, "%-20s: %d%d%d%d\n", "SWS_OP_SWIZZLE",
+                   op->swizzle.x, op->swizzle.y, op->swizzle.z, op->swizzle.w);
+            break;
+        case SWS_OP_CONVERT:
+            av_log(log, lev, "%-20s: %s -> %s%s\n", "SWS_OP_CONVERT",
+                   ff_sws_pixel_type_name(op->type),
+                   ff_sws_pixel_type_name(op->convert.to),
+                   op->convert.expand ? " (expand)" : "");
+            break;
+        case SWS_OP_DITHER:
+            av_log(log, lev, "%-20s: %dx%d matrix\n", "SWS_OP_DITHER",
+                    1 << op->dither.size_log2, 1 << op->dither.size_log2);
+            break;
+        case SWS_OP_MIN:
+            av_log(log, lev, "%-20s: x <= {%s %s %s %s}\n", "SWS_OP_MIN",
+                    op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                    op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                    op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                    op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_MAX:
+            av_log(log, lev, "%-20s: {%s %s %s %s} <= x\n", "SWS_OP_MAX",
+                    op->c.q4[0].den ? PRINTQ(op->c.q4[0]) : "_",
+                    op->c.q4[1].den ? PRINTQ(op->c.q4[1]) : "_",
+                    op->c.q4[2].den ? PRINTQ(op->c.q4[2]) : "_",
+                    op->c.q4[3].den ? PRINTQ(op->c.q4[3]) : "_");
+            break;
+        case SWS_OP_LINEAR:
+            av_log(log, lev, "%-20s: %s [[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s] "
+                                        "[%s %s %s %s %s]]\n",
+                   "SWS_OP_LINEAR", describe_lin_mask(op->lin.mask),
+                   PRINTQ(op->lin.m[0][0]), PRINTQ(op->lin.m[0][1]), PRINTQ(op->lin.m[0][2]), PRINTQ(op->lin.m[0][3]), PRINTQ(op->lin.m[0][4]),
+                   PRINTQ(op->lin.m[1][0]), PRINTQ(op->lin.m[1][1]), PRINTQ(op->lin.m[1][2]), PRINTQ(op->lin.m[1][3]), PRINTQ(op->lin.m[1][4]),
+                   PRINTQ(op->lin.m[2][0]), PRINTQ(op->lin.m[2][1]), PRINTQ(op->lin.m[2][2]), PRINTQ(op->lin.m[2][3]), PRINTQ(op->lin.m[2][4]),
+                   PRINTQ(op->lin.m[3][0]), PRINTQ(op->lin.m[3][1]), PRINTQ(op->lin.m[3][2]), PRINTQ(op->lin.m[3][3]), PRINTQ(op->lin.m[3][4]));
+            break;
+        case SWS_OP_SCALE:
+            av_log(log, lev, "%-20s: * %s\n", "SWS_OP_SCALE",
+                   PRINTQ(op->c.q));
+            break;
+        case SWS_OP_TYPE_NB:
+            break;
+        }
+
+        if (op->comps.min[0].den || op->comps.min[1].den ||
+            op->comps.min[2].den || op->comps.min[3].den ||
+            op->comps.max[0].den || op->comps.max[1].den ||
+            op->comps.max[2].den || op->comps.max[3].den)
+        {
+            av_log(log, AV_LOG_TRACE, "    min: {%s, %s, %s, %s}, max: {%s, %s, %s, %s}\n",
+                PRINTQ(op->comps.min[0]), PRINTQ(op->comps.min[1]),
+                PRINTQ(op->comps.min[2]), PRINTQ(op->comps.min[3]),
+                PRINTQ(op->comps.max[0]), PRINTQ(op->comps.max[1]),
+                PRINTQ(op->comps.max[2]), PRINTQ(op->comps.max[3]));
+        }
+
+    }
+
+    av_log(log, lev, "    (X = unused, + = exact, 0 = zero)\n");
+}
diff --git a/libswscale/ops.h b/libswscale/ops.h
new file mode 100644
index 0000000000..71841c1572
--- /dev/null
+++ b/libswscale/ops.h
@@ -0,0 +1,240 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_H
+#define SWSCALE_OPS_H
+
+#include <assert.h>
+#include <stdbool.h>
+#include <stdalign.h>
+
+#include "graph.h"
+
+typedef enum SwsPixelType {
+    SWS_PIXEL_NONE = 0,
+    SWS_PIXEL_U8,
+    SWS_PIXEL_U16,
+    SWS_PIXEL_U32,
+    SWS_PIXEL_F32,
+    SWS_PIXEL_TYPE_NB
+} SwsPixelType;
+
+const char *ff_sws_pixel_type_name(SwsPixelType type);
+int ff_sws_pixel_type_size(SwsPixelType type) av_const;
+bool ff_sws_pixel_type_is_int(SwsPixelType type) av_const;
+SwsPixelType ff_sws_pixel_type_to_uint(SwsPixelType type) av_const;
+
+typedef enum SwsOpType {
+    SWS_OP_INVALID = 0,
+
+    /* Input/output handling */
+    SWS_OP_READ,            /* gather raw pixels from planes */
+    SWS_OP_WRITE,           /* write raw pixels to planes */
+    SWS_OP_SWAP_BYTES,      /* swap byte order (for differing endianness) */
+    SWS_OP_UNPACK,          /* split tightly packed data into components */
+    SWS_OP_PACK,            /* compress components into tightly packed data */
+
+    /* Pixel manipulation */
+    SWS_OP_CLEAR,           /* clear pixel values */
+    SWS_OP_LSHIFT,          /* logical left shift of raw pixel values by (u8) */
+    SWS_OP_RSHIFT,          /* right shift of raw pixel values by (u8) */
+    SWS_OP_SWIZZLE,         /* rearrange channel order, or duplicate channels */
+    SWS_OP_CONVERT,         /* convert (cast) between formats */
+    SWS_OP_DITHER,          /* add dithering noise */
+
+    /* Arithmetic operations */
+    SWS_OP_LINEAR,          /* generalized linear affine transform */
+    SWS_OP_SCALE,           /* multiplication by scalar (q) */
+    SWS_OP_MIN,             /* numeric minimum (q4) */
+    SWS_OP_MAX,             /* numeric maximum (q4) */
+
+    SWS_OP_TYPE_NB,
+} SwsOpType;
+
+enum SwsCompFlags {
+    SWS_COMP_GARBAGE = 1 << 0, /* contents are undefined / garbage data */
+    SWS_COMP_EXACT   = 1 << 1, /* value is an in-range, exact, integer */
+    SWS_COMP_ZERO    = 1 << 2, /* known to be a constant zero */
+};
+
+typedef union SwsConst {
+    /* Generic constant value */
+    AVRational q;
+    AVRational q4[4];
+    unsigned u;
+} SwsConst;
+
+typedef struct SwsComps {
+    unsigned flags[4]; /* knowledge about (output) component contents */
+    bool unused[4];    /* which input components are definitely unused */
+
+    /* Keeps track of the known possible value range, or {0, 0} for undefined
+     * or (unknown range) floating point inputs */
+    AVRational min[4], max[4];
+} SwsComps;
+
+typedef struct SwsReadWriteOp {
+    uint8_t elems; /* number of elements (of type `op.type`) to read/write */
+    uint8_t frac;  /* fractional pixel step factor (log2) */
+    bool packed;   /* read multiple elements from a single plane */
+
+    /** Examples:
+     *    rgba      = 4x u8 packed
+     *    yuv444p   = 3x u8
+     *    rgb565    = 1x u16   <- use SWS_OP_UNPACK to unpack
+     *    monow     = 1x u8 (frac 3)
+     *    rgb4      = 1x u8 (frac 1)
+     */
+} SwsReadWriteOp;
+
+typedef struct SwsPackOp {
+    uint8_t pattern[4]; /* bit depth pattern, from MSB to LSB */
+} SwsPackOp;
+
+typedef struct SwsSwizzleOp {
+    /**
+     * Input component for each output component:
+     *   Out[x] := In[swizzle.in[x]]
+     */
+    union {
+        uint32_t mask;
+        uint8_t in[4];
+        struct { uint8_t x, y, z, w; };
+    };
+} SwsSwizzleOp;
+
+#define SWS_SWIZZLE(X,Y,Z,W) ((SwsSwizzleOp) { .in = {X, Y, Z, W} })
+
+typedef struct SwsConvertOp {
+    SwsPixelType to; /* type of pixel to convert to */
+    bool expand; /* if true, integers are expanded to the full range */
+} SwsConvertOp;
+
+typedef struct SwsDitherOp {
+    AVRational *matrix; /* tightly packed dither matrix (refstruct) */
+    int size_log2; /* size (in bits) of the dither matrix */
+} SwsDitherOp;
+
+typedef struct SwsLinearOp {
+    /**
+     * Generalized 5x5 affine transformation:
+     *   [ Out.x ] = [ A B C D E ]
+     *   [ Out.y ] = [ F G H I J ] * [ x y z w 1 ]
+     *   [ Out.z ] = [ K L M N O ]
+     *   [ Out.w ] = [ P Q R S T ]
+     *
+     * The mask keeps track of which components differ from an identity matrix.
+     * There may be more efficient implementations of particular subsets, for
+     * example the common subset of {A, E, G, J, M, O} can be implemented with
+     * just three fused multiply-add operations.
+     */
+    AVRational m[4][5];
+    uint32_t mask; /* m[i][j] <-> 1 << (5 * i + j) */
+} SwsLinearOp;
+
+#define SWS_MASK(I, J)  (1 << (5 * (I) + (J)))
+#define SWS_MASK_OFF(I) SWS_MASK(I, 4)
+#define SWS_MASK_ROW(I) (0b11111 << (5 * (I)))
+#define SWS_MASK_COL(J) (0b1000010000100001 << J)
+
+enum {
+    SWS_MASK_ALL   = (1 << 20) - 1,
+    SWS_MASK_LUMA  = SWS_MASK(0, 0) | SWS_MASK_OFF(0),
+    SWS_MASK_ALPHA = SWS_MASK(3, 3) | SWS_MASK_OFF(3),
+
+    SWS_MASK_DIAG3 = SWS_MASK(0, 0)  | SWS_MASK(1, 1)  | SWS_MASK(2, 2),
+    SWS_MASK_OFF3  = SWS_MASK_OFF(0) | SWS_MASK_OFF(1) | SWS_MASK_OFF(2),
+    SWS_MASK_MAT3  = SWS_MASK(0, 0)  | SWS_MASK(0, 1)  | SWS_MASK(0, 2) |
+                     SWS_MASK(1, 0)  | SWS_MASK(1, 1)  | SWS_MASK(1, 2) |
+                     SWS_MASK(2, 0)  | SWS_MASK(2, 1)  | SWS_MASK(2, 2),
+
+    SWS_MASK_DIAG4 = SWS_MASK_DIAG3  | SWS_MASK(3, 3),
+    SWS_MASK_OFF4  = SWS_MASK_OFF3   | SWS_MASK_OFF(3),
+    SWS_MASK_MAT4  = SWS_MASK_ALL & ~SWS_MASK_OFF4,
+};
+
+/* Helper function to compute the correct mask */
+uint32_t ff_sws_linear_mask(SwsLinearOp);
+
+typedef struct SwsOp {
+    SwsOpType op;      /* operation to perform */
+    SwsPixelType type; /* pixel type to operate on */
+    union {
+        SwsReadWriteOp  rw;
+        SwsPackOp       pack;
+        SwsSwizzleOp    swizzle;
+        SwsConvertOp    convert;
+        SwsDitherOp     dither;
+        SwsLinearOp     lin;
+        SwsConst        c;
+    };
+
+    /* For use internal use inside ff_sws_*() functions */
+    SwsComps comps;
+} SwsOp;
+
+/**
+ * Frees any allocations associated with an SwsOp and sets it to {0}.
+ */
+void ff_sws_op_uninit(SwsOp *op);
+
+/**
+ * Apply an operation to an AVRational. No-op for read/write operations.
+ */
+void ff_sws_apply_op_q(const SwsOp *op, AVRational x[4]);
+
+/**
+ * Helper struct for representing a list of operations.
+ */
+typedef struct SwsOpList {
+    SwsOp *ops;
+    int num_ops;
+
+    /* Purely informative metadata associated with this operation list */
+    SwsFormat src, dst;
+} SwsOpList;
+
+SwsOpList *ff_sws_op_list_alloc(void);
+void ff_sws_op_list_free(SwsOpList **ops);
+
+/**
+ * Returns a duplicate of `ops`, or NULL on OOM.
+ */
+SwsOpList *ff_sws_op_list_duplicate(const SwsOpList *ops);
+
+/**
+ * Returns the size of the largest pixel type used in `ops`.
+ */
+int ff_sws_op_list_max_size(const SwsOpList *ops);
+
+/**
+ * These will take over ownership of `op` and set it to {0}, even on failure.
+ */
+int ff_sws_op_list_append(SwsOpList *ops, SwsOp *op);
+int ff_sws_op_list_insert_at(SwsOpList *ops, int index, SwsOp *op);
+
+void ff_sws_op_list_remove_at(SwsOpList *ops, int index, int count);
+
+/**
+ * Print out the contents of an operation list.
+ */
+void ff_sws_op_list_print(void *log_ctx, int log_level, const SwsOpList *ops);
+
+#endif
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 07/17] swscale/optimizer: add high-level ops optimizer
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (5 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 06/17] swscale/ops: introduce new low level framework Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 08/17] swscale/ops_internal: add internal ops backend API Niklas Haas
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This is responsible for taking a "naive" ops list and optimizing it
as much as possible. Also includes a small analyzer that generates component
metadata for use by the optimizer.
---
 libswscale/Makefile        |   1 +
 libswscale/ops.h           |  12 +
 libswscale/ops_optimizer.c | 783 +++++++++++++++++++++++++++++++++++++
 3 files changed, 796 insertions(+)
 create mode 100644 libswscale/ops_optimizer.c

diff --git a/libswscale/Makefile b/libswscale/Makefile
index e0beef4e69..810c9dee78 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -16,6 +16,7 @@ OBJS = alphablend.o                                     \
        input.o                                          \
        lut3d.o                                          \
        ops.o                                            \
+       ops_optimizer.o                                  \
        options.o                                        \
        output.o                                         \
        rgb2rgb.o                                        \
diff --git a/libswscale/ops.h b/libswscale/ops.h
index 71841c1572..a90701cf50 100644
--- a/libswscale/ops.h
+++ b/libswscale/ops.h
@@ -237,4 +237,16 @@ void ff_sws_op_list_remove_at(SwsOpList *ops, int index, int count);
  */
 void ff_sws_op_list_print(void *log_ctx, int log_level, const SwsOpList *ops);
 
+/**
+ * Infer + propagate known information about components. Called automatically
+ * when needed by the optimizer and compiler.
+ */
+void ff_sws_op_list_update_comps(SwsOpList *ops);
+
+/**
+ * Fuse compatible and eliminate redundant operations, as well as replacing
+ * some operations with more efficient alternatives.
+ */
+int ff_sws_op_list_optimize(SwsOpList *ops);
+
 #endif
diff --git a/libswscale/ops_optimizer.c b/libswscale/ops_optimizer.c
new file mode 100644
index 0000000000..d503bf7bf3
--- /dev/null
+++ b/libswscale/ops_optimizer.c
@@ -0,0 +1,783 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/rational.h"
+
+#include "ops.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+
+#define RET(x)                                                                 \
+    do {                                                                       \
+        if ((ret = (x)) < 0)                                                   \
+            return ret;                                                        \
+    } while (0)
+
+/* Returns true for operations that are independent per channel. These can
+ * usually be commuted freely other such operations. */
+static bool op_type_is_independent(SwsOpType op)
+{
+    switch (op) {
+    case SWS_OP_SWAP_BYTES:
+    case SWS_OP_LSHIFT:
+    case SWS_OP_RSHIFT:
+    case SWS_OP_CONVERT:
+    case SWS_OP_DITHER:
+    case SWS_OP_MIN:
+    case SWS_OP_MAX:
+    case SWS_OP_SCALE:
+        return true;
+    case SWS_OP_INVALID:
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+    case SWS_OP_SWIZZLE:
+    case SWS_OP_CLEAR:
+    case SWS_OP_LINEAR:
+    case SWS_OP_PACK:
+    case SWS_OP_UNPACK:
+        return false;
+    case SWS_OP_TYPE_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid operation type!");
+    return false;
+}
+
+static AVRational expand_factor(SwsPixelType from, SwsPixelType to)
+{
+    const int src = ff_sws_pixel_type_size(from);
+    const int dst = ff_sws_pixel_type_size(to);
+    int scale = 0;
+    for (int i = 0; i < dst / src; i++)
+        scale = scale << src * 8 | 1;
+    return Q(scale);
+}
+
+/* merge_comp_flags() forms a monoid with flags_identity as the null element */
+static const unsigned flags_identity = SWS_COMP_ZERO | SWS_COMP_EXACT;
+static unsigned merge_comp_flags(unsigned a, unsigned b)
+{
+    const unsigned flags_or  = SWS_COMP_GARBAGE;
+    const unsigned flags_and = SWS_COMP_ZERO | SWS_COMP_EXACT;
+    return ((a & b) & flags_and) | ((a | b) & flags_or);
+}
+
+/* Infer + propagate known information about components */
+void ff_sws_op_list_update_comps(SwsOpList *ops)
+{
+    SwsComps next = { .unused = {true, true, true, true} };
+    SwsComps prev = { .flags = {
+        SWS_COMP_GARBAGE, SWS_COMP_GARBAGE, SWS_COMP_GARBAGE, SWS_COMP_GARBAGE,
+    }};
+
+    /* Forwards pass, propagates knowledge about the incoming pixel values */
+    for (int n = 0; n < ops->num_ops; n++) {
+        SwsOp *op = &ops->ops[n];
+
+        /* Prefill min/max values automatically; may have to be fixed in
+         * special cases */
+        memcpy(op->comps.min, prev.min, sizeof(prev.min));
+        memcpy(op->comps.max, prev.max, sizeof(prev.max));
+
+        if (op->op != SWS_OP_SWAP_BYTES) {
+            ff_sws_apply_op_q(op, op->comps.min);
+            ff_sws_apply_op_q(op, op->comps.max);
+        }
+
+        switch (op->op) {
+        case SWS_OP_READ:
+            for (int i = 0; i < op->rw.elems; i++) {
+                if (ff_sws_pixel_type_is_int(op->type)) {
+                    int bits = 8 * ff_sws_pixel_type_size(op->type);
+                    if (!op->rw.packed && ops->src.desc) {
+                        /* Use legal value range from pixdesc if available;
+                         * we don't need to do this for packed formats because
+                         * non-byte-aligned packed formats will necessarily go
+                         * through SWS_OP_UNPACK anyway */
+                        for (int c = 0; c < 4; c++) {
+                            if (ops->src.desc->comp[c].plane == i) {
+                                bits = ops->src.desc->comp[c].depth;
+                                break;
+                            }
+                        }
+                    }
+
+                    op->comps.flags[i] = SWS_COMP_EXACT;
+                    op->comps.min[i] = Q(0);
+                    op->comps.max[i] = Q((1ULL << bits) - 1);
+                }
+            }
+            for (int i = op->rw.elems; i < 4; i++)
+                op->comps.flags[i] = prev.flags[i];
+            break;
+        case SWS_OP_WRITE:
+            for (int i = 0; i < op->rw.elems; i++)
+                av_assert1(!(prev.flags[i] & SWS_COMP_GARBAGE));
+            /* fall through */
+        case SWS_OP_SWAP_BYTES:
+        case SWS_OP_LSHIFT:
+        case SWS_OP_RSHIFT:
+        case SWS_OP_MIN:
+        case SWS_OP_MAX:
+            /* Linearly propagate flags per component */
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] = prev.flags[i];
+            break;
+        case SWS_OP_DITHER:
+            /* Strip zero flag because of the nonzero dithering offset */
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] = prev.flags[i] & ~SWS_COMP_ZERO;
+            break;
+        case SWS_OP_UNPACK:
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    op->comps.flags[i] = prev.flags[0];
+                else
+                    op->comps.flags[i] = SWS_COMP_GARBAGE;
+            }
+            break;
+        case SWS_OP_PACK: {
+            unsigned flags = flags_identity;
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    flags = merge_comp_flags(flags, prev.flags[i]);
+                if (i > 0) /* clear remaining comps for sanity */
+                    op->comps.flags[i] = SWS_COMP_GARBAGE;
+            }
+            op->comps.flags[0] = flags;
+            break;
+        }
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (op->c.q4[i].den) {
+                    if (op->c.q4[i].num == 0) {
+                        op->comps.flags[i] = SWS_COMP_ZERO | SWS_COMP_EXACT;
+                    } else if (op->c.q4[i].den == 1) {
+                        op->comps.flags[i] = SWS_COMP_EXACT;
+                    }
+                } else {
+                    op->comps.flags[i] = prev.flags[i];
+                }
+            }
+            break;
+        case SWS_OP_SWIZZLE:
+            for (int i = 0; i < 4; i++)
+                op->comps.flags[i] = prev.flags[op->swizzle.in[i]];
+            break;
+        case SWS_OP_CONVERT:
+            for (int i = 0; i < 4; i++) {
+                op->comps.flags[i] = prev.flags[i];
+                if (ff_sws_pixel_type_is_int(op->convert.to))
+                    op->comps.flags[i] |= SWS_COMP_EXACT;
+            }
+            break;
+        case SWS_OP_LINEAR:
+            for (int i = 0; i < 4; i++) {
+                unsigned flags = flags_identity;
+                AVRational min = Q(0), max = Q(0);
+                for (int j = 0; j < 4; j++) {
+                    const AVRational k = op->lin.m[i][j];
+                    AVRational mink = av_mul_q(prev.min[j], k);
+                    AVRational maxk = av_mul_q(prev.max[j], k);
+                    if (k.num) {
+                        flags = merge_comp_flags(flags, prev.flags[j]);
+                        if (k.den != 1) /* fractional coefficient */
+                            flags &= ~SWS_COMP_EXACT;
+                        if (k.num < 0)
+                            FFSWAP(AVRational, mink, maxk);
+                        min = av_add_q(min, mink);
+                        max = av_add_q(max, maxk);
+                    }
+                }
+                if (op->lin.m[i][4].num) { /* nonzero offset */
+                    flags &= ~SWS_COMP_ZERO;
+                    if (op->lin.m[i][4].den != 1) /* fractional offset */
+                        flags &= ~SWS_COMP_EXACT;
+                    min = av_add_q(min, op->lin.m[i][4]);
+                    max = av_add_q(max, op->lin.m[i][4]);
+                }
+                op->comps.flags[i] = flags;
+                op->comps.min[i] = min;
+                op->comps.max[i] = max;
+            }
+            break;
+        case SWS_OP_SCALE:
+            for (int i = 0; i < 4; i++) {
+                op->comps.flags[i] = prev.flags[i];
+                if (op->c.q.den != 1) /* fractional scale */
+                    op->comps.flags[i] &= ~SWS_COMP_EXACT;
+                if (op->c.q.num < 0)
+                    FFSWAP(AVRational, op->comps.min[i], op->comps.max[i]);
+            }
+            break;
+
+        case SWS_OP_INVALID:
+        case SWS_OP_TYPE_NB:
+            av_assert0(!"Invalid operation type!");
+        }
+
+        prev = op->comps;
+    }
+
+    /* Backwards pass, solves for component dependencies */
+    for (int n = ops->num_ops - 1; n >= 0; n--) {
+        SwsOp *op = &ops->ops[n];
+
+        switch (op->op) {
+        case SWS_OP_READ:
+        case SWS_OP_WRITE:
+            for (int i = 0; i < op->rw.elems; i++)
+                op->comps.unused[i] = op->op == SWS_OP_READ;
+            for (int i = op->rw.elems; i < 4; i++)
+                op->comps.unused[i] = next.unused[i];
+            break;
+        case SWS_OP_SWAP_BYTES:
+        case SWS_OP_LSHIFT:
+        case SWS_OP_RSHIFT:
+        case SWS_OP_CONVERT:
+        case SWS_OP_DITHER:
+        case SWS_OP_MIN:
+        case SWS_OP_MAX:
+        case SWS_OP_SCALE:
+            for (int i = 0; i < 4; i++)
+                op->comps.unused[i] = next.unused[i];
+            break;
+        case SWS_OP_UNPACK: {
+            bool unused = true;
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    unused &= next.unused[i];
+                op->comps.unused[i] = i > 0;
+            }
+            op->comps.unused[0] = unused;
+            break;
+        }
+        case SWS_OP_PACK:
+            for (int i = 0; i < 4; i++) {
+                if (op->pack.pattern[i])
+                    op->comps.unused[i] = next.unused[0];
+                else
+                    op->comps.unused[i] = true;
+            }
+            break;
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (op->c.q4[i].den)
+                    op->comps.unused[i] = true;
+                else
+                    op->comps.unused[i] = next.unused[i];
+            }
+            break;
+        case SWS_OP_SWIZZLE: {
+            bool unused[4] = { true, true, true, true };
+            for (int i = 0; i < 4; i++)
+                unused[op->swizzle.in[i]] &= next.unused[i];
+            for (int i = 0; i < 4; i++)
+                op->comps.unused[i] = unused[i];
+            break;
+        }
+        case SWS_OP_LINEAR:
+            for (int j = 0; j < 4; j++) {
+                bool unused = true;
+                for (int i = 0; i < 4; i++) {
+                    if (op->lin.m[i][j].num)
+                        unused &= next.unused[i];
+                }
+                op->comps.unused[j] = unused;
+            }
+            break;
+        }
+
+        next = op->comps;
+    }
+}
+
+/* returns log2(x) only if x is a power of two, or 0 otherwise */
+static int exact_log2(const int x)
+{
+    int p;
+    if (x <= 0)
+        return 0;
+    p = av_log2(x);
+    return (1 << p) == x ? p : 0;
+}
+
+static int exact_log2_q(const AVRational x)
+{
+    if (x.den == 1)
+        return exact_log2(x.num);
+    else if (x.num == 1)
+        return -exact_log2(x.den);
+    else
+        return 0;
+}
+
+/**
+ * If a linear operation can be reduced to a scalar multiplication, returns
+ * the corresponding scaling factor, or 0 otherwise.
+ */
+static bool extract_scalar(const SwsLinearOp *c, SwsComps prev, SwsComps next,
+                           SwsConst *out_scale)
+{
+    SwsConst scale = {0};
+
+    /* There are components not on the main diagonal */
+    if (c->mask & ~SWS_MASK_DIAG4)
+        return false;
+
+    for (int i = 0; i < 4; i++) {
+        const AVRational s = c->m[i][i];
+        if ((prev.flags[i] & SWS_COMP_ZERO) || next.unused[i])
+            continue;
+        if (scale.q.den && av_cmp_q(s, scale.q))
+            return false;
+        scale.q = s;
+    }
+
+    if (scale.q.den)
+        *out_scale = scale;
+    return scale.q.den;
+}
+
+/* Extracts an integer clear operation (subset) from the given linear op. */
+static bool extract_constant_rows(SwsLinearOp *c, SwsComps prev,
+                                  SwsConst *out_clear)
+{
+    SwsConst clear = {0};
+    bool ret = false;
+
+    for (int i = 0; i < 4; i++) {
+        bool const_row = c->m[i][4].den == 1; /* offset is integer */
+        for (int j = 0; j < 4; j++) {
+            const_row &= c->m[i][j].num == 0 || /* scalar is zero */
+                         (prev.flags[j] & SWS_COMP_ZERO); /* input is zero */
+        }
+        if (const_row && (c->mask & SWS_MASK_ROW(i))) {
+            clear.q4[i] = c->m[i][4];
+            for (int j = 0; j < 5; j++)
+                c->m[i][j] = Q(i == j);
+            c->mask &= ~SWS_MASK_ROW(i);
+            ret = true;
+        }
+    }
+
+    if (ret)
+        *out_clear = clear;
+    return ret;
+}
+
+/* Unswizzle a linear operation by aligning single-input rows with
+ * their corresponding diagonal */
+static bool extract_swizzle(SwsLinearOp *op, SwsComps prev, SwsSwizzleOp *out_swiz)
+{
+    SwsSwizzleOp swiz = SWS_SWIZZLE(0, 1, 2, 3);
+    SwsLinearOp c = *op;
+
+    for (int i = 0; i < 4; i++) {
+        int idx = -1;
+        for (int j = 0; j < 4; j++) {
+            if (!c.m[i][j].num || (prev.flags[j] & SWS_COMP_ZERO))
+                continue;
+            if (idx >= 0)
+                return false; /* multiple inputs */
+            idx = j;
+        }
+
+        if (idx >= 0 && idx != i) {
+            /* Move coefficient to the diagonal */
+            c.m[i][i] = c.m[i][idx];
+            c.m[i][idx] = Q(0);
+            swiz.in[i] = idx;
+        }
+    }
+
+    if (swiz.mask == SWS_SWIZZLE(0, 1, 2, 3).mask)
+        return false; /* no swizzle was identified */
+
+    c.mask = ff_sws_linear_mask(c);
+    *out_swiz = swiz;
+    *op = c;
+    return true;
+}
+
+int ff_sws_op_list_optimize(SwsOpList *ops)
+{
+    int ret;
+
+retry:
+    ff_sws_op_list_update_comps(ops);
+
+    for (int n = 0; n < ops->num_ops;) {
+        SwsOp dummy = {0};
+        SwsOp *op = &ops->ops[n];
+        SwsOp *prev = n ? &ops->ops[n - 1] : &dummy;
+        SwsOp *next = n + 1 < ops->num_ops ? &ops->ops[n + 1] : &dummy;
+
+        /* common helper variable */
+        bool noop = true;
+
+        switch (op->op) {
+        case SWS_OP_READ:
+            /* Optimized further into refcopy / memcpy */
+            if (next->op == SWS_OP_WRITE &&
+                next->rw.elems == op->rw.elems &&
+                next->rw.packed == op->rw.packed &&
+                next->rw.frac == op->rw.frac)
+            {
+                ff_sws_op_list_remove_at(ops, n, 2);
+                av_assert1(ops->num_ops == 0);
+                return 0;
+            }
+
+            /* Skip reading extra unneeded components */
+            if (!op->rw.packed) {
+                int needed = op->rw.elems;
+                while (needed > 0 && next->comps.unused[needed - 1])
+                    needed--;
+                if (op->rw.elems != needed) {
+                    op->rw.elems = needed;
+                    op->rw.packed &= op->rw.elems > 1;
+                    goto retry;
+                }
+            }
+            break;
+
+        case SWS_OP_SWAP_BYTES:
+            /* Redundant (double) swap */
+            if (next->op == SWS_OP_SWAP_BYTES) {
+                ff_sws_op_list_remove_at(ops, n, 2);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_UNPACK:
+            /* Redundant unpack+pack */
+            if (next->op == SWS_OP_PACK && next->type == op->type &&
+                next->pack.pattern[0] == op->pack.pattern[0] &&
+                next->pack.pattern[1] == op->pack.pattern[1] &&
+                next->pack.pattern[2] == op->pack.pattern[2] &&
+                next->pack.pattern[3] == op->pack.pattern[3])
+            {
+                ff_sws_op_list_remove_at(ops, n, 2);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_LSHIFT:
+        case SWS_OP_RSHIFT:
+            /* Two shifts in the same direction */
+            if (next->op == op->op) {
+                op->c.u += next->c.u;
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* No-op shift */
+            if (!op->c.u) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (!op->c.q4[i].den)
+                    continue;
+
+                if ((prev->comps.flags[i] & SWS_COMP_ZERO) &&
+                    !(prev->comps.flags[i] & SWS_COMP_GARBAGE) &&
+                    op->c.q4[i].num == 0)
+                {
+                    /* Redundant clear-to-zero of zero component */
+                    op->c.q4[i].den = 0;
+                } else if (next->comps.unused[i]) {
+                    /* Unnecessary clear of unused component */
+                    op->c.q4[i] = (AVRational) {0, 0};
+                } else if (op->c.q4[i].den) {
+                    noop = false;
+                }
+            }
+
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            /* Transitive clear */
+            if (next->op == SWS_OP_CLEAR) {
+                for (int i = 0; i < 4; i++) {
+                    if (next->c.q4[i].den)
+                        op->c.q4[i] = next->c.q4[i];
+                }
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* Prefer to clear as late as possible, to avoid doing
+                * redundant work */
+            if ((op_type_is_independent(next->op) && next->op != SWS_OP_SWAP_BYTES) ||
+                next->op == SWS_OP_SWIZZLE)
+            {
+                if (next->op == SWS_OP_CONVERT)
+                    op->type = next->convert.to;
+                ff_sws_apply_op_q(next, op->c.q4);
+                FFSWAP(SwsOp, *op, *next);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_SWIZZLE: {
+            bool seen[4] = {0};
+            bool has_duplicates = false;
+            for (int i = 0; i < 4; i++) {
+                if (next->comps.unused[i])
+                    continue;
+                if (op->swizzle.in[i] != i)
+                    noop = false;
+                has_duplicates |= seen[op->swizzle.in[i]];
+                seen[op->swizzle.in[i]] = true;
+            }
+
+            /* Identity swizzle */
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            /* Transitive swizzle */
+            if (next->op == SWS_OP_SWIZZLE) {
+                const SwsSwizzleOp orig = op->swizzle;
+                for (int i = 0; i < 4; i++)
+                    op->swizzle.in[i] = orig.in[next->swizzle.in[i]];
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* Try to push swizzles with duplicates towards the output */
+            if (has_duplicates && op_type_is_independent(next->op)) {
+                if (next->op == SWS_OP_CONVERT)
+                    op->type = next->convert.to;
+                if (next->op == SWS_OP_MIN || next->op == SWS_OP_MAX) {
+                    /* Un-swizzle the next operation */
+                    const SwsConst c = next->c;
+                    for (int i = 0; i < 4; i++) {
+                        if (!next->comps.unused[i])
+                            next->c.q4[op->swizzle.in[i]] = c.q4[i];
+                    }
+                }
+                FFSWAP(SwsOp, *op, *next);
+                goto retry;
+            }
+
+            /* Move swizzle out of the way between two converts so that
+             * they may be merged */
+            if (prev->op == SWS_OP_CONVERT && next->op == SWS_OP_CONVERT) {
+                op->type = next->convert.to;
+                FFSWAP(SwsOp, *op, *next);
+                goto retry;
+            }
+            break;
+        }
+
+        case SWS_OP_CONVERT:
+            /* No-op conversion */
+            if (op->type == op->convert.to) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            /* Transitive conversion */
+            if (next->op == SWS_OP_CONVERT &&
+                op->convert.expand == next->convert.expand)
+            {
+                av_assert1(op->convert.to == next->type);
+                op->convert.to = next->convert.to;
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* Conversion followed by integer expansion */
+            if (next->op == SWS_OP_SCALE &&
+                !av_cmp_q(next->c.q, expand_factor(op->type, op->convert.to)))
+            {
+                op->convert.expand = true;
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_MIN:
+            for (int i = 0; i < 4; i++) {
+                if (next->comps.unused[i] || !op->c.q4[i].den)
+                    continue;
+                if (av_cmp_q(op->c.q4[i], prev->comps.max[i]) < 0)
+                    noop = false;
+            }
+
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_MAX:
+            for (int i = 0; i < 4; i++) {
+                if (next->comps.unused[i] || !op->c.q4[i].den)
+                    continue;
+                if (av_cmp_q(prev->comps.min[i], op->c.q4[i]) < 0)
+                    noop = false;
+            }
+
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_DITHER:
+            for (int i = 0; i < 4; i++) {
+                noop &= (prev->comps.flags[i] & SWS_COMP_EXACT) ||
+                        next->comps.unused[i];
+            }
+
+            if (noop) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+            break;
+
+        case SWS_OP_LINEAR: {
+            SwsSwizzleOp swizzle;
+            SwsConst c;
+
+            /* No-op (identity) linear operation */
+            if (!op->lin.mask) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            if (next->op == SWS_OP_LINEAR) {
+                /* 5x5 matrix multiplication after appending [ 0 0 0 0 1 ] */
+                const SwsLinearOp m1 = op->lin;
+                const SwsLinearOp m2 = next->lin;
+                for (int i = 0; i < 4; i++) {
+                    for (int j = 0; j < 5; j++) {
+                        AVRational sum = Q(0);
+                        for (int k = 0; k < 4; k++)
+                            sum = av_add_q(sum, av_mul_q(m2.m[i][k], m1.m[k][j]));
+                        if (j == 4) /* m1.m[4][j] == 1 */
+                            sum = av_add_q(sum, m2.m[i][4]);
+                        op->lin.m[i][j] = sum;
+                    }
+                }
+                op->lin.mask = ff_sws_linear_mask(op->lin);
+                ff_sws_op_list_remove_at(ops, n + 1, 1);
+                goto retry;
+            }
+
+            /* Optimize away zero columns */
+            for (int j = 0; j < 4; j++) {
+                const uint32_t col = SWS_MASK_COL(j);
+                if (!(prev->comps.flags[j] & SWS_COMP_ZERO) || !(op->lin.mask & col))
+                    continue;
+                for (int i = 0; i < 4; i++)
+                    op->lin.m[i][j] = Q(i == j);
+                op->lin.mask &= ~col;
+                goto retry;
+            }
+
+            /* Optimize away unused rows */
+            for (int i = 0; i < 4; i++) {
+                const uint32_t row = SWS_MASK_ROW(i);
+                if (!next->comps.unused[i] || !(op->lin.mask & row))
+                    continue;
+                for (int j = 0; j < 5; j++)
+                    op->lin.m[i][j] = Q(i == j);
+                op->lin.mask &= ~row;
+                goto retry;
+            }
+
+            /* Convert constant rows to explicit clear instruction */
+            if (extract_constant_rows(&op->lin, prev->comps, &c)) {
+                RET(ff_sws_op_list_insert_at(ops, n + 1, &(SwsOp) {
+                    .op    = SWS_OP_CLEAR,
+                    .type  = op->type,
+                    .comps = op->comps,
+                    .c     = c,
+                }));
+                goto retry;
+            }
+
+            /* Multiplication by scalar constant */
+            if (extract_scalar(&op->lin, prev->comps, next->comps, &c)) {
+                op->op = SWS_OP_SCALE;
+                op->c  = c;
+                goto retry;
+            }
+
+            /* Swizzle by fixed pattern */
+            if (extract_swizzle(&op->lin, prev->comps, &swizzle)) {
+                RET(ff_sws_op_list_insert_at(ops, n, &(SwsOp) {
+                    .op      = SWS_OP_SWIZZLE,
+                    .type    = op->type,
+                    .swizzle = swizzle,
+                }));
+                goto retry;
+            }
+            break;
+        }
+
+        case SWS_OP_SCALE: {
+            const int factor2 = exact_log2_q(op->c.q);
+
+            /* No-op scaling */
+            if (op->c.q.num == 1 && op->c.q.den == 1) {
+                ff_sws_op_list_remove_at(ops, n, 1);
+                goto retry;
+            }
+
+            /* Scaling by integer before conversion to int */
+            if (op->c.q.den == 1 &&
+                next->op == SWS_OP_CONVERT &&
+                ff_sws_pixel_type_is_int(next->convert.to))
+            {
+                op->type = next->convert.to;
+                FFSWAP(SwsOp, *op, *next);
+                goto retry;
+            }
+
+            /* Scaling by exact power of two */
+            if (factor2 && ff_sws_pixel_type_is_int(op->type)) {
+                op->op = factor2 > 0 ? SWS_OP_LSHIFT : SWS_OP_RSHIFT;
+                op->c.u = FFABS(factor2);
+                goto retry;
+            }
+            break;
+        }
+        }
+
+        /* No optimization triggered, move on to next operation */
+        n++;
+    }
+
+    return 0;
+}
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 08/17] swscale/ops_internal: add internal ops backend API
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (6 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 07/17] swscale/optimizer: add high-level ops optimizer Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 09/17] swscale/ops: add dispatch layer Niklas Haas
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This adds an internal API for ops backends, which are responsible for
compiling op lists into executable functions.
---
 libswscale/ops.c          |  62 ++++++++++++++++++++++
 libswscale/ops_internal.h | 108 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+)
 create mode 100644 libswscale/ops_internal.h

diff --git a/libswscale/ops.c b/libswscale/ops.c
index 004686147d..8491bd9cad 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -25,9 +25,22 @@
 #include "libavutil/refstruct.h"
 
 #include "ops.h"
+#include "ops_internal.h"
+
+const SwsOpBackend * const ff_sws_op_backends[] = {
+    NULL
+};
+
+const int ff_sws_num_op_backends = FF_ARRAY_ELEMS(ff_sws_op_backends) - 1;
 
 #define Q(N) ((AVRational) { N, 1 })
 
+#define RET(x)                                                                 \
+    do {                                                                       \
+        if ((ret = (x)) < 0)                                                   \
+            return ret;                                                        \
+    } while (0)
+
 const char *ff_sws_pixel_type_name(SwsPixelType type)
 {
     switch (type) {
@@ -520,3 +533,52 @@ void ff_sws_op_list_print(void *log, int lev, const SwsOpList *ops)
 
     av_log(log, lev, "    (X = unused, + = exact, 0 = zero)\n");
 }
+
+int ff_sws_ops_compile_backend(SwsContext *ctx, const SwsOpBackend *backend,
+                               const SwsOpList *ops, SwsCompiledOp *out)
+{
+    SwsOpList *copy, rest;
+    int ret = 0;
+
+    copy = ff_sws_op_list_duplicate(ops);
+    if (!copy)
+        return AVERROR(ENOMEM);
+
+    /* Ensure these are always set during compilation */
+    ff_sws_op_list_update_comps(copy);
+
+    /* Make an on-stack copy of `ops` to ensure we can still properly clean up
+     * the copy afterwards */
+    rest = *copy;
+
+    ret = backend->compile(ctx, &rest, out);
+    if (ret == AVERROR(ENOTSUP)) {
+        av_log(ctx, AV_LOG_DEBUG, "Backend '%s' does not support operations:\n", backend->name);
+        ff_sws_op_list_print(ctx, AV_LOG_DEBUG, &rest);
+    } else if (ret < 0) {
+        av_log(ctx, AV_LOG_ERROR, "Failed to compile operations: %s\n", av_err2str(ret));
+        ff_sws_op_list_print(ctx, AV_LOG_ERROR, &rest);
+    }
+
+    ff_sws_op_list_free(&copy);
+    return ret;
+}
+
+int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out)
+{
+    for (int n = 0; ff_sws_op_backends[n]; n++) {
+        const SwsOpBackend *backend = ff_sws_op_backends[n];
+        if (ff_sws_ops_compile_backend(ctx, backend, ops, out) < 0)
+            continue;
+
+        av_log(ctx, AV_LOG_VERBOSE, "Compiled using backend '%s': "
+               "block size = %d, over-read = %d, over-write = %d, cpu flags = 0x%x\n",
+               backend->name, out->block_size, out->over_read, out->over_write,
+               out->cpu_flags);
+        return 0;
+    }
+
+    av_log(ctx, AV_LOG_WARNING, "No backend found for operations:\n");
+    ff_sws_op_list_print(ctx, AV_LOG_WARNING, ops);
+    return AVERROR(ENOTSUP);
+}
diff --git a/libswscale/ops_internal.h b/libswscale/ops_internal.h
new file mode 100644
index 0000000000..9fd866430b
--- /dev/null
+++ b/libswscale/ops_internal.h
@@ -0,0 +1,108 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_INTERNAL_H
+#define SWSCALE_OPS_INTERNAL_H
+
+#include "libavutil/mem_internal.h"
+
+#include "ops.h"
+
+/**
+ * Global execution context for all compiled functions.
+ *
+ * Note: This struct is hard-coded in assembly, so do not change the layout
+ * without updating the corresponding assembly definitions.
+ */
+typedef struct SwsOpExec {
+    /* The data pointers point to the first pixel to process */
+    DECLARE_ALIGNED_32(const uint8_t, *in[4]);
+    DECLARE_ALIGNED_32(uint8_t, *out[4]);
+
+    /* Separation between lines in bytes */
+    DECLARE_ALIGNED_32(ptrdiff_t, in_stride[4]);
+    DECLARE_ALIGNED_32(ptrdiff_t, out_stride[4]);
+
+    /* Extra metadata, may or may not be useful */
+    int32_t width, height;      /* Overall image dimensions */
+    int32_t slice_y, slice_h;   /* Start and height of current slice */
+    int32_t pixel_bits_in;      /* Bits per input pixel */
+    int32_t pixel_bits_out;     /* Bits per output pixel */
+} SwsOpExec;
+
+static_assert(sizeof(SwsOpExec) == 16 * sizeof(void *) + 8 * sizeof(int32_t),
+              "SwsOpExec layout mismatch");
+
+/**
+ * Process a given range of pixel blocks.
+ *
+ * Note: `bx_start` and `bx_end` are in units of `SwsCompiledOp.block_size`.
+ */
+typedef void (*SwsOpFunc)(const SwsOpExec *exec, const void *priv,
+                          int bx_start, int y_start, int bx_end, int y_end);
+
+#define SWS_DECL_FUNC(NAME) \
+    void NAME(const SwsOpExec *, const void *, int, int, int, int)
+
+typedef struct SwsCompiledOp {
+    SwsOpFunc func;
+
+    int block_size; /* number of pixels processed per iteration */
+    int over_read;  /* implementation over-reads input by this many bytes */
+    int over_write; /* implementation over-writes output by this many bytes */
+    int cpu_flags;  /* active set of CPU flags (informative) */
+
+    /* Arbitrary private data */
+    void *priv;
+    void (*free)(void *priv);
+} SwsCompiledOp;
+
+typedef struct SwsOpBackend {
+    const char *name; /* Descriptive name for this backend */
+
+    /**
+     * Compile an operation list to an implementation chain. May modify `ops`
+     * freely; the original list will be freed automatically by the caller.
+     *
+     * Returns 0 or a negative error code.
+     */
+    int (*compile)(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out);
+} SwsOpBackend;
+
+/* List of all backends, terminated by NULL */
+extern const SwsOpBackend *const ff_sws_op_backends[];
+extern const int ff_sws_num_op_backends; /* excludes terminating NULL */
+
+/**
+ * Attempt to compile a list of operations using a specific backend.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_ops_compile_backend(SwsContext *ctx, const SwsOpBackend *backend,
+                               const SwsOpList *ops, SwsCompiledOp *out);
+
+/**
+ * Compile a list of operations using the best available backend.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out);
+
+#endif
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 09/17] swscale/ops: add dispatch layer
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (7 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 08/17] swscale/ops_internal: add internal ops backend API Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 10/17] swscale/optimizer: add packed shuffle solver Niklas Haas
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This handles the low-level execution of an op list, and integration into
the SwsGraph infrastructure. To handle frames with insufficient padding in
the stride (or a width smaller than one block size), we use a fallback loop
that pads the last column of pixels using `memcpy` into an appropriately
sized buffer.
---
 libswscale/ops.c          | 269 ++++++++++++++++++++++++++++++++++++++
 libswscale/ops.h          |  14 ++
 libswscale/ops_internal.h |  18 ++-
 3 files changed, 294 insertions(+), 7 deletions(-)

diff --git a/libswscale/ops.c b/libswscale/ops.c
index 8491bd9cad..60791ba875 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -582,3 +582,272 @@ int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out
     ff_sws_op_list_print(ctx, AV_LOG_WARNING, ops);
     return AVERROR(ENOTSUP);
 }
+
+typedef struct SwsOpPass {
+    SwsCompiledOp comp;
+    SwsOpExec exec_base;
+    int num_blocks;
+    int tail_off_in;
+    int tail_off_out;
+    int tail_size_in;
+    int tail_size_out;
+    int pixel_bits_in;
+    int pixel_bits_out;
+    bool memcpy_in;
+    bool memcpy_out;
+} SwsOpPass;
+
+static void op_pass_free(void *ptr)
+{
+    SwsOpPass *p = ptr;
+    if (!p)
+        return;
+
+    if (p->comp.free)
+        p->comp.free(p->comp.priv);
+
+    av_free(p);
+}
+
+static void op_pass_setup(const SwsImg *out, const SwsImg *in, const SwsPass *pass)
+{
+    const AVPixFmtDescriptor *indesc  = av_pix_fmt_desc_get(in->fmt);
+    const AVPixFmtDescriptor *outdesc = av_pix_fmt_desc_get(out->fmt);
+
+    SwsOpPass *p = pass->priv;
+    SwsOpExec *exec = &p->exec_base;
+    const SwsCompiledOp *comp = &p->comp;
+    const int block_size = comp->block_size;
+    p->num_blocks = (pass->width + block_size - 1) / block_size;
+
+    /* Set up main loop parameters */
+    const int aligned_w  = p->num_blocks * block_size;
+    const int safe_width = (p->num_blocks - 1) * block_size;
+    const int tail_size  = pass->width - safe_width;
+    p->tail_off_in   = safe_width * p->pixel_bits_in  >> 3;
+    p->tail_off_out  = safe_width * p->pixel_bits_out >> 3;
+    p->tail_size_in  = tail_size  * p->pixel_bits_in  >> 3;
+    p->tail_size_out = tail_size  * p->pixel_bits_out >> 3;
+    p->memcpy_in     = false;
+    p->memcpy_out    = false;
+
+    for (int i = 0; i < 4 && in->data[i]; i++) {
+        const int sub_x      = (i == 1 || i == 2) ? indesc->log2_chroma_w : 0;
+        const int plane_w    = (aligned_w + sub_x) >> sub_x;
+        const int plane_pad  = (comp->over_read + sub_x) >> sub_x;
+        const int plane_size = plane_w * p->pixel_bits_in >> 3;
+        p->memcpy_in |= plane_size + plane_pad > in->linesize[i];
+        exec->in_stride[i] = in->linesize[i];
+    }
+
+    for (int i = 0; i < 4 && out->data[i]; i++) {
+        const int sub_x      = (i == 1 || i == 2) ? outdesc->log2_chroma_w : 0;
+        const int plane_w    = (aligned_w + sub_x) >> sub_x;
+        const int plane_pad  = (comp->over_write + sub_x) >> sub_x;
+        const int plane_size = plane_w * p->pixel_bits_out >> 3;
+        p->memcpy_out |= plane_size + plane_pad > out->linesize[i];
+        exec->out_stride[i] = out->linesize[i];
+    }
+
+    /* Pre-fill pointer bump for the main section only; this value does not
+     * matter at all for the tail / last row handlers because they only ever
+     * process a single line */
+    const int blocks_main = p->num_blocks - p->memcpy_out;
+    for (int i = 0; i < 4; i++) {
+        exec->in_bump[i]  = in->linesize[i]  - blocks_main * exec->block_size_in;
+        exec->out_bump[i] = out->linesize[i] - blocks_main * exec->block_size_out;
+    }
+}
+
+/* Dispatch kernel over the last column of the image using memcpy */
+static av_always_inline void
+handle_tail(const SwsOpPass *p, SwsOpExec *exec,
+            const SwsImg *out_base, const bool copy_out,
+            const SwsImg *in_base, const bool copy_in,
+            int y, const int h)
+{
+    DECLARE_ALIGNED_64(uint8_t, tmp)[2][4][sizeof(uint32_t[128])];
+
+    const SwsCompiledOp *comp = &p->comp;
+    const int tail_size_in  = p->tail_size_in;
+    const int tail_size_out = p->tail_size_out;
+    const int bx = p->num_blocks - 1;
+
+    SwsImg in  = ff_sws_img_shift(in_base,  y);
+    SwsImg out = ff_sws_img_shift(out_base, y);
+    for (int i = 0; i < 4 && in.data[i]; i++) {
+        in.data[i]  += p->tail_off_in;
+        if (copy_in) {
+            exec->in[i] = (void *) tmp[0][i];
+            exec->in_stride[i] = sizeof(tmp[0][i]);
+        } else {
+            exec->in[i] = in.data[i];
+        }
+    }
+
+    for (int i = 0; i < 4 && out.data[i]; i++) {
+        out.data[i] += p->tail_off_out;
+        if (copy_out) {
+            exec->out[i] = (void *) tmp[1][i];
+            exec->out_stride[i] = sizeof(tmp[1][i]);
+        } else {
+            exec->out[i] = out.data[i];
+        }
+    }
+
+    for (int y_end = y + h; y < y_end; y++) {
+        if (copy_in) {
+            for (int i = 0; i < 4 && in.data[i]; i++) {
+                av_assert2(tmp[0][i] + tail_size_in < (uint8_t *) tmp[1]);
+                memcpy(tmp[0][i], in.data[i], tail_size_in);
+                in.data[i] += in.linesize[i];
+            }
+        }
+
+        comp->func(exec, comp->priv, bx, y, p->num_blocks, y + 1);
+
+        if (copy_out) {
+            for (int i = 0; i < 4 && out.data[i]; i++) {
+                av_assert2(tmp[1][i] + tail_size_out < (uint8_t *) tmp[2]);
+                memcpy(out.data[i], tmp[1][i], tail_size_out);
+                out.data[i] += out.linesize[i];
+            }
+        }
+
+        for (int i = 0; i < 4; i++) {
+            if (!copy_in)
+                exec->in[i] += in.linesize[i];
+            if (!copy_out)
+                exec->out[i] += out.linesize[i];
+        }
+    }
+}
+
+static void op_pass_run(const SwsImg *out_base, const SwsImg *in_base,
+                        const int y, const int h, const SwsPass *pass)
+{
+    const SwsOpPass *p = pass->priv;
+    const SwsCompiledOp *comp = &p->comp;
+    const SwsImg in  = ff_sws_img_shift(in_base,  y);
+    const SwsImg out = ff_sws_img_shift(out_base, y);
+
+    /* Fill exec metadata for this slice */
+    DECLARE_ALIGNED_32(SwsOpExec, exec) = p->exec_base;
+    exec.slice_y = y;
+    exec.slice_h = h;
+    for (int i = 0; i < 4; i++) {
+        exec.in[i]  = in.data[i];
+        exec.out[i] = out.data[i];
+    }
+
+    /**
+     *  To ensure safety, we need to consider the following:
+     *
+     * 1. We can overread the input, unless this is the last line of an
+     *    unpadded buffer. All defined operations can handle arbitrary pixel
+     *    input, so overread of arbitrary data is fine.
+     *
+     * 2. We can overwrite the output, as long as we don't write more than the
+     *    amount of pixels that fit into one linesize. So we always need to
+     *    memcpy the last column on the output side if unpadded.
+     *
+     * 3. For the last row, we also need to memcpy the remainder of the input,
+     *    to avoid reading past the end of the buffer. Note that since we know
+     *    the run() function is called on stripes of the same buffer, we don't
+     *    need to worry about this for the end of a slice.
+     */
+
+    const int last_slice  = y + h == pass->height;
+    const bool memcpy_in  = last_slice && p->memcpy_in;
+    const bool memcpy_out = p->memcpy_out;
+    const int num_blocks  = p->num_blocks;
+    const int blocks_main = num_blocks - memcpy_out;
+    const int h_main      = h - memcpy_in;
+
+    /* Handle main section */
+    comp->func(&exec, comp->priv, 0, y, blocks_main, y + h_main);
+
+    if (memcpy_in) {
+        /* Safe part of last row */
+        for (int i = 0; i < 4; i++) {
+            exec.in[i]  += h_main * in.linesize[i];
+            exec.out[i] += h_main * out.linesize[i];
+        }
+        comp->func(&exec, comp->priv, 0, y + h_main, num_blocks - 1, y + h);
+    }
+
+    /* Handle last column via memcpy, takes over `exec` so call these last */
+    if (memcpy_out)
+        handle_tail(p, &exec, out_base, true, in_base, false, y, h_main);
+    if (memcpy_in)
+        handle_tail(p, &exec, out_base, memcpy_out, in_base, true, y + h_main, 1);
+}
+
+static int rw_pixel_bits(const SwsOp *op)
+{
+    const int elems = op->rw.packed ? op->rw.elems : 1;
+    const int size  = ff_sws_pixel_type_size(op->type);
+    const int bits  = 8 >> op->rw.frac;
+    av_assert1(bits >= 1);
+    return elems * size * bits;
+}
+
+int ff_sws_compile_pass(SwsGraph *graph, SwsOpList *ops, int flags, SwsFormat dst,
+                        SwsPass *input, SwsPass **output)
+{
+    SwsContext *ctx = graph->ctx;
+    SwsOpPass *p = NULL;
+    const SwsOp *read = &ops->ops[0];
+    const SwsOp *write = &ops->ops[ops->num_ops - 1];
+    SwsPass *pass;
+    int ret;
+
+    if (ops->num_ops < 2) {
+        av_log(ctx, AV_LOG_ERROR, "Need at least two operations.\n");
+        return AVERROR(EINVAL);
+    }
+
+    if (read->op != SWS_OP_READ || write->op != SWS_OP_WRITE) {
+        av_log(ctx, AV_LOG_ERROR, "First and last operations must be a read "
+               "and write, respectively.\n");
+        return AVERROR(EINVAL);
+    }
+
+    if (flags & SWS_OP_FLAG_OPTIMIZE)
+        RET(ff_sws_op_list_optimize(ops));
+    else
+        ff_sws_op_list_update_comps(ops);
+
+    p = av_mallocz(sizeof(*p));
+    if (!p)
+        return AVERROR(ENOMEM);
+
+    ret = ff_sws_ops_compile(ctx, ops, &p->comp);
+    if (ret < 0)
+        goto fail;
+
+    p->pixel_bits_in  = rw_pixel_bits(read),
+    p->pixel_bits_out = rw_pixel_bits(write),
+    p->exec_base = (SwsOpExec) {
+        .width  = dst.width,
+        .height = dst.height,
+        .block_size_in  = p->comp.block_size * p->pixel_bits_in  >> 3,
+        .block_size_out = p->comp.block_size * p->pixel_bits_out >> 3,
+    };
+
+    pass = ff_sws_graph_add_pass(graph, dst.format, dst.width, dst.height, input,
+                                 1, p, op_pass_run);
+    if (!pass) {
+        ret = AVERROR(ENOMEM);
+        goto fail;
+    }
+    pass->setup = op_pass_setup;
+    pass->free  = op_pass_free;
+
+    *output = pass;
+    return 0;
+
+fail:
+    op_pass_free(p);
+    return ret;
+}
diff --git a/libswscale/ops.h b/libswscale/ops.h
index a90701cf50..c4701404e1 100644
--- a/libswscale/ops.h
+++ b/libswscale/ops.h
@@ -249,4 +249,18 @@ void ff_sws_op_list_update_comps(SwsOpList *ops);
  */
 int ff_sws_op_list_optimize(SwsOpList *ops);
 
+enum SwsOpCompileFlags {
+    /* Automatically optimize the operations when compiling */
+    SWS_OP_FLAG_OPTIMIZE = 1 << 0,
+};
+
+/**
+ * Resolves an operation list to a graph pass. The first and last operations
+ * must be a read/write respectively. `flags` is a list of SwsOpCompileFlags.
+ *
+ * Note: `ops` may be modified by this function.
+ */
+int ff_sws_compile_pass(SwsGraph *graph, SwsOpList *ops, int flags, SwsFormat dst,
+                        SwsPass *input, SwsPass **output);
+
 #endif
diff --git a/libswscale/ops_internal.h b/libswscale/ops_internal.h
index 9fd866430b..2fbd8a55d0 100644
--- a/libswscale/ops_internal.h
+++ b/libswscale/ops_internal.h
@@ -33,21 +33,25 @@
  */
 typedef struct SwsOpExec {
     /* The data pointers point to the first pixel to process */
-    DECLARE_ALIGNED_32(const uint8_t, *in[4]);
-    DECLARE_ALIGNED_32(uint8_t, *out[4]);
+    const uint8_t *in[4];
+    uint8_t *out[4];
 
     /* Separation between lines in bytes */
-    DECLARE_ALIGNED_32(ptrdiff_t, in_stride[4]);
-    DECLARE_ALIGNED_32(ptrdiff_t, out_stride[4]);
+    ptrdiff_t in_stride[4];
+    ptrdiff_t out_stride[4];
+
+    /* Pointer bump, difference between stride and processed line size */
+    ptrdiff_t in_bump[4];
+    ptrdiff_t out_bump[4];
 
     /* Extra metadata, may or may not be useful */
     int32_t width, height;      /* Overall image dimensions */
     int32_t slice_y, slice_h;   /* Start and height of current slice */
-    int32_t pixel_bits_in;      /* Bits per input pixel */
-    int32_t pixel_bits_out;     /* Bits per output pixel */
+    int32_t block_size_in;      /* Size of a block of pixels in bytes */
+    int32_t block_size_out;
 } SwsOpExec;
 
-static_assert(sizeof(SwsOpExec) == 16 * sizeof(void *) + 8 * sizeof(int32_t),
+static_assert(sizeof(SwsOpExec) == 24 * sizeof(void *) + 6 * sizeof(int32_t),
               "SwsOpExec layout mismatch");
 
 /**
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 10/17] swscale/optimizer: add packed shuffle solver
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (8 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 09/17] swscale/ops: add dispatch layer Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 11/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This can turn any compatible sequence of operations into a single packed
shuffle, including packed swizzling, grayscale->RGB conversion, endianness
swapping, RGB bit depth conversions, rgb24->rgb0 alpha clearing and more.
---
 libswscale/ops_internal.h  | 28 +++++++++++
 libswscale/ops_optimizer.c | 96 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 124 insertions(+)

diff --git a/libswscale/ops_internal.h b/libswscale/ops_internal.h
index 2fbd8a55d0..e7b6fb1c4c 100644
--- a/libswscale/ops_internal.h
+++ b/libswscale/ops_internal.h
@@ -109,4 +109,32 @@ int ff_sws_ops_compile_backend(SwsContext *ctx, const SwsOpBackend *backend,
  */
 int ff_sws_ops_compile(SwsContext *ctx, const SwsOpList *ops, SwsCompiledOp *out);
 
+/**
+ * "Solve" an op list into a fixed shuffle mask, with an optional ability to
+ * also directly clear the output value (for e.g. rgb24 -> rgb0). This can
+ * accept any operation chain that only consists of the following operations:
+ *
+ * - SWS_OP_READ (non-planar, non-fractional)
+ * - SWS_OP_SWIZZLE
+ * - SWS_OP_SWAP_BYTES
+ * - SWS_OP_CLEAR to zero (when clear_val is specified)
+ * - SWS_OP_CONVERT (integer expand)
+ * - SWS_OP_WRITE (non-planar, non-fractional)
+ *
+ * Basically, any operation that purely consists of moving around and reording
+ * bytes within a single plane, can be turned into a shuffle mask.
+ *
+ * @param ops         The operation list to decompose.
+ * @param shuffle     The output shuffle mask.
+ * @param size        The size (in bytes) of the output shuffle mask.
+ * @param clear_val   If nonzero, this index will be used to clear the output.
+ * @param read_bytes  Returns the number of bytes read per shuffle iteration.
+ * @param write_bytes Returns the number of bytes written per shuffle iteration.
+ *
+ * @return  The number of pixels processed per iteration, or a negative error
+            code; in particular AVERROR(ENOTSUP) for unsupported operations.
+ */
+int ff_sws_solve_shuffle(const SwsOpList *ops, uint8_t shuffle[], int size,
+                         uint8_t clear_val, int *read_bytes, int *write_bytes);
+
 #endif
diff --git a/libswscale/ops_optimizer.c b/libswscale/ops_optimizer.c
index d503bf7bf3..9cde60ed58 100644
--- a/libswscale/ops_optimizer.c
+++ b/libswscale/ops_optimizer.c
@@ -19,9 +19,11 @@
  */
 
 #include "libavutil/avassert.h"
+#include <libavutil/bswap.h>
 #include "libavutil/rational.h"
 
 #include "ops.h"
+#include "ops_internal.h"
 
 #define Q(N) ((AVRational) { N, 1 })
 
@@ -781,3 +783,97 @@ retry:
 
     return 0;
 }
+
+int ff_sws_solve_shuffle(const SwsOpList *const ops, uint8_t shuffle[],
+                         int shuffle_size, uint8_t clear_val,
+                         int *out_read_bytes, int *out_write_bytes)
+{
+    const SwsOp read = ops->ops[0];
+    const int read_size = ff_sws_pixel_type_size(read.type);
+    uint32_t mask[4] = {0};
+
+    if (!ops->num_ops || read.op != SWS_OP_READ)
+        return AVERROR(EINVAL);
+    if (read.rw.frac || (!read.rw.packed && read.rw.elems > 1))
+        return AVERROR(ENOTSUP);
+
+    for (int i = 0; i < read.rw.elems; i++)
+        mask[i] = 0x01010101 * i * read_size + 0x03020100;
+
+    for (int opidx = 1; opidx < ops->num_ops; opidx++) {
+        const SwsOp *op = &ops->ops[opidx];
+        switch (op->op) {
+        case SWS_OP_SWIZZLE: {
+            uint32_t orig[4] = { mask[0], mask[1], mask[2], mask[3] };
+            for (int i = 0; i < 4; i++)
+                mask[i] = orig[op->swizzle.in[i]];
+            break;
+        }
+
+        case SWS_OP_SWAP_BYTES:
+            for (int i = 0; i < 4; i++) {
+                switch (ff_sws_pixel_type_size(op->type)) {
+                case 2: mask[i] = av_bswap16(mask[i]); break;
+                case 4: mask[i] = av_bswap32(mask[i]); break;
+                }
+            }
+            break;
+
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (!op->c.q4[i].den)
+                    continue;
+                if (op->c.q4[i].num != 0 || !clear_val)
+                    return AVERROR(ENOTSUP);
+                mask[i] = 0x1010101ul * clear_val;
+            }
+            break;
+
+        case SWS_OP_CONVERT: {
+            if (!op->convert.expand)
+                return AVERROR(ENOTSUP);
+            for (int i = 0; i < 4; i++) {
+                switch (ff_sws_pixel_type_size(op->type)) {
+                case 1: mask[i] = 0x01010101 * (mask[i] & 0xFF);   break;
+                case 2: mask[i] = 0x00010001 * (mask[i] & 0xFFFF); break;
+                }
+            }
+            break;
+        }
+
+        case SWS_OP_WRITE: {
+            if (op->rw.frac || !op->rw.packed)
+                return AVERROR(ENOTSUP);
+
+            /* Initialize to no-op */
+            memset(shuffle, clear_val, shuffle_size);
+
+            const int write_size  = ff_sws_pixel_type_size(op->type);
+            const int read_chunk  = read.rw.elems * read_size;
+            const int write_chunk = op->rw.elems * write_size;
+            const int num_groups  = shuffle_size / FFMAX(read_chunk, write_chunk);
+            for (int n = 0; n < num_groups; n++) {
+                const int base_in  = n * read_chunk;
+                const int base_out = n * write_chunk;
+                for (int i = 0; i < op->rw.elems; i++) {
+                    const int offset = base_out + i * write_size;
+                    for (int b = 0; b < write_size; b++) {
+                        const uint8_t idx = mask[i] >> (b * 8);
+                        if (idx != clear_val)
+                            shuffle[offset + b] = base_in + idx;
+                    }
+                }
+            }
+
+            *out_read_bytes  = num_groups * read_chunk;
+            *out_write_bytes = num_groups * write_chunk;
+            return num_groups;
+        }
+
+        default:
+            return AVERROR(ENOTSUP);
+        }
+    }
+
+    return AVERROR(EINVAL);
+}
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 11/17] swscale/ops_chain: add internal abstraction for kernel linking
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (9 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 10/17] swscale/optimizer: add packed shuffle solver Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 12/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

See doc/swscale-v2.txt for design details.
---
 libswscale/Makefile    |   1 +
 libswscale/ops_chain.c | 283 +++++++++++++++++++++++++++++++++++++++++
 libswscale/ops_chain.h | 134 +++++++++++++++++++
 3 files changed, 418 insertions(+)
 create mode 100644 libswscale/ops_chain.c
 create mode 100644 libswscale/ops_chain.h

diff --git a/libswscale/Makefile b/libswscale/Makefile
index 810c9dee78..c9dfa78c89 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -16,6 +16,7 @@ OBJS = alphablend.o                                     \
        input.o                                          \
        lut3d.o                                          \
        ops.o                                            \
+       ops_chain.o                                      \
        ops_optimizer.o                                  \
        options.o                                        \
        output.o                                         \
diff --git a/libswscale/ops_chain.c b/libswscale/ops_chain.c
new file mode 100644
index 0000000000..129408df27
--- /dev/null
+++ b/libswscale/ops_chain.c
@@ -0,0 +1,283 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/mem.h"
+#include "libavutil/rational.h"
+
+#include "ops_chain.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+
+SwsOpChain *ff_sws_op_chain_alloc(void)
+{
+    return av_mallocz(sizeof(SwsOpChain));
+}
+
+void ff_sws_op_chain_free(SwsOpChain *chain)
+{
+    if (!chain)
+        return;
+
+    for (int i = 0; i < chain->num_impl + 1; i++) {
+        if (chain->free[i])
+            chain->free[i](chain->impl[i].priv.ptr);
+    }
+
+    av_free(chain);
+}
+
+int ff_sws_op_chain_append(SwsOpChain *chain, SwsFuncPtr func,
+                           void (*free)(void *), SwsOpPriv priv)
+{
+    const int idx = chain->num_impl;
+    if (idx == SWS_MAX_OPS)
+        return AVERROR(EINVAL);
+
+    av_assert1(func);
+    chain->impl[idx].cont = func;
+    chain->impl[idx + 1].priv = priv;
+    chain->free[idx + 1] = free;
+    chain->num_impl++;
+    return 0;
+}
+
+/**
+ * Match an operation against a reference operation. Returns a score for how
+ * well the reference matches the operation, or 0 if there is no match.
+ *
+ * If `ref->comps` has any flags set, they must be set in `op` as well.
+ * Likewise, if `ref->comps` has any components marked as unused, they must be
+ * marked as as unused in `ops` as well.
+ *
+ * For SWS_OP_LINEAR, `ref->linear.mask` must be a strict superset of
+ * `op->linear.mask`, but may not contain any columns explicitly ignored by
+ * `op->comps.unused`.
+ *
+ * For SWS_OP_READ, SWS_OP_WRITE, SWS_OP_SWAP_BYTES and SWS_OP_SWIZZLE, the
+ * exact type is not checked, just the size.
+ *
+ * Components set in `next.unused` are ignored when matching. If `flexible`
+ * is true, the op body is ignored - only the operation, pixel type, and
+ * component masks are checked.
+ */
+static int op_match(const SwsOp *op, const SwsOpEntry *entry, const SwsComps next)
+{
+    int score = 10;
+    if (op->op != entry->op)
+        return 0;
+
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+    case SWS_OP_SWAP_BYTES:
+    case SWS_OP_SWIZZLE:
+        /* Only the size matters for these operations */
+        if (ff_sws_pixel_type_size(op->type) != ff_sws_pixel_type_size(entry->type))
+            return 0;
+        break;
+    default:
+        if (op->type != entry->type)
+            return 0;
+        break;
+    }
+
+    for (int i = 0; i < 4; i++) {
+        if (entry->unused[i]) {
+            if (op->comps.unused[i])
+                score += 1; /* Operating on fewer components is better .. */
+            else
+                return false; /* .. but not too few! */
+        }
+    }
+
+    /* Flexible variants always match, but lower the score to prioritize more
+     * specific implementations if they exist */
+    if (entry->flexible)
+        return score - 5;
+
+    switch (op->op) {
+    case SWS_OP_INVALID:
+        return 0;
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        if (op->rw.elems   != entry->rw.elems ||
+            op->rw.frac    != entry->rw.frac  ||
+            (op->rw.elems > 1 && op->rw.packed != entry->rw.packed))
+            return 0;
+        return score;
+    case SWS_OP_SWAP_BYTES:
+        return score;
+    case SWS_OP_PACK:
+    case SWS_OP_UNPACK:
+        for (int i = 0; i < 4 && op->pack.pattern[i]; i++) {
+            if (op->pack.pattern[i] != entry->pack.pattern[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_CLEAR:
+        for (int i = 0; i < 4; i++) {
+            if (!op->c.q4[i].den)
+                continue;
+            if (av_cmp_q(op->c.q4[i], Q(entry->clear_value)) && !next.unused[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_LSHIFT:
+    case SWS_OP_RSHIFT:
+        av_assert1(entry->flexible);
+        return score;
+    case SWS_OP_SWIZZLE:
+        for (int i = 0; i < 4; i++) {
+            if (op->swizzle.in[i] != entry->swizzle.in[i] && !next.unused[i])
+                return 0;
+        }
+        return score;
+    case SWS_OP_CONVERT:
+        if (op->convert.to     != entry->convert.to ||
+            op->convert.expand != entry->convert.expand)
+            return 0;
+        return score;
+    case SWS_OP_DITHER:
+        return op->dither.size_log2 == entry->dither_size ? score : 0;
+    case SWS_OP_MIN:
+    case SWS_OP_MAX:
+        av_assert1(entry->flexible);
+        return score;
+    case SWS_OP_LINEAR:
+        /* All required elements must be present */
+        if (op->lin.mask & ~entry->linear_mask)
+            return 0;
+        /* To avoid operating on possibly undefined memory, filter out
+         * implementations that operate on more input components */
+        for (int i = 0; i < 4; i++) {
+            if ((entry->linear_mask & SWS_MASK_COL(i)) && op->comps.unused[i])
+                return 0;
+        }
+        /* Prioritize smaller implementations */
+        score += av_popcount(SWS_MASK_ALL ^ entry->linear_mask);
+        return score;
+    case SWS_OP_SCALE:
+        return score;
+    case SWS_OP_TYPE_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid operation type!");
+    return 0;
+}
+
+int ff_sws_op_compile_tables(const SwsOpTable *const tables[], int num_tables,
+                             SwsOpList *ops, const int block_size,
+                             SwsOpChain *chain)
+{
+    static const SwsOp dummy = { .comps.unused = { true, true, true, true }};
+    const SwsOp *next = ops->num_ops > 1 ? &ops->ops[1] : &dummy;
+    const unsigned cpu_flags = av_get_cpu_flags();
+    const SwsOpEntry *best = NULL;
+    const SwsOp *op = &ops->ops[0];
+    int ret, best_score = 0, best_cpu_flags;
+    SwsOpPriv priv = {0};
+
+    for (int n = 0; n < num_tables; n++) {
+        const SwsOpTable *table = tables[n];
+        if (table->block_size && table->block_size != block_size ||
+            table->cpu_flags & ~cpu_flags)
+            continue;
+
+        for (int i = 0; table->entries[i]; i++) {
+            const SwsOpEntry *entry = table->entries[i];
+            int score = op_match(op, entry, next->comps);
+            if (score > best_score) {
+                best_score = score;
+                best_cpu_flags = table->cpu_flags;
+                best = entry;
+            }
+        }
+    }
+
+    if (!best)
+        return AVERROR(ENOTSUP);
+
+    if (best->setup) {
+        ret = best->setup(op, &priv);
+        if (ret < 0)
+            return ret;
+    }
+
+    chain->cpu_flags |= best_cpu_flags;
+    ret = ff_sws_op_chain_append(chain, best->func, best->free, priv);
+    if (ret < 0) {
+        if (best->free)
+            best->free(&priv);
+        return ret;
+    }
+
+    ops->ops++;
+    ops->num_ops--;
+    return ops->num_ops ? AVERROR(EAGAIN) : 0;
+}
+
+#define q2pixel(type, q) ((q).den ? (type) (q).num / (q).den : 0)
+
+int ff_sws_setup_u8(const SwsOp *op, SwsOpPriv *out)
+{
+    out->u8[0] = op->c.u;
+    return 0;
+}
+
+int ff_sws_setup_u(const SwsOp *op, SwsOpPriv *out)
+{
+    switch (op->type) {
+    case SWS_PIXEL_U8:  out->u8[0]  = op->c.u; return 0;
+    case SWS_PIXEL_U16: out->u16[0] = op->c.u; return 0;
+    case SWS_PIXEL_U32: out->u32[0] = op->c.u; return 0;
+    case SWS_PIXEL_F32: out->f32[0] = op->c.u; return 0;
+    default: return AVERROR(EINVAL);
+    }
+}
+
+int ff_sws_setup_q(const SwsOp *op, SwsOpPriv *out)
+{
+    switch (op->type) {
+    case SWS_PIXEL_U8:  out->u8[0]  = q2pixel(uint8_t,  op->c.q); return 0;
+    case SWS_PIXEL_U16: out->u16[0] = q2pixel(uint16_t, op->c.q); return 0;
+    case SWS_PIXEL_U32: out->u32[0] = q2pixel(uint32_t, op->c.q); return 0;
+    case SWS_PIXEL_F32: out->f32[0] = q2pixel(float,    op->c.q); return 0;
+    default: return AVERROR(EINVAL);
+    }
+
+    return 0;
+}
+
+int ff_sws_setup_q4(const SwsOp *op, SwsOpPriv *out)
+{
+    for (int i = 0; i < 4; i++) {
+        switch (op->type) {
+        case SWS_PIXEL_U8:  out->u8[i]  = q2pixel(uint8_t,  op->c.q4[i]); break;
+        case SWS_PIXEL_U16: out->u16[i] = q2pixel(uint16_t, op->c.q4[i]); break;
+        case SWS_PIXEL_U32: out->u32[i] = q2pixel(uint32_t, op->c.q4[i]); break;
+        case SWS_PIXEL_F32: out->f32[i] = q2pixel(float,    op->c.q4[i]); break;
+        default: return AVERROR(EINVAL);
+        }
+    }
+
+    return 0;
+}
diff --git a/libswscale/ops_chain.h b/libswscale/ops_chain.h
new file mode 100644
index 0000000000..777d8e13c1
--- /dev/null
+++ b/libswscale/ops_chain.h
@@ -0,0 +1,134 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_CHAIN_H
+#define SWSCALE_OPS_CHAIN_H
+
+#include "libavutil/cpu.h"
+
+#include "ops_internal.h"
+
+/**
+ * Helpers for SIMD implementations based on chained kernels, using a
+ * continuation passing style to link them together.
+ *
+ * The basic idea here is to "link" together a series of different operation
+ * kernels by constructing a list of kernel addresses into an SwsOpChain. Each
+ * kernel will load the address of the next kernel (the "continuation") from
+ * this struct, and jump directly into it; using an internal function signature
+ * that is an implementation detail of the specific backend.
+ */
+
+/**
+ * Private data for each kernel.
+ */
+typedef union SwsOpPriv {
+    DECLARE_ALIGNED_16(char, data)[16];
+
+    /* Common types */
+    void *ptr;
+    uint8_t   u8[16];
+    uint16_t u16[8];
+    uint32_t u32[4];
+    float    f32[4];
+} SwsOpPriv;
+
+static_assert(sizeof(SwsOpPriv) == 16, "SwsOpPriv size mismatch");
+
+/* Setup helpers */
+int ff_sws_setup_u(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_u8(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_q(const SwsOp *op, SwsOpPriv *out);
+int ff_sws_setup_q4(const SwsOp *op, SwsOpPriv *out);
+
+/**
+ * Per-kernel execution context.
+ *
+ * Note: This struct is hard-coded in assembly, so do not change the layout.
+ */
+typedef void (*SwsFuncPtr)(void);
+typedef struct SwsOpImpl {
+    SwsFuncPtr cont; /* [offset =  0] Continuation for this operation. */
+    SwsOpPriv  priv; /* [offset = 16] Private data for this operation. */
+} SwsOpImpl;
+
+static_assert(sizeof(SwsOpImpl) == 32,         "SwsOpImpl layout mismatch");
+static_assert(offsetof(SwsOpImpl, priv) == 16, "SwsOpImpl layout mismatch");
+
+/**
+ * Compiled "chain" of operations, which can be dispatched efficiently.
+ * Effectively just a list of function pointers, alongside a small amount of
+ * private data for each operation.
+ */
+typedef struct SwsOpChain {
+#define SWS_MAX_OPS 16
+    SwsOpImpl impl[SWS_MAX_OPS + 1]; /* reserve extra space for the entrypoint */
+    void (*free[SWS_MAX_OPS + 1])(void *);
+    int num_impl;
+    int cpu_flags; /* set of all used CPU flags */
+} SwsOpChain;
+
+SwsOpChain *ff_sws_op_chain_alloc(void);
+void ff_sws_op_chain_free(SwsOpChain *chain);
+
+/* Returns 0 on success, or a negative error code. */
+int ff_sws_op_chain_append(SwsOpChain *chain, SwsFuncPtr func,
+                           void (*free)(void *), SwsOpPriv priv);
+
+typedef struct SwsOpEntry {
+    /* Kernel metadata; reduced size subset of SwsOp */
+    SwsOpType op;
+    SwsPixelType type;
+    bool flexible; /* if true, only the type and op are matched */
+    bool unused[4]; /* for kernels which operate on a subset of components */
+
+    union { /* extra data defining the operation, unless `flexible` is true */
+        SwsReadWriteOp rw;
+        SwsPackOp      pack;
+        SwsSwizzleOp   swizzle;
+        SwsConvertOp   convert;
+        uint32_t       linear_mask; /* subset of SwsLinearOp */
+        int            dither_size; /* subset of SwsDitherOp */
+        int            clear_value; /* clear value for integer clears */
+    };
+
+    /* Kernel implementation */
+    SwsFuncPtr func;
+    int (*setup)(const SwsOp *op, SwsOpPriv *out); /* optional */
+    void (*free)(void *priv);
+} SwsOpEntry;
+
+typedef struct SwsOpTable {
+    unsigned cpu_flags;   /* required CPU flags for this table */
+    int block_size;       /* fixed block size of this table */
+    const SwsOpEntry *entries[]; /* terminated by NULL */
+} SwsOpTable;
+
+/**
+ * "Compile" a single op by looking it up in a list of fixed size op tables.
+ * See `op_match` in `ops.c` for details on how the matching works.
+ *
+ * Returns 0, AVERROR(EAGAIN), or a negative error code.
+ */
+int ff_sws_op_compile_tables(const SwsOpTable *const tables[], int num_tables,
+                             SwsOpList *ops, const int block_size,
+                             SwsOpChain *chain);
+
+#endif
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 12/17] swscale/ops_backend: add reference backend basend on C templates
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (10 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 11/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 13/17] swscale/ops_memcpy: add 'memcpy' backend for plane->plane copies Niklas Haas
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This will serve as a reference for the SIMD backends to come. That said,
with auto-vectorization enabled, the performance of this is not atrocious.
It easily beats the old C code and sometimes even the old SIMD.

In theory, we can dramatically speed it up by using GCC vectors instead of
arrays, but the performance gains from this are too dependent on exact GCC
versions and flags, so it practice it's not a substitute for a SIMD
implementation.
---
 libswscale/Makefile          |   6 +
 libswscale/ops.c             |   3 +
 libswscale/ops_backend.c     | 105 ++++++
 libswscale/ops_backend.h     | 167 ++++++++++
 libswscale/ops_tmpl_common.c | 176 ++++++++++
 libswscale/ops_tmpl_float.c  | 257 +++++++++++++++
 libswscale/ops_tmpl_int.c    | 603 +++++++++++++++++++++++++++++++++++
 7 files changed, 1317 insertions(+)
 create mode 100644 libswscale/ops_backend.c
 create mode 100644 libswscale/ops_backend.h
 create mode 100644 libswscale/ops_tmpl_common.c
 create mode 100644 libswscale/ops_tmpl_float.c
 create mode 100644 libswscale/ops_tmpl_int.c

diff --git a/libswscale/Makefile b/libswscale/Makefile
index c9dfa78c89..6e5696c5a6 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -16,6 +16,7 @@ OBJS = alphablend.o                                     \
        input.o                                          \
        lut3d.o                                          \
        ops.o                                            \
+       ops_backend.o                                    \
        ops_chain.o                                      \
        ops_optimizer.o                                  \
        options.o                                        \
@@ -29,6 +30,11 @@ OBJS = alphablend.o                                     \
        yuv2rgb.o                                        \
        vscale.o                                         \
 
+OPS-CFLAGS = -Wno-uninitialized \
+             -ffinite-math-only
+
+$(SUBDIR)ops_backend.o: CFLAGS += $(OPS-CFLAGS)
+
 # Objects duplicated from other libraries for shared builds
 SHLIBOBJS                    += log2_tab.o half2float.o
 
diff --git a/libswscale/ops.c b/libswscale/ops.c
index 60791ba875..fe7ea6a565 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -27,7 +27,10 @@
 #include "ops.h"
 #include "ops_internal.h"
 
+extern SwsOpBackend backend_c;
+
 const SwsOpBackend * const ff_sws_op_backends[] = {
+    &backend_c,
     NULL
 };
 
diff --git a/libswscale/ops_backend.c b/libswscale/ops_backend.c
new file mode 100644
index 0000000000..47ce992bb3
--- /dev/null
+++ b/libswscale/ops_backend.c
@@ -0,0 +1,105 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "ops_backend.h"
+
+/* Array-based reference implementation */
+
+#ifndef SWS_BLOCK_SIZE
+#  define SWS_BLOCK_SIZE 32
+#endif
+
+typedef  uint8_t  u8block_t[SWS_BLOCK_SIZE];
+typedef uint16_t u16block_t[SWS_BLOCK_SIZE];
+typedef uint32_t u32block_t[SWS_BLOCK_SIZE];
+typedef    float f32block_t[SWS_BLOCK_SIZE];
+
+#define BIT_DEPTH 8
+# include "ops_tmpl_int.c"
+#undef BIT_DEPTH
+
+#define BIT_DEPTH 16
+# include "ops_tmpl_int.c"
+#undef BIT_DEPTH
+
+#define BIT_DEPTH 32
+# include "ops_tmpl_int.c"
+# include "ops_tmpl_float.c"
+#undef BIT_DEPTH
+
+static void process(const SwsOpExec *exec, const void *priv,
+                    const int bx_start, const int y_start, int bx_end, int y_end)
+{
+    const SwsOpChain *chain = priv;
+    const SwsOpImpl *impl = chain->impl;
+    SwsOpIter iter;
+
+    for (iter.y = y_start; iter.y < y_end; iter.y++) {
+        for (int i = 0; i < 4; i++) {
+            iter.in[i]  = exec->in[i]  + (iter.y - y_start) * exec->in_stride[i];
+            iter.out[i] = exec->out[i] + (iter.y - y_start) * exec->out_stride[i];
+        }
+
+        for (int block = bx_start; block < bx_end; block++) {
+            iter.x = block * SWS_BLOCK_SIZE;
+            ((void (*)(SwsOpIter *, const SwsOpImpl *)) impl->cont)
+                (&iter, &impl[1]);
+        }
+    }
+}
+
+static int compile(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out)
+{
+    int ret;
+
+    SwsOpChain *chain = ff_sws_op_chain_alloc();
+    if (!chain)
+        return AVERROR(ENOMEM);
+
+    static const SwsOpTable *const tables[] = {
+        &bitfn(op_table_int,    u8),
+        &bitfn(op_table_int,   u16),
+        &bitfn(op_table_int,   u32),
+        &bitfn(op_table_float, f32),
+    };
+
+    do {
+        ret = ff_sws_op_compile_tables(tables, FF_ARRAY_ELEMS(tables), ops,
+                                       SWS_BLOCK_SIZE, chain);
+    } while (ret == AVERROR(EAGAIN));
+    if (ret < 0) {
+        ff_sws_op_chain_free(chain);
+        return ret;
+    }
+
+    *out = (SwsCompiledOp) {
+        .func       = process,
+        .block_size = SWS_BLOCK_SIZE,
+        .cpu_flags  = chain->cpu_flags,
+        .priv       = chain,
+        .free       = (void (*)(void *)) ff_sws_op_chain_free,
+    };
+    return 0;
+}
+
+SwsOpBackend backend_c = {
+    .name       = "c",
+    .compile    = compile,
+};
diff --git a/libswscale/ops_backend.h b/libswscale/ops_backend.h
new file mode 100644
index 0000000000..7880bb608e
--- /dev/null
+++ b/libswscale/ops_backend.h
@@ -0,0 +1,167 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#ifndef SWSCALE_OPS_BACKEND_H
+#define SWSCALE_OPS_BACKEND_H
+
+/**
+ * Helper macros for the C-based backend.
+ *
+ * To use these macros, the following types must be defined:
+ *  - PIXEL_TYPE should be one of SWS_PIXEL_*
+ *  - pixel_t should be the type of pixels
+ *  - block_t should be the type of blocks (groups of pixels)
+ */
+
+#include <assert.h>
+#include <float.h>
+#include <stdint.h>
+
+#include "libavutil/attributes.h"
+#include "libavutil/mem.h"
+
+#include "ops_chain.h"
+
+/**
+ * Internal context holding per-iter execution data. The data pointers will be
+ * directly incremented by the corresponding read/write functions.
+ */
+typedef struct SwsOpIter {
+    const uint8_t *in[4];
+    uint8_t *out[4];
+    int x, y;
+} SwsOpIter;
+
+#ifdef __clang__
+#  define SWS_FUNC
+#  define SWS_LOOP AV_PRAGMA(clang loop vectorize(assume_safety))
+#elif defined(__GNUC__)
+#  define SWS_FUNC __attribute__((optimize("tree-vectorize")))
+#  define SWS_LOOP AV_PRAGMA(GCC ivdep)
+#else
+#  define SWS_FUNC
+#  define SWS_LOOP
+#endif
+
+/* Miscellaneous helpers */
+#define bitfn2(name, ext) name ## _ ## ext
+#define bitfn(name, ext)  bitfn2(name, ext)
+
+#define FN_SUFFIX AV_JOIN(FMT_CHAR, BIT_DEPTH)
+#define fn(name)  bitfn(name, FN_SUFFIX)
+
+#define av_q2pixel(q) ((q).den ? (pixel_t) (q).num / (q).den : 0)
+
+/* Helper macros to make writing common function signatures less painful */
+#define DECL_FUNC(NAME, ...)                                                    \
+    static av_always_inline void fn(NAME)(SwsOpIter *restrict iter,             \
+                                          const SwsOpImpl *restrict impl,       \
+                                          block_t x, block_t y,                 \
+                                          block_t z, block_t w,                 \
+                                          __VA_ARGS__)
+
+#define DECL_READ(NAME, ...)                                                    \
+    static av_always_inline void fn(NAME)(SwsOpIter *restrict iter,             \
+                                          const SwsOpImpl *restrict impl,       \
+                                          const pixel_t *restrict in0,          \
+                                          const pixel_t *restrict in1,          \
+                                          const pixel_t *restrict in2,          \
+                                          const pixel_t *restrict in3,          \
+                                          __VA_ARGS__)
+
+#define DECL_WRITE(NAME, ...)                                                   \
+    DECL_FUNC(NAME, pixel_t *restrict out0, pixel_t *restrict out1,             \
+                    pixel_t *restrict out2, pixel_t *restrict out3,             \
+                    __VA_ARGS__)
+
+/* Helper macros to call into functions declared with DECL_FUNC_* */
+#define CALL(FUNC, ...) \
+    fn(FUNC)(iter, impl, x, y, z, w, __VA_ARGS__)
+
+#define CALL_READ(FUNC, ...)                                                    \
+    fn(FUNC)(iter, impl, (const pixel_t *) iter->in[0],                         \
+                         (const pixel_t *) iter->in[1],                         \
+                         (const pixel_t *) iter->in[2],                         \
+                         (const pixel_t *) iter->in[3], __VA_ARGS__)
+
+#define CALL_WRITE(FUNC, ...)                                                   \
+    CALL(FUNC, (pixel_t *) iter->out[0], (pixel_t *) iter->out[1],              \
+               (pixel_t *) iter->out[2], (pixel_t *) iter->out[3], __VA_ARGS__)
+
+/* Helper macros to declare continuation functions */
+#define DECL_IMPL(NAME)                                                         \
+    static SWS_FUNC void fn(NAME)(SwsOpIter *restrict iter,                     \
+                                  const SwsOpImpl *restrict impl,               \
+                                  block_t x, block_t y,                         \
+                                  block_t z, block_t w)                         \
+
+/* Helper macro to call into the next continuation with a given type */
+#define CONTINUE(TYPE, ...)                                                     \
+    ((void (*)(SwsOpIter *, const SwsOpImpl *,                                  \
+               TYPE x, TYPE y, TYPE z, TYPE w)) impl->cont)                     \
+        (iter, &impl[1], __VA_ARGS__)
+
+/* Helper macros for common op setup code */
+#define DECL_SETUP(NAME)                                                        \
+    static int fn(NAME)(const SwsOp *op, SwsOpPriv *out)
+
+#define SETUP_MEMDUP(c) ff_setup_memdup(&(c), sizeof(c), out)
+static inline int ff_setup_memdup(const void *c, size_t size, SwsOpPriv *out)
+{
+    out->ptr = av_memdup(c, size);
+    return out->ptr ? 0 : AVERROR(ENOMEM);
+}
+
+/* Helper macro for declaring op table entries */
+#define DECL_ENTRY(NAME, ...)                                                   \
+    static const SwsOpEntry fn(op_##NAME) = {                                   \
+        .func = (SwsFuncPtr) fn(NAME),                                          \
+        .type = PIXEL_TYPE,                                                     \
+        __VA_ARGS__                                                             \
+    }
+
+/* Helpers to define functions for common subsets of components */
+#define DECL_PATTERN(NAME) \
+    DECL_FUNC(NAME, const bool X, const bool Y, const bool Z, const bool W)
+
+#define WRAP_PATTERN(FUNC, X, Y, Z, W, ...)                                     \
+    DECL_IMPL(FUNC##_##X##Y##Z##W)                                              \
+    {                                                                           \
+        CALL(FUNC, X, Y, Z, W);                                                 \
+    }                                                                           \
+                                                                                \
+    DECL_ENTRY(FUNC##_##X##Y##Z##W,                                             \
+        .unused = { !X, !Y, !Z, !W },                                           \
+        __VA_ARGS__                                                             \
+    )
+
+#define WRAP_COMMON_PATTERNS(FUNC, ...)                                         \
+    WRAP_PATTERN(FUNC, 1, 0, 0, 0, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 0, 0, 1, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 1, 1, 0, __VA_ARGS__);                                \
+    WRAP_PATTERN(FUNC, 1, 1, 1, 1, __VA_ARGS__)
+
+#define REF_COMMON_PATTERNS(NAME)                                               \
+    &fn(op_##NAME##_1000),                                                      \
+    &fn(op_##NAME##_1001),                                                      \
+    &fn(op_##NAME##_1110),                                                      \
+    &fn(op_##NAME##_1111)
+
+#endif
diff --git a/libswscale/ops_tmpl_common.c b/libswscale/ops_tmpl_common.c
new file mode 100644
index 0000000000..5490f6f755
--- /dev/null
+++ b/libswscale/ops_tmpl_common.c
@@ -0,0 +1,176 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  error Should only be included from ops_tmpl_*.c!
+#endif
+
+#define WRAP_CONVERT_UINT(N)                                                    \
+DECL_PATTERN(convert_uint##N)                                                   \
+{                                                                               \
+    u##N##block_t xu, yu, zu, wu;                                               \
+                                                                                \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        if (X)                                                                  \
+            xu[i] = x[i];                                                       \
+        if (Y)                                                                  \
+            yu[i] = y[i];                                                       \
+        if (Z)                                                                  \
+            zu[i] = z[i];                                                       \
+        if (W)                                                                  \
+            wu[i] = w[i];                                                       \
+    }                                                                           \
+                                                                                \
+    CONTINUE(u##N##block_t, xu, yu, zu, wu);                                    \
+}                                                                               \
+                                                                                \
+WRAP_COMMON_PATTERNS(convert_uint##N,                                           \
+    .op = SWS_OP_CONVERT,                                                       \
+    .convert.to = SWS_PIXEL_U##N,                                               \
+);
+
+#if BIT_DEPTH != 8
+WRAP_CONVERT_UINT(8)
+#endif
+
+#if BIT_DEPTH != 16
+WRAP_CONVERT_UINT(16)
+#endif
+
+#if BIT_DEPTH != 32 || defined(IS_FLOAT)
+WRAP_CONVERT_UINT(32)
+#endif
+
+DECL_FUNC(clear, const bool X, const bool Y, const bool Z, const bool W)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (!X)
+            x[i] = impl->priv.px[0];
+        if (!Y)
+            y[i] = impl->priv.px[1];
+        if (!Z)
+            z[i] = impl->priv.px[2];
+        if (!W)
+            w[i] = impl->priv.px[3];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_CLEAR(X, Y, Z, W)                                                  \
+DECL_IMPL(clear##_##X##Y##Z##W)                                                 \
+{                                                                               \
+    CALL(clear, X, Y, Z, W);                                                    \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(clear##_##X##Y##Z##W,                                                \
+    .setup = ff_sws_setup_q4,                                                   \
+    .op = SWS_OP_CLEAR,                                                         \
+    .flexible = true,                                                           \
+    .unused = { !X, !Y, !Z, !W },                                               \
+);
+
+WRAP_CLEAR(1, 1, 1, 0) /* rgba alpha */
+WRAP_CLEAR(0, 1, 1, 1) /* argb alpha */
+
+WRAP_CLEAR(0, 0, 1, 1) /* vuya chroma */
+WRAP_CLEAR(1, 0, 0, 1) /* yuva chroma */
+WRAP_CLEAR(1, 1, 0, 0) /* ayuv chroma */
+WRAP_CLEAR(0, 1, 0, 1) /* uyva chroma */
+WRAP_CLEAR(1, 0, 1, 0) /* xvyu chroma */
+
+WRAP_CLEAR(1, 0, 0, 0) /* gray -> yuva */
+WRAP_CLEAR(0, 1, 0, 0) /* gray -> ayuv */
+WRAP_CLEAR(0, 0, 1, 0) /* gray -> vuya */
+
+DECL_PATTERN(min)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = FFMIN(x[i], impl->priv.px[0]);
+        if (Y)
+            y[i] = FFMIN(y[i], impl->priv.px[1]);
+        if (Z)
+            z[i] = FFMIN(z[i], impl->priv.px[2]);
+        if (W)
+            w[i] = FFMIN(w[i], impl->priv.px[3]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_PATTERN(max)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = FFMAX(x[i], impl->priv.px[0]);
+        if (Y)
+            y[i] = FFMAX(y[i], impl->priv.px[1]);
+        if (Z)
+            z[i] = FFMAX(z[i], impl->priv.px[2]);
+        if (W)
+            w[i] = FFMAX(w[i], impl->priv.px[3]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(min,
+    .op = SWS_OP_MIN,
+    .setup = ff_sws_setup_q4,
+    .flexible = true,
+);
+
+WRAP_COMMON_PATTERNS(max,
+    .op = SWS_OP_MAX,
+    .setup = ff_sws_setup_q4,
+    .flexible = true,
+);
+
+DECL_PATTERN(scale)
+{
+    const pixel_t scale = impl->priv.px[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] *= scale;
+        if (Y)
+            y[i] *= scale;
+        if (Z)
+            z[i] *= scale;
+        if (W)
+            w[i] *= scale;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(scale,
+    .op = SWS_OP_SCALE,
+    .setup = ff_sws_setup_q,
+    .flexible = true,
+);
diff --git a/libswscale/ops_tmpl_float.c b/libswscale/ops_tmpl_float.c
new file mode 100644
index 0000000000..16fb3197bb
--- /dev/null
+++ b/libswscale/ops_tmpl_float.c
@@ -0,0 +1,257 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  define BIT_DEPTH 32
+#endif
+
+#if BIT_DEPTH == 32
+#  define PIXEL_TYPE SWS_PIXEL_F32
+#  define PIXEL_MAX  FLT_MAX
+#  define PIXEL_MIN  FLT_MIN
+#  define pixel_t    float
+#  define block_t    f32block_t
+#  define px         f32
+#else
+#  error Invalid BIT_DEPTH
+#endif
+
+#define IS_FLOAT 1
+#define FMT_CHAR f
+#include "ops_tmpl_common.c"
+
+DECL_SETUP(setup_dither)
+{
+    const int size = 1 << op->dither.size_log2;
+    if (!size) {
+        /* We special case this value */
+        av_assert1(!av_cmp_q(op->dither.matrix[0], av_make_q(1, 2)));
+        out->ptr = NULL;
+        return 0;
+    }
+
+    const int width = FFMAX(size, SWS_BLOCK_SIZE);
+    pixel_t *matrix = out->ptr = av_malloc(sizeof(pixel_t) * size * width);
+    if (!matrix)
+        return AVERROR(ENOMEM);
+
+    for (int y = 0; y < size; y++) {
+        for (int x = 0; x < size; x++)
+            matrix[y * width + x] = av_q2pixel(op->dither.matrix[y * size + x]);
+        for (int x = size; x < width; x++) /* pad to block size */
+            matrix[y * width + x] = matrix[y * width + (x % size)];
+    }
+
+    return 0;
+}
+
+DECL_FUNC(dither, const int size_log2)
+{
+    const pixel_t *restrict matrix = impl->priv.ptr;
+    const int mask = (1 << size_log2) - 1;
+    const int y_line = iter->y;
+    const int row0 = (y_line +  0) & mask;
+    const int row1 = (y_line +  3) & mask;
+    const int row2 = (y_line +  2) & mask;
+    const int row3 = (y_line +  5) & mask;
+    const int size = 1 << size_log2;
+    const int width = FFMAX(size, SWS_BLOCK_SIZE);
+    const int base = iter->x & ~(SWS_BLOCK_SIZE - 1) & (size - 1);
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] += size_log2 ? matrix[row0 * width + base + i] : (pixel_t) 0.5;
+        y[i] += size_log2 ? matrix[row1 * width + base + i] : (pixel_t) 0.5;
+        z[i] += size_log2 ? matrix[row2 * width + base + i] : (pixel_t) 0.5;
+        w[i] += size_log2 ? matrix[row3 * width + base + i] : (pixel_t) 0.5;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_DITHER(N)                                                          \
+DECL_IMPL(dither##N)                                                            \
+{                                                                               \
+    CALL(dither, N);                                                            \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(dither##N,                                                           \
+    .op = SWS_OP_DITHER,                                                        \
+    .dither_size = N,                                                           \
+    .setup = fn(setup_dither),                                                  \
+    .free = av_free,                                                            \
+);
+
+WRAP_DITHER(0)
+WRAP_DITHER(1)
+WRAP_DITHER(2)
+WRAP_DITHER(3)
+WRAP_DITHER(4)
+WRAP_DITHER(5)
+WRAP_DITHER(6)
+WRAP_DITHER(7)
+WRAP_DITHER(8)
+
+typedef struct {
+    /* Stored in split form for convenience */
+    pixel_t m[4][4];
+    pixel_t k[4];
+} fn(LinCoeffs);
+
+DECL_SETUP(setup_linear)
+{
+    fn(LinCoeffs) c;
+
+    for (int i = 0; i < 4; i++) {
+        for (int j = 0; j < 4; j++)
+            c.m[i][j] = av_q2pixel(op->lin.m[i][j]);
+        c.k[i] = av_q2pixel(op->lin.m[i][4]);
+    }
+
+    return SETUP_MEMDUP(c);
+}
+
+/**
+ * Fully general case for a 5x5 linear affine transformation. Should never be
+ * called without constant `mask`. This function will compile down to the
+ * appropriately optimized version for the required subset of operations when
+ * called with a constant mask.
+ */
+DECL_FUNC(linear_mask, const uint32_t mask)
+{
+    const fn(LinCoeffs) c = *(const fn(LinCoeffs) *) impl->priv.ptr;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        const pixel_t xx = x[i];
+        const pixel_t yy = y[i];
+        const pixel_t zz = z[i];
+        const pixel_t ww = w[i];
+
+        x[i]  = (mask & SWS_MASK_OFF(0)) ? c.k[0] : 0;
+        x[i] += (mask & SWS_MASK(0, 0))  ? c.m[0][0] * xx : xx;
+        x[i] += (mask & SWS_MASK(0, 1))  ? c.m[0][1] * yy : 0;
+        x[i] += (mask & SWS_MASK(0, 2))  ? c.m[0][2] * zz : 0;
+        x[i] += (mask & SWS_MASK(0, 3))  ? c.m[0][3] * ww : 0;
+
+        y[i]  = (mask & SWS_MASK_OFF(1)) ? c.k[1] : 0;
+        y[i] += (mask & SWS_MASK(1, 0))  ? c.m[1][0] * xx : 0;
+        y[i] += (mask & SWS_MASK(1, 1))  ? c.m[1][1] * yy : yy;
+        y[i] += (mask & SWS_MASK(1, 2))  ? c.m[1][2] * zz : 0;
+        y[i] += (mask & SWS_MASK(1, 3))  ? c.m[1][3] * ww : 0;
+
+        z[i]  = (mask & SWS_MASK_OFF(2)) ? c.k[2] : 0;
+        z[i] += (mask & SWS_MASK(2, 0))  ? c.m[2][0] * xx : 0;
+        z[i] += (mask & SWS_MASK(2, 1))  ? c.m[2][1] * yy : 0;
+        z[i] += (mask & SWS_MASK(2, 2))  ? c.m[2][2] * zz : zz;
+        z[i] += (mask & SWS_MASK(2, 3))  ? c.m[2][3] * ww : 0;
+
+        w[i]  = (mask & SWS_MASK_OFF(3)) ? c.k[3] : 0;
+        w[i] += (mask & SWS_MASK(3, 0))  ? c.m[3][0] * xx : 0;
+        w[i] += (mask & SWS_MASK(3, 1))  ? c.m[3][1] * yy : 0;
+        w[i] += (mask & SWS_MASK(3, 2))  ? c.m[3][2] * zz : 0;
+        w[i] += (mask & SWS_MASK(3, 3))  ? c.m[3][3] * ww : ww;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+#define WRAP_LINEAR(NAME, MASK)                                                 \
+DECL_IMPL(linear_##NAME)                                                        \
+{                                                                               \
+    CALL(linear_mask, MASK);                                                    \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(linear_##NAME,                                                       \
+    .op    = SWS_OP_LINEAR,                                                     \
+    .setup = fn(setup_linear),                                                  \
+    .free  = av_free,                                                           \
+    .linear_mask = (MASK),                                                      \
+);
+
+WRAP_LINEAR(luma,      SWS_MASK_LUMA)
+WRAP_LINEAR(alpha,     SWS_MASK_ALPHA)
+WRAP_LINEAR(lumalpha,  SWS_MASK_LUMA | SWS_MASK_ALPHA)
+WRAP_LINEAR(dot3,      0b111)
+WRAP_LINEAR(row0,      SWS_MASK_ROW(0))
+WRAP_LINEAR(row0a,     SWS_MASK_ROW(0) | SWS_MASK_ALPHA)
+WRAP_LINEAR(diag3,     SWS_MASK_DIAG3)
+WRAP_LINEAR(diag4,     SWS_MASK_DIAG4)
+WRAP_LINEAR(diagoff3,  SWS_MASK_DIAG3 | SWS_MASK_OFF3)
+WRAP_LINEAR(matrix3,   SWS_MASK_MAT3)
+WRAP_LINEAR(affine3,   SWS_MASK_MAT3 | SWS_MASK_OFF3)
+WRAP_LINEAR(affine3a,  SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA)
+WRAP_LINEAR(matrix4,   SWS_MASK_MAT4)
+WRAP_LINEAR(affine4,   SWS_MASK_MAT4 | SWS_MASK_OFF4)
+
+static const SwsOpTable fn(op_table_float) = {
+    .block_size = SWS_BLOCK_SIZE,
+    .entries = {
+        REF_COMMON_PATTERNS(convert_uint8),
+        REF_COMMON_PATTERNS(convert_uint16),
+        REF_COMMON_PATTERNS(convert_uint32),
+
+        &fn(op_clear_1110),
+        REF_COMMON_PATTERNS(min),
+        REF_COMMON_PATTERNS(max),
+        REF_COMMON_PATTERNS(scale),
+
+        &fn(op_dither0),
+        &fn(op_dither1),
+        &fn(op_dither2),
+        &fn(op_dither3),
+        &fn(op_dither4),
+        &fn(op_dither5),
+        &fn(op_dither6),
+        &fn(op_dither7),
+        &fn(op_dither8),
+
+        &fn(op_linear_luma),
+        &fn(op_linear_alpha),
+        &fn(op_linear_lumalpha),
+        &fn(op_linear_dot3),
+        &fn(op_linear_row0),
+        &fn(op_linear_row0a),
+        &fn(op_linear_diag3),
+        &fn(op_linear_diag4),
+        &fn(op_linear_diagoff3),
+        &fn(op_linear_matrix3),
+        &fn(op_linear_affine3),
+        &fn(op_linear_affine3a),
+        &fn(op_linear_matrix4),
+        &fn(op_linear_affine4),
+
+        NULL
+    },
+};
+
+#undef PIXEL_TYPE
+#undef PIXEL_MAX
+#undef PIXEL_MIN
+#undef pixel_t
+#undef block_t
+#undef px
+
+#undef FMT_CHAR
+#undef IS_FLOAT
diff --git a/libswscale/ops_tmpl_int.c b/libswscale/ops_tmpl_int.c
new file mode 100644
index 0000000000..b8fa184975
--- /dev/null
+++ b/libswscale/ops_tmpl_int.c
@@ -0,0 +1,603 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+#include "libavutil/bswap.h"
+
+#include "ops_backend.h"
+
+#ifndef BIT_DEPTH
+#  define BIT_DEPTH 8
+#endif
+
+#if BIT_DEPTH == 32
+#  define PIXEL_TYPE SWS_PIXEL_U32
+#  define PIXEL_MAX  0xFFFFFFFFu
+#  define SWAP_BYTES av_bswap32
+#  define pixel_t    uint32_t
+#  define block_t    u32block_t
+#  define px         u32
+#elif BIT_DEPTH == 16
+#  define PIXEL_TYPE SWS_PIXEL_U16
+#  define PIXEL_MAX  0xFFFFu
+#  define SWAP_BYTES av_bswap16
+#  define pixel_t    uint16_t
+#  define block_t    u16block_t
+#  define px         u16
+#elif BIT_DEPTH == 8
+#  define PIXEL_TYPE SWS_PIXEL_U8
+#  define PIXEL_MAX  0xFFu
+#  define pixel_t    uint8_t
+#  define block_t    u8block_t
+#  define px         u8
+#else
+#  error Invalid BIT_DEPTH
+#endif
+
+#define IS_FLOAT  0
+#define FMT_CHAR  u
+#define PIXEL_MIN 0
+#include "ops_tmpl_common.c"
+
+DECL_READ(read_planar, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] = in0[i];
+        if (elems > 1)
+            y[i] = in1[i];
+        if (elems > 2)
+            z[i] = in2[i];
+        if (elems > 3)
+            w[i] = in3[i];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_READ(read_packed, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] = in0[elems * i + 0];
+        if (elems > 1)
+            y[i] = in0[elems * i + 1];
+        if (elems > 2)
+            z[i] = in0[elems * i + 2];
+        if (elems > 3)
+            w[i] = in0[elems * i + 3];
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_WRITE(write_planar, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        out0[i] = x[i];
+        if (elems > 1)
+            out1[i] = y[i];
+        if (elems > 2)
+            out2[i] = z[i];
+        if (elems > 3)
+            out3[i] = w[i];
+    }
+}
+
+DECL_WRITE(write_packed, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        out0[elems * i + 0] = x[i];
+        if (elems > 1)
+            out0[elems * i + 1] = y[i];
+        if (elems > 2)
+            out0[elems * i + 2] = z[i];
+        if (elems > 3)
+            out0[elems * i + 3] = w[i];
+    }
+}
+
+#define WRAP_READ(FUNC, ELEMS, FRAC, PACKED)                                    \
+DECL_IMPL(FUNC##ELEMS)                                                          \
+{                                                                               \
+    CALL_READ(FUNC, ELEMS);                                                     \
+    for (int i = 0; i < (PACKED ? 1 : ELEMS); i++)                              \
+        iter->in[i] += sizeof(block_t) * (PACKED ? ELEMS : 1) >> FRAC;          \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(FUNC##ELEMS,                                                         \
+    .op = SWS_OP_READ,                                                          \
+    .rw = {                                                                     \
+        .elems  = ELEMS,                                                        \
+        .packed = PACKED,                                                       \
+        .frac   = FRAC,                                                         \
+    },                                                                          \
+);
+
+WRAP_READ(read_planar, 1, 0, false)
+WRAP_READ(read_planar, 2, 0, false)
+WRAP_READ(read_planar, 3, 0, false)
+WRAP_READ(read_planar, 4, 0, false)
+WRAP_READ(read_packed, 2, 0, true)
+WRAP_READ(read_packed, 3, 0, true)
+WRAP_READ(read_packed, 4, 0, true)
+
+#define WRAP_WRITE(FUNC, ELEMS, FRAC, PACKED)                                   \
+DECL_IMPL(FUNC##ELEMS)                                                          \
+{                                                                               \
+    CALL_WRITE(FUNC, ELEMS);                                                    \
+    for (int i = 0; i < (PACKED ? 1 : ELEMS); i++)                              \
+        iter->out[i] += sizeof(block_t) * (PACKED ? ELEMS : 1) >> FRAC;         \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(FUNC##ELEMS,                                                         \
+    .op = SWS_OP_WRITE,                                                         \
+    .rw = {                                                                     \
+        .elems  = ELEMS,                                                        \
+        .packed = PACKED,                                                       \
+        .frac   = FRAC,                                                         \
+    },                                                                          \
+);
+
+WRAP_WRITE(write_planar, 1, 0, false)
+WRAP_WRITE(write_planar, 2, 0, false)
+WRAP_WRITE(write_planar, 3, 0, false)
+WRAP_WRITE(write_planar, 4, 0, false)
+WRAP_WRITE(write_packed, 2, 0, true)
+WRAP_WRITE(write_packed, 3, 0, true)
+WRAP_WRITE(write_packed, 4, 0, true)
+
+#if BIT_DEPTH == 8
+DECL_READ(read_nibbles, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 2) {
+        const pixel_t val = ((const pixel_t *) in0)[i >> 1];
+        x[i + 0] = val >> 4;  /* high nibble */
+        x[i + 1] = val & 0xF; /* low nibble */
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_READ(read_bits, const int elems)
+{
+    block_t x, y, z, w;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 8) {
+        const pixel_t val = ((const pixel_t *) in0)[i >> 3];
+        x[i + 0] = (val >> 7) & 1;
+        x[i + 1] = (val >> 6) & 1;
+        x[i + 2] = (val >> 5) & 1;
+        x[i + 3] = (val >> 4) & 1;
+        x[i + 4] = (val >> 3) & 1;
+        x[i + 5] = (val >> 2) & 1;
+        x[i + 6] = (val >> 1) & 1;
+        x[i + 7] = (val >> 0) & 1;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_READ(read_nibbles, 1, 1, false)
+WRAP_READ(read_bits,    1, 3, false)
+
+DECL_WRITE(write_nibbles, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 2)
+        out0[i >> 1] = x[i] << 4 | x[i + 1];
+}
+
+DECL_WRITE(write_bits, const int elems)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i += 8) {
+        out0[i >> 3] = x[i + 0] << 7 |
+                       x[i + 1] << 6 |
+                       x[i + 2] << 5 |
+                       x[i + 3] << 4 |
+                       x[i + 4] << 3 |
+                       x[i + 5] << 2 |
+                       x[i + 6] << 1 |
+                       x[i + 7];
+    }
+}
+
+WRAP_WRITE(write_nibbles, 1, 1, false)
+WRAP_WRITE(write_bits,    1, 3, false)
+#endif /* BIT_DEPTH == 8 */
+
+#ifdef SWAP_BYTES
+DECL_PATTERN(swap_bytes)
+{
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x[i] = SWAP_BYTES(x[i]);
+        if (Y)
+            y[i] = SWAP_BYTES(y[i]);
+        if (Z)
+            z[i] = SWAP_BYTES(z[i]);
+        if (W)
+            w[i] = SWAP_BYTES(w[i]);
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(swap_bytes, .op = SWS_OP_SWAP_BYTES);
+#endif /* SWAP_BYTES */
+
+#if BIT_DEPTH == 8
+DECL_PATTERN(expand16)
+{
+    u16block_t x16, y16, z16, w16;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        if (X)
+            x16[i] = x[i] << 8 | x[i];
+        if (Y)
+            y16[i] = y[i] << 8 | y[i];
+        if (Z)
+            z16[i] = z[i] << 8 | z[i];
+        if (W)
+            w16[i] = w[i] << 8 | w[i];
+    }
+
+    CONTINUE(u16block_t, x16, y16, z16, w16);
+}
+
+WRAP_COMMON_PATTERNS(expand16,
+    .op = SWS_OP_CONVERT,
+    .convert.to = SWS_PIXEL_U16,
+    .convert.expand = true,
+);
+
+DECL_PATTERN(expand32)
+{
+    u32block_t x32, y32, z32, w32;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x32[i] = x[i] << 24 | x[i] << 16 | x[i] << 8 | x[i];
+        y32[i] = y[i] << 24 | y[i] << 16 | y[i] << 8 | y[i];
+        z32[i] = z[i] << 24 | z[i] << 16 | z[i] << 8 | z[i];
+        w32[i] = w[i] << 24 | w[i] << 16 | w[i] << 8 | w[i];
+    }
+
+    CONTINUE(u32block_t, x32, y32, z32, w32);
+}
+
+WRAP_COMMON_PATTERNS(expand32,
+    .op = SWS_OP_CONVERT,
+    .convert.to = SWS_PIXEL_U32,
+    .convert.expand = true,
+);
+#endif
+
+#define WRAP_PACK_UNPACK(X, Y, Z, W)                                            \
+inline DECL_IMPL(pack_##X##Y##Z##W)                                             \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        x[i] = x[i] << (Y+Z+W);                                                 \
+        if (Y)                                                                  \
+            x[i] |= y[i] << (Z+W);                                              \
+        if (Z)                                                                  \
+            x[i] |= z[i] << W;                                                  \
+        if (W)                                                                  \
+            x[i] |= w[i];                                                       \
+    }                                                                           \
+                                                                                \
+    CONTINUE(block_t, x, y, z, w);                                              \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(pack_##X##Y##Z##W,                                                   \
+    .op = SWS_OP_PACK,                                                          \
+    .pack.pattern = { X, Y, Z, W },                                             \
+);                                                                              \
+                                                                                \
+inline DECL_IMPL(unpack_##X##Y##Z##W)                                           \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {                                  \
+        const pixel_t val = x[i];                                               \
+        x[i] = val >> (Y+Z+W);                                                  \
+        if (Y)                                                                  \
+            y[i] = (val >> (Z+W)) & ((1 << Y) - 1);                             \
+        if (Z)                                                                  \
+            z[i] = (val >> W) & ((1 << Z) - 1);                                 \
+        if (W)                                                                  \
+            w[i] = val & ((1 << W) - 1);                                        \
+    }                                                                           \
+                                                                                \
+    CONTINUE(block_t, x, y, z, w);                                              \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(unpack_##X##Y##Z##W,                                                 \
+    .op = SWS_OP_UNPACK,                                                        \
+    .pack.pattern = { X, Y, Z, W },                                             \
+);
+
+WRAP_PACK_UNPACK( 3,  3,  2,  0)
+WRAP_PACK_UNPACK( 2,  3,  3,  0)
+WRAP_PACK_UNPACK( 1,  2,  1,  0)
+WRAP_PACK_UNPACK( 5,  6,  5,  0)
+WRAP_PACK_UNPACK( 5,  5,  5,  0)
+WRAP_PACK_UNPACK( 4,  4,  4,  0)
+WRAP_PACK_UNPACK( 2, 10, 10, 10)
+WRAP_PACK_UNPACK(10, 10, 10,  2)
+
+#if BIT_DEPTH != 8
+DECL_PATTERN(lshift)
+{
+    const uint8_t amount = impl->priv.u8[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] <<= amount;
+        y[i] <<= amount;
+        z[i] <<= amount;
+        w[i] <<= amount;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+DECL_PATTERN(rshift)
+{
+    const uint8_t amount = impl->priv.u8[0];
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        x[i] >>= amount;
+        y[i] >>= amount;
+        z[i] >>= amount;
+        w[i] >>= amount;
+    }
+
+    CONTINUE(block_t, x, y, z, w);
+}
+
+WRAP_COMMON_PATTERNS(lshift,
+    .op       = SWS_OP_LSHIFT,
+    .setup    = ff_sws_setup_u8,
+    .flexible = true,
+);
+
+WRAP_COMMON_PATTERNS(rshift,
+    .op       = SWS_OP_RSHIFT,
+    .setup    = ff_sws_setup_u8,
+    .flexible = true,
+);
+#endif /* BIT_DEPTH != 8 */
+
+DECL_PATTERN(convert_float)
+{
+    f32block_t xf, yf, zf, wf;
+
+    SWS_LOOP
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++) {
+        xf[i] = x[i];
+        yf[i] = y[i];
+        zf[i] = z[i];
+        wf[i] = w[i];
+    }
+
+    CONTINUE(f32block_t, xf, yf, zf, wf);
+}
+
+WRAP_COMMON_PATTERNS(convert_float,
+    .op = SWS_OP_CONVERT,
+    .convert.to = SWS_PIXEL_F32,
+);
+
+/**
+ * Swizzle by directly swapping the order of arguments to the continuation.
+ * Note that this is only safe to do if no arguments are duplicated.
+ */
+#define DECL_SWIZZLE(X, Y, Z, W)                                                \
+static SWS_FUNC void                                                            \
+fn(swizzle_##X##Y##Z##W)(SwsOpIter *restrict iter,                              \
+                         const SwsOpImpl *restrict impl,                        \
+                         block_t c0, block_t c1, block_t c2, block_t c3)        \
+{                                                                               \
+    CONTINUE(block_t, c##X, c##Y, c##Z, c##W);                                  \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(swizzle_##X##Y##Z##W,                                                \
+    .op = SWS_OP_SWIZZLE,                                                       \
+    .swizzle = SWS_SWIZZLE(X, Y, Z, W),                                         \
+);
+
+DECL_SWIZZLE(3, 0, 1, 2)
+DECL_SWIZZLE(3, 0, 2, 1)
+DECL_SWIZZLE(2, 1, 0, 3)
+DECL_SWIZZLE(3, 2, 1, 0)
+DECL_SWIZZLE(3, 1, 0, 2)
+DECL_SWIZZLE(3, 2, 0, 1)
+DECL_SWIZZLE(1, 2, 0, 3)
+DECL_SWIZZLE(1, 0, 2, 3)
+DECL_SWIZZLE(2, 0, 1, 3)
+DECL_SWIZZLE(2, 3, 1, 0)
+DECL_SWIZZLE(2, 1, 3, 0)
+DECL_SWIZZLE(1, 2, 3, 0)
+DECL_SWIZZLE(1, 3, 2, 0)
+DECL_SWIZZLE(0, 2, 1, 3)
+DECL_SWIZZLE(0, 2, 3, 1)
+DECL_SWIZZLE(0, 3, 1, 2)
+DECL_SWIZZLE(3, 1, 2, 0)
+DECL_SWIZZLE(0, 3, 2, 1)
+
+/* Broadcast luma -> rgb (only used for y(a) -> rgb(a)) */
+#define DECL_EXPAND_LUMA(X, W, T0, T1)                                          \
+static SWS_FUNC void                                                            \
+fn(expand_luma_##X##W)(SwsOpIter *restrict iter,                                \
+                       const SwsOpImpl *restrict impl,                          \
+                       block_t c0, block_t c1,  block_t c2, block_t c3)         \
+{                                                                               \
+    SWS_LOOP                                                                    \
+    for (int i = 0; i < SWS_BLOCK_SIZE; i++)                                    \
+        T0[i] = T1[i] = c0[i];                                                  \
+                                                                                \
+    CONTINUE(block_t, c##X, T0, T1, c##W);                                      \
+}                                                                               \
+                                                                                \
+DECL_ENTRY(expand_luma_##X##W,                                                  \
+    .op = SWS_OP_SWIZZLE,                                                       \
+    .swizzle = SWS_SWIZZLE(X, 0, 0, W),                                         \
+);
+
+DECL_EXPAND_LUMA(0, 3, c1, c2)
+DECL_EXPAND_LUMA(3, 0, c1, c2)
+DECL_EXPAND_LUMA(1, 0, c2, c3)
+DECL_EXPAND_LUMA(0, 1, c2, c3)
+
+static const SwsOpTable fn(op_table_int) = {
+    .block_size = SWS_BLOCK_SIZE,
+    .entries = {
+        &fn(op_read_planar1),
+        &fn(op_read_planar2),
+        &fn(op_read_planar3),
+        &fn(op_read_planar4),
+        &fn(op_read_packed2),
+        &fn(op_read_packed3),
+        &fn(op_read_packed4),
+
+        &fn(op_write_planar1),
+        &fn(op_write_planar2),
+        &fn(op_write_planar3),
+        &fn(op_write_planar4),
+        &fn(op_write_packed2),
+        &fn(op_write_packed3),
+        &fn(op_write_packed4),
+
+#if BIT_DEPTH == 8
+        &fn(op_read_bits1),
+        &fn(op_read_nibbles1),
+        &fn(op_write_bits1),
+        &fn(op_write_nibbles1),
+
+        &fn(op_pack_1210),
+        &fn(op_pack_2330),
+        &fn(op_pack_3320),
+
+        &fn(op_unpack_1210),
+        &fn(op_unpack_2330),
+        &fn(op_unpack_3320),
+
+        REF_COMMON_PATTERNS(expand16),
+        REF_COMMON_PATTERNS(expand32),
+#elif BIT_DEPTH == 16
+        &fn(op_pack_4440),
+        &fn(op_pack_5550),
+        &fn(op_pack_5650),
+        &fn(op_unpack_4440),
+        &fn(op_unpack_5550),
+        &fn(op_unpack_5650),
+#elif BIT_DEPTH == 32
+        &fn(op_pack_2101010),
+        &fn(op_pack_1010102),
+        &fn(op_unpack_2101010),
+        &fn(op_unpack_1010102),
+#endif
+
+#ifdef SWAP_BYTES
+        REF_COMMON_PATTERNS(swap_bytes),
+#endif
+
+        REF_COMMON_PATTERNS(min),
+        REF_COMMON_PATTERNS(max),
+        REF_COMMON_PATTERNS(scale),
+        REF_COMMON_PATTERNS(convert_float),
+
+        &fn(op_clear_1110),
+        &fn(op_clear_0111),
+        &fn(op_clear_0011),
+        &fn(op_clear_1001),
+        &fn(op_clear_1100),
+        &fn(op_clear_0101),
+        &fn(op_clear_1010),
+        &fn(op_clear_1000),
+        &fn(op_clear_0100),
+        &fn(op_clear_0010),
+
+        &fn(op_swizzle_3012),
+        &fn(op_swizzle_3021),
+        &fn(op_swizzle_2103),
+        &fn(op_swizzle_3210),
+        &fn(op_swizzle_3102),
+        &fn(op_swizzle_3201),
+        &fn(op_swizzle_1203),
+        &fn(op_swizzle_1023),
+        &fn(op_swizzle_2013),
+        &fn(op_swizzle_2310),
+        &fn(op_swizzle_2130),
+        &fn(op_swizzle_1230),
+        &fn(op_swizzle_1320),
+        &fn(op_swizzle_0213),
+        &fn(op_swizzle_0231),
+        &fn(op_swizzle_0312),
+        &fn(op_swizzle_3120),
+        &fn(op_swizzle_0321),
+
+        &fn(op_expand_luma_03),
+        &fn(op_expand_luma_30),
+        &fn(op_expand_luma_10),
+        &fn(op_expand_luma_01),
+
+#if BIT_DEPTH != 8
+        REF_COMMON_PATTERNS(lshift),
+        REF_COMMON_PATTERNS(rshift),
+        REF_COMMON_PATTERNS(convert_uint8),
+#endif /* BIT_DEPTH != 8 */
+
+#if BIT_DEPTH != 16
+        REF_COMMON_PATTERNS(convert_uint16),
+#endif
+#if BIT_DEPTH != 32
+        REF_COMMON_PATTERNS(convert_uint32),
+#endif
+
+        NULL
+    },
+};
+
+#undef PIXEL_TYPE
+#undef PIXEL_MAX
+#undef PIXEL_MIN
+#undef SWAP_BYTES
+#undef pixel_t
+#undef block_t
+#undef px
+
+#undef FMT_CHAR
+#undef IS_FLOAT
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 13/17] swscale/ops_memcpy: add 'memcpy' backend for plane->plane copies
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (11 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 12/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 14/17] swscale/x86: add SIMD backend Niklas Haas
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Provides a generic fast path for any operation list that can be decomposed
into a series of memcpy and memset operations.

25% faster than the x86 backend for yuv444p -> yuva444p
33% faster than the x86 backend for gray -> yuvj444p
---
 libswscale/Makefile     |   1 +
 libswscale/ops.c        |   2 +
 libswscale/ops_memcpy.c | 132 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 135 insertions(+)
 create mode 100644 libswscale/ops_memcpy.c

diff --git a/libswscale/Makefile b/libswscale/Makefile
index 6e5696c5a6..136d33f6bc 100644
--- a/libswscale/Makefile
+++ b/libswscale/Makefile
@@ -18,6 +18,7 @@ OBJS = alphablend.o                                     \
        ops.o                                            \
        ops_backend.o                                    \
        ops_chain.o                                      \
+       ops_memcpy.o                                     \
        ops_optimizer.o                                  \
        options.o                                        \
        output.o                                         \
diff --git a/libswscale/ops.c b/libswscale/ops.c
index fe7ea6a565..c7bdbd305c 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -28,8 +28,10 @@
 #include "ops_internal.h"
 
 extern SwsOpBackend backend_c;
+extern SwsOpBackend backend_murder;
 
 const SwsOpBackend * const ff_sws_op_backends[] = {
+    &backend_murder,
     &backend_c,
     NULL
 };
diff --git a/libswscale/ops_memcpy.c b/libswscale/ops_memcpy.c
new file mode 100644
index 0000000000..ef4784faa4
--- /dev/null
+++ b/libswscale/ops_memcpy.c
@@ -0,0 +1,132 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include "libavutil/avassert.h"
+
+#include "ops_backend.h"
+
+typedef struct MemcpyPriv {
+    int num_planes;
+    int index[4]; /* or -1 to clear plane */
+    uint8_t clear_value[4];
+} MemcpyPriv;
+
+/* Memcpy backend for trivial cases */
+
+static void process(const SwsOpExec *exec, const void *priv,
+                    int x_start, int y_start, int x_end, int y_end)
+{
+    const MemcpyPriv *p = priv;
+    const int lines = y_end - y_start;
+    av_assert1(x_start == 0 && x_end == exec->width);
+
+    for (int i = 0; i < p->num_planes; i++) {
+        uint8_t *out = exec->out[i];
+        const int idx = p->index[i];
+        if (idx < 0) {
+            memset(out, p->clear_value[i], exec->out_stride[i] * lines);
+        } else if (exec->out_stride[i] == exec->in_stride[idx]) {
+            memcpy(out, exec->in[idx], exec->out_stride[i] * lines);
+        } else {
+            const int bytes = x_end * exec->block_size_out;
+            const uint8_t *in = exec->in[idx];
+            for (int y = y_start; y < y_end; y++) {
+                memcpy(out, in, bytes);
+                out += exec->out_stride[i];
+                in  += exec->in_stride[idx];
+            }
+        }
+    }
+}
+
+static int compile(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out)
+{
+    MemcpyPriv p = {0};
+
+    for (int n = 0; n < ops->num_ops; n++) {
+        const SwsOp *op = &ops->ops[n];
+        switch (op->op) {
+        case SWS_OP_READ:
+            if ((op->rw.packed && op->rw.elems != 1) || op->rw.frac)
+                return AVERROR(ENOTSUP);
+            for (int i = 0; i < op->rw.elems; i++)
+                p.index[i] = i;
+            break;
+
+        case SWS_OP_SWIZZLE: {
+            const MemcpyPriv orig = p;
+            for (int i = 0; i < 4; i++) {
+                /* Explicitly exclude swizzle masks that contain duplicates,
+                 * because these are wasteful to implement as a memcpy */
+                for (int j = 0; j < i; j++) {
+                    if (op->swizzle.in[i] == op->swizzle.in[j])
+                        return AVERROR(ENOTSUP);
+                }
+                p.index[i] = orig.index[op->swizzle.in[i]];
+            }
+            break;
+        }
+
+        case SWS_OP_CLEAR:
+            for (int i = 0; i < 4; i++) {
+                if (!op->c.q4[i].den)
+                    continue;
+                if (op->c.q4[i].den != 1)
+                    return AVERROR(ENOTSUP);
+
+                /* Ensure all bytes to be cleared are the same, because we
+                 * can't memset on multi-byte sequences */
+                uint8_t val = op->c.q4[i].num & 0xFF;
+                uint32_t ref = val;
+                switch (ff_sws_pixel_type_size(op->type)) {
+                case 2: ref *= 0x101; break;
+                case 4: ref *= 0x1010101; break;
+                }
+                if (ref != op->c.q4[i].num)
+                    return AVERROR(ENOTSUP);
+                p.clear_value[i] = val;
+                p.index[i] = -1;
+            }
+            break;
+
+        case SWS_OP_WRITE:
+            if ((op->rw.packed && op->rw.elems != 1) || op->rw.frac)
+                return AVERROR(ENOTSUP);
+            p.num_planes = op->rw.elems;
+            break;
+
+        default:
+            return AVERROR(ENOTSUP);
+        }
+    }
+
+    *out = (SwsCompiledOp) {
+        .block_size = 1,
+        .func = process,
+        .priv = av_memdup(&p, sizeof(p)),
+        .free = av_free,
+    };
+    return out->priv ? 0 : AVERROR(ENOMEM);
+}
+
+SwsOpBackend backend_murder = {
+    .name    = "memcpy",
+    .compile = compile,
+};
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 14/17] swscale/x86: add SIMD backend
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (12 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 13/17] swscale/ops_memcpy: add 'memcpy' backend for plane->plane copies Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-30  2:23   ` Michael Niedermayer
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 15/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
floating point operations. While this is not yet 100% coverage, it's good
enough for the vast majority of formats out there.

Of special note is the packed shuffle fast path, which uses pshufb at vector
sizes up to AVX512.
---
 libswscale/ops.c              |    4 +
 libswscale/x86/Makefile       |    3 +
 libswscale/x86/ops.c          |  722 +++++++++++++++++++++++
 libswscale/x86/ops_common.asm |  305 ++++++++++
 libswscale/x86/ops_float.asm  |  389 ++++++++++++
 libswscale/x86/ops_int.asm    | 1049 +++++++++++++++++++++++++++++++++
 6 files changed, 2472 insertions(+)
 create mode 100644 libswscale/x86/ops.c
 create mode 100644 libswscale/x86/ops_common.asm
 create mode 100644 libswscale/x86/ops_float.asm
 create mode 100644 libswscale/x86/ops_int.asm

diff --git a/libswscale/ops.c b/libswscale/ops.c
index c7bdbd305c..60e8bc6ed6 100644
--- a/libswscale/ops.c
+++ b/libswscale/ops.c
@@ -29,9 +29,13 @@
 
 extern SwsOpBackend backend_c;
 extern SwsOpBackend backend_murder;
+extern SwsOpBackend backend_x86;
 
 const SwsOpBackend * const ff_sws_op_backends[] = {
     &backend_murder,
+#if ARCH_X86
+    &backend_x86,
+#endif
     &backend_c,
     NULL
 };
diff --git a/libswscale/x86/Makefile b/libswscale/x86/Makefile
index f00154941d..a04bc8336f 100644
--- a/libswscale/x86/Makefile
+++ b/libswscale/x86/Makefile
@@ -10,6 +10,9 @@ OBJS-$(CONFIG_XMM_CLOBBER_TEST) += x86/w64xmmtest.o
 
 X86ASM-OBJS                     += x86/input.o                          \
                                    x86/output.o                         \
+                                   x86/ops_int.o                        \
+                                   x86/ops_float.o                      \
+                                   x86/ops.o                            \
                                    x86/scale.o                          \
                                    x86/scale_avx2.o                          \
                                    x86/range_convert.o                  \
diff --git a/libswscale/x86/ops.c b/libswscale/x86/ops.c
new file mode 100644
index 0000000000..918ac6f0f8
--- /dev/null
+++ b/libswscale/x86/ops.c
@@ -0,0 +1,722 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2.1 of the License, or (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with FFmpeg; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ */
+
+#include <float.h>
+
+#include <libavutil/avassert.h>
+#include <libavutil/mem.h>
+
+#include "../ops_chain.h"
+
+#define DECL_ENTRY(TYPE, NAME, ...)                                             \
+    static const SwsOpEntry op_##NAME = {                                       \
+        .type = SWS_PIXEL_##TYPE,                                               \
+        __VA_ARGS__                                                             \
+    }
+
+#define DECL_ASM(TYPE, NAME, ...)                                               \
+    void ff_##NAME(void);                                                       \
+    DECL_ENTRY(TYPE, NAME,                                                      \
+        .func = ff_##NAME,                                                      \
+        __VA_ARGS__)
+
+#define DECL_PATTERN(TYPE, NAME, X, Y, Z, W, ...)                               \
+    DECL_ASM(TYPE, p##X##Y##Z##W##_##NAME,                                      \
+        .unused = { !X, !Y, !Z, !W },                                           \
+        __VA_ARGS__                                                             \
+    )
+
+#define REF_PATTERN(NAME, X, Y, Z, W)                                           \
+    &op_p##X##Y##Z##W##_##NAME
+
+#define DECL_COMMON_PATTERNS(TYPE, NAME, ...)                                   \
+    DECL_PATTERN(TYPE, NAME, 1, 0, 0, 0, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 0, 0, 1, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 1, 1, 0, __VA_ARGS__);                          \
+    DECL_PATTERN(TYPE, NAME, 1, 1, 1, 1, __VA_ARGS__)                           \
+
+#define REF_COMMON_PATTERNS(NAME)                                               \
+    REF_PATTERN(NAME, 1, 0, 0, 0),                                              \
+    REF_PATTERN(NAME, 1, 0, 0, 1),                                              \
+    REF_PATTERN(NAME, 1, 1, 1, 0),                                              \
+    REF_PATTERN(NAME, 1, 1, 1, 1)
+
+#define DECL_RW(EXT, TYPE, NAME, OP, ELEMS, PACKED, FRAC)                       \
+    DECL_ASM(TYPE, NAME##ELEMS##EXT,                                            \
+        .op = SWS_OP_##OP,                                                      \
+        .rw = { .elems = ELEMS, .packed = PACKED, .frac = FRAC },               \
+    );
+
+#define DECL_PACKED_RW(EXT, DEPTH)                                              \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  2, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  3, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, read##DEPTH##_packed,  READ,  4, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 2, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 3, true,  0)           \
+    DECL_RW(EXT, U##DEPTH, write##DEPTH##_packed, WRITE, 4, true,  0)           \
+
+#define DECL_PACK_UNPACK(EXT, TYPE, X, Y, Z, W)                                 \
+    DECL_ASM(TYPE, pack_##X##Y##Z##W##EXT,                                      \
+        .op = SWS_OP_PACK,                                                      \
+        .pack.pattern = {X, Y, Z, W},                                           \
+    );                                                                          \
+                                                                                \
+    DECL_ASM(TYPE, unpack_##X##Y##Z##W##EXT,                                    \
+        .op = SWS_OP_UNPACK,                                                    \
+        .pack.pattern = {X, Y, Z, W},                                           \
+    );                                                                          \
+
+static int setup_swap_bytes(const SwsOp *op, SwsOpPriv *out)
+{
+    const int mask = ff_sws_pixel_type_size(op->type) - 1;
+    for (int i = 0; i < 16; i++)
+        out->u8[i] = (i & ~mask) | (mask - (i & mask));
+    return 0;
+}
+
+#define DECL_SWAP_BYTES(EXT, TYPE, X, Y, Z, W)                                  \
+    DECL_ENTRY(TYPE, p##X##Y##Z##W##_swap_bytes_##TYPE##EXT,                    \
+        .op = SWS_OP_SWAP_BYTES,                                                \
+        .unused = { !X, !Y, !Z, !W },                                           \
+        .func = ff_p##X##Y##Z##W##_shuffle##EXT,                                \
+        .setup = setup_swap_bytes,                                              \
+    );
+
+#define DECL_CLEAR_ALPHA(EXT, IDX)                                              \
+    DECL_ASM(U8, clear_alpha##IDX##EXT,                                         \
+        .op = SWS_OP_CLEAR,                                                     \
+        .clear_value = -1,                                                      \
+        .unused[IDX] = true,                                                    \
+    );                                                                          \
+
+#define DECL_CLEAR_ZERO(EXT, IDX)                                               \
+    DECL_ASM(U8, clear_zero##IDX##EXT,                                          \
+        .op = SWS_OP_CLEAR,                                                     \
+        .clear_value = 0,                                                       \
+        .unused[IDX] = true,                                                    \
+    );
+
+static int setup_clear(const SwsOp *op, SwsOpPriv *out)
+{
+    for (int i = 0; i < 4; i++)
+        out->u32[i] = (uint32_t) op->c.q4[i].num;
+    return 0;
+}
+
+#define DECL_CLEAR(EXT, X, Y, Z, W)                                             \
+    DECL_PATTERN(U8, clear##EXT, X, Y, Z, W,                                    \
+        .op = SWS_OP_CLEAR,                                                     \
+        .setup = setup_clear,                                                   \
+        .flexible = true,                                                       \
+    );
+
+#define DECL_SWIZZLE(EXT, X, Y, Z, W)                                           \
+    DECL_ASM(U8, swizzle_##X##Y##Z##W##EXT,                                     \
+        .op = SWS_OP_SWIZZLE,                                                   \
+        .swizzle = SWS_SWIZZLE( X, Y, Z, W ),                                   \
+    );
+
+#define DECL_CONVERT(EXT, FROM, TO)                                             \
+    DECL_COMMON_PATTERNS(FROM, convert_##FROM##_##TO##EXT,                      \
+        .op = SWS_OP_CONVERT,                                                   \
+        .convert.to = SWS_PIXEL_##TO,                                           \
+    );
+
+#define DECL_EXPAND(EXT, FROM, TO)                                              \
+    DECL_COMMON_PATTERNS(FROM, expand_##FROM##_##TO##EXT,                       \
+        .op = SWS_OP_CONVERT,                                                   \
+        .convert.to = SWS_PIXEL_##TO,                                           \
+        .convert.expand = true,                                                 \
+    );
+
+static int setup_shift(const SwsOp *op, SwsOpPriv *out)
+{
+    out->u16[0] = op->c.u;
+    return 0;
+}
+
+#define DECL_SHIFT16(EXT)                                                       \
+    DECL_COMMON_PATTERNS(U16, lshift16##EXT,                                    \
+        .op = SWS_OP_LSHIFT,                                                    \
+        .setup = setup_shift,                                                   \
+        .flexible = true,                                                       \
+    );                                                                          \
+                                                                                \
+    DECL_COMMON_PATTERNS(U16, rshift16##EXT,                                    \
+        .op = SWS_OP_RSHIFT,                                                    \
+        .setup = setup_shift,                                                   \
+        .flexible = true,                                                       \
+    );
+
+#define DECL_MIN_MAX(EXT)                                                       \
+    DECL_COMMON_PATTERNS(F32, min##EXT,                                         \
+        .op = SWS_OP_MIN,                                                       \
+        .setup = ff_sws_setup_q4,                                               \
+        .flexible = true,                                                       \
+    );                                                                          \
+                                                                                \
+    DECL_COMMON_PATTERNS(F32, max##EXT,                                         \
+        .op = SWS_OP_MAX,                                                       \
+        .setup = ff_sws_setup_q4,                                               \
+        .flexible = true,                                                       \
+    );
+
+#define DECL_SCALE(EXT)                                                         \
+    DECL_COMMON_PATTERNS(F32, scale##EXT,                                       \
+        .op = SWS_OP_SCALE,                                                     \
+        .setup = ff_sws_setup_q,                                                \
+    );
+
+/* 2x2 matrix fits inside SwsOpPriv directly; save an indirect in this case */
+static_assert(sizeof(SwsOpPriv) >= sizeof(float[2][2]), "2x2 dither matrix too large");
+static int setup_dither(const SwsOp *op, SwsOpPriv *out)
+{
+    const int size = 1 << op->dither.size_log2;
+    float *matrix = out->f32;
+    if (size > 2) {
+        matrix = out->ptr = av_mallocz(size * size * sizeof(*matrix));
+        if (!matrix)
+            return AVERROR(ENOMEM);
+    }
+
+    for (int i = 0; i < size * size; i++)
+        matrix[i] = (float) op->dither.matrix[i].num / op->dither.matrix[i].den;
+
+    return 0;
+}
+
+#define DECL_DITHER(EXT, SIZE)                                                  \
+    DECL_COMMON_PATTERNS(F32, dither##SIZE##EXT,                                \
+        .op    = SWS_OP_DITHER,                                                 \
+        .setup = setup_dither,                                                  \
+        .free  = SIZE > 2 ? av_free : NULL,                                     \
+        .dither_size = SIZE,                                                    \
+    );
+
+static int setup_linear(const SwsOp *op, SwsOpPriv *out)
+{
+    float *matrix = out->ptr = av_mallocz(sizeof(float[4][5]));
+    if (!matrix)
+        return AVERROR(ENOMEM);
+
+    for (int y = 0; y < 4; y++) {
+        for (int x = 0; x < 5; x++)
+            matrix[y * 5 + x] = (float) op->lin.m[y][x].num / op->lin.m[y][x].den;
+    }
+
+    return 0;
+}
+
+#define DECL_LINEAR(EXT, NAME, MASK)                                            \
+    DECL_ASM(F32, NAME##EXT,                                                    \
+        .op    = SWS_OP_LINEAR,                                                 \
+        .setup = setup_linear,                                                  \
+        .free  = av_free,                                                       \
+        .linear_mask = (MASK),                                                  \
+    );
+
+#define DECL_FUNCS_8(SIZE, EXT, FLAG)                                           \
+    DECL_RW(EXT, U8, read_planar,   READ,  1, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  2, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  3, false, 0)                         \
+    DECL_RW(EXT, U8, read_planar,   READ,  4, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 1, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 2, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 3, false, 0)                         \
+    DECL_RW(EXT, U8, write_planar,  WRITE, 4, false, 0)                         \
+    DECL_RW(EXT, U8, read_nibbles,  READ,  1, false, 1)                         \
+    DECL_RW(EXT, U8, read_bits,     READ,  1, false, 3)                         \
+    DECL_RW(EXT, U8, write_bits,    WRITE, 1, false, 3)                         \
+    DECL_PACKED_RW(EXT, 8)                                                      \
+    DECL_PACK_UNPACK(EXT, U8, 1, 2, 1, 0)                                       \
+    DECL_PACK_UNPACK(EXT, U8, 3, 3, 2, 0)                                       \
+    DECL_PACK_UNPACK(EXT, U8, 2, 3, 3, 0)                                       \
+    void ff_p1000_shuffle##EXT(void);                                           \
+    void ff_p1001_shuffle##EXT(void);                                           \
+    void ff_p1110_shuffle##EXT(void);                                           \
+    void ff_p1111_shuffle##EXT(void);                                           \
+    DECL_SWIZZLE(EXT, 3, 0, 1, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 0, 2, 1)                                               \
+    DECL_SWIZZLE(EXT, 2, 1, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 3, 2, 1, 0)                                               \
+    DECL_SWIZZLE(EXT, 3, 1, 0, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 2, 0, 1)                                               \
+    DECL_SWIZZLE(EXT, 1, 2, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 1, 0, 2, 3)                                               \
+    DECL_SWIZZLE(EXT, 2, 0, 1, 3)                                               \
+    DECL_SWIZZLE(EXT, 2, 3, 1, 0)                                               \
+    DECL_SWIZZLE(EXT, 2, 1, 3, 0)                                               \
+    DECL_SWIZZLE(EXT, 1, 2, 3, 0)                                               \
+    DECL_SWIZZLE(EXT, 1, 3, 2, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 2, 1, 3)                                               \
+    DECL_SWIZZLE(EXT, 0, 2, 3, 1)                                               \
+    DECL_SWIZZLE(EXT, 0, 3, 1, 2)                                               \
+    DECL_SWIZZLE(EXT, 3, 1, 2, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 3, 2, 1)                                               \
+    DECL_SWIZZLE(EXT, 0, 0, 0, 3)                                               \
+    DECL_SWIZZLE(EXT, 3, 0, 0, 0)                                               \
+    DECL_SWIZZLE(EXT, 0, 0, 0, 1)                                               \
+    DECL_SWIZZLE(EXT, 1, 0, 0, 0)                                               \
+    DECL_CLEAR_ALPHA(EXT, 0)                                                    \
+    DECL_CLEAR_ALPHA(EXT, 1)                                                    \
+    DECL_CLEAR_ALPHA(EXT, 3)                                                    \
+    DECL_CLEAR_ZERO(EXT, 0)                                                     \
+    DECL_CLEAR_ZERO(EXT, 1)                                                     \
+    DECL_CLEAR_ZERO(EXT, 3)                                                     \
+    DECL_CLEAR(EXT, 1, 1, 1, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 1, 1)                                                 \
+    DECL_CLEAR(EXT, 0, 0, 1, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 0, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 1, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 0, 1)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 1, 0)                                                 \
+    DECL_CLEAR(EXT, 1, 0, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 1, 0, 0)                                                 \
+    DECL_CLEAR(EXT, 0, 0, 1, 0)                                                 \
+                                                                                \
+static const SwsOpTable ops8##EXT = {                                           \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        &op_read_planar1##EXT,                                                  \
+        &op_read_planar2##EXT,                                                  \
+        &op_read_planar3##EXT,                                                  \
+        &op_read_planar4##EXT,                                                  \
+        &op_write_planar1##EXT,                                                 \
+        &op_write_planar2##EXT,                                                 \
+        &op_write_planar3##EXT,                                                 \
+        &op_write_planar4##EXT,                                                 \
+        &op_read8_packed2##EXT,                                                 \
+        &op_read8_packed3##EXT,                                                 \
+        &op_read8_packed4##EXT,                                                 \
+        &op_write8_packed2##EXT,                                                \
+        &op_write8_packed3##EXT,                                                \
+        &op_write8_packed4##EXT,                                                \
+        &op_read_nibbles1##EXT,                                                 \
+        &op_read_bits1##EXT,                                                    \
+        &op_write_bits1##EXT,                                                   \
+        &op_pack_1210##EXT,                                                     \
+        &op_pack_3320##EXT,                                                     \
+        &op_pack_2330##EXT,                                                     \
+        &op_unpack_1210##EXT,                                                   \
+        &op_unpack_3320##EXT,                                                   \
+        &op_unpack_2330##EXT,                                                   \
+        &op_swizzle_3012##EXT,                                                  \
+        &op_swizzle_3021##EXT,                                                  \
+        &op_swizzle_2103##EXT,                                                  \
+        &op_swizzle_3210##EXT,                                                  \
+        &op_swizzle_3102##EXT,                                                  \
+        &op_swizzle_3201##EXT,                                                  \
+        &op_swizzle_1203##EXT,                                                  \
+        &op_swizzle_1023##EXT,                                                  \
+        &op_swizzle_2013##EXT,                                                  \
+        &op_swizzle_2310##EXT,                                                  \
+        &op_swizzle_2130##EXT,                                                  \
+        &op_swizzle_1230##EXT,                                                  \
+        &op_swizzle_1320##EXT,                                                  \
+        &op_swizzle_0213##EXT,                                                  \
+        &op_swizzle_0231##EXT,                                                  \
+        &op_swizzle_0312##EXT,                                                  \
+        &op_swizzle_3120##EXT,                                                  \
+        &op_swizzle_0321##EXT,                                                  \
+        &op_swizzle_0003##EXT,                                                  \
+        &op_swizzle_0001##EXT,                                                  \
+        &op_swizzle_3000##EXT,                                                  \
+        &op_swizzle_1000##EXT,                                                  \
+        &op_clear_alpha0##EXT,                                                  \
+        &op_clear_alpha1##EXT,                                                  \
+        &op_clear_alpha3##EXT,                                                  \
+        &op_clear_zero0##EXT,                                                   \
+        &op_clear_zero1##EXT,                                                   \
+        &op_clear_zero3##EXT,                                                   \
+        REF_PATTERN(clear##EXT, 1, 1, 1, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 1, 1),                                    \
+        REF_PATTERN(clear##EXT, 0, 0, 1, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 0, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 1, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 0, 1),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 1, 0),                                    \
+        REF_PATTERN(clear##EXT, 1, 0, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 1, 0, 0),                                    \
+        REF_PATTERN(clear##EXT, 0, 0, 1, 0),                                    \
+        NULL                                                                    \
+    },                                                                          \
+};
+
+#define DECL_FUNCS_16(SIZE, EXT, FLAG)                                          \
+    DECL_PACKED_RW(EXT, 16)                                                     \
+    DECL_PACK_UNPACK(EXT, U16, 4, 4, 4, 0)                                      \
+    DECL_PACK_UNPACK(EXT, U16, 5, 5, 5, 0)                                      \
+    DECL_PACK_UNPACK(EXT, U16, 5, 6, 5, 0)                                      \
+    DECL_SWAP_BYTES(EXT, U16, 1, 0, 0, 0)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 0, 0, 1)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 1, 1, 0)                                       \
+    DECL_SWAP_BYTES(EXT, U16, 1, 1, 1, 1)                                       \
+    DECL_SHIFT16(EXT)                                                           \
+    DECL_CONVERT(EXT,  U8, U16)                                                 \
+    DECL_CONVERT(EXT, U16,  U8)                                                 \
+    DECL_EXPAND(EXT,   U8, U16)                                                 \
+                                                                                \
+static const SwsOpTable ops16##EXT = {                                          \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        &op_read16_packed2##EXT,                                                \
+        &op_read16_packed3##EXT,                                                \
+        &op_read16_packed4##EXT,                                                \
+        &op_write16_packed2##EXT,                                               \
+        &op_write16_packed3##EXT,                                               \
+        &op_write16_packed4##EXT,                                               \
+        &op_pack_4440##EXT,                                                     \
+        &op_pack_5550##EXT,                                                     \
+        &op_pack_5650##EXT,                                                     \
+        &op_unpack_4440##EXT,                                                   \
+        &op_unpack_5550##EXT,                                                   \
+        &op_unpack_5650##EXT,                                                   \
+        REF_COMMON_PATTERNS(swap_bytes_U16##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U8_U16##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_U8##EXT),                               \
+        REF_COMMON_PATTERNS(expand_U8_U16##EXT),                                \
+        REF_COMMON_PATTERNS(lshift16##EXT),                                     \
+        REF_COMMON_PATTERNS(rshift16##EXT),                                     \
+        NULL                                                                    \
+    },                                                                          \
+};
+
+#define DECL_FUNCS_32(SIZE, EXT, FLAG)                                          \
+    DECL_PACKED_RW(_m2##EXT, 32)                                                \
+    DECL_PACK_UNPACK(_m2##EXT, U32, 10, 10, 10, 2)                              \
+    DECL_PACK_UNPACK(_m2##EXT, U32, 2, 10, 10, 10)                              \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 0, 0, 0)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 0, 0, 1)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 1, 1, 0)                                  \
+    DECL_SWAP_BYTES(_m2##EXT, U32, 1, 1, 1, 1)                                  \
+    DECL_CONVERT(EXT,  U8, U32)                                                 \
+    DECL_CONVERT(EXT, U32,  U8)                                                 \
+    DECL_CONVERT(EXT, U16, U32)                                                 \
+    DECL_CONVERT(EXT, U32, U16)                                                 \
+    DECL_CONVERT(EXT,  U8, F32)                                                 \
+    DECL_CONVERT(EXT, F32,  U8)                                                 \
+    DECL_CONVERT(EXT, U16, F32)                                                 \
+    DECL_CONVERT(EXT, F32, U16)                                                 \
+    DECL_EXPAND(EXT,   U8, U32)                                                 \
+    DECL_MIN_MAX(EXT)                                                           \
+    DECL_SCALE(EXT)                                                             \
+    DECL_DITHER(EXT, 0)                                                         \
+    DECL_DITHER(EXT, 1)                                                         \
+    DECL_DITHER(EXT, 2)                                                         \
+    DECL_DITHER(EXT, 3)                                                         \
+    DECL_DITHER(EXT, 4)                                                         \
+    DECL_DITHER(EXT, 5)                                                         \
+    DECL_DITHER(EXT, 6)                                                         \
+    DECL_DITHER(EXT, 7)                                                         \
+    DECL_DITHER(EXT, 8)                                                         \
+    DECL_LINEAR(EXT, luma,      SWS_MASK_LUMA)                                  \
+    DECL_LINEAR(EXT, alpha,     SWS_MASK_ALPHA)                                 \
+    DECL_LINEAR(EXT, lumalpha,  SWS_MASK_LUMA | SWS_MASK_ALPHA)                 \
+    DECL_LINEAR(EXT, dot3,      0b111)                                          \
+    DECL_LINEAR(EXT, row0,      SWS_MASK_ROW(0))                                \
+    DECL_LINEAR(EXT, row0a,     SWS_MASK_ROW(0) | SWS_MASK_ALPHA)               \
+    DECL_LINEAR(EXT, diag3,     SWS_MASK_DIAG3)                                 \
+    DECL_LINEAR(EXT, diag4,     SWS_MASK_DIAG4)                                 \
+    DECL_LINEAR(EXT, diagoff3,  SWS_MASK_DIAG3 | SWS_MASK_OFF3)                 \
+    DECL_LINEAR(EXT, matrix3,   SWS_MASK_MAT3)                                  \
+    DECL_LINEAR(EXT, affine3,   SWS_MASK_MAT3 | SWS_MASK_OFF3)                  \
+    DECL_LINEAR(EXT, affine3a,  SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA) \
+    DECL_LINEAR(EXT, matrix4,   SWS_MASK_MAT4)                                  \
+    DECL_LINEAR(EXT, affine4,   SWS_MASK_MAT4 | SWS_MASK_OFF4)                  \
+                                                                                \
+static const SwsOpTable ops32##EXT = {                                          \
+    .cpu_flags = AV_CPU_FLAG_##FLAG,                                            \
+    .block_size = SIZE,                                                         \
+    .entries = {                                                                \
+        &op_read32_packed2_m2##EXT,                                             \
+        &op_read32_packed3_m2##EXT,                                             \
+        &op_read32_packed4_m2##EXT,                                             \
+        &op_write32_packed2_m2##EXT,                                            \
+        &op_write32_packed3_m2##EXT,                                            \
+        &op_write32_packed4_m2##EXT,                                            \
+        &op_pack_1010102_m2##EXT,                                               \
+        &op_pack_2101010_m2##EXT,                                               \
+        &op_unpack_1010102_m2##EXT,                                             \
+        &op_unpack_2101010_m2##EXT,                                             \
+        REF_COMMON_PATTERNS(swap_bytes_U32_m2##EXT),                            \
+        REF_COMMON_PATTERNS(convert_U8_U32##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U32_U8##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_U32##EXT),                              \
+        REF_COMMON_PATTERNS(convert_U32_U16##EXT),                              \
+        REF_COMMON_PATTERNS(convert_U8_F32##EXT),                               \
+        REF_COMMON_PATTERNS(convert_F32_U8##EXT),                               \
+        REF_COMMON_PATTERNS(convert_U16_F32##EXT),                              \
+        REF_COMMON_PATTERNS(convert_F32_U16##EXT),                              \
+        REF_COMMON_PATTERNS(expand_U8_U32##EXT),                                \
+        REF_COMMON_PATTERNS(min##EXT),                                          \
+        REF_COMMON_PATTERNS(max##EXT),                                          \
+        REF_COMMON_PATTERNS(scale##EXT),                                        \
+        REF_COMMON_PATTERNS(dither0##EXT),                                      \
+        REF_COMMON_PATTERNS(dither1##EXT),                                      \
+        REF_COMMON_PATTERNS(dither2##EXT),                                      \
+        REF_COMMON_PATTERNS(dither3##EXT),                                      \
+        REF_COMMON_PATTERNS(dither4##EXT),                                      \
+        REF_COMMON_PATTERNS(dither5##EXT),                                      \
+        REF_COMMON_PATTERNS(dither6##EXT),                                      \
+        REF_COMMON_PATTERNS(dither7##EXT),                                      \
+        REF_COMMON_PATTERNS(dither8##EXT),                                      \
+        &op_luma##EXT,                                                          \
+        &op_alpha##EXT,                                                         \
+        &op_lumalpha##EXT,                                                      \
+        &op_dot3##EXT,                                                          \
+        &op_row0##EXT,                                                          \
+        &op_row0a##EXT,                                                         \
+        &op_diag3##EXT,                                                         \
+        &op_diag4##EXT,                                                         \
+        &op_diagoff3##EXT,                                                      \
+        &op_matrix3##EXT,                                                       \
+        &op_affine3##EXT,                                                       \
+        &op_affine3a##EXT,                                                      \
+        &op_matrix4##EXT,                                                       \
+        &op_affine4##EXT,                                                       \
+        NULL                                                                    \
+    },                                                                          \
+};
+
+DECL_FUNCS_8(16, _m1_sse4, SSE4)
+DECL_FUNCS_8(32, _m1_avx2, AVX2)
+DECL_FUNCS_8(32, _m2_sse4, SSE4)
+DECL_FUNCS_8(64, _m2_avx2, AVX2)
+
+DECL_FUNCS_16(16, _m1_avx2, AVX2)
+DECL_FUNCS_16(32, _m2_avx2, AVX2)
+
+DECL_FUNCS_32(16, _avx2, AVX2)
+
+static av_const int get_mmsize(const int cpu_flags)
+{
+    if (cpu_flags & AV_CPU_FLAG_AVX512)
+        return 64;
+    else if (cpu_flags & AV_CPU_FLAG_AVX2)
+        return 32;
+    else if (cpu_flags & AV_CPU_FLAG_SSE4)
+        return 16;
+    else
+        return AVERROR(ENOTSUP);
+}
+
+/**
+ * Returns true if the operation's implementation only depends on the block
+ * size, and not the underlying pixel type
+ */
+static bool op_is_type_invariant(const SwsOp *op)
+{
+    switch (op->op) {
+    case SWS_OP_READ:
+    case SWS_OP_WRITE:
+        return !op->rw.packed && !op->rw.frac;
+    case SWS_OP_SWIZZLE:
+    case SWS_OP_CLEAR:
+        return true;
+    }
+
+    return false;
+}
+
+static int solve_shuffle(const SwsOpList *ops, int mmsize, SwsCompiledOp *out)
+{
+    uint8_t shuffle[16];
+    int read_bytes, write_bytes;
+    int pixels;
+
+    /* Solve the shuffle mask for one 128-bit lane only */
+    pixels = ff_sws_solve_shuffle(ops, shuffle, 16, 0x80, &read_bytes, &write_bytes);
+    if (pixels < 0)
+        return pixels;
+
+    /* We can't shuffle acress lanes, so restrict the vector size to XMM
+     * whenever the read/write size would be a subset of the full vector */
+    if (read_bytes < 16 || write_bytes < 16)
+        mmsize = 16;
+
+    const int num_lanes = mmsize / 16;
+    const int in_total  = num_lanes * read_bytes;
+    const int out_total = num_lanes * write_bytes;
+    const int read_size = in_total <= 4 ? 4 : /* movd */
+                          in_total <= 8 ? 8 : /* movq */
+                          mmsize;             /* movu */
+
+    *out = (SwsCompiledOp) {
+        .priv       = av_memdup(shuffle, sizeof(shuffle)),
+        .free       = av_free,
+        .block_size = pixels * num_lanes,
+        .over_read  = read_size - in_total,
+        .over_write = mmsize - out_total,
+        .cpu_flags  = mmsize > 32 ? AV_CPU_FLAG_AVX512 :
+                      mmsize > 16 ? AV_CPU_FLAG_AVX2 :
+                                    AV_CPU_FLAG_SSE4,
+    };
+
+    if (!out->priv)
+        return AVERROR(ENOMEM);
+
+#define ASSIGN_SHUFFLE_FUNC(IN, OUT, EXT)                                       \
+do {                                                                            \
+    SWS_DECL_FUNC(ff_packed_shuffle##IN##_##OUT##_##EXT);                       \
+    if (in_total == IN && out_total == OUT)                                     \
+        out->func = ff_packed_shuffle##IN##_##OUT##_##EXT;                      \
+} while (0)
+
+    ASSIGN_SHUFFLE_FUNC( 5, 15, sse4);
+    ASSIGN_SHUFFLE_FUNC( 4, 16, sse4);
+    ASSIGN_SHUFFLE_FUNC( 2, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(10, 15, sse4);
+    ASSIGN_SHUFFLE_FUNC( 8, 16, sse4);
+    ASSIGN_SHUFFLE_FUNC( 4, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(15, 15, sse4);
+    ASSIGN_SHUFFLE_FUNC(12, 16, sse4);
+    ASSIGN_SHUFFLE_FUNC( 6, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(16, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(16, 16, sse4);
+    ASSIGN_SHUFFLE_FUNC( 8, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(12, 12, sse4);
+    ASSIGN_SHUFFLE_FUNC(32, 32, avx2);
+    ASSIGN_SHUFFLE_FUNC(64, 64, avx512);
+    av_assert1(out->func);
+    return 0;
+}
+
+/* Normalize clear values into 32-bit integer constants */
+static void normalize_clear(SwsOp *op)
+{
+    static_assert(sizeof(uint32_t) == sizeof(int), "int size mismatch");
+    SwsOpPriv priv;
+    union {
+        uint32_t u32;
+        int i;
+    } c;
+
+    ff_sws_setup_q4(op, &priv);
+    for (int i = 0; i < 4; i++) {
+        if (!op->c.q4[i].den)
+            continue;
+        switch (ff_sws_pixel_type_size(op->type)) {
+        case 1: c.u32 = 0x1010101 * priv.u8[i]; break;
+        case 2: c.u32 = priv.u16[i] << 16 | priv.u16[i]; break;
+        case 4: c.u32 = priv.u32[i]; break;
+        }
+
+        op->c.q4[i].num = c.i;
+        op->c.q4[i].den = 1;
+    }
+}
+
+static int compile(SwsContext *ctx, SwsOpList *ops, SwsCompiledOp *out)
+{
+    const int cpu_flags = av_get_cpu_flags();
+    const int mmsize = get_mmsize(cpu_flags);
+    if (mmsize < 0)
+        return mmsize;
+
+    av_assert1(ops->num_ops > 0);
+    const SwsOp read = ops->ops[0];
+    const SwsOp write = ops->ops[ops->num_ops - 1];
+    int ret;
+
+    /* Special fast path for in-place packed shuffle */
+    ret = solve_shuffle(ops, mmsize, out);
+    if (ret != AVERROR(ENOTSUP))
+        return ret;
+
+    SwsOpChain *chain = ff_sws_op_chain_alloc();
+    if (!chain)
+        return AVERROR(ENOMEM);
+
+    *out = (SwsCompiledOp) {
+        .priv = chain,
+        .free = (void (*)(void *)) ff_sws_op_chain_free,
+
+        /* Use at most two full YMM regs during the widest precision section */
+        .block_size = 2 * FFMIN(mmsize, 32) / ff_sws_op_list_max_size(ops),
+    };
+
+    /* 3-component reads/writes process one extra garbage word */
+    if (read.rw.packed && read.rw.elems == 3)
+        out->over_read = sizeof(uint32_t);
+    if (write.rw.packed && write.rw.elems == 3)
+        out->over_write = sizeof(uint32_t);
+
+    static const SwsOpTable *const tables[] = {
+        &ops8_m1_sse4,
+        &ops8_m1_avx2,
+        &ops8_m2_sse4,
+        &ops8_m2_avx2,
+        &ops16_m1_avx2,
+        &ops16_m2_avx2,
+        &ops32_avx2,
+    };
+
+    do {
+        int op_block_size = out->block_size;
+        SwsOp *op = &ops->ops[0];
+
+        if (op_is_type_invariant(op)) {
+            if (op->op == SWS_OP_CLEAR)
+                normalize_clear(op);
+            op_block_size *= ff_sws_pixel_type_size(op->type);
+            op->type = SWS_PIXEL_U8;
+        }
+
+        ret = ff_sws_op_compile_tables(tables, FF_ARRAY_ELEMS(tables), ops,
+                                       op_block_size, chain);
+    } while (ret == AVERROR(EAGAIN));
+    if (ret < 0) {
+        ff_sws_op_chain_free(chain);
+        return ret;
+    }
+
+#define ASSIGN_PROCESS_FUNC(NAME)                               \
+    do {                                                        \
+        SWS_DECL_FUNC(NAME);                                    \
+        void NAME##_return(void) __asm__(#NAME ".return");      \
+        ret = ff_sws_op_chain_append(chain, NAME##_return,      \
+                                     NULL, (SwsOpPriv) {0});    \
+        out->func = NAME;                                       \
+    } while (0)
+
+    const int read_planes  = read.rw.packed  ? 1 : read.rw.elems;
+    const int write_planes = write.rw.packed ? 1 : write.rw.elems;
+    switch (FFMAX(read_planes, write_planes)) {
+    case 1: ASSIGN_PROCESS_FUNC(ff_sws_process1_x86); break;
+    case 2: ASSIGN_PROCESS_FUNC(ff_sws_process2_x86); break;
+    case 3: ASSIGN_PROCESS_FUNC(ff_sws_process3_x86); break;
+    case 4: ASSIGN_PROCESS_FUNC(ff_sws_process4_x86); break;
+    }
+
+    if (ret < 0) {
+        ff_sws_op_chain_free(chain);
+        return ret;
+    }
+
+    out->cpu_flags = chain->cpu_flags;
+    return 0;
+}
+
+SwsOpBackend backend_x86 = {
+    .name       = "x86",
+    .compile    = compile,
+};
diff --git a/libswscale/x86/ops_common.asm b/libswscale/x86/ops_common.asm
new file mode 100644
index 0000000000..2ea28f8fb7
--- /dev/null
+++ b/libswscale/x86/ops_common.asm
@@ -0,0 +1,305 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "libavutil/x86/x86util.asm"
+
+; High-level explanation of how the x86 backend works:
+;
+; sws_processN is the shared entry point for all operation chains. This
+; function is responsible for the block loop, as well as initializing the
+; plane pointers. It will jump directly into the first operation kernel,
+; and each operation kernel will jump directly into the next one, with the
+; final kernel jumping back into the sws_process return point. (See label
+; `sws_process.return` in ops_int.asm)
+;
+; To handle the jump back to the return point, we append an extra address
+; corresponding to the correct sws_process.return label into the SwsOpChain,
+; and have the WRITE kernel jump into it as usual. (See the FINISH macro)
+;
+; Inside an operation chain, we use a custom calling convention to preserve
+; registers between kernels. The exact register allocation is found further
+; below in this file, but we basically reserve (and share) the following
+; registers:
+;
+; - const execq (read-only, shared execution data, see SwsOpExec); stores the
+;   static metadata for this call and describes the image layouts
+;
+; - implq (read-only, operation chain, see SwsOpChain); stores the private data
+;   for each operation as well as the pointer to the next kernel in the sequence.
+;   This register is automatically incremented by the CONTINUE macro, and will
+;   be reset back to the first operation kernel by sws_process.
+;
+; - bxd, yd: current line and block number, used as loop counters in sws_process.
+;   Also used by e.g. the dithering code to do position-dependent dithering.
+;
+; - tmp0, tmp1: two temporary registers which are NOT preserved between kernels
+;
+; - inNq, outNq: plane pointers. These are incremented automatically after the
+;   corresponding read/write operation, by the read/write kernels themselves.
+;   sws_process will take care of resetting these to the next line after the
+;   block loop is done.
+;
+; Additionally, we pass data between kernels by directly keeping them inside
+; vector registers. For this, we reserve the following registers:
+;
+; - mx, my, mz, mw:     low half of the X, Y, Z and W components
+; - mx2, my2, mz2, mw2: high half of the X, Y, Z and W components
+; (As well as sized variants for xmx, ymx, etc.)
+;
+; The "high half" registers are only sometimes used; in order to enable
+; processing more pixels at the same time. See `decl_v2` below, which allows
+; assembling the same operation twice, once with only the lower half (V2=0),
+; and once with both halves (V2=1). The remaining vectors are free for use
+; inside operation kernels, starting from m8.
+;
+; The basic rule is that we always use the full set of both vector registers
+; when processing the largest element size within a pixel chain. For example,
+; if we load 8-bit values and convert them to 32-bit floats internally, then
+; we would have an operation chain which combines an SSE4 V2=0 u8 kernel (128
+; bits = 16 pixels) with an AVX2 V2=1 f32 kernel (512 bits = 16 pixels). This
+; keeps the number of pixels being processed (the block size) constant. The
+; V2 setting is suffixed to the operation name (_m1 or _m2) during name
+; mangling.
+;
+; This design leaves us with the folowing set of possibilities:
+;
+; SSE4:
+; - max element is 32-bit: currently unsupported
+; - max element is 16-bit: currently unsupported
+; - max element is 8-bit:  block size 32, u8_m2_sse4
+;
+; AVX2:
+; - max element is 32-bit: block size 16, u32_m2_avx2, u16_m1_avx2, u8_m1_sse4
+; - max element is 16-bit: block size 32, u16_m2_avx2, u8_m1_avx2
+; - max element is 8-bit:  block size 64, u8_m2_avx2
+;
+; Meaning we need to cover the following code paths for each bit depth:
+;
+; -  8-bit kernels: m1_sse4, m2_sse4, m1_avx2, m2_avx2
+; - 16-bit kernels: m1_avx2, m2_avx2
+; - 32-bit kernels: m2_avx2
+;
+; This is achieved by macro'ing each operation kernel and declaring it once
+; per SIMD version, and (if needed) once per V2 setting using decl_v2. (See
+; the bottom of ops_int.asm for an example)
+;
+; Finally, we overload some operation kernel to different number of components,
+; using the `decl_pattern` and `decl_common_patterns` macros. Inside these
+; kernels, the variables X, Y, Z and W will be set to 0 or 1 respectively,
+; depending on which components are active for this particular kernel instance.
+; They will receive the _pXYZW prefix during name mangling.
+
+struc SwsOpExec
+    .in0 resq 1
+    .in1 resq 1
+    .in2 resq 1
+    .in3 resq 1
+    .out0 resq 1
+    .out1 resq 1
+    .out2 resq 1
+    .out3 resq 1
+    .in_stride0 resq 1
+    .in_stride1 resq 1
+    .in_stride2 resq 1
+    .in_stride3 resq 1
+    .out_stride0 resq 1
+    .out_stride1 resq 1
+    .out_stride2 resq 1
+    .out_stride3 resq 1
+    .in_bump0 resq 1
+    .in_bump1 resq 1
+    .in_bump2 resq 1
+    .in_bump3 resq 1
+    .out_bump0 resq 1
+    .out_bump1 resq 1
+    .out_bump2 resq 1
+    .out_bump3 resq 1
+    .width resd 1
+    .height resd 1
+    .slice_y resd 1
+    .slice_h resd 1
+    .block_size_in resd 1
+    .block_size_out resd 1
+endstruc
+
+struc SwsOpImpl
+    .cont resb 16
+    .priv resb 16
+    .next resb 0
+endstruc
+
+;---------------------------------------------------------
+; Common macros for declaring operations
+
+; Declare an operation kernel with the correct name mangling.
+%macro op 1 ; name
+    %ifdef X
+        %define ADD_PAT(name) p %+ X %+ Y %+ Z %+ W %+ _ %+ name
+    %else
+        %define ADD_PAT(name) name
+    %endif
+
+    %ifdef V2
+        %if V2
+            %define ADD_MUL(name) name %+ _m2
+        %else
+            %define ADD_MUL(name) name %+ _m1
+        %endif
+    %else
+        %define ADD_MUL(name) name
+    %endif
+
+    cglobal ADD_PAT(ADD_MUL(%1)), 0, 0, 0 ; already allocated by entry point
+
+    %undef ADD_PAT
+    %undef ADD_MUL
+%endmacro
+
+; Declare an operation kernel twice, once with V2=0 and once with V2=1
+%macro decl_v2 2+ ; v2, func
+    %xdefine V2 %1
+    %2
+    %undef V2
+%endmacro
+
+; Declare an operation kernel specialized to a given subset of active components
+%macro decl_pattern 5+ ; X, Y, Z, W, func
+    %xdefine X %1
+    %xdefine Y %2
+    %xdefine Z %3
+    %xdefine W %4
+    %5
+    %undef X
+    %undef Y
+    %undef Z
+    %undef W
+%endmacro
+
+; Declare an operation kernel specialized to each common component pattern
+%macro decl_common_patterns 1+ ; func
+    decl_pattern 1, 0, 0, 0, %1 ; y
+    decl_pattern 1, 0, 0, 1, %1 ; ya
+    decl_pattern 1, 1, 1, 0, %1 ; yuv
+    decl_pattern 1, 1, 1, 1, %1 ; yuva
+%endmacro
+
+;---------------------------------------------------------
+; Common names for the internal calling convention
+%define mx      m0
+%define my      m1
+%define mz      m2
+%define mw      m3
+
+%define xmx     xm0
+%define xmy     xm1
+%define xmz     xm2
+%define xmw     xm3
+
+%define ymx     ym0
+%define ymy     ym1
+%define ymz     ym2
+%define ymw     ym3
+
+%define mx2     m4
+%define my2     m5
+%define mz2     m6
+%define mw2     m7
+
+%define xmx2    xm4
+%define xmy2    xm5
+%define xmz2    xm6
+%define xmw2    xm7
+
+%define ymx2    ym4
+%define ymy2    ym5
+%define ymz2    ym6
+%define ymw2    ym7
+
+; Reserved in this order by the signature of SwsOpFunc
+%define execq   r0q
+%define implq   r1q
+%define bxd     r2d
+%define yd      r3d
+
+; Extra registers for free use by kernels, not saved between ops
+%define tmp0q   r4q
+%define tmp1q   r5q
+
+%define tmp0d   r4d
+%define tmp1d   r5d
+
+; Registers for plane pointers; put at the end (and in ascending plane order)
+; so that we can avoid reserving them when not necessary
+%define out0q   r6q
+%define  in0q   r7q
+%define out1q   r8q
+%define  in1q   r9q
+%define out2q   r10q
+%define  in2q   r11q
+%define out3q   r12q
+%define  in3q   r13q
+
+;---------------------------------------------------------
+; Common macros for linking together different kernels
+
+; Load the next operation kernel's address to a register
+%macro LOAD_CONT 1 ; reg
+    mov %1, [implq + SwsOpImpl.cont]
+%endmacro
+
+; Tail call into the next operation kernel, given that kernel's address
+%macro CONTINUE 1 ; reg
+    add implq, SwsOpImpl.next
+    jmp %1
+    annotate_function_size
+%endmacro
+
+; Convenience macro to load and continue to the next kernel in one step
+%macro CONTINUE 0
+    LOAD_CONT tmp0q
+    CONTINUE tmp0q
+%endmacro
+
+; Final macro to end the operation chain, used by WRITE kernels to jump back
+; to the process function return point. Very similar to CONTINUE, but skips
+; incrementing the implq pointer, and also clears AVX registers to avoid
+; phantom dependencies between loop iterations.
+%macro FINISH 1 ; reg
+    %if vzeroupper_required
+        ; we may jump back into an SSE read, so always zero upper regs here
+        vzeroupper
+    %endif
+    jmp %1
+    annotate_function_size
+%endmacro
+
+; Helper for inline conditionals; used to conditionally include single lines
+%macro IF 2+ ; cond, body
+    %if %1
+        %2
+    %endif
+%endmacro
+
+; Alternate name; for nested usage (to work around NASM limitations)
+%macro IF1 2+
+    %if %1
+        %2
+    %endif
+%endmacro
diff --git a/libswscale/x86/ops_float.asm b/libswscale/x86/ops_float.asm
new file mode 100644
index 0000000000..1c92ac37fb
--- /dev/null
+++ b/libswscale/x86/ops_float.asm
@@ -0,0 +1,389 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "ops_common.asm"
+
+SECTION .text
+
+;---------------------------------------------------------
+; Pixel type conversions
+
+%macro conv8to32f 0
+op convert_U8_F32
+        LOAD_CONT tmp0q
+IF X,   vpsrldq xmx2, xmx, 8
+IF Y,   vpsrldq xmy2, xmy, 8
+IF Z,   vpsrldq xmz2, xmz, 8
+IF W,   vpsrldq xmw2, xmw, 8
+IF X,   pmovzxbd mx, xmx
+IF Y,   pmovzxbd my, xmy
+IF Z,   pmovzxbd mz, xmz
+IF W,   pmovzxbd mw, xmw
+IF X,   pmovzxbd mx2, xmx2
+IF Y,   pmovzxbd my2, xmy2
+IF Z,   pmovzxbd mz2, xmz2
+IF W,   pmovzxbd mw2, xmw2
+IF X,   vcvtdq2ps mx, mx
+IF Y,   vcvtdq2ps my, my
+IF Z,   vcvtdq2ps mz, mz
+IF W,   vcvtdq2ps mw, mw
+IF X,   vcvtdq2ps mx2, mx2
+IF Y,   vcvtdq2ps my2, my2
+IF Z,   vcvtdq2ps mz2, mz2
+IF W,   vcvtdq2ps mw2, mw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv16to32f 0
+op convert_U16_F32
+        LOAD_CONT tmp0q
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   pmovzxwd mx, xmx
+IF Y,   pmovzxwd my, xmy
+IF Z,   pmovzxwd mz, xmz
+IF W,   pmovzxwd mw, xmw
+IF X,   pmovzxwd mx2, xmx2
+IF Y,   pmovzxwd my2, xmy2
+IF Z,   pmovzxwd mz2, xmz2
+IF W,   pmovzxwd mw2, xmw2
+IF X,   vcvtdq2ps mx, mx
+IF Y,   vcvtdq2ps my, my
+IF Z,   vcvtdq2ps mz, mz
+IF W,   vcvtdq2ps mw, mw
+IF X,   vcvtdq2ps mx2, mx2
+IF Y,   vcvtdq2ps my2, my2
+IF Z,   vcvtdq2ps mz2, mz2
+IF W,   vcvtdq2ps mw2, mw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32fto8 0
+op convert_F32_U8
+        LOAD_CONT tmp0q
+IF X,   cvttps2dq mx, mx
+IF Y,   cvttps2dq my, my
+IF Z,   cvttps2dq mz, mz
+IF W,   cvttps2dq mw, mw
+IF X,   cvttps2dq mx2, mx2
+IF Y,   cvttps2dq my2, my2
+IF Z,   cvttps2dq mz2, mz2
+IF W,   cvttps2dq mw2, mw2
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+        vzeroupper
+IF X,   packuswb xmx, xmx2
+IF Y,   packuswb xmy, xmy2
+IF Z,   packuswb xmz, xmz2
+IF W,   packuswb xmw, xmw2
+IF X,   vpshufd xmx, xmx, q3120
+IF Y,   vpshufd xmy, xmy, q3120
+IF Z,   vpshufd xmz, xmz, q3120
+IF W,   vpshufd xmw, xmw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32fto16 0
+op convert_F32_U16
+        LOAD_CONT tmp0q
+IF X,   cvttps2dq mx, mx
+IF Y,   cvttps2dq my, my
+IF Z,   cvttps2dq mz, mz
+IF W,   cvttps2dq mw, mw
+IF X,   cvttps2dq mx2, mx2
+IF Y,   cvttps2dq my2, my2
+IF Z,   cvttps2dq mz2, mz2
+IF W,   cvttps2dq mw2, mw2
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro min_max 0
+op min
+IF X,   vbroadcastss m8,  [implq + SwsOpImpl.priv + 0]
+IF Y,   vbroadcastss m9,  [implq + SwsOpImpl.priv + 4]
+IF Z,   vbroadcastss m10, [implq + SwsOpImpl.priv + 8]
+IF W,   vbroadcastss m11, [implq + SwsOpImpl.priv + 12]
+        LOAD_CONT tmp0q
+IF X,   minps mx, mx, m8
+IF Y,   minps my, my, m9
+IF Z,   minps mz, mz, m10
+IF W,   minps mw, mw, m11
+IF X,   minps mx2, m8
+IF Y,   minps my2, m9
+IF Z,   minps mz2, m10
+IF W,   minps mw2, m11
+        CONTINUE tmp0q
+
+op max
+IF X,   vbroadcastss m8,  [implq + SwsOpImpl.priv + 0]
+IF Y,   vbroadcastss m9,  [implq + SwsOpImpl.priv + 4]
+IF Z,   vbroadcastss m10, [implq + SwsOpImpl.priv + 8]
+IF W,   vbroadcastss m11, [implq + SwsOpImpl.priv + 12]
+        LOAD_CONT tmp0q
+IF X,   maxps mx, m8
+IF Y,   maxps my, m9
+IF Z,   maxps mz, m10
+IF W,   maxps mw, m11
+IF X,   maxps mx2, m8
+IF Y,   maxps my2, m9
+IF Z,   maxps mz2, m10
+IF W,   maxps mw2, m11
+        CONTINUE tmp0q
+%endmacro
+
+%macro scale 0
+op scale
+        vbroadcastss m8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   mulps mx, m8
+IF Y,   mulps my, m8
+IF Z,   mulps mz, m8
+IF W,   mulps mw, m8
+IF X,   mulps mx2, m8
+IF Y,   mulps my2, m8
+IF Z,   mulps mz2, m8
+IF W,   mulps mw2, m8
+        CONTINUE tmp0q
+%endmacro
+
+%macro load_dither_row 5 ; size_log2, y, addr, out, out2
+        lea tmp0q, %2
+        and tmp0q, (1 << %1) - 1
+        shl tmp0q, %1+2
+%if %1 == 2
+        VBROADCASTI128 %4, [%3 + tmp0q]
+%else
+        mova %4, [%3 + tmp0q]
+    %if (4 << %1) > mmsize
+        mova %5, [%3 + tmp0q + mmsize]
+    %endif
+%endif
+%endmacro
+
+%macro dither 1 ; size_log2
+op dither%1
+        %define DX  m8
+        %define DY  m9
+        %define DZ  m10
+        %define DW  m11
+        %define DX2 DX
+        %define DY2 DY
+        %define DZ2 DZ
+        %define DW2 DW
+%if %1 == 0
+        ; constant offest for all channels
+        vbroadcastss DX, [implq + SwsOpImpl.priv]
+        %define DY DX
+        %define DZ DX
+        %define DW DX
+%elif %1 == 1
+        ; 2x2 matrix, only sign of y matters
+        mov tmp0d, yd
+        and tmp0d, 1
+        shl tmp0d, 3
+    %if X || Z
+        ; dither matrix is stored directly in the private data
+        vbroadcastsd DX, [implq + SwsOpImpl.priv + tmp0q]
+    %endif
+    %if Y || W
+        xor tmp0d, 8
+        vbroadcastsd DY, [implq + SwsOpImpl.priv + tmp0q]
+    %endif
+        %define DZ DX
+        %define DW DY
+%else
+        ; matrix is at least 4x4, load all four channels with custom offset
+    %if (4 << %1) > mmsize
+        %define DX2 m12
+        %define DY2 m13
+        %define DZ2 m14
+        %define DW2 m15
+    %endif
+        ; dither matrix is stored indirectly at the private data address
+        mov tmp1q, [implq + SwsOpImpl.priv]
+    %if (4 << %1) > 2 * mmsize
+        ; need to add in x offset
+        mov tmp0d, bxd
+        shl tmp0d, 6 ; sizeof(float[16])
+        and tmp0d, (4 << %1) - 1
+        add tmp1q, tmp0q
+    %endif
+IF X,   load_dither_row %1, [yd + 0], tmp1q, DX, DX2
+IF Y,   load_dither_row %1, [yd + 3], tmp1q, DY, DY2
+IF Z,   load_dither_row %1, [yd + 2], tmp1q, DZ, DZ2
+IF W,   load_dither_row %1, [yd + 5], tmp1q, DW, DW2
+%endif
+        LOAD_CONT tmp0q
+IF X,   addps mx, DX
+IF Y,   addps my, DY
+IF Z,   addps mz, DZ
+IF W,   addps mw, DW
+IF X,   addps mx2, DX2
+IF Y,   addps my2, DY2
+IF Z,   addps mz2, DZ2
+IF W,   addps mw2, DW2
+        CONTINUE tmp0q
+%endmacro
+
+%macro dither_fns 0
+        dither 0
+        dither 1
+        dither 2
+        dither 3
+        dither 4
+        dither 5
+        dither 6
+        dither 7
+        dither 8
+%endmacro
+
+%xdefine MASK(I, J)  (1 << (5 * (I) + (J)))
+%xdefine MASK_OFF(I) MASK(I, 4)
+%xdefine MASK_ROW(I) (0b11111 << (5 * (I)))
+%xdefine MASK_COL(J) (0b1000010000100001 << J)
+%xdefine MASK_ALL    (1 << 20) - 1
+%xdefine MASK_LUMA   MASK(0, 0) | MASK_OFF(0)
+%xdefine MASK_ALPHA  MASK(3, 3) | MASK_OFF(3)
+%xdefine MASK_DIAG3  MASK(0, 0) | MASK(1, 1) | MASK(2, 2)
+%xdefine MASK_OFF3   MASK_OFF(0) | MASK_OFF(1) | MASK_OFF(2)
+%xdefine MASK_MAT3   MASK(0, 0) | MASK(0, 1) | MASK(0, 2) |\
+                     MASK(1, 0) | MASK(1, 1) | MASK(1, 2) |\
+                     MASK(2, 0) | MASK(2, 1) | MASK(2, 2)
+%xdefine MASK_DIAG4  MASK_DIAG3 | MASK(3, 3)
+%xdefine MASK_OFF4   MASK_OFF3 | MASK_OFF(3)
+%xdefine MASK_MAT4   MASK_ALL & ~MASK_OFF4
+
+%macro linear_row 7 ; res, x, y, z, w, row, mask
+%define COL(J) ((%7) & MASK(%6, J)) ; true if mask contains component J
+%define NOP(J) (J == %6 && !COL(J)) ; true if J is untouched input component
+
+    ; load weights
+    IF COL(0),  vbroadcastss m12,  [tmp0q + %6 * 20 + 0]
+    IF COL(1),  vbroadcastss m13,  [tmp0q + %6 * 20 + 4]
+    IF COL(2),  vbroadcastss m14,  [tmp0q + %6 * 20 + 8]
+    IF COL(3),  vbroadcastss m15,  [tmp0q + %6 * 20 + 12]
+
+    ; initialize result vector as appropriate
+    %if COL(4) ; offset
+        vbroadcastss %1, [tmp0q + %6 * 20 + 16]
+    %elif NOP(0)
+        ; directly reuse first component vector if possible
+        mova %1, %2
+    %else
+        xorps %1, %1
+    %endif
+
+    IF COL(0),  mulps m12, %2
+    IF COL(1),  mulps m13, %3
+    IF COL(2),  mulps m14, %4
+    IF COL(3),  mulps m15, %5
+    IF COL(0),  addps %1, m12
+    IF NOP(0) && COL(4), addps %1, %3 ; first vector was not reused
+    IF COL(1),  addps %1, m13
+    IF NOP(1),  addps %1, %3
+    IF COL(2),  addps %1, m14
+    IF NOP(2),  addps %1, %4
+    IF COL(3),  addps %1, m15
+    IF NOP(3),  addps %1, %5
+%endmacro
+
+%macro linear_inner 5 ; x, y, z, w, mask
+    %define ROW(I) ((%5) & MASK_ROW(I))
+    IF1 ROW(0), linear_row m8,  %1, %2, %3, %4, 0, %5
+    IF1 ROW(1), linear_row m9,  %1, %2, %3, %4, 1, %5
+    IF1 ROW(2), linear_row m10, %1, %2, %3, %4, 2, %5
+    IF1 ROW(3), linear_row m11, %1, %2, %3, %4, 3, %5
+    IF ROW(0),  mova %1, m8
+    IF ROW(1),  mova %2, m9
+    IF ROW(2),  mova %3, m10
+    IF ROW(3),  mova %4, m11
+%endmacro
+
+%macro linear_mask 2 ; name, mask
+op %1
+        mov tmp0q, [implq + SwsOpImpl.priv] ; address of matrix
+        linear_inner mx,  my,  mz,  mw,  %2
+        linear_inner mx2, my2, mz2, mw2, %2
+        CONTINUE
+%endmacro
+
+; specialized functions for very simple cases
+%macro linear_dot3 0
+op dot3
+        mov tmp0q, [implq + SwsOpImpl.priv]
+        vbroadcastss m12,  [tmp0q + 0]
+        vbroadcastss m13,  [tmp0q + 4]
+        vbroadcastss m14,  [tmp0q + 8]
+        LOAD_CONT tmp0q
+        mulps mx, m12
+        mulps m8, my, m13
+        mulps m9, mz, m14
+        addps mx, m8
+        addps mx, m9
+        mulps mx2, m12
+        mulps m10, my2, m13
+        mulps m11, mz2, m14
+        addps mx2, m10
+        addps mx2, m11
+        CONTINUE tmp0q
+%endmacro
+
+%macro linear_fns 0
+        linear_dot3
+        linear_mask luma,       MASK_LUMA
+        linear_mask alpha,      MASK_ALPHA
+        linear_mask lumalpha,   MASK_LUMA | MASK_ALPHA
+        linear_mask row0,       MASK_ROW(0)
+        linear_mask row0a,      MASK_ROW(0) | MASK_ALPHA
+        linear_mask diag3,      MASK_DIAG3
+        linear_mask diag4,      MASK_DIAG4
+        linear_mask diagoff3,   MASK_DIAG3 | MASK_OFF3
+        linear_mask matrix3,    MASK_MAT3
+        linear_mask affine3,    MASK_MAT3 | MASK_OFF3
+        linear_mask affine3a,   MASK_MAT3 | MASK_OFF3 | MASK_ALPHA
+        linear_mask matrix4,    MASK_MAT4
+        linear_mask affine4,    MASK_MAT4 | MASK_OFF4
+%endmacro
+
+INIT_YMM avx2
+decl_common_patterns conv8to32f
+decl_common_patterns conv16to32f
+decl_common_patterns conv32fto8
+decl_common_patterns conv32fto16
+decl_common_patterns min_max
+decl_common_patterns scale
+decl_common_patterns dither_fns
+linear_fns
diff --git a/libswscale/x86/ops_int.asm b/libswscale/x86/ops_int.asm
new file mode 100644
index 0000000000..147167b2f4
--- /dev/null
+++ b/libswscale/x86/ops_int.asm
@@ -0,0 +1,1049 @@
+;******************************************************************************
+;* Copyright (c) 2025 Niklas Haas
+;*
+;* This file is part of FFmpeg.
+;*
+;* FFmpeg is free software; you can redistribute it and/or
+;* modify it under the terms of the GNU Lesser General Public
+;* License as published by the Free Software Foundation; either
+;* version 2.1 of the License, or (at your option) any later version.
+;*
+;* FFmpeg is distributed in the hope that it will be useful,
+;* but WITHOUT ANY WARRANTY; without even the implied warranty of
+;* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+;* Lesser General Public License for more details.
+;*
+;* You should have received a copy of the GNU Lesser General Public
+;* License along with FFmpeg; if not, write to the Free Software
+;* Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+;******************************************************************************
+
+%include "ops_common.asm"
+
+SECTION_RODATA
+
+align 16
+expand16_shuf:  db   0,  0,  2,  2,  4,  4,  6,  6,  8,  8, 10, 10, 12, 12, 14, 14
+expand32_shuf:  db   0,  0,  0,  0,  4,  4,  4,  4,  8,  8,  8,  8, 12, 12, 12, 12
+
+read8_unpack2:  db   0,  2,  4,  6,  8, 10, 12, 14,  1,  3,  5,  7,  9, 11, 13, 15
+read8_unpack3:  db   0,  3,  6,  9,  1,  4,  7, 10,  2,  5,  8, 11, -1, -1, -1, -1
+read8_unpack4:  db   0,  4,  8, 12,  1,  5,  9, 13,  2,  6, 10, 14,  3,  7, 11, 15
+read16_unpack2: db   0,  1,  4,  5,  8,  9, 12, 13,  2,  3,  6,  7, 10, 11, 14, 15
+read16_unpack3: db   0,  1,  6,  7,  2,  3,  8,  9,  4,  5, 10, 11, -1, -1, -1, -1
+read16_unpack4: db   0,  1,  8,  9,  2,  3, 10, 11,  4,  5, 12, 13,  6,  7, 14, 15
+write8_pack2:   db   0,  8,  1,  9,  2, 10,  3, 11,  4, 12,  5, 13,  6, 14,  7, 15
+write8_pack3:   db   0,  4,  8,  1,  5,  9,  2,  6, 10,  3,  7, 11, -1, -1, -1, -1
+write16_pack3:  db   0,  1,  4,  5,  8,  9,  2,  3,  6,  7, 10, 11, -1, -1, -1, -1
+
+%define write8_pack4  read8_unpack4
+%define write16_pack4 read16_unpack2
+%define write16_pack2 read16_unpack4
+
+align 32
+bits_shuf:      db   0,  0,  0,  0,  0,  0,  0,  0,  1,  1,  1,  1,  1,  1,  1,  1, \
+                     2,  2,  2,  2,  2,  2,  2,  2,  3,  3,  3,  3,  3,  3,  3,  3
+bits_mask:      db 128, 64, 32, 16,  8,  4,  2,  1,128, 64, 32, 16,  8,  4,  2,  1
+bits_reverse:   db   7,  6,  5,  4,  3,  2,  1,  0, 15, 14, 13, 12, 11, 10,  9,  8,
+
+align 32
+mask1: times 32 db 0x01
+mask2: times 32 db 0x03
+mask3: times 32 db 0x07
+mask4: times 32 db 0x0F
+
+SECTION .text
+
+;---------------------------------------------------------
+; Global entry point. See `ops_common.asm` for info.
+
+%macro process_fn 1 ; number of planes
+cglobal sws_process%1_x86, 6, 6 + 2 * %1, 16
+            ; Args:
+            ;   execq, implq, bxd, yd as defined in ops_common.int
+            ;   bx_end and y_end are initially in tmp0d / tmp1d
+            ;   (see SwsOpFunc signature)
+            ;
+            ; Stack layout:
+            ;   [rsp +  0] = [qword] impl->cont (address of first kernel)
+            ;   [rsp +  8] = [qword] &impl[1]   (restore implq after chain)
+            ;   [rsp + 16] = [dword] bx start   (restore after line finish)
+            ;   [rsp + 20] = [dword] bx end     (loop counter limit)
+            ;   [rsp + 24] = [dword] y end      (loop counter limit)
+            sub rsp, 32
+            mov [rsp + 16], bxd
+            mov [rsp + 20], tmp0d ; bx_end
+            mov [rsp + 24], tmp1d ; y_end
+            mov tmp0q, [implq + SwsOpImpl.cont]
+            add implq, SwsOpImpl.next
+            mov [rsp +  0], tmp0q
+            mov [rsp +  8], implq
+
+            ; load plane pointers
+            mov in0q,  [execq + SwsOpExec.in0]
+IF %1 > 1,  mov in1q,  [execq + SwsOpExec.in1]
+IF %1 > 2,  mov in2q,  [execq + SwsOpExec.in2]
+IF %1 > 3,  mov in3q,  [execq + SwsOpExec.in3]
+            mov out0q, [execq + SwsOpExec.out0]
+IF %1 > 1,  mov out1q, [execq + SwsOpExec.out1]
+IF %1 > 2,  mov out2q, [execq + SwsOpExec.out2]
+IF %1 > 3,  mov out3q, [execq + SwsOpExec.out3]
+.loop:
+            jmp [rsp] ; call into op chain
+
+ALIGN function_align
+cglobal_label .return
+            ; op chain always returns back here
+            mov implq, [rsp + 8]
+            inc bxd
+            cmp bxd, [rsp + 20]
+            jne .loop
+            ; end of line
+            inc yd
+            cmp yd, [rsp + 24]
+            je .end
+            ; bump addresses to point to start of next line
+            add in0q,  [execq + SwsOpExec.in_bump0]
+IF %1 > 1,  add in1q,  [execq + SwsOpExec.in_bump1]
+IF %1 > 2,  add in2q,  [execq + SwsOpExec.in_bump2]
+IF %1 > 3,  add in3q,  [execq + SwsOpExec.in_bump3]
+            add out0q, [execq + SwsOpExec.out_bump0]
+IF %1 > 1,  add out1q, [execq + SwsOpExec.out_bump1]
+IF %1 > 2,  add out2q, [execq + SwsOpExec.out_bump2]
+IF %1 > 3,  add out3q, [execq + SwsOpExec.out_bump3]
+            mov bxd, [rsp + 16]
+            jmp [rsp]
+.end:
+            add rsp, 32
+            RET
+%endmacro
+
+process_fn 1
+process_fn 2
+process_fn 3
+process_fn 4
+
+;---------------------------------------------------------
+; Packed shuffle fast-path
+
+; This is a special entry point for handling a subset of operation chains
+; that can be reduced down to a single `pshufb` shuffle mask. For more details
+; about when this works, refer to the documentation of `ff_sws_solve_shuffle`.
+;
+; We specialize this function for every possible combination of pixel strides.
+; For example, gray -> gray16 is classified as an "8, 16" operation because it
+; takes 8 bytes and expands them out to 16 bytes in each application of the
+; 128-bit shuffle mask.
+;
+; Since we can't shuffle across lanes, we only instantiate SSE4 versions for
+; all shuffles that are not a clean multiple of 128 bits (e.g. rgb24 -> rgb0).
+; For the clean multiples (e.g. rgba -> argb), we also define AVX2 and AVX512
+; versions that can handle a larger number of bytes at once.
+
+%macro packed_shuffle 2 ; size_in, size_out
+cglobal packed_shuffle%1_%2, 6, 10, 2, \
+    exec, shuffle, bx, y, bxend, yend, src, dst, src_stride, dst_stride
+            mov srcq, [execq + SwsOpExec.in0]
+            mov dstq, [execq + SwsOpExec.out0]
+            mov src_strideq, [execq + SwsOpExec.in_stride0]
+            mov dst_strideq, [execq + SwsOpExec.out_stride0]
+            VBROADCASTI128 m1, [shuffleq]
+            sub bxendd, bxd
+            sub yendd, yd
+            ; reuse now-unneeded regs
+    %define srcidxq execq
+            imul srcidxq, bxendq, -%1
+%if %1 = %2
+    %define dstidxq srcidxq
+%else
+    %define dstidxq shuffleq ; no longer needed reg
+            imul dstidxq, bxendq, -%2
+%endif
+            sub srcq, srcidxq
+            sub dstq, dstidxq
+.loop:
+    %if %1 <= 4
+            movd m0, [srcq + srcidxq]
+    %elif %1 <= 8
+            movq m0, [srcq + srcidxq]
+    %else
+            movu m0, [srcq + srcidxq]
+    %endif
+            pshufb m0, m1
+            movu [dstq + dstidxq], m0
+            add srcidxq, %1
+IF %1 != %2,add dstidxq, %2
+            jnz .loop
+            add srcq, src_strideq
+            add dstq, dst_strideq
+            imul srcidxq, bxendq, -%1
+IF %1 != %2,imul dstidxq, bxendq, -%2
+            dec yendd
+            jnz .loop
+            RET
+%endmacro
+
+INIT_XMM sse4
+packed_shuffle  5, 15 ;  8 -> 24
+packed_shuffle  4, 16 ;  8 -> 32, 16 -> 64
+packed_shuffle  2, 12 ;  8 -> 48
+packed_shuffle 10, 15 ; 16 -> 24
+packed_shuffle  8, 16 ; 16 -> 32, 32 -> 64
+packed_shuffle  4, 12 ; 16 -> 48
+packed_shuffle 15, 15 ; 24 -> 24
+packed_shuffle 12, 16 ; 24 -> 32
+packed_shuffle  6, 12 ; 24 -> 48
+packed_shuffle 16, 12 ; 32 -> 24, 64 -> 48
+packed_shuffle 16, 16 ; 32 -> 32, 64 -> 64
+packed_shuffle  8, 12 ; 32 -> 48
+packed_shuffle 12, 12 ; 48 -> 48
+
+INIT_YMM avx2
+packed_shuffle 32, 32
+
+INIT_ZMM avx512
+packed_shuffle 64, 64
+
+;---------------------------------------------------------
+; Planar reads / writes
+
+%macro read_planar 1 ; elems
+op read_planar%1
+            movu mx, [in0q]
+IF %1 > 1,  movu my, [in1q]
+IF %1 > 2,  movu mz, [in2q]
+IF %1 > 3,  movu mw, [in3q]
+%if V2
+            movu mx2, [in0q + mmsize]
+IF %1 > 1,  movu my2, [in1q + mmsize]
+IF %1 > 2,  movu mz2, [in2q + mmsize]
+IF %1 > 3,  movu mw2, [in3q + mmsize]
+%endif
+            LOAD_CONT tmp0q
+            add in0q, mmsize * (1 + V2)
+IF %1 > 1,  add in1q, mmsize * (1 + V2)
+IF %1 > 2,  add in2q, mmsize * (1 + V2)
+IF %1 > 3,  add in3q, mmsize * (1 + V2)
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_planar 1 ; elems
+op write_planar%1
+            LOAD_CONT tmp0q
+            movu [out0q], mx
+IF %1 > 1,  movu [out1q], my
+IF %1 > 2,  movu [out2q], mz
+IF %1 > 3,  movu [out3q], mw
+%if V2
+            movu [out0q + mmsize], mx2
+IF %1 > 1,  movu [out1q + mmsize], my2
+IF %1 > 2,  movu [out2q + mmsize], mz2
+IF %1 > 3,  movu [out3q + mmsize], mw2
+%endif
+            add out0q, mmsize * (1 + V2)
+IF %1 > 1,  add out1q, mmsize * (1 + V2)
+IF %1 > 2,  add out2q, mmsize * (1 + V2)
+IF %1 > 3,  add out3q, mmsize * (1 + V2)
+            FINISH tmp0q
+%endmacro
+
+%macro read_packed2 1 ; depth
+op read%1_packed2
+            movu m8,  [in0q + 0*mmsize]
+            movu m9,  [in0q + 1*mmsize]
+    IF V2,  movu m10, [in0q + 2*mmsize]
+    IF V2,  movu m11, [in0q + 3*mmsize]
+IF %1 < 32, VBROADCASTI128 m12, [read%1_unpack2]
+            LOAD_CONT tmp0q
+            add in0q, mmsize * (2 + V2 * 2)
+%if %1 == 32
+            shufps m8, m8, q3120
+            shufps m9, m9, q3120
+    IF V2,  shufps m10, m10, q3120
+    IF V2,  shufps m11, m11, q3120
+%else
+            pshufb m8, m12              ; { X0 Y0 | X1 Y1 }
+            pshufb m9, m12              ; { X2 Y2 | X3 Y3 }
+    IF V2,  pshufb m10, m12
+    IF V2,  pshufb m11, m12
+%endif
+            unpcklpd mx, m8, m9         ; { X0 X2 | X1 X3 }
+            unpckhpd my, m8, m9         ; { Y0 Y2 | Y1 Y3 }
+    IF V2,  unpcklpd mx2, m10, m11
+    IF V2,  unpckhpd my2, m10, m11
+%if avx_enabled
+            vpermq mx, mx, q3120       ; { X0 X1 | X2 X3 }
+            vpermq my, my, q3120       ; { Y0 Y1 | Y2 Y3 }
+    IF V2,  vpermq mx2, mx2, q3120
+    IF V2,  vpermq my2, my2, q3120
+%endif
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_packed2 1 ; depth
+op write%1_packed2
+IF %1 < 32, VBROADCASTI128 m12, [write%1_pack2]
+            LOAD_CONT tmp0q
+%if avx_enabled
+            vpermq mx, mx, q3120       ; { X0 X2 | X1 X3 }
+            vpermq my, my, q3120       ; { Y0 Y2 | Y1 Y3 }
+    IF V2,  vpermq mx2, mx2, q3120
+    IF V2,  vpermq my2, my2, q3120
+%endif
+            unpcklpd m8, mx, my        ; { X0 Y0 | X1 Y1 }
+            unpckhpd m9, mx, my        ; { X2 Y2 | X3 Y3 }
+    IF V2,  unpcklpd m10, mx2, my2
+    IF V2,  unpckhpd m11, mx2, my2
+%if %1 == 32
+            shufps m8, m8, q3120
+            shufps m9, m9, q3120
+    IF V2,  shufps m10, m10, q3120
+    IF V2,  shufps m11, m11, q3120
+%else
+            pshufb m8, m12
+            pshufb m9, m12
+    IF V2,  pshufb m10, m12
+    IF V2,  pshufb m11, m12
+%endif
+            movu [out0q + 0*mmsize], m8
+            movu [out0q + 1*mmsize], m9
+IF V2,      movu [out0q + 2*mmsize], m10
+IF V2,      movu [out0q + 3*mmsize], m11
+            add out0q, mmsize * (2 + V2 * 2)
+            FINISH tmp0q
+%endmacro
+
+; helper macro reused for both 3 and 4 component packed reads
+%macro read_packed_inner 7 ; x, y, z, w, addr, num, depth
+            movu xm8,  [%5 + 0  * %6]
+            movu xm9,  [%5 + 4  * %6]
+            movu xm10, [%5 + 8  * %6]
+            movu xm11, [%5 + 12 * %6]
+    %if avx_enabled
+            vinserti128 m8,  m8,  [%5 + 16 * %6], 1
+            vinserti128 m9,  m9,  [%5 + 20 * %6], 1
+            vinserti128 m10, m10, [%5 + 24 * %6], 1
+            vinserti128 m11, m11, [%5 + 28 * %6], 1
+    %endif
+    %if %7 == 32
+            mova %1, m8
+            mova %2, m9
+            mova %3, m10
+            mova %4, m11
+    %else
+            pshufb %1, m8,  m12         ; { X0 Y0 Z0 W0 | X4 Y4 Z4 W4 }
+            pshufb %2, m9,  m12         ; { X1 Y1 Z1 W1 | X5 Y5 Z5 W5 }
+            pshufb %3, m10, m12         ; { X2 Y2 Z2 W2 | X6 Y6 Z6 W6 }
+            pshufb %4, m11, m12         ; { X3 Y3 Z3 W3 | X7 Y7 Z7 W7 }
+    %endif
+            punpckldq m8,  %1, %2       ; { X0 X1 Y0 Y1 | X4 X5 Y4 Y5 }
+            punpckldq m9,  %3, %4       ; { X2 X3 Y2 Y3 | X6 X7 Y6 Y7 }
+            punpckhdq m10, %1, %2       ; { Z0 Z1 W0 W1 | Z4 Z5 W4 W5 }
+            punpckhdq m11, %3, %4       ; { Z2 Z3 W2 W3 | Z6 Z7 W6 W7 }
+            punpcklqdq %1, m8, m9       ; { X0 X1 X2 X3 | X4 X5 X6 X7 }
+            punpckhqdq %2, m8, m9       ; { Y0 Y1 Y2 Y3 | Y4 Y5 Y6 Y7 }
+            punpcklqdq %3, m10, m11     ; { Z0 Z1 Z2 Z3 | Z4 Z5 Z6 Z7 }
+IF %6 > 3,  punpckhqdq %4, m10, m11     ; { W0 W1 W2 W3 | W4 W5 W6 W7 }
+%endmacro
+
+%macro read_packed 2 ; num, depth
+op read%2_packed%1
+IF %2 < 32, VBROADCASTI128 m12, [read%2_unpack%1]
+            LOAD_CONT tmp0q
+            read_packed_inner mx, my, mz, mw, in0q, %1, %2
+IF1 V2,     read_packed_inner mx2, my2, mz2, mw2, in0q + %1 * mmsize, %1, %2
+            add in0q, %1 * mmsize * (1 + V2)
+            CONTINUE tmp0q
+%endmacro
+
+%macro write_packed_inner 7 ; x, y, z, w, addr, num, depth
+        punpckldq m8,  %1, %2       ; { X0 Y0 X1 Y1 | X4 Y4 X5 Y5 }
+        punpckldq m9,  %3, %4       ; { Z0 W0 Z1 W1 | Z4 W4 Z5 W5 }
+        punpckhdq m10, %1, %2       ; { X2 Y2 X3 Y3 | X6 Y6 X7 Y7 }
+        punpckhdq m11, %3, %4       ; { Z2 W2 Z3 W3 | Z6 W6 Z7 W7 }
+        punpcklqdq %1, m8, m9       ; { X0 Y0 Z0 W0 | X4 Y4 Z4 W4 }
+        punpckhqdq %2, m8, m9       ; { X1 Y1 Z1 W1 | X5 Y5 Z5 W5 }
+        punpcklqdq %3, m10, m11     ; { X2 Y2 Z2 W2 | X6 Y6 Z6 W6 }
+        punpckhqdq %4, m10, m11     ; { X3 Y3 Z3 W3 | X7 Y7 Z7 W7 }
+    %if %7 == 32
+        mova m8,  %1
+        mova m9,  %2
+        mova m10, %3
+        mova m11, %4
+    %else
+        pshufb m8,  %1, m12
+        pshufb m9,  %2, m12
+        pshufb m10, %3, m12
+        pshufb m11, %4, m12
+    %endif
+        movu [%5 +  0*%6], xm8
+        movu [%5 +  4*%6], xm9
+        movu [%5 +  8*%6], xm10
+        movu [%5 + 12*%6], xm11
+    %if avx_enabled
+        vextracti128 [%5 + 16*%6], m8, 1
+        vextracti128 [%5 + 20*%6], m9, 1
+        vextracti128 [%5 + 24*%6], m10, 1
+        vextracti128 [%5 + 28*%6], m11, 1
+    %endif
+%endmacro
+
+%macro write_packed 2 ; num, depth
+op write%2_packed%1
+IF %2 < 32, VBROADCASTI128 m12, [write%2_pack%1]
+            LOAD_CONT tmp0q
+            write_packed_inner mx, my, mz, mw, out0q, %1, %2
+IF1 V2,     write_packed_inner mx2, my2, mz2, mw2, out0q + %1 * mmsize, %1, %2
+            add out0q, %1 * mmsize * (1 + V2)
+            FINISH tmp0q
+%endmacro
+
+%macro rw_packed 1 ; depth
+        read_packed2 %1
+        read_packed 3, %1
+        read_packed 4, %1
+        write_packed2 %1
+        write_packed 3, %1
+        write_packed 4, %1
+%endmacro
+
+%macro read_nibbles 0
+op read_nibbles1
+%if avx_enabled
+        movu xmx,  [in0q]
+IF V2,  movu xmx2, [in0q + 16]
+%else
+        movq xmx,  [in0q]
+IF V2,  movq xmx2, [in0q + 8]
+%endif
+        VBROADCASTI128 m8, [mask4]
+        LOAD_CONT tmp0q
+        add in0q, (mmsize >> 1) * (1 + V2)
+        pmovzxbw mx, xmx
+IF V2,  pmovzxbw mx2, xmx2
+        psllw my, mx, 8
+IF V2,  psllw my2, mx2, 8
+        psrlw mx, 4
+IF V2,  psrlw mx2, 4
+        pand my, m8
+IF V2,  pand my2, m8
+        por mx, my
+IF V2,  por mx2, my2
+        CONTINUE tmp0q
+%endmacro
+
+%macro read_bits 0
+op read_bits1
+%if avx_enabled
+        vpbroadcastd mx,  [in0q]
+IF V2,  vpbroadcastd mx2, [in0q + 4]
+%else
+        movd mx, [in0q]
+IF V2,  movd mx2, [in0q + 2]
+%endif
+        mova m8, [bits_shuf]
+        VBROADCASTI128 m9,  [bits_mask]
+        VBROADCASTI128 m10, [mask1]
+        LOAD_CONT tmp0q
+        add in0q, (mmsize >> 3) * (1 + V2)
+        pshufb mx,  m8
+IF V2,  pshufb mx2, m8
+        pand mx,  m9
+IF V2,  pand mx2, m9
+        pcmpeqb mx,  m9
+IF V2,  pcmpeqb mx2, m9
+        pand mx,  m10
+IF V2,  pand mx2, m10
+        CONTINUE tmp0q
+%endmacro
+
+; TODO: write_nibbles
+
+%macro write_bits 0
+op write_bits1
+        VBROADCASTI128 m8, [bits_reverse]
+        psllw mx,  7
+IF V2,  psllw mx2, 7
+        pshufb mx,  m8
+IF V2,  pshufb mx2, m8
+        pmovmskb tmp0d, mx
+IF V2,  pmovmskb tmp1d, mx2
+%if avx_enabled
+        mov [out0q],     tmp0d
+IF V2,  mov [out0q + 4], tmp1d
+%else
+        mov [out0q],     tmp0d
+IF V2,  mov [out0q + 2], tmp1d
+%endif
+        LOAD_CONT tmp0q
+        add out0q, (mmsize >> 3) * (1 + V2)
+        FINISH tmp0q
+%endmacro
+
+;--------------------------
+; Pixel packing / unpacking
+
+%macro pack_generic 3-4 0 ; x, y, z, w
+op pack_%1%2%3%4
+        ; pslld works for all sizes because the input should not overflow
+IF %2,  pslld mx, %4+%3+%2
+IF %3,  pslld my, %4+%3
+IF %4,  pslld mz, %4
+IF %2,  por mx, my
+IF %3,  por mx, mz
+IF %4,  por mx, mw
+    %if V2
+IF %2,  pslld mx2, %4+%3+%2
+IF %3,  pslld my2, %4+%3
+IF %4,  pslld mz2, %4
+IF %2,  por mx2, my2
+IF %3,  por mx2, mz2
+IF %4,  por mx2, mw2
+    %endif
+        CONTINUE
+%endmacro
+
+%macro unpack 5-6 0 ; type, bits, x, y, z, w
+op unpack_%3%4%5%6
+        ; clear high bits by shifting left
+IF %6,  vpsll%1 mw, mx, %2 - (%6)
+IF %5,  vpsll%1 mz, mx, %2 - (%6+%5)
+IF %4,  vpsll%1 my, mx, %2 - (%6+%5+%4)
+        psrl%1 mx, %4+%5+%6
+IF %4,  psrl%1 my, %2 - %4
+IF %5,  psrl%1 mz, %2 - %5
+IF %6,  psrl%1 mw, %2 - %6
+    %if V2
+IF %6,  vpsll%1 mw2, mx2, %2 - (%6)
+IF %5,  vpsll%1 mz2, mx2, %2 - (%6+%5)
+IF %4,  vpsll%1 my2, mx2, %2 - (%6+%5+%4)
+        psrl%1 mx2, %4+%5+%6
+IF %4,  psrl%1 my2, %2 - %4
+IF %5,  psrl%1 mz2, %2 - %5
+IF %6,  psrl%1 mw2, %2 - %6
+    %endif
+        CONTINUE
+%endmacro
+
+%macro unpack8 3 ; x, y, z
+op unpack_%1%2%3 %+ 0
+        pand mz, mx, [mask%3]
+        psrld my, mx, %3
+        psrld mx, %3+%2
+        pand my, [mask%2]
+        pand mx, [mask%1]
+    %if V2
+        pand mz2, mx2, [mask%3]
+        psrld my2, mx2, %3
+        psrld mx2, %3+%2
+        pand my2, [mask%2]
+        pand mx2, [mask%1]
+    %endif
+        CONTINUE
+%endmacro
+
+;---------------------------------------------------------
+; Generic byte order shuffle (packed swizzle, endian, etc)
+
+%macro shuffle 0
+op shuffle
+        VBROADCASTI128 m8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   pshufb mx, m8
+IF Y,   pshufb my, m8
+IF Z,   pshufb mz, m8
+IF W,   pshufb mw, m8
+%if V2
+IF X,   pshufb mx2, m8
+IF Y,   pshufb my2, m8
+IF Z,   pshufb mz2, m8
+IF W,   pshufb mw2, m8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Clearing
+
+%macro clear_alpha 3 ; idx, vreg, vreg2
+op clear_alpha%1
+        LOAD_CONT tmp0q
+        pcmpeqb %2, %2
+IF V2,  mova %3, %2
+        CONTINUE tmp0q
+%endmacro
+
+%macro clear_zero 3 ; idx, vreg, vreg2
+op clear_zero%1
+        LOAD_CONT tmp0q
+        pxor %2, %2
+IF V2,  mova %3, %2
+        CONTINUE tmp0q
+%endmacro
+
+; note: the pattern is inverted for these functions; i.e. X=1 implies that we
+; *keep* the X component, not that we clear it
+%macro clear_generic 0
+op clear
+            LOAD_CONT tmp0q
+%if avx_enabled
+    IF !X,  vpbroadcastd mx, [implq + SwsOpImpl.priv + 0]
+    IF !Y,  vpbroadcastd my, [implq + SwsOpImpl.priv + 4]
+    IF !Z,  vpbroadcastd mz, [implq + SwsOpImpl.priv + 8]
+    IF !W,  vpbroadcastd mw, [implq + SwsOpImpl.priv + 12]
+%else ; !avx_enabled
+    IF !X,  movd mx, [implq + SwsOpImpl.priv + 0]
+    IF !Y,  movd my, [implq + SwsOpImpl.priv + 4]
+    IF !Z,  movd mz, [implq + SwsOpImpl.priv + 8]
+    IF !W,  movd mw, [implq + SwsOpImpl.priv + 12]
+    IF !X,  pshufd mx, mx, 0
+    IF !Y,  pshufd my, my, 0
+    IF !Z,  pshufd mz, mz, 0
+    IF !W,  pshufd mw, mw, 0
+%endif
+%if V2
+    IF !X,  mova mx2, mx
+    IF !Y,  mova my2, my
+    IF !Z,  mova mz2, mz
+    IF !W,  mova mw2, mw
+%endif
+            CONTINUE tmp0q
+%endmacro
+
+%macro clear_funcs 0
+        decl_pattern 1, 1, 1, 0, clear_generic
+        decl_pattern 0, 1, 1, 1, clear_generic
+        decl_pattern 0, 0, 1, 1, clear_generic
+        decl_pattern 1, 0, 0, 1, clear_generic
+        decl_pattern 1, 1, 0, 0, clear_generic
+        decl_pattern 0, 1, 0, 1, clear_generic
+        decl_pattern 1, 0, 1, 0, clear_generic
+        decl_pattern 1, 0, 0, 0, clear_generic
+        decl_pattern 0, 1, 0, 0, clear_generic
+        decl_pattern 0, 0, 1, 0, clear_generic
+%endmacro
+
+;---------------------------------------------------------
+; Swizzling and duplicating
+
+; mA := mB, mB := mC, ... mX := mA across both halves
+%macro vrotate 2-* ; A, B, C, ...
+    %rep %0
+        %assign rot_a %1 + 4
+        %assign rot_b %2 + 4
+        mova m%1, m%2
+        IF V2, mova m%[rot_a], m%[rot_b]
+    %rotate 1
+    %endrep
+    %undef rot_a
+    %undef rot_b
+%endmacro
+
+%macro swizzle_funcs 0
+op swizzle_3012
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 2, 1
+    CONTINUE tmp0q
+
+op swizzle_3021
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 1
+    CONTINUE tmp0q
+
+op swizzle_2103
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2
+    CONTINUE tmp0q
+
+op swizzle_3210
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3
+    vrotate 8, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_3102
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 2
+    CONTINUE tmp0q
+
+op swizzle_3201
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_1203
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_1023
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1
+    CONTINUE tmp0q
+
+op swizzle_2013
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 1
+    CONTINUE tmp0q
+
+op swizzle_2310
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_2130
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_1230
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_1320
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_0213
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 2
+    CONTINUE tmp0q
+
+op swizzle_0231
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 2, 3
+    CONTINUE tmp0q
+
+op swizzle_0312
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 3, 2
+    CONTINUE tmp0q
+
+op swizzle_3120
+    LOAD_CONT tmp0q
+    vrotate 8, 0, 3
+    CONTINUE tmp0q
+
+op swizzle_0321
+    LOAD_CONT tmp0q
+    vrotate 8, 1, 3
+    CONTINUE tmp0q
+
+op swizzle_0003
+    LOAD_CONT tmp0q
+    mova my, mx
+    mova mz, mx
+%if V2
+    mova my2, mx2
+    mova mz2, mx2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_0001
+    LOAD_CONT tmp0q
+    mova mw, my
+    mova mz, mx
+    mova my, mx
+%if V2
+    mova mw2, my2
+    mova mz2, mx2
+    mova my2, mx2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_3000
+    LOAD_CONT tmp0q
+    mova my, mx
+    mova mz, mx
+    mova mx, mw
+    mova mw, my
+%if V2
+    mova my2, mx2
+    mova mz2, mx2
+    mova mx2, mw2
+    mova mw2, my2
+%endif
+    CONTINUE tmp0q
+
+op swizzle_1000
+    LOAD_CONT tmp0q
+    mova mz, mx
+    mova mw, mx
+    mova mx, my
+    mova my, mz
+%if V2
+    mova mz2, mx2
+    mova mw2, mx2
+    mova mx2, my2
+    mova my2, mz2
+%endif
+    CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Pixel type conversions
+
+%macro conv8to16 1 ; type
+op %1_U8_U16
+            LOAD_CONT tmp0q
+%if V2
+    %if avx_enabled
+    IF X,   vextracti128 xmx2, mx, 1
+    IF Y,   vextracti128 xmy2, my, 1
+    IF Z,   vextracti128 xmz2, mz, 1
+    IF W,   vextracti128 xmw2, mw, 1
+    %else
+    IF X,   psrldq xmx2, mx, 8
+    IF Y,   psrldq xmy2, my, 8
+    IF Z,   psrldq xmz2, mz, 8
+    IF W,   psrldq xmw2, mw, 8
+    %endif
+    IF X,   pmovzxbw mx2, xmx2
+    IF Y,   pmovzxbw my2, xmy2
+    IF Z,   pmovzxbw mz2, xmz2
+    IF W,   pmovzxbw mw2, xmw2
+%endif ; V2
+    IF X,   pmovzxbw mx, xmx
+    IF Y,   pmovzxbw my, xmy
+    IF Z,   pmovzxbw mz, xmz
+    IF W,   pmovzxbw mw, xmw
+
+%ifidn %1, expand
+            VBROADCASTI128 m8, [expand16_shuf]
+    %if V2
+    IF X,   pshufb mx2, m8
+    IF Y,   pshufb my2, m8
+    IF Z,   pshufb mz2, m8
+    IF W,   pshufb mw2, m8
+    %endif
+    IF X,   pshufb mx, m8
+    IF Y,   pshufb my, m8
+    IF Z,   pshufb mz, m8
+    IF W,   pshufb mw, m8
+%endif ; expand
+            CONTINUE tmp0q
+%endmacro
+
+%macro conv16to8 0
+op convert_U16_U8
+        LOAD_CONT tmp0q
+%if V2
+        ; this code technically works for the !V2 case as well, but slower
+IF X,   packuswb mx, mx2
+IF Y,   packuswb my, my2
+IF Z,   packuswb mz, mz2
+IF W,   packuswb mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+%else
+IF X,   vextracti128  xm8, mx, 1
+IF Y,   vextracti128  xm9, my, 1
+IF Z,   vextracti128 xm10, mz, 1
+IF W,   vextracti128 xm11, mw, 1
+        vzeroupper
+IF X,   packuswb xmx, xm8
+IF Y,   packuswb xmy, xm9
+IF Z,   packuswb xmz, xm10
+IF W,   packuswb xmw, xm11
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv8to32 1 ; type
+op %1_U8_U32
+        LOAD_CONT tmp0q
+IF X,   psrldq xmx2, xmx, 8
+IF Y,   psrldq xmy2, xmy, 8
+IF Z,   psrldq xmz2, xmz, 8
+IF W,   psrldq xmw2, xmw, 8
+IF X,   pmovzxbd mx, xmx
+IF Y,   pmovzxbd my, xmy
+IF Z,   pmovzxbd mz, xmz
+IF W,   pmovzxbd mw, xmw
+IF X,   pmovzxbd mx2, xmx2
+IF Y,   pmovzxbd my2, xmy2
+IF Z,   pmovzxbd mz2, xmz2
+IF W,   pmovzxbd mw2, xmw2
+%ifidn %1, expand
+        VBROADCASTI128 m8, [expand32_shuf]
+IF X,   pshufb mx, m8
+IF Y,   pshufb my, m8
+IF Z,   pshufb mz, m8
+IF W,   pshufb mw, m8
+IF X,   pshufb mx2, m8
+IF Y,   pshufb my2, m8
+IF Z,   pshufb mz2, m8
+IF W,   pshufb mw2, m8
+%endif ; expand
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32to8 0
+op convert_U32_U8
+        LOAD_CONT tmp0q
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+        vzeroupper
+IF X,   packuswb xmx, xmx2
+IF Y,   packuswb xmy, xmy2
+IF Z,   packuswb xmz, xmz2
+IF W,   packuswb xmw, xmw2
+IF X,   vpshufd xmx, xmx, q3120
+IF Y,   vpshufd xmy, xmy, q3120
+IF Z,   vpshufd xmz, xmz, q3120
+IF W,   vpshufd xmw, xmw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv16to32 0
+op convert_U16_U32
+        LOAD_CONT tmp0q
+IF X,   vextracti128 xmx2, mx, 1
+IF Y,   vextracti128 xmy2, my, 1
+IF Z,   vextracti128 xmz2, mz, 1
+IF W,   vextracti128 xmw2, mw, 1
+IF X,   pmovzxwd mx, xmx
+IF Y,   pmovzxwd my, xmy
+IF Z,   pmovzxwd mz, xmz
+IF W,   pmovzxwd mw, xmw
+IF X,   pmovzxwd mx2, xmx2
+IF Y,   pmovzxwd my2, xmy2
+IF Z,   pmovzxwd mz2, xmz2
+IF W,   pmovzxwd mw2, xmw2
+        CONTINUE tmp0q
+%endmacro
+
+%macro conv32to16 0
+op convert_U32_U16
+        LOAD_CONT tmp0q
+IF X,   packusdw mx, mx2
+IF Y,   packusdw my, my2
+IF Z,   packusdw mz, mz2
+IF W,   packusdw mw, mw2
+IF X,   vpermq mx, mx, q3120
+IF Y,   vpermq my, my, q3120
+IF Z,   vpermq mz, mz, q3120
+IF W,   vpermq mw, mw, q3120
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Shifting
+
+%macro lshift16 0
+op lshift16
+        vmovq xm8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   psllw mx, xm8
+IF Y,   psllw my, xm8
+IF Z,   psllw mz, xm8
+IF W,   psllw mw, xm8
+%if V2
+IF X,   psllw mx2, xm8
+IF Y,   psllw my2, xm8
+IF Z,   psllw mz2, xm8
+IF W,   psllw mw2, xm8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+%macro rshift16 0
+op rshift16
+        vmovq xm8, [implq + SwsOpImpl.priv]
+        LOAD_CONT tmp0q
+IF X,   psrlw mx, xm8
+IF Y,   psrlw my, xm8
+IF Z,   psrlw mz, xm8
+IF W,   psrlw mw, xm8
+%if V2
+IF X,   psrlw mx2, xm8
+IF Y,   psrlw my2, xm8
+IF Z,   psrlw mz2, xm8
+IF W,   psrlw mw2, xm8
+%endif
+        CONTINUE tmp0q
+%endmacro
+
+;---------------------------------------------------------
+; Macro instantiations for kernel functions
+
+%macro funcs_u8 0
+    read_planar 1
+    read_planar 2
+    read_planar 3
+    read_planar 4
+    write_planar 1
+    write_planar 2
+    write_planar 3
+    write_planar 4
+
+    rw_packed 8
+    read_nibbles
+    read_bits
+    write_bits
+
+    pack_generic 1, 2, 1
+    pack_generic 3, 3, 2
+    pack_generic 2, 3, 3
+    unpack8 1, 2, 1
+    unpack8 3, 3, 2
+    unpack8 2, 3, 3
+
+    clear_alpha 0, mx, mx2
+    clear_alpha 1, my, my2
+    clear_alpha 3, mw, mw2
+    clear_zero  0, mx, mx2
+    clear_zero  1, my, my2
+    clear_zero  3, mw, mw2
+    clear_funcs
+    swizzle_funcs
+
+    decl_common_patterns shuffle
+%endmacro
+
+%macro funcs_u16 0
+    rw_packed 16
+    pack_generic  4, 4, 4
+    pack_generic  5, 5, 5
+    pack_generic  5, 6, 5
+    unpack w, 16, 4, 4, 4
+    unpack w, 16, 5, 5, 5
+    unpack w, 16, 5, 6, 5
+    decl_common_patterns conv8to16 convert
+    decl_common_patterns conv8to16 expand
+    decl_common_patterns conv16to8
+    decl_common_patterns lshift16
+    decl_common_patterns rshift16
+%endmacro
+
+INIT_XMM sse4
+decl_v2 0, funcs_u8
+decl_v2 1, funcs_u8
+
+INIT_YMM avx2
+decl_v2 0, funcs_u8
+decl_v2 1, funcs_u8
+decl_v2 0, funcs_u16
+decl_v2 1, funcs_u16
+
+INIT_YMM avx2
+decl_v2 1, rw_packed 32
+decl_v2 1, pack_generic  10, 10, 10,  2
+decl_v2 1, pack_generic   2, 10, 10, 10
+decl_v2 1, unpack d, 32, 10, 10, 10,  2
+decl_v2 1, unpack d, 32,  2, 10, 10, 10
+decl_common_patterns conv8to32 convert
+decl_common_patterns conv8to32 expand
+decl_common_patterns conv32to8
+decl_common_patterns conv16to32
+decl_common_patterns conv32to16
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 15/17] tests/checkasm: add checkasm tests for swscale ops
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (13 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 14/17] swscale/x86: add SIMD backend Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  8:25   ` Martin Storsjö
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 16/17] swscale/format: add new format decode/encode logic Niklas Haas
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

Because of the lack of an external ABI on low-level kernels, we cannot
directly test internal functions. Instead, we construct a minimal op chain
consisting of a read, the op to be tested, and a write.

The bigger complication arises from the fact that the backend may generate
arbitrary internal state that needs to be passed back to the implementation,
which means we cannot directly call `func_ref` on the generated chain. To get
around this, always compile the op chain twice - once using the backend to be
tested, and once using the reference C backend.

The actual entry point may also just be a shared wrapper, so we need to
be very careful to run checkasm_check_func() on a pseudo-pointer that will
actually be unique for each combination of backend and active CPU flags.
---
 tests/checkasm/Makefile   |   8 +-
 tests/checkasm/checkasm.c |   1 +
 tests/checkasm/checkasm.h |   1 +
 tests/checkasm/sw_ops.c   | 776 ++++++++++++++++++++++++++++++++++++++
 4 files changed, 785 insertions(+), 1 deletion(-)
 create mode 100644 tests/checkasm/sw_ops.c

diff --git a/tests/checkasm/Makefile b/tests/checkasm/Makefile
index fabbf595b4..d38ec371df 100644
--- a/tests/checkasm/Makefile
+++ b/tests/checkasm/Makefile
@@ -66,7 +66,13 @@ AVFILTEROBJS-$(CONFIG_SOBEL_FILTER)      += vf_convolution.o
 CHECKASMOBJS-$(CONFIG_AVFILTER) += $(AVFILTEROBJS-yes)
 
 # swscale tests
-SWSCALEOBJS                             += sw_gbrp.o sw_range_convert.o sw_rgb.o sw_scale.o sw_yuv2rgb.o sw_yuv2yuv.o
+SWSCALEOBJS                             += sw_gbrp.o            \
+                                           sw_ops.o             \
+                                           sw_range_convert.o   \
+                                           sw_rgb.o             \
+                                           sw_scale.o           \
+                                           sw_yuv2rgb.o         \
+                                           sw_yuv2yuv.o
 
 CHECKASMOBJS-$(CONFIG_SWSCALE)  += $(SWSCALEOBJS)
 
diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
index f393a0cb96..11bd5668cf 100644
--- a/tests/checkasm/checkasm.c
+++ b/tests/checkasm/checkasm.c
@@ -298,6 +298,7 @@ static const struct {
     { "sw_scale", checkasm_check_sw_scale },
     { "sw_yuv2rgb", checkasm_check_sw_yuv2rgb },
     { "sw_yuv2yuv", checkasm_check_sw_yuv2yuv },
+    { "sw_ops", checkasm_check_sw_ops },
 #endif
 #if CONFIG_AVUTIL
         { "aes",       checkasm_check_aes },
diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
index ec01bd6207..d69f4cb835 100644
--- a/tests/checkasm/checkasm.h
+++ b/tests/checkasm/checkasm.h
@@ -132,6 +132,7 @@ void checkasm_check_sw_rgb(void);
 void checkasm_check_sw_scale(void);
 void checkasm_check_sw_yuv2rgb(void);
 void checkasm_check_sw_yuv2yuv(void);
+void checkasm_check_sw_ops(void);
 void checkasm_check_takdsp(void);
 void checkasm_check_utvideodsp(void);
 void checkasm_check_v210dec(void);
diff --git a/tests/checkasm/sw_ops.c b/tests/checkasm/sw_ops.c
new file mode 100644
index 0000000000..c8cba96879
--- /dev/null
+++ b/tests/checkasm/sw_ops.c
@@ -0,0 +1,776 @@
+/**
+ * Copyright (C) 2025 Niklas Haas
+ *
+ * This file is part of FFmpeg.
+ *
+ * FFmpeg is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * FFmpeg is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License along
+ * with FFmpeg; if not, write to the Free Software Foundation, Inc.,
+ * 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA.
+ */
+
+#include <string.h>
+
+#include "libavutil/avassert.h"
+#include "libavutil/mem_internal.h"
+#include "libavutil/refstruct.h"
+
+#include "libswscale/ops.h"
+#include "libswscale/ops_internal.h"
+
+#include "checkasm.h"
+
+enum {
+    LINES  = 2,
+    PLANES = 4,
+    PIXELS = 64,
+};
+
+enum {
+    U8  = SWS_PIXEL_U8,
+    U16 = SWS_PIXEL_U16,
+    U32 = SWS_PIXEL_U32,
+    F32 = SWS_PIXEL_F32,
+};
+
+#define FMT(fmt, ...) tprintf((char[256]) {0}, 256, fmt, __VA_ARGS__)
+static const char *tprintf(char buf[], size_t size, const char *fmt, ...)
+{
+    va_list ap;
+    va_start(ap, fmt);
+    vsnprintf(buf, size, fmt, ap);
+    va_end(ap);
+    return buf;
+}
+
+static int rw_pixel_bits(const SwsOp *op)
+{
+    const int elems = op->rw.packed ? op->rw.elems : 1;
+    const int size  = ff_sws_pixel_type_size(op->type);
+    const int bits  = 8 >> op->rw.frac;
+    av_assert1(bits >= 1);
+    return elems * size * bits;
+}
+
+static float rndf(void)
+{
+    union { uint32_t u; float f; } x;
+    do {
+        x.u = rnd();
+    } while (!isnormal(x.f));
+    return x.f;
+}
+
+static void fill32f(float *line, int num, unsigned range)
+{
+    const float scale = (float) range / UINT32_MAX;
+    for (int i = 0; i < num; i++)
+        line[i] = range ? scale * rnd() : rndf();
+}
+
+static void fill32(uint32_t *line, int num, unsigned range)
+{
+    for (int i = 0; i < num; i++)
+        line[i] = range ? rnd() % (range + 1) : rnd();
+}
+
+static void fill16(uint16_t *line, int num, unsigned range)
+{
+    if (!range) {
+        fill32((uint32_t *) line, AV_CEIL_RSHIFT(num, 1), 0);
+    } else {
+        for (int i = 0; i < num; i++)
+            line[i] = rnd() % (range + 1);
+    }
+}
+
+static void fill8(uint8_t *line, int num, unsigned range)
+{
+    if (!range) {
+        fill32((uint32_t *) line, AV_CEIL_RSHIFT(num, 2), 0);
+    } else {
+        for (int i = 0; i < num; i++)
+            line[i] = rnd() % (range + 1);
+    }
+}
+
+static void check_ops(const char *report, const unsigned ranges[PLANES],
+                      const SwsOp *ops)
+{
+    SwsContext *ctx = sws_alloc_context();
+    SwsCompiledOp comp_ref = {0}, comp_new = {0};
+    const SwsOpBackend *backend_new = NULL;
+    SwsOpList oplist = { .ops = (SwsOp *) ops };
+    const SwsOp *read_op, *write_op;
+    static const unsigned def_ranges[4] = {0};
+    if (!ranges)
+        ranges = def_ranges;
+
+    declare_func(void, const SwsOpExec *, const void *, int bx, int y, int bx_end, int y_end);
+
+    DECLARE_ALIGNED_64(char, src0)[PLANES][LINES][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, src1)[PLANES][LINES][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, dst0)[PLANES][LINES][PIXELS * sizeof(uint32_t[4])];
+    DECLARE_ALIGNED_64(char, dst1)[PLANES][LINES][PIXELS * sizeof(uint32_t[4])];
+
+    if (!ctx)
+        return;
+    ctx->flags = SWS_BITEXACT;
+
+    read_op = &ops[0];
+    for (oplist.num_ops = 0; ops[oplist.num_ops].op; oplist.num_ops++)
+        write_op = &ops[oplist.num_ops];
+
+    const int read_size  = PIXELS * rw_pixel_bits(read_op)  >> 3;
+    const int write_size = PIXELS * rw_pixel_bits(write_op) >> 3;
+
+    for (int p = 0; p < PLANES; p++) {
+        void *plane = src0[p];
+        switch (read_op->type) {
+        case U8:    fill8(plane, sizeof(src0[p]) /  sizeof(uint8_t), ranges[p]); break;
+        case U16:  fill16(plane, sizeof(src0[p]) / sizeof(uint16_t), ranges[p]); break;
+        case U32:  fill32(plane, sizeof(src0[p]) / sizeof(uint32_t), ranges[p]); break;
+        case F32: fill32f(plane, sizeof(src0[p]) / sizeof(uint32_t), ranges[p]); break;
+        }
+    }
+
+    memcpy(src1, src0, sizeof(src0));
+    memset(dst0, 0, sizeof(dst0));
+    memset(dst1, 0, sizeof(dst1));
+
+    /* Compile `ops` using both the asm and c backends */
+    for (int n = 0; ff_sws_op_backends[n]; n++) {
+        const SwsOpBackend *backend = ff_sws_op_backends[n];
+        const bool is_ref = !strcmp(backend->name, "c");
+        if (is_ref || !comp_new.func) {
+            SwsCompiledOp comp;
+            int ret = ff_sws_ops_compile_backend(ctx, backend, &oplist, &comp);
+            if (ret == AVERROR(ENOTSUP))
+                continue;
+            else if (ret < 0)
+                fail();
+            else if (PIXELS % comp.block_size != 0)
+                fail();
+
+            if (is_ref)
+                comp_ref = comp;
+            if (!comp_new.func) {
+                comp_new = comp;
+                backend_new = backend;
+            }
+        }
+    }
+
+    av_assert0(comp_ref.func && comp_new.func);
+
+    SwsOpExec exec = {0};
+    exec.width = PIXELS;
+    exec.height = exec.slice_h = 1;
+    for (int i = 0; i < PLANES; i++) {
+        exec.in_stride[i]  = sizeof(src0[i][0]);
+        exec.out_stride[i] = sizeof(dst0[i][0]);
+        exec.in_bump[i]  = exec.in_stride[i]  - read_size;
+        exec.out_bump[i] = exec.out_stride[i] - write_size;
+    }
+
+    /**
+     * Don't use check_func() because the actual function pointer may be a
+     * wrapper shared by multiple implementations. Instead, take a hash of both
+     * the backend pointer and the active CPU flags.
+     */
+    uintptr_t id = (uintptr_t) backend_new;
+    id ^= (id << 6) + (id >> 2) + 0x9e3779b97f4a7c15 + comp_new.cpu_flags;
+
+    checkasm_save_context();
+    if (checkasm_check_func((void *) id, "%s", report)) {
+        func_new = comp_new.func;
+        func_ref = comp_ref.func;
+
+        exec.block_size_in  = comp_ref.block_size * rw_pixel_bits(read_op)  >> 3;
+        exec.block_size_out = comp_ref.block_size * rw_pixel_bits(write_op) >> 3;
+        for (int i = 0; i < PLANES; i++) {
+            exec.in[i]  = (void *) src0[i];
+            exec.out[i] = (void *) dst0[i];
+        }
+        call_ref(&exec, comp_ref.priv, 0, 0, PIXELS / comp_ref.block_size, LINES);
+
+        exec.block_size_in  = comp_new.block_size * rw_pixel_bits(read_op)  >> 3;
+        exec.block_size_out = comp_new.block_size * rw_pixel_bits(write_op) >> 3;
+        for (int i = 0; i < PLANES; i++) {
+            exec.in[i]  = (void *) src1[i];
+            exec.out[i] = (void *) dst1[i];
+        }
+        call_new(&exec, comp_new.priv, 0, 0, PIXELS / comp_new.block_size, LINES);
+
+        for (int i = 0; i < PLANES; i++) {
+            const char *name = FMT("%s[%d]", report, i);
+            const int stride = sizeof(dst0[i][0]);
+
+            switch (write_op->type) {
+            case U8:
+                checkasm_check(uint8_t, (void *) dst0[i], stride,
+                                        (void *) dst1[i], stride,
+                                        write_size, LINES, name);
+                break;
+            case U16:
+                checkasm_check(uint16_t, (void *) dst0[i], stride,
+                                         (void *) dst1[i], stride,
+                                         write_size >> 1, LINES, name);
+                break;
+            case U32:
+                checkasm_check(uint32_t, (void *) dst0[i], stride,
+                                         (void *) dst1[i], stride,
+                                         write_size >> 2, LINES, name);
+                break;
+            case F32:
+                checkasm_check(float_ulp, (void *) dst0[i], stride,
+                                          (void *) dst1[i], stride,
+                                          write_size >> 2, LINES, name, 0);
+                break;
+            }
+
+            if (write_op->rw.packed)
+                break;
+        }
+
+        bench_new(&exec, comp_new.priv, 0, 0, PIXELS / comp_new.block_size, LINES);
+    }
+
+    if (comp_new.func != comp_ref.func && comp_new.free)
+        comp_new.free(comp_new.priv);
+    if (comp_ref.free)
+        comp_ref.free(comp_ref.priv);
+    sws_free_context(&ctx);
+}
+
+#define CHECK_RANGES(NAME, RANGES, N_IN, N_OUT, IN, OUT, ...)                   \
+  do {                                                                          \
+      check_ops(NAME, RANGES, (SwsOp[]) {                                       \
+        {                                                                       \
+            .op = SWS_OP_READ,                                                  \
+            .type = IN,                                                         \
+            .rw.elems = N_IN,                                                   \
+        },                                                                      \
+        __VA_ARGS__,                                                            \
+        {                                                                       \
+            .op = SWS_OP_WRITE,                                                 \
+            .type = OUT,                                                        \
+            .rw.elems = N_OUT,                                                  \
+        }, {0}                                                                  \
+    });                                                                         \
+  } while (0)
+
+#define MK_RANGES(R) ((const unsigned[]) { R, R, R, R })
+#define CHECK_RANGE(NAME, RANGE, N_IN, N_OUT, IN, OUT, ...)                     \
+    CHECK_RANGES(NAME, MK_RANGES(RANGE), N_IN, N_OUT, IN, OUT, __VA_ARGS__)
+
+#define CHECK_COMMON_RANGE(NAME, RANGE, IN, OUT, ...)                           \
+    CHECK_RANGE(FMT("%s_p1000", NAME), RANGE, 1, 1, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1110", NAME), RANGE, 3, 3, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1111", NAME), RANGE, 4, 4, IN, OUT, __VA_ARGS__);      \
+    CHECK_RANGE(FMT("%s_p1001", NAME), RANGE, 4, 2, IN, OUT, __VA_ARGS__, {     \
+        .op = SWS_OP_SWIZZLE,                                                   \
+        .type = OUT,                                                            \
+        .swizzle = SWS_SWIZZLE(0, 3, 1, 2),                                     \
+    })
+
+#define CHECK(NAME, N_IN, N_OUT, IN, OUT, ...) \
+    CHECK_RANGE(NAME, 0, N_IN, N_OUT, IN, OUT, __VA_ARGS__)
+
+#define CHECK_COMMON(NAME, IN, OUT, ...) \
+    CHECK_COMMON_RANGE(NAME, 0, IN, OUT, __VA_ARGS__)
+
+static void check_read_write(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        for (int i = 1; i <= 4; i++) {
+            /* Test N->N planar read/write */
+            for (int o = 1; o <= i; o++) {
+                check_ops(FMT("rw_%d_%d_%s", i, o, type), NULL, (SwsOp[]) {
+                    {
+                        .op = SWS_OP_READ,
+                        .type = t,
+                        .rw.elems = i,
+                    }, {
+                        .op = SWS_OP_WRITE,
+                        .type = t,
+                        .rw.elems = o,
+                    }, {0}
+                });
+            }
+
+            /* Test packed read/write */
+            if (i == 1)
+                continue;
+
+            check_ops(FMT("read_packed%d_%s", i, type), NULL, (SwsOp[]) {
+                {
+                    .op = SWS_OP_READ,
+                    .type = t,
+                    .rw.elems = i,
+                    .rw.packed = true,
+                }, {
+                    .op = SWS_OP_WRITE,
+                    .type = t,
+                    .rw.elems = i,
+                }, {0}
+            });
+
+            check_ops(FMT("write_packed%d_%s", i, type), NULL, (SwsOp[]) {
+                {
+                    .op = SWS_OP_READ,
+                    .type = t,
+                    .rw.elems = i,
+                }, {
+                    .op = SWS_OP_WRITE,
+                    .type = t,
+                    .rw.elems = i,
+                    .rw.packed = true,
+                }, {0}
+            });
+        }
+    }
+
+    /* Test fractional reads/writes */
+    for (int frac = 1; frac <= 3; frac++) {
+        const int bits = 8 >> frac;
+        const int range = (1 << bits) - 1;
+        if (bits == 2)
+            continue; /* no 2 bit packed formats currently exist */
+
+        check_ops(FMT("read_frac%d", frac), NULL, (SwsOp[]) {
+            {
+                .op = SWS_OP_READ,
+                .type = U8,
+                .rw.elems = 1,
+                .rw.frac  = frac,
+            }, {
+                .op = SWS_OP_WRITE,
+                .type = U8,
+                .rw.elems = 1,
+            }, {0}
+        });
+
+        check_ops(FMT("write_frac%d", frac), MK_RANGES(range), (SwsOp[]) {
+            {
+                .op = SWS_OP_READ,
+                .type = U8,
+                .rw.elems = 1,
+            }, {
+                .op = SWS_OP_WRITE,
+                .type = U8,
+                .rw.elems = 1,
+                .rw.frac  = frac,
+            }, {0}
+        });
+    }
+}
+
+static void check_swap_bytes(void)
+{
+    CHECK_COMMON("swap_bytes_16", U16, U16, {
+        .op   = SWS_OP_SWAP_BYTES,
+        .type = U16,
+    });
+
+    CHECK_COMMON("swap_bytes_32", U32, U32, {
+        .op   = SWS_OP_SWAP_BYTES,
+        .type = U32,
+    });
+}
+
+static void check_pack_unpack(void)
+{
+    const struct {
+        SwsPixelType type;
+        SwsPackOp op;
+    } patterns[] = {
+        { U8, {{ 3,  3,  2 }}},
+        { U8, {{ 2,  3,  3 }}},
+        { U8, {{ 1,  2,  1 }}},
+        {U16, {{ 5,  6,  5 }}},
+        {U16, {{ 5,  5,  5 }}},
+        {U16, {{ 4,  4,  4 }}},
+        {U32, {{ 2, 10, 10, 10 }}},
+        {U32, {{10, 10, 10,  2 }}},
+    };
+
+    for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+        const SwsPixelType type = patterns[i].type;
+        const SwsPackOp pack = patterns[i].op;
+        const int num = pack.pattern[3] ? 4 : 3;
+        const char *pat = FMT("%d%d%d%d", pack.pattern[0], pack.pattern[1],
+                                          pack.pattern[2], pack.pattern[3]);
+        const int total = pack.pattern[0] + pack.pattern[1] +
+                          pack.pattern[2] + pack.pattern[3];
+        const unsigned ranges[4] = {
+            (1 << pack.pattern[0]) - 1,
+            (1 << pack.pattern[1]) - 1,
+            (1 << pack.pattern[2]) - 1,
+            (1 << pack.pattern[3]) - 1,
+        };
+
+        CHECK_RANGES(FMT("pack_%s", pat), ranges, num, 1, type, type, {
+            .op   = SWS_OP_PACK,
+            .type = type,
+            .pack = pack,
+        });
+
+        CHECK_RANGE(FMT("unpack_%s", pat), (1 << total) - 1, 1, num, type, type, {
+            .op   = SWS_OP_UNPACK,
+            .type = type,
+            .pack = pack,
+        });
+    }
+}
+
+static AVRational rndq(SwsPixelType t)
+{
+    const unsigned num = rnd();
+    if (ff_sws_pixel_type_is_int(t)) {
+        const unsigned mask = (1 << (ff_sws_pixel_type_size(t) * 8)) - 1;
+        return (AVRational) { num & mask, 1 };
+    } else {
+        const unsigned den = rnd();
+        return (AVRational) { num, den ? den : 1 };
+    }
+}
+
+static void check_clear(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        const int bits = ff_sws_pixel_type_size(t) * 8;
+
+        /* TODO: AVRational can't fit 32 bit constants */
+        if (bits < 32) {
+            const AVRational chroma = (AVRational) { 1 << (bits - 1), 1};
+            const AVRational alpha  = (AVRational) { (1 << bits) - 1, 1};
+            const AVRational zero   = (AVRational) { 0, 1};
+            const AVRational none = {0};
+
+            const SwsConst patterns[] = {
+                /* Zero only */
+                {.q4 = {   none,   none,   none,   zero }},
+                {.q4 = {   zero,   none,   none,   none }},
+                /* Alpha only */
+                {.q4 = {   none,   none,   none,  alpha }},
+                {.q4 = {  alpha,   none,   none,   none }},
+                /* Chroma only */
+                {.q4 = { chroma, chroma,   none,   none }},
+                {.q4 = {   none, chroma, chroma,   none }},
+                {.q4 = {   none,   none, chroma, chroma }},
+                {.q4 = { chroma,   none, chroma,   none }},
+                {.q4 = {   none, chroma,   none, chroma }},
+                /* Alpha+chroma */
+                {.q4 = { chroma, chroma,   none,  alpha }},
+                {.q4 = {   none, chroma, chroma,  alpha }},
+                {.q4 = {  alpha,   none, chroma, chroma }},
+                {.q4 = { chroma,   none, chroma,  alpha }},
+                {.q4 = {  alpha, chroma,   none, chroma }},
+                /* Random values */
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+                {.q4 = { none, rndq(t), rndq(t), rndq(t) }},
+            };
+
+            for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+                CHECK(FMT("clear_pattern_%s[%d]", type, i), 4, 4, t, t, {
+                    .op   = SWS_OP_CLEAR,
+                    .type = t,
+                    .c    = patterns[i],
+                });
+            }
+        } else if (!ff_sws_pixel_type_is_int(t)) {
+            /* Floating point YUV doesn't exist, only alpha needs to be cleared */
+            CHECK(FMT("clear_alpha_%s", type), 4, 4, t, t, {
+                .op      = SWS_OP_CLEAR,
+                .type    = t,
+                .c.q4[3] = { 0, 1 },
+            });
+        }
+    }
+}
+
+static void check_shift(void)
+{
+    for (SwsPixelType t = U16; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (!ff_sws_pixel_type_is_int(t))
+            continue;
+
+        for (int shift = 1; shift <= 8; shift++) {
+            CHECK_COMMON(FMT("lshift%d_%s", shift, type), t, t, {
+                .op   = SWS_OP_LSHIFT,
+                .type = t,
+                .c.u  = shift,
+            });
+
+            CHECK_COMMON(FMT("rshift%d_%s", shift, type), t, t, {
+                .op   = SWS_OP_RSHIFT,
+                .type = t,
+                .c.u  = shift,
+            });
+        }
+    }
+}
+
+static void check_swizzle(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        static const int patterns[][4] = {
+            /* Pure swizzle */
+            {3, 0, 1, 2},
+            {3, 0, 2, 1},
+            {2, 1, 0, 3},
+            {3, 2, 1, 0},
+            {3, 1, 0, 2},
+            {3, 2, 0, 1},
+            {1, 2, 0, 3},
+            {1, 0, 2, 3},
+            {2, 0, 1, 3},
+            {2, 3, 1, 0},
+            {2, 1, 3, 0},
+            {1, 2, 3, 0},
+            {1, 3, 2, 0},
+            {0, 2, 1, 3},
+            {0, 2, 3, 1},
+            {0, 3, 1, 2},
+            {3, 1, 2, 0},
+            {0, 3, 2, 1},
+            /* Luma expansion */
+            {0, 0, 0, 3},
+            {3, 0, 0, 0},
+            {0, 0, 0, 1},
+            {1, 0, 0, 0},
+        };
+
+        for (int i = 0; i < FF_ARRAY_ELEMS(patterns); i++) {
+            const int x = patterns[i][0], y = patterns[i][1],
+                      z = patterns[i][2], w = patterns[i][3];
+            CHECK(FMT("swizzle_%d%d%d%d_%s", x, y, z, w, type), 4, 4, t, t, {
+                .op = SWS_OP_SWIZZLE,
+                .type = t,
+                .swizzle = SWS_SWIZZLE(x, y, z, w),
+            });
+        }
+    }
+}
+
+static void check_convert(void)
+{
+    for (SwsPixelType i = U8; i < SWS_PIXEL_TYPE_NB; i++) {
+        const char *itype = ff_sws_pixel_type_name(i);
+        const int isize = ff_sws_pixel_type_size(i);
+        for (SwsPixelType o = U8; o < SWS_PIXEL_TYPE_NB; o++) {
+            const char *otype = ff_sws_pixel_type_name(o);
+            const int osize = ff_sws_pixel_type_size(o);
+            const char *name = FMT("convert_%s_%s", itype, otype);
+            if (i == o)
+                continue;
+
+            if (isize < osize || !ff_sws_pixel_type_is_int(o)) {
+                CHECK_COMMON(name, i, o, {
+                    .op = SWS_OP_CONVERT,
+                    .type = i,
+                    .convert.to = o,
+                });
+            } else if (isize > osize || !ff_sws_pixel_type_is_int(i)) {
+                uint32_t range = (1 << osize * 8) - 1;
+                CHECK_COMMON_RANGE(name, range, i, o, {
+                    .op = SWS_OP_CONVERT,
+                    .type = i,
+                    .convert.to = o,
+                });
+            }
+        }
+    }
+
+    /* Check expanding conversions */
+    CHECK_COMMON("expand16", U8, U16, {
+        .op = SWS_OP_CONVERT,
+        .type = U8,
+        .convert.to = U16,
+        .convert.expand = true,
+    });
+
+    CHECK_COMMON("expand32", U8, U32, {
+        .op = SWS_OP_CONVERT,
+        .type = U8,
+        .convert.to = U32,
+        .convert.expand = true,
+    });
+}
+
+static void check_dither(void)
+{
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (ff_sws_pixel_type_is_int(t))
+            continue;
+
+        /* Test all sizes up to 256x256 */
+        for (int size_log2 = 0; size_log2 <= 8; size_log2++) {
+            const int size = 1 << size_log2;
+            AVRational *matrix = av_refstruct_allocz(size * size * sizeof(*matrix));
+            if (!matrix) {
+                fail();
+                return;
+            }
+
+            if (size == 1) {
+                matrix[0] = (AVRational) { 1, 2 };
+            } else {
+                for (int i = 0; i < size * size; i++)
+                    matrix[i] = rndq(t);
+            }
+
+            CHECK_COMMON(FMT("dither_%dx%d_%s", size, size, type), t, t, {
+                .op = SWS_OP_DITHER,
+                .type = t,
+                .dither.size_log2 = size_log2,
+                .dither.matrix = matrix,
+            });
+
+            av_refstruct_unref(&matrix);
+        }
+    }
+}
+
+static void check_min_max(void)
+{
+    for (SwsPixelType t = U8; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        CHECK_COMMON(FMT("min_%s", type), t, t, {
+            .op = SWS_OP_MIN,
+            .type = t,
+            .c.q4 = { rndq(t), rndq(t), rndq(t), rndq(t) },
+        });
+
+        CHECK_COMMON(FMT("max_%s", type), t, t, {
+            .op = SWS_OP_MAX,
+            .type = t,
+            .c.q4 = { rndq(t), rndq(t), rndq(t), rndq(t) },
+        });
+    }
+}
+
+static void check_linear(void)
+{
+    static const struct {
+        const char *name;
+        uint32_t mask;
+    } patterns[] = {
+        { "noop",               0 },
+        { "luma",               SWS_MASK_LUMA },
+        { "alpha",              SWS_MASK_ALPHA },
+        { "luma+alpha",         SWS_MASK_LUMA | SWS_MASK_ALPHA },
+        { "dot3",               0b111 },
+        { "dot4",               0b1111 },
+        { "row0",               SWS_MASK_ROW(0) },
+        { "row0+alpha",         SWS_MASK_ROW(0) | SWS_MASK_ALPHA },
+        { "off3",               SWS_MASK_OFF3 },
+        { "off3+alpha",         SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag3",              SWS_MASK_DIAG3 },
+        { "diag4",              SWS_MASK_DIAG4 },
+        { "diag3+alpha",        SWS_MASK_DIAG3 | SWS_MASK_ALPHA },
+        { "diag3+off3",         SWS_MASK_DIAG3 | SWS_MASK_OFF3 },
+        { "diag3+off3+alpha",   SWS_MASK_DIAG3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "diag4+off4",         SWS_MASK_DIAG4 | SWS_MASK_OFF4 },
+        { "matrix3",            SWS_MASK_MAT3 },
+        { "matrix3+off3",       SWS_MASK_MAT3 | SWS_MASK_OFF3 },
+        { "matrix3+off3+alpha", SWS_MASK_MAT3 | SWS_MASK_OFF3 | SWS_MASK_ALPHA },
+        { "matrix4",            SWS_MASK_MAT4 },
+        { "matrix4+off4",       SWS_MASK_MAT4 | SWS_MASK_OFF4 },
+    };
+
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        if (ff_sws_pixel_type_is_int(t))
+            continue;
+
+        for (int p = 0; p < FF_ARRAY_ELEMS(patterns); p++) {
+            const uint32_t mask = patterns[p].mask;
+            SwsLinearOp lin = { .mask = mask };
+
+            for (int i = 0; i < 4; i++) {
+                for (int j = 0; j < 5; j++) {
+                    if (mask & SWS_MASK(i, j)) {
+                        lin.m[i][j] = rndq(t);
+                    } else {
+                        lin.m[i][j] = (AVRational) { i == j, 1 };
+                    }
+                }
+            }
+
+            CHECK(FMT("linear_%s_%s", patterns[p].name, type), 4, 4, t, t, {
+                .op = SWS_OP_LINEAR,
+                .type = t,
+                .lin = lin,
+            });
+        }
+    }
+}
+
+static void check_scale(void)
+{
+    for (SwsPixelType t = F32; t < SWS_PIXEL_TYPE_NB; t++) {
+        const char *type = ff_sws_pixel_type_name(t);
+        const int bits = ff_sws_pixel_type_size(t) * 8;
+        if (ff_sws_pixel_type_is_int(t)) {
+            /* Ensure the result won't exceed the value range */
+            const unsigned max = (1 << bits) - 1;
+            const unsigned scale = rnd() & max;
+            const unsigned range = max / (scale ? scale : 1);
+            CHECK_COMMON_RANGE(FMT("scale_%s", type), range, t, t, {
+                .op   = SWS_OP_SCALE,
+                .type = t,
+                .c.q  = { scale, 1 },
+            });
+        } else {
+            CHECK_COMMON(FMT("scale_%s", type), t, t, {
+                .op   = SWS_OP_SCALE,
+                .type = t,
+                .c.q  = rndq(t),
+            });
+        }
+    }
+}
+
+void checkasm_check_sw_ops(void)
+{
+    check_read_write();
+    report("read_write");
+    check_swap_bytes();
+    report("swap_bytes");
+    check_pack_unpack();
+    report("pack_unpack");
+    check_clear();
+    report("clear");
+    check_shift();
+    report("shift");
+    check_swizzle();
+    report("swizzle");
+    check_convert();
+    report("convert");
+    check_dither();
+    report("dither");
+    check_min_max();
+    report("min_max");
+    check_linear();
+    report("linear");
+    check_scale();
+    report("scale");
+}
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 16/17] swscale/format: add new format decode/encode logic
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (14 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 15/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas
  2025-05-27  8:29 ` [FFmpeg-devel] (no subject) Kieran Kunhya via ffmpeg-devel
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

This patch adds format handling code for the new operations. This entails
fully decoding a format to standardized RGB, and the inverse.

Handling it this way means we can always guarantee that a conversion path
exists from A to B without having to explicitly cover logic for each path;
and choosing RGB instead of YUV as the intermediate (as was done in swscale
v1) is more flexible with regards to enabling further operations such as
primaries conversions, linear scaling, etc.

In the case of YUV->YUV transform, the redundant matrix multiplication will
be canceled out anyways.
---
 libswscale/format.c | 926 ++++++++++++++++++++++++++++++++++++++++++++
 libswscale/format.h |  23 ++
 2 files changed, 949 insertions(+)

diff --git a/libswscale/format.c b/libswscale/format.c
index b77081dd7a..7cbc5b37db 100644
--- a/libswscale/format.c
+++ b/libswscale/format.c
@@ -21,8 +21,22 @@
 #include "libavutil/avassert.h"
 #include "libavutil/hdr_dynamic_metadata.h"
 #include "libavutil/mastering_display_metadata.h"
+#include "libavutil/refstruct.h"
 
 #include "format.h"
+#include "csputils.h"
+#include "ops_internal.h"
+
+#define Q(N) ((AVRational) { N, 1 })
+#define Q0   Q(0)
+#define Q1   Q(1)
+
+#define RET(x)                                                                 \
+    do {                                                                       \
+        int __ret = (x);                                                       \
+        if (__ret  < 0)                                                        \
+            return __ret;                                                      \
+    } while (0)
 
 typedef struct LegacyFormatEntry {
     uint8_t is_supported_in         :1;
@@ -582,3 +596,915 @@ int sws_is_noop(const AVFrame *dst, const AVFrame *src)
 
     return 1;
 }
+
+/* Returns the type suitable for a pixel after fully decoding/unpacking it */
+static SwsPixelType fmt_pixel_type(enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    const int bits = FFALIGN(desc->comp[0].depth, 8);
+    if (desc->flags & AV_PIX_FMT_FLAG_FLOAT) {
+        switch (bits) {
+        case 32: return SWS_PIXEL_F32;
+        }
+    } else {
+        switch (bits) {
+        case  8: return SWS_PIXEL_U8;
+        case 16: return SWS_PIXEL_U16;
+        case 32: return SWS_PIXEL_U32;
+        }
+    }
+
+    return SWS_PIXEL_NONE;
+}
+
+static SwsSwizzleOp fmt_swizzle(enum AVPixelFormat fmt)
+{
+    switch (fmt) {
+    case AV_PIX_FMT_ARGB:
+    case AV_PIX_FMT_0RGB:
+    case AV_PIX_FMT_AYUV64LE:
+    case AV_PIX_FMT_AYUV64BE:
+    case AV_PIX_FMT_AYUV:
+    case AV_PIX_FMT_X2RGB10LE:
+    case AV_PIX_FMT_X2RGB10BE:
+        return (SwsSwizzleOp) {{ .x = 3, 0, 1, 2 }};
+    case AV_PIX_FMT_BGR24:
+    case AV_PIX_FMT_BGR8:
+    case AV_PIX_FMT_BGR4:
+    case AV_PIX_FMT_BGR4_BYTE:
+    case AV_PIX_FMT_BGRA:
+    case AV_PIX_FMT_BGR565BE:
+    case AV_PIX_FMT_BGR565LE:
+    case AV_PIX_FMT_BGR555BE:
+    case AV_PIX_FMT_BGR555LE:
+    case AV_PIX_FMT_BGR444BE:
+    case AV_PIX_FMT_BGR444LE:
+    case AV_PIX_FMT_BGR48BE:
+    case AV_PIX_FMT_BGR48LE:
+    case AV_PIX_FMT_BGRA64BE:
+    case AV_PIX_FMT_BGRA64LE:
+    case AV_PIX_FMT_BGR0:
+    case AV_PIX_FMT_VUYA:
+    case AV_PIX_FMT_VUYX:
+        return (SwsSwizzleOp) {{ .x = 2, 1, 0, 3 }};
+    case AV_PIX_FMT_ABGR:
+    case AV_PIX_FMT_0BGR:
+    case AV_PIX_FMT_X2BGR10LE:
+    case AV_PIX_FMT_X2BGR10BE:
+        return (SwsSwizzleOp) {{ .x = 3, 2, 1, 0 }};
+    case AV_PIX_FMT_YA8:
+    case AV_PIX_FMT_YA16BE:
+    case AV_PIX_FMT_YA16LE:
+        return (SwsSwizzleOp) {{ .x = 0, 3, 1, 2 }};
+    case AV_PIX_FMT_XV30BE:
+    case AV_PIX_FMT_XV30LE:
+        return (SwsSwizzleOp) {{ .x = 3, 2, 0, 1 }};
+    case AV_PIX_FMT_VYU444:
+    case AV_PIX_FMT_V30XBE:
+    case AV_PIX_FMT_V30XLE:
+        return (SwsSwizzleOp) {{ .x = 2, 0, 1, 3 }};
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XV48BE:
+    case AV_PIX_FMT_XV48LE:
+    case AV_PIX_FMT_UYVA:
+        return (SwsSwizzleOp) {{ .x = 1, 0, 2, 3 }};
+    case AV_PIX_FMT_GBRP:
+    case AV_PIX_FMT_GBRP9BE:
+    case AV_PIX_FMT_GBRP9LE:
+    case AV_PIX_FMT_GBRP10BE:
+    case AV_PIX_FMT_GBRP10LE:
+    case AV_PIX_FMT_GBRP12BE:
+    case AV_PIX_FMT_GBRP12LE:
+    case AV_PIX_FMT_GBRP14BE:
+    case AV_PIX_FMT_GBRP14LE:
+    case AV_PIX_FMT_GBRP16BE:
+    case AV_PIX_FMT_GBRP16LE:
+    case AV_PIX_FMT_GBRPF16BE:
+    case AV_PIX_FMT_GBRPF16LE:
+    case AV_PIX_FMT_GBRAP:
+    case AV_PIX_FMT_GBRAP10LE:
+    case AV_PIX_FMT_GBRAP10BE:
+    case AV_PIX_FMT_GBRAP12LE:
+    case AV_PIX_FMT_GBRAP12BE:
+    case AV_PIX_FMT_GBRAP14LE:
+    case AV_PIX_FMT_GBRAP14BE:
+    case AV_PIX_FMT_GBRAP16LE:
+    case AV_PIX_FMT_GBRAP16BE:
+    case AV_PIX_FMT_GBRPF32BE:
+    case AV_PIX_FMT_GBRPF32LE:
+    case AV_PIX_FMT_GBRAPF16BE:
+    case AV_PIX_FMT_GBRAPF16LE:
+    case AV_PIX_FMT_GBRAPF32BE:
+    case AV_PIX_FMT_GBRAPF32LE:
+        return (SwsSwizzleOp) {{ .x = 1, 2, 0, 3 }};
+    default:
+        return (SwsSwizzleOp) {{ .x = 0, 1, 2, 3 }};
+    }
+}
+
+static SwsSwizzleOp swizzle_inv(SwsSwizzleOp swiz) {
+    /* Input[x] =: Output[swizzle.x] */
+    unsigned out[4];
+    out[swiz.x] = 0;
+    out[swiz.y] = 1;
+    out[swiz.z] = 2;
+    out[swiz.w] = 3;
+    return (SwsSwizzleOp) {{ .x = out[0], out[1], out[2], out[3] }};
+}
+
+/* Shift factor for MSB aligned formats */
+static int fmt_shift(enum AVPixelFormat fmt)
+{
+    switch (fmt) {
+    case AV_PIX_FMT_P010BE:
+    case AV_PIX_FMT_P010LE:
+    case AV_PIX_FMT_P210BE:
+    case AV_PIX_FMT_P210LE:
+    case AV_PIX_FMT_Y210BE:
+    case AV_PIX_FMT_Y210LE:
+        return 6;
+    case AV_PIX_FMT_P012BE:
+    case AV_PIX_FMT_P012LE:
+    case AV_PIX_FMT_P212BE:
+    case AV_PIX_FMT_P212LE:
+    case AV_PIX_FMT_P412BE:
+    case AV_PIX_FMT_P412LE:
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XYZ12BE:
+    case AV_PIX_FMT_XYZ12LE:
+        return 4;
+    }
+
+    return 0;
+}
+
+/**
+ * This initializes all absent components explicitly to zero. There is no
+ * need to worry about the correct neutral value as fmt_decode() will
+ * implicitly ignore and overwrite absent components in any case. This function
+ * is just to ensure that we don't operate on undefined memory. In most cases,
+ * it will end up getting pushed towards the output or optimized away entirely
+ * by the optimization pass.
+ */
+static SwsConst fmt_clear(enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    const bool has_chroma = desc->nb_components >= 3;
+    const bool has_alpha  = desc->flags & AV_PIX_FMT_FLAG_ALPHA;
+
+    SwsConst c = {0};
+    if (!has_chroma)
+        c.q4[1] = c.q4[2] = Q0;
+    if (!has_alpha)
+        c.q4[3] = Q0;
+
+    return c;
+}
+
+static int fmt_read_write(enum AVPixelFormat fmt, SwsReadWriteOp *rw_op,
+                          SwsPackOp *pack_op)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    if (!desc)
+        return AVERROR(EINVAL);
+
+    switch (fmt) {
+    case AV_PIX_FMT_NONE:
+    case AV_PIX_FMT_NB:
+        break;
+
+    /* Packed bitstream formats */
+    case AV_PIX_FMT_MONOWHITE:
+    case AV_PIX_FMT_MONOBLACK:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) {
+            .elems = 1,
+            .frac  = 3,
+        };
+        return 0;
+    case AV_PIX_FMT_RGB4:
+    case AV_PIX_FMT_BGR4:
+        *pack_op = (SwsPackOp) {{ 1, 2, 1 }};
+        *rw_op = (SwsReadWriteOp) {
+            .elems = 1,
+            .packed = true,
+            .frac  = 1,
+        };
+        return 0;
+    /* Packed 8-bit aligned formats */
+    case AV_PIX_FMT_RGB4_BYTE:
+    case AV_PIX_FMT_BGR4_BYTE:
+        *pack_op = (SwsPackOp) {{ 1, 2, 1 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_BGR8:
+        *pack_op = (SwsPackOp) {{ 2, 3, 3 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_RGB8:
+        *pack_op = (SwsPackOp) {{ 3, 3, 2 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+
+    /* Packed 16-bit aligned formats */
+    case AV_PIX_FMT_RGB565BE:
+    case AV_PIX_FMT_RGB565LE:
+    case AV_PIX_FMT_BGR565BE:
+    case AV_PIX_FMT_BGR565LE:
+        *pack_op = (SwsPackOp) {{ 5, 6, 5 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_RGB555BE:
+    case AV_PIX_FMT_RGB555LE:
+    case AV_PIX_FMT_BGR555BE:
+    case AV_PIX_FMT_BGR555LE:
+        *pack_op = (SwsPackOp) {{ 5, 5, 5 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_RGB444BE:
+    case AV_PIX_FMT_RGB444LE:
+    case AV_PIX_FMT_BGR444BE:
+    case AV_PIX_FMT_BGR444LE:
+        *pack_op = (SwsPackOp) {{ 4, 4, 4 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    /* Packed 32-bit aligned 4:4:4 formats */
+    case AV_PIX_FMT_X2RGB10BE:
+    case AV_PIX_FMT_X2RGB10LE:
+    case AV_PIX_FMT_X2BGR10BE:
+    case AV_PIX_FMT_X2BGR10LE:
+    case AV_PIX_FMT_XV30BE:
+    case AV_PIX_FMT_XV30LE:
+        *pack_op = (SwsPackOp) {{ 2, 10, 10, 10 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    case AV_PIX_FMT_V30XBE:
+    case AV_PIX_FMT_V30XLE:
+        *pack_op = (SwsPackOp) {{ 10, 10, 10, 2 }};
+        *rw_op = (SwsReadWriteOp) { .elems = 1, .packed = true };
+        return 0;
+    /* 3 component formats with one channel ignored */
+    case AV_PIX_FMT_RGB0:
+    case AV_PIX_FMT_BGR0:
+    case AV_PIX_FMT_0RGB:
+    case AV_PIX_FMT_0BGR:
+    case AV_PIX_FMT_XV36BE:
+    case AV_PIX_FMT_XV36LE:
+    case AV_PIX_FMT_XV48BE:
+    case AV_PIX_FMT_XV48LE:
+    case AV_PIX_FMT_VUYX:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) { .elems = 4, .packed = true };
+        return 0;
+    /* Unpacked byte-aligned 4:4:4 formats */
+    case AV_PIX_FMT_YUV444P:
+    case AV_PIX_FMT_YUVJ444P:
+    case AV_PIX_FMT_YUV444P9BE:
+    case AV_PIX_FMT_YUV444P9LE:
+    case AV_PIX_FMT_YUV444P10BE:
+    case AV_PIX_FMT_YUV444P10LE:
+    case AV_PIX_FMT_YUV444P12BE:
+    case AV_PIX_FMT_YUV444P12LE:
+    case AV_PIX_FMT_YUV444P14BE:
+    case AV_PIX_FMT_YUV444P14LE:
+    case AV_PIX_FMT_YUV444P16BE:
+    case AV_PIX_FMT_YUV444P16LE:
+    case AV_PIX_FMT_YUVA444P:
+    case AV_PIX_FMT_YUVA444P9BE:
+    case AV_PIX_FMT_YUVA444P9LE:
+    case AV_PIX_FMT_YUVA444P10BE:
+    case AV_PIX_FMT_YUVA444P10LE:
+    case AV_PIX_FMT_YUVA444P12BE:
+    case AV_PIX_FMT_YUVA444P12LE:
+    case AV_PIX_FMT_YUVA444P16BE:
+    case AV_PIX_FMT_YUVA444P16LE:
+    case AV_PIX_FMT_AYUV:
+    case AV_PIX_FMT_UYVA:
+    case AV_PIX_FMT_VYU444:
+    case AV_PIX_FMT_AYUV64BE:
+    case AV_PIX_FMT_AYUV64LE:
+    case AV_PIX_FMT_VUYA:
+    case AV_PIX_FMT_RGB24:
+    case AV_PIX_FMT_BGR24:
+    case AV_PIX_FMT_RGB48BE:
+    case AV_PIX_FMT_RGB48LE:
+    case AV_PIX_FMT_BGR48BE:
+    case AV_PIX_FMT_BGR48LE:
+    //case AV_PIX_FMT_RGB96BE: TODO: AVRational can't fit 2^32-1
+    //case AV_PIX_FMT_RGB96LE:
+    //case AV_PIX_FMT_RGBF16BE: TODO: no support for float16 currently
+    //case AV_PIX_FMT_RGBF16LE:
+    case AV_PIX_FMT_RGBF32BE:
+    case AV_PIX_FMT_RGBF32LE:
+    case AV_PIX_FMT_ARGB:
+    case AV_PIX_FMT_RGBA:
+    case AV_PIX_FMT_ABGR:
+    case AV_PIX_FMT_BGRA:
+    case AV_PIX_FMT_RGBA64BE:
+    case AV_PIX_FMT_RGBA64LE:
+    case AV_PIX_FMT_BGRA64BE:
+    case AV_PIX_FMT_BGRA64LE:
+    //case AV_PIX_FMT_RGBA128BE: TODO: AVRational can't fit 2^32-1
+    //case AV_PIX_FMT_RGBA128LE:
+    case AV_PIX_FMT_RGBAF32BE:
+    case AV_PIX_FMT_RGBAF32LE:
+    case AV_PIX_FMT_GBRP:
+    case AV_PIX_FMT_GBRP9BE:
+    case AV_PIX_FMT_GBRP9LE:
+    case AV_PIX_FMT_GBRP10BE:
+    case AV_PIX_FMT_GBRP10LE:
+    case AV_PIX_FMT_GBRP12BE:
+    case AV_PIX_FMT_GBRP12LE:
+    case AV_PIX_FMT_GBRP14BE:
+    case AV_PIX_FMT_GBRP14LE:
+    case AV_PIX_FMT_GBRP16BE:
+    case AV_PIX_FMT_GBRP16LE:
+    //case AV_PIX_FMT_GBRPF16BE: TODO
+    //case AV_PIX_FMT_GBRPF16LE:
+    case AV_PIX_FMT_GBRPF32BE:
+    case AV_PIX_FMT_GBRPF32LE:
+    case AV_PIX_FMT_GBRAP:
+    case AV_PIX_FMT_GBRAP10BE:
+    case AV_PIX_FMT_GBRAP10LE:
+    case AV_PIX_FMT_GBRAP12BE:
+    case AV_PIX_FMT_GBRAP12LE:
+    case AV_PIX_FMT_GBRAP14BE:
+    case AV_PIX_FMT_GBRAP14LE:
+    case AV_PIX_FMT_GBRAP16BE:
+    case AV_PIX_FMT_GBRAP16LE:
+    //case AV_PIX_FMT_GBRAPF16BE: TODO
+    //case AV_PIX_FMT_GBRAPF16LE:
+    case AV_PIX_FMT_GBRAPF32BE:
+    case AV_PIX_FMT_GBRAPF32LE:
+    case AV_PIX_FMT_GRAY8:
+    case AV_PIX_FMT_GRAY9BE:
+    case AV_PIX_FMT_GRAY9LE:
+    case AV_PIX_FMT_GRAY10BE:
+    case AV_PIX_FMT_GRAY10LE:
+    case AV_PIX_FMT_GRAY12BE:
+    case AV_PIX_FMT_GRAY12LE:
+    case AV_PIX_FMT_GRAY14BE:
+    case AV_PIX_FMT_GRAY14LE:
+    case AV_PIX_FMT_GRAY16BE:
+    case AV_PIX_FMT_GRAY16LE:
+    //case AV_PIX_FMT_GRAYF16BE: TODO
+    //case AV_PIX_FMT_GRAYF16LE:
+    //case AV_PIX_FMT_YAF16BE:
+    //case AV_PIX_FMT_YAF16LE:
+    case AV_PIX_FMT_GRAYF32BE:
+    case AV_PIX_FMT_GRAYF32LE:
+    case AV_PIX_FMT_YAF32BE:
+    case AV_PIX_FMT_YAF32LE:
+    case AV_PIX_FMT_YA8:
+    case AV_PIX_FMT_YA16LE:
+    case AV_PIX_FMT_YA16BE:
+        *pack_op = (SwsPackOp) {0};
+        *rw_op = (SwsReadWriteOp) {
+            .elems  = desc->nb_components,
+            .packed = desc->nb_components > 1 && !(desc->flags & AV_PIX_FMT_FLAG_PLANAR),
+        };
+        return 0;
+    }
+
+    return AVERROR(ENOTSUP);
+}
+
+static SwsPixelType get_packed_type(SwsPackOp pack)
+{
+    const int sum = pack.pattern[0] + pack.pattern[1] +
+                    pack.pattern[2] + pack.pattern[3];
+    if (sum > 16)
+        return SWS_PIXEL_U32;
+    else if (sum > 8)
+        return SWS_PIXEL_U16;
+    else
+        return SWS_PIXEL_U8;
+}
+
+#if HAVE_BIGENDIAN
+#  define NATIVE_ENDIAN_FLAG AV_PIX_FMT_FLAG_BE
+#else
+#  define NATIVE_ENDIAN_FLAG 0
+#endif
+
+int ff_sws_decode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    SwsPixelType pixel_type = fmt_pixel_type(fmt);
+    SwsPixelType raw_type = pixel_type;
+    SwsReadWriteOp rw_op;
+    SwsPackOp unpack;
+
+    RET(fmt_read_write(fmt, &rw_op, &unpack));
+    if (unpack.pattern[0])
+        raw_type = get_packed_type(unpack);
+
+    /* TODO: handle subsampled or semipacked input formats */
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_READ,
+        .type = raw_type,
+        .rw   = rw_op,
+    }));
+
+    if ((desc->flags & AV_PIX_FMT_FLAG_BE) != NATIVE_ENDIAN_FLAG) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_SWAP_BYTES,
+            .type = raw_type,
+        }));
+    }
+
+    if (unpack.pattern[0]) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_UNPACK,
+            .type = raw_type,
+            .pack = unpack,
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CONVERT,
+            .type = raw_type,
+            .convert.to = pixel_type,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op      = SWS_OP_SWIZZLE,
+        .type    = pixel_type,
+        .swizzle = swizzle_inv(fmt_swizzle(fmt)),
+    }));
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_RSHIFT,
+        .type = pixel_type,
+        .c.u  = fmt_shift(fmt),
+    }));
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_CLEAR,
+        .type = pixel_type,
+        .c    = fmt_clear(fmt),
+    }));
+
+    return 0;
+}
+
+int ff_sws_encode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt)
+{
+    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(fmt);
+    SwsPixelType pixel_type = fmt_pixel_type(fmt);
+    SwsPixelType raw_type = pixel_type;
+    SwsReadWriteOp rw_op;
+    SwsPackOp pack;
+
+    RET(fmt_read_write(fmt, &rw_op, &pack));
+    if (pack.pattern[0])
+        raw_type = get_packed_type(pack);
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_LSHIFT,
+        .type = pixel_type,
+        .c.u  = fmt_shift(fmt),
+    }));
+
+    if (rw_op.elems > desc->nb_components) {
+        /* Format writes unused alpha channel, clear it explicitly for sanity */
+        av_assert1(!(desc->flags & AV_PIX_FMT_FLAG_ALPHA));
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CLEAR,
+            .type = pixel_type,
+            .c.q4[3] = Q0,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op      = SWS_OP_SWIZZLE,
+        .type    = pixel_type,
+        .swizzle = fmt_swizzle(fmt),
+    }));
+
+    if (pack.pattern[0]) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_CONVERT,
+            .type = pixel_type,
+            .convert.to = raw_type,
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_PACK,
+            .type = raw_type,
+            .pack = pack,
+        }));
+    }
+
+    if ((desc->flags & AV_PIX_FMT_FLAG_BE) != NATIVE_ENDIAN_FLAG) {
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_SWAP_BYTES,
+            .type = raw_type,
+        }));
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op   = SWS_OP_WRITE,
+        .type = raw_type,
+        .rw   = rw_op,
+    }));
+    return 0;
+}
+
+static inline AVRational av_neg_q(AVRational x)
+{
+    return (AVRational) { -x.num, x.den };
+}
+
+static SwsLinearOp fmt_encode_range(const SwsFormat fmt, bool *incomplete)
+{
+    SwsLinearOp c = { .m = {
+        { Q1, Q0, Q0, Q0, Q0 },
+        { Q0, Q1, Q0, Q0, Q0 },
+        { Q0, Q0, Q1, Q0, Q0 },
+        { Q0, Q0, Q0, Q1, Q0 },
+    }};
+
+    const int depth0 = fmt.desc->comp[0].depth;
+    const int depth1 = fmt.desc->comp[1].depth;
+    const int depth2 = fmt.desc->comp[2].depth;
+    const int depth3 = fmt.desc->comp[3].depth;
+
+    if (fmt.desc->flags & AV_PIX_FMT_FLAG_FLOAT)
+        return c; /* floats are directly output as-is */
+
+    if (fmt.csp == AVCOL_SPC_RGB || (fmt.desc->flags & AV_PIX_FMT_FLAG_XYZ)) {
+        c.m[0][0] = Q((1 << depth0) - 1);
+        c.m[1][1] = Q((1 << depth1) - 1);
+        c.m[2][2] = Q((1 << depth2) - 1);
+    } else if (fmt.range == AVCOL_RANGE_JPEG) {
+        /* Full range YUV */
+        c.m[0][0] = Q((1 << depth0) - 1);
+        if (fmt.desc->nb_components >= 3) {
+            /* This follows the ITU-R convention, which is slightly different
+             * from the JFIF convention. */
+            c.m[1][1] = Q((1 << depth1) - 1);
+            c.m[2][2] = Q((1 << depth2) - 1);
+            c.m[1][4] = Q(1 << (depth1 - 1));
+            c.m[2][4] = Q(1 << (depth2 - 1));
+        }
+    } else {
+        /* Limited range YUV */
+        if (fmt.range == AVCOL_RANGE_UNSPECIFIED)
+            *incomplete = true;
+        c.m[0][0] = Q(219 << (depth0 - 8));
+        c.m[0][4] = Q( 16 << (depth0 - 8));
+        if (fmt.desc->nb_components >= 3) {
+            c.m[1][1] = Q(224 << (depth1 - 8));
+            c.m[2][2] = Q(224 << (depth2 - 8));
+            c.m[1][4] = Q(128 << (depth1 - 8));
+            c.m[2][4] = Q(128 << (depth2 - 8));
+        }
+    }
+
+    if (fmt.desc->flags & AV_PIX_FMT_FLAG_ALPHA) {
+        const bool is_ya = fmt.desc->nb_components == 2;
+        c.m[3][3] = Q((1 << (is_ya ? depth1 : depth3)) - 1);
+    }
+
+    if (fmt.format == AV_PIX_FMT_MONOWHITE) {
+        /* This format is inverted, 0 = white, 1 = black */
+        c.m[0][4] = av_add_q(c.m[0][4], c.m[0][0]);
+        c.m[0][0] = av_neg_q(c.m[0][0]);
+    }
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+static SwsLinearOp fmt_decode_range(const SwsFormat fmt, bool *incomplete)
+{
+    SwsLinearOp c = fmt_encode_range(fmt, incomplete);
+
+    /* Invert main diagonal + offset: x = s * y + k  ==>  y = (x - k) / s */
+    for (int i = 0; i < 4; i++) {
+        c.m[i][i] = av_inv_q(c.m[i][i]);
+        c.m[i][4] = av_mul_q(c.m[i][4], av_neg_q(c.m[i][i]));
+    }
+
+    /* Explicitly initialize alpha for sanity */
+    if (!(fmt.desc->flags & AV_PIX_FMT_FLAG_ALPHA))
+        c.m[3][4] = Q1;
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+static AVRational *generate_bayer_matrix(const int size_log2)
+{
+    const int size = 1 << size_log2;
+    const int num_entries = size * size;
+    AVRational *m = av_refstruct_allocz(sizeof(*m) * num_entries);
+    av_assert1(size_log2 < 16);
+    if (!m)
+        return NULL;
+
+    /* Start with a 1x1 matrix */
+    m[0] = Q0;
+
+    /* Generate three copies of the current, appropriately scaled and offset */
+    for (int sz = 1; sz < size; sz <<= 1) {
+        const int den = 4 * sz * sz;
+        for (int y = 0; y < sz; y++) {
+            for (int x = 0; x < sz; x++) {
+                const AVRational cur = m[y * size + x];
+                m[(y + sz) * size + x + sz] = av_add_q(cur, av_make_q(1, den));
+                m[(y     ) * size + x + sz] = av_add_q(cur, av_make_q(2, den));
+                m[(y + sz) * size + x     ] = av_add_q(cur, av_make_q(3, den));
+            }
+        }
+    }
+
+    /**
+     * To correctly round, we need to evenly distribute the result on [0, 1),
+     * giving an average value of 1/2.
+     *
+     * After the above construction, we have a matrix with average value:
+     *   [ 0/N + 1/N + 2/N + ... (N-1)/N ] / N = (N-1)/(2N)
+     * where N = size * size is the total number of entries.
+     *
+     * To make the average value equal to 1/2 = N/(2N), add a bias of 1/(2N).
+     */
+    for (int i = 0; i < num_entries; i++)
+        m[i] = av_add_q(m[i], av_make_q(1, 2 * num_entries));
+
+    return m;
+}
+
+static bool trc_is_hdr(enum AVColorTransferCharacteristic trc)
+{
+    static_assert(AVCOL_TRC_NB == 19, "Update this list when adding TRCs");
+    switch (trc) {
+    case AVCOL_TRC_LOG:
+    case AVCOL_TRC_LOG_SQRT:
+    case AVCOL_TRC_SMPTEST2084:
+    case AVCOL_TRC_ARIB_STD_B67:
+        return true;
+    default:
+        return false;
+    }
+}
+
+static int fmt_dither(SwsContext *ctx, SwsOpList *ops,
+                      const SwsPixelType type, const SwsFormat fmt)
+{
+    SwsDither mode = ctx->dither;
+    SwsDitherOp dither;
+
+    if (mode == SWS_DITHER_AUTO) {
+        /* Visual threshold of perception: 12 bits for SDR, 14 bits for HDR */
+        const int jnd_bits = trc_is_hdr(fmt.color.trc) ? 14 : 12;
+        const int bpc = fmt.desc->comp[0].depth;
+        mode = bpc >= jnd_bits ? SWS_DITHER_NONE : SWS_DITHER_BAYER;
+    }
+
+    switch (mode) {
+    case SWS_DITHER_NONE:
+        if (ctx->flags & SWS_ACCURATE_RND) {
+            /* Add constant 0.5 for correct rounding */
+            AVRational *bias = av_refstruct_allocz(sizeof(*bias));
+            if (!bias)
+                return AVERROR(ENOMEM);
+            *bias = (AVRational) {1, 2};
+            return ff_sws_op_list_append(ops, &(SwsOp) {
+                .op   = SWS_OP_DITHER,
+                .type = type,
+                .dither.matrix = bias,
+            });
+        } else {
+            return 0; /* No-op */
+        }
+    case SWS_DITHER_BAYER:
+        /* Hardcode 16x16 matrix for now; in theory we could adjust this
+         * based on the expected level of precision in the output, since lower
+         * bit depth outputs can suffice with smaller dither matrices; however
+         * in practice we probably want to use error diffusion for such low bit
+         * depths anyway */
+        dither.size_log2 = 4;
+        dither.matrix = generate_bayer_matrix(dither.size_log2);
+        if (!dither.matrix)
+            return AVERROR(ENOMEM);
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .op     = SWS_OP_DITHER,
+            .type   = type,
+            .dither = dither,
+        });
+    case SWS_DITHER_ED:
+    case SWS_DITHER_A_DITHER:
+    case SWS_DITHER_X_DITHER:
+        return AVERROR(ENOTSUP);
+
+    case SWS_DITHER_NB:
+        break;
+    }
+
+    av_assert0(!"Invalid dither mode");
+    return AVERROR(EINVAL);
+}
+
+static inline SwsLinearOp
+linear_mat3(const AVRational m00, const AVRational m01, const AVRational m02,
+            const AVRational m10, const AVRational m11, const AVRational m12,
+            const AVRational m20, const AVRational m21, const AVRational m22)
+{
+    SwsLinearOp c = {{
+        { m00, m01, m02, Q0, Q0 },
+        { m10, m11, m12, Q0, Q0 },
+        { m20, m21, m22, Q0, Q0 },
+        {  Q0,  Q0,  Q0, Q1, Q0 },
+    }};
+
+    c.mask = ff_sws_linear_mask(c);
+    return c;
+}
+
+int ff_sws_decode_colors(SwsContext *ctx, SwsPixelType type,
+                         SwsOpList *ops, const SwsFormat fmt, bool *incomplete)
+{
+    const AVLumaCoefficients *c = av_csp_luma_coeffs_from_avcsp(fmt.csp);
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .op         = SWS_OP_CONVERT,
+        .type       = fmt_pixel_type(fmt.format),
+        .convert.to = type,
+    }));
+
+    /* Decode pixel format into standardized range */
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .type = type,
+        .op   = SWS_OP_LINEAR,
+        .lin  = fmt_decode_range(fmt, incomplete),
+    }));
+
+    /* Final step, decode colorspace */
+    switch (fmt.csp) {
+    case AVCOL_SPC_RGB:
+        return 0;
+    case AVCOL_SPC_UNSPECIFIED:
+        c = av_csp_luma_coeffs_from_avcsp(AVCOL_SPC_BT470BG);
+        *incomplete = true;
+        /* fall through */
+    case AVCOL_SPC_FCC:
+    case AVCOL_SPC_BT470BG:
+    case AVCOL_SPC_SMPTE170M:
+    case AVCOL_SPC_BT709:
+    case AVCOL_SPC_SMPTE240M:
+    case AVCOL_SPC_BT2020_NCL: {
+        AVRational crg = av_sub_q(Q0, av_div_q(c->cr, c->cg));
+        AVRational cbg = av_sub_q(Q0, av_div_q(c->cb, c->cg));
+        AVRational m02 = av_mul_q(Q(2), av_sub_q(Q1, c->cr));
+        AVRational m21 = av_mul_q(Q(2), av_sub_q(Q1, c->cb));
+        AVRational m11 = av_mul_q(cbg, m21);
+        AVRational m12 = av_mul_q(crg, m02);
+
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                Q1,  Q0, m02,
+                Q1, m11, m12,
+                Q1, m21,  Q0
+            ),
+        });
+    }
+
+    case AVCOL_SPC_YCGCO:
+        return ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                Q1, Q(-1), Q( 1),
+                Q1, Q( 1), Q( 0),
+                Q1, Q(-1), Q(-1)
+            ),
+        });
+
+    case AVCOL_SPC_BT2020_CL:
+    case AVCOL_SPC_SMPTE2085:
+    case AVCOL_SPC_CHROMA_DERIVED_NCL:
+    case AVCOL_SPC_CHROMA_DERIVED_CL:
+    case AVCOL_SPC_ICTCP:
+    case AVCOL_SPC_IPT_C2:
+    case AVCOL_SPC_YCGCO_RE:
+    case AVCOL_SPC_YCGCO_RO:
+        return AVERROR(ENOTSUP);
+
+    case AVCOL_SPC_RESERVED:
+        return AVERROR(EINVAL);
+
+    case AVCOL_SPC_NB:
+        break;
+    }
+
+    av_assert0(!"Corrupt AVColorSpace value?");
+    return AVERROR(EINVAL);
+}
+
+int ff_sws_encode_colors(SwsContext *ctx, SwsPixelType type,
+                         SwsOpList *ops, const SwsFormat fmt, bool *incomplete)
+{
+    const AVLumaCoefficients *c = av_csp_luma_coeffs_from_avcsp(fmt.csp);
+
+    switch (fmt.csp) {
+    case AVCOL_SPC_RGB:
+        break;
+    case AVCOL_SPC_UNSPECIFIED:
+        c = av_csp_luma_coeffs_from_avcsp(AVCOL_SPC_BT470BG);
+        *incomplete = true;
+        /* fall through */
+    case AVCOL_SPC_FCC:
+    case AVCOL_SPC_BT470BG:
+    case AVCOL_SPC_SMPTE170M:
+    case AVCOL_SPC_BT709:
+    case AVCOL_SPC_SMPTE240M:
+    case AVCOL_SPC_BT2020_NCL: {
+        AVRational cb1 = av_sub_q(c->cb, Q1);
+        AVRational cr1 = av_sub_q(c->cr, Q1);
+        AVRational m20 = av_make_q(1,2);
+        AVRational m10 = av_mul_q(m20, av_div_q(c->cr, cb1));
+        AVRational m11 = av_mul_q(m20, av_div_q(c->cg, cb1));
+        AVRational m21 = av_mul_q(m20, av_div_q(c->cg, cr1));
+        AVRational m22 = av_mul_q(m20, av_div_q(c->cb, cr1));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                c->cr, c->cg, c->cb,
+                m10,     m11,   m20,
+                m20,     m21,   m22
+            ),
+        }));
+        break;
+    }
+
+    case AVCOL_SPC_YCGCO:
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .type = type,
+            .op   = SWS_OP_LINEAR,
+            .lin  = linear_mat3(
+                av_make_q( 1, 4), av_make_q(1, 2), av_make_q( 1, 4),
+                av_make_q( 1, 2), av_make_q(0, 1), av_make_q(-1, 2),
+                av_make_q(-1, 4), av_make_q(1, 2), av_make_q(-1, 4)
+            ),
+        }));
+        break;
+
+    case AVCOL_SPC_BT2020_CL:
+    case AVCOL_SPC_SMPTE2085:
+    case AVCOL_SPC_CHROMA_DERIVED_NCL:
+    case AVCOL_SPC_CHROMA_DERIVED_CL:
+    case AVCOL_SPC_ICTCP:
+    case AVCOL_SPC_IPT_C2:
+    case AVCOL_SPC_YCGCO_RE:
+    case AVCOL_SPC_YCGCO_RO:
+        return AVERROR(ENOTSUP);
+
+    case AVCOL_SPC_RESERVED:
+    case AVCOL_SPC_NB:
+        return AVERROR(EINVAL);
+    }
+
+    RET(ff_sws_op_list_append(ops, &(SwsOp) {
+        .type = type,
+        .op   = SWS_OP_LINEAR,
+        .lin  = fmt_encode_range(fmt, incomplete),
+    }));
+
+    if (!(fmt.desc->flags & AV_PIX_FMT_FLAG_FLOAT)) {
+        SwsConst range = {0};
+
+        const bool is_ya = fmt.desc->nb_components == 2;
+        for (int i = 0; i < fmt.desc->nb_components; i++) {
+            /* Clamp to legal pixel range */
+            const int idx = i * (is_ya ? 3 : 1);
+            range.q4[idx] = Q((1 << fmt.desc->comp[i].depth) - 1);
+        }
+
+        RET(fmt_dither(ctx, ops, type, fmt));
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_MAX,
+            .type = type,
+            .c.q4 = { Q0, Q0, Q0, Q0 },
+        }));
+
+        RET(ff_sws_op_list_append(ops, &(SwsOp) {
+            .op   = SWS_OP_MIN,
+            .type = type,
+            .c    = range,
+        }));
+    }
+
+    return ff_sws_op_list_append(ops, &(SwsOp) {
+        .type       = type,
+        .op         = SWS_OP_CONVERT,
+        .convert.to = fmt_pixel_type(fmt.format),
+    });
+}
diff --git a/libswscale/format.h b/libswscale/format.h
index be92038f4f..e6a1fd7116 100644
--- a/libswscale/format.h
+++ b/libswscale/format.h
@@ -148,4 +148,27 @@ int ff_test_fmt(const SwsFormat *fmt, int output);
 /* Returns true if the formats are incomplete, false otherwise */
 bool ff_infer_colors(SwsColor *src, SwsColor *dst);
 
+typedef struct SwsOpList SwsOpList;
+typedef enum SwsPixelType SwsPixelType;
+
+/**
+ * Append a set of operations for decoding/encoding raw pixels. This will
+ * handle input read/write, swizzling, shifting and byte swapping.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_decode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt);
+int ff_sws_encode_pixfmt(SwsOpList *ops, enum AVPixelFormat fmt);
+
+/**
+ * Append a set of operations for transforming decoded pixel values to/from
+ * normalized RGB in the specified gamut and pixel type.
+ *
+ * Returns 0 on success, or a negative error code on failure.
+ */
+int ff_sws_decode_colors(SwsContext *ctx, SwsPixelType type, SwsOpList *ops,
+                         const SwsFormat fmt, bool *incomplete);
+int ff_sws_encode_colors(SwsContext *ctx, SwsPixelType type, SwsOpList *ops,
+                         const SwsFormat fmt, bool *incomplete);
+
 #endif /* SWSCALE_FORMAT_H */
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] [PATCH v3 17/17] swscale/graph: allow experimental use of new format handler
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (15 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 16/17] swscale/format: add new format decode/encode logic Niklas Haas
@ 2025-05-27  7:55 ` Niklas Haas
  2025-05-27  8:29 ` [FFmpeg-devel] (no subject) Kieran Kunhya via ffmpeg-devel
  17 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  7:55 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Niklas Haas

From: Niklas Haas <git@haasn.dev>

\o/
---
 libswscale/graph.c | 84 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 82 insertions(+), 2 deletions(-)

diff --git a/libswscale/graph.c b/libswscale/graph.c
index dc7784aa49..24930e7627 100644
--- a/libswscale/graph.c
+++ b/libswscale/graph.c
@@ -34,6 +34,7 @@
 #include "lut3d.h"
 #include "swscale_internal.h"
 #include "graph.h"
+#include "ops.h"
 
 static int pass_alloc_output(SwsPass *pass)
 {
@@ -453,6 +454,85 @@ static int add_legacy_sws_pass(SwsGraph *graph, SwsFormat src, SwsFormat dst,
     return 0;
 }
 
+/*********************
+ * Format conversion *
+ *********************/
+
+static int add_convert_pass(SwsGraph *graph, SwsFormat src, SwsFormat dst,
+                            SwsPass *input, SwsPass **output)
+{
+    const SwsPixelType type = SWS_PIXEL_F32;
+
+    SwsContext *ctx = graph->ctx;
+    SwsOpList *ops = NULL;
+    int ret = AVERROR(ENOTSUP);
+
+    /* Mark the entire new ops infrastructure as experimental for now */
+    if (!(ctx->flags & SWS_UNSTABLE))
+        goto fail;
+
+    /* The new format conversion layer cannot scale for now */
+    if (src.width != dst.width || src.height != dst.height ||
+        src.desc->log2_chroma_h || src.desc->log2_chroma_w ||
+        dst.desc->log2_chroma_h || dst.desc->log2_chroma_w)
+        goto fail;
+
+    /* The new code does not yet support alpha blending */
+    if (src.desc->flags & AV_PIX_FMT_FLAG_ALPHA &&
+        ctx->alpha_blend != SWS_ALPHA_BLEND_NONE)
+        goto fail;
+
+    ops = ff_sws_op_list_alloc();
+    if (!ops)
+        return AVERROR(ENOMEM);
+    ops->src = src;
+    ops->dst = dst;
+
+    ret = ff_sws_decode_pixfmt(ops, src.format);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_decode_colors(ctx, type, ops, src, &graph->incomplete);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_encode_colors(ctx, type, ops, dst, &graph->incomplete);
+    if (ret < 0)
+        goto fail;
+    ret = ff_sws_encode_pixfmt(ops, dst.format);
+    if (ret < 0)
+        goto fail;
+
+    av_log(ctx, AV_LOG_VERBOSE, "Conversion pass for %s -> %s:\n",
+           av_get_pix_fmt_name(src.format), av_get_pix_fmt_name(dst.format));
+
+    av_log(ctx, AV_LOG_DEBUG, "Unoptimized operation list:\n");
+    ff_sws_op_list_print(ctx, AV_LOG_DEBUG, ops);
+    av_log(ctx, AV_LOG_DEBUG, "Optimized operation list:\n");
+
+    ff_sws_op_list_optimize(ops);
+    if (ops->num_ops == 0) {
+        av_log(ctx, AV_LOG_VERBOSE, "  optimized into memcpy\n");
+        ff_sws_op_list_free(&ops);
+        *output = input;
+        return 0;
+    }
+
+    ff_sws_op_list_print(ctx, AV_LOG_VERBOSE, ops);
+
+    ret = ff_sws_compile_pass(graph, ops, 0, dst, input, output);
+    if (ret < 0)
+        goto fail;
+
+    ret = 0;
+    /* fall through */
+
+fail:
+    ff_sws_op_list_free(&ops);
+    if (ret == AVERROR(ENOTSUP))
+        return add_legacy_sws_pass(graph, src, dst, input, output);
+    return ret;
+}
+
+
 /**************************
  * Gamut and tone mapping *
  **************************/
@@ -522,7 +602,7 @@ static int adapt_colors(SwsGraph *graph, SwsFormat src, SwsFormat dst,
     if (fmt_in != src.format) {
         SwsFormat tmp = src;
         tmp.format = fmt_in;
-        ret = add_legacy_sws_pass(graph, src, tmp, input, &input);
+        ret = add_convert_pass(graph, src, tmp, input, &input);
         if (ret < 0)
             return ret;
     }
@@ -564,7 +644,7 @@ static int init_passes(SwsGraph *graph)
     src.color  = dst.color;
 
     if (!ff_fmt_equal(&src, &dst)) {
-        ret = add_legacy_sws_pass(graph, src, dst, pass, &pass);
+        ret = add_convert_pass(graph, src, dst, pass, &pass);
         if (ret < 0)
             return ret;
     }
-- 
2.49.0

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [FFmpeg-devel] [PATCH v3 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats Niklas Haas
@ 2025-05-27  8:24   ` Martin Storsjö
  0 siblings, 0 replies; 34+ messages in thread
From: Martin Storsjö @ 2025-05-27  8:24 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Niklas Haas

On Tue, 27 May 2025, Niklas Haas wrote:

> From: Niklas Haas <git@haasn.dev>
>
> We split the standard macro into its body (implementation) and declaration,
> and use a macro argument in place of the raw `memcmp` call, with the major
> difference that we now take the number of pixels to compare instead of the
> number of bytes (to match the signature of float_near_ulp_array).
> ---
> tests/checkasm/checkasm.c | 52 ++++++++++++++++++++++++++-------------
> tests/checkasm/checkasm.h |  7 ++++++
> 2 files changed, 42 insertions(+), 17 deletions(-)
>
> diff --git a/tests/checkasm/checkasm.c b/tests/checkasm/checkasm.c
> index 71d1e5766c..f393a0cb96 100644
> --- a/tests/checkasm/checkasm.c
> +++ b/tests/checkasm/checkasm.c
> @@ -1187,14 +1187,8 @@ static int check_err(const char *file, int line,
>     return 0;
> }
>
> -#define DEF_CHECKASM_CHECK_FUNC(type, fmt) \
> -int checkasm_check_##type(const char *file, int line, \
> -                          const type *buf1, ptrdiff_t stride1, \
> -                          const type *buf2, ptrdiff_t stride2, \
> -                          int w, int h, const char *name, \
> -                          int align_w, int align_h, \
> -                          int padding) \
> -{ \
> +#define DEF_CHECKASM_CHECK_BODY(compare, type, fmt) \
> +do { \
>     int64_t aligned_w = (w - 1LL + align_w) & ~(align_w - 1); \
>     int64_t aligned_h = (h - 1LL + align_h) & ~(align_h - 1); \
>     int err = 0; \
> @@ -1204,7 +1198,7 @@ int checkasm_check_##type(const char *file, int line, \
>     stride1 /= sizeof(*buf1); \
>     stride2 /= sizeof(*buf2); \
>     for (y = 0; y < h; y++) \
> -        if (memcmp(&buf1[y*stride1], &buf2[y*stride2], w*sizeof(*buf1))) \
> +        if (!compare(&buf1[y*stride1], &buf2[y*stride2], w)) \
>             break; \
>     if (y != h) { \
>         if (check_err(file, line, name, w, h, &err)) \
> @@ -1226,38 +1220,50 @@ int checkasm_check_##type(const char *file, int line, \
>         buf2 -= h*stride2; \
>     } \
>     for (y = -padding; y < 0; y++) \
> -        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
> -                   (w + 2*padding)*sizeof(*buf1))) { \
> +        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
> +                     w + 2*padding)) { \
>             if (check_err(file, line, name, w, h, &err)) \
>                 return 1; \
>             fprintf(stderr, " overwrite above\n"); \
>             break; \
>         } \
>     for (y = aligned_h; y < aligned_h + padding; y++) \
> -        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
> -                   (w + 2*padding)*sizeof(*buf1))) { \
> +        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
> +                     w + 2*padding)) { \
>             if (check_err(file, line, name, w, h, &err)) \
>                 return 1; \
>             fprintf(stderr, " overwrite below\n"); \
>             break; \
>         } \
>     for (y = 0; y < h; y++) \
> -        if (memcmp(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
> -                   padding*sizeof(*buf1))) { \
> +        if (!compare(&buf1[y*stride1 - padding], &buf2[y*stride2 - padding], \
> +                     padding)) { \
>             if (check_err(file, line, name, w, h, &err)) \
>                 return 1; \
>             fprintf(stderr, " overwrite left\n"); \
>             break; \
>         } \
>     for (y = 0; y < h; y++) \
> -        if (memcmp(&buf1[y*stride1 + aligned_w], &buf2[y*stride2 + aligned_w], \
> -                   padding*sizeof(*buf1))) { \
> +        if (!compare(&buf1[y*stride1 + aligned_w], &buf2[y*stride2 + aligned_w], \
> +                     padding)) { \
>             if (check_err(file, line, name, w, h, &err)) \
>                 return 1; \
>             fprintf(stderr, " overwrite right\n"); \
>             break; \
>         } \
>     return err; \
> +} while (0)
> +
> +#define cmp_int(a, b, len) (!memcmp(a, b, (len) * sizeof(*(a))))
> +#define DEF_CHECKASM_CHECK_FUNC(type, fmt) \
> +int checkasm_check_##type(const char *file, int line, \
> +                          const type *buf1, ptrdiff_t stride1, \
> +                          const type *buf2, ptrdiff_t stride2, \
> +                          int w, int h, const char *name, \
> +                          int align_w, int align_h, \
> +                          int padding) \
> +{ \
> +    DEF_CHECKASM_CHECK_BODY(cmp_int, type, fmt); \
> }
>
> DEF_CHECKASM_CHECK_FUNC(uint8_t,  "%02x")
> @@ -1265,3 +1271,15 @@ DEF_CHECKASM_CHECK_FUNC(uint16_t, "%04x")
> DEF_CHECKASM_CHECK_FUNC(uint32_t, "%08x")
> DEF_CHECKASM_CHECK_FUNC(int16_t,  "%6d")
> DEF_CHECKASM_CHECK_FUNC(int32_t,  "%9d")
> +
> +int checkasm_check_float_ulp(const char *file, int line,
> +                             const float *buf1, ptrdiff_t stride1,
> +                             const float *buf2, ptrdiff_t stride2,
> +                             int w, int h, const char *name,
> +                             unsigned max_ulp, int align_w, int align_h,
> +                             int padding)
> +{
> +    #define cmp_float(a, b, len) float_near_ulp_array(a, b, max_ulp, len)
> +    DEF_CHECKASM_CHECK_BODY(cmp_float, float, "%g");
> +    #undef cmp_float
> +}
> diff --git a/tests/checkasm/checkasm.h b/tests/checkasm/checkasm.h
> index ad7ed10613..ec01bd6207 100644
> --- a/tests/checkasm/checkasm.h
> +++ b/tests/checkasm/checkasm.h
> @@ -423,6 +423,13 @@ DECL_CHECKASM_CHECK_FUNC(uint32_t);
> DECL_CHECKASM_CHECK_FUNC(int16_t);
> DECL_CHECKASM_CHECK_FUNC(int32_t);
>
> +int checkasm_check_float_ulp(const char *file, int line,
> +                             const float *buf1, ptrdiff_t stride1,
> +                             const float *buf2, ptrdiff_t stride2,
> +                             int w, int h, const char *name,
> +                             unsigned max_elp, int align_w, int align_h,

Typo - max_ulp?


Other than that, thanks, this looks reasonable!

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [FFmpeg-devel] [PATCH v3 15/17] tests/checkasm: add checkasm tests for swscale ops
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 15/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
@ 2025-05-27  8:25   ` Martin Storsjö
  0 siblings, 0 replies; 34+ messages in thread
From: Martin Storsjö @ 2025-05-27  8:25 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Niklas Haas

On Tue, 27 May 2025, Niklas Haas wrote:

> From: Niklas Haas <git@haasn.dev>
>
> Because of the lack of an external ABI on low-level kernels, we cannot
> directly test internal functions. Instead, we construct a minimal op chain
> consisting of a read, the op to be tested, and a write.
>
> The bigger complication arises from the fact that the backend may generate
> arbitrary internal state that needs to be passed back to the implementation,
> which means we cannot directly call `func_ref` on the generated chain. To get
> around this, always compile the op chain twice - once using the backend to be
> tested, and once using the reference C backend.
>
> The actual entry point may also just be a shared wrapper, so we need to
> be very careful to run checkasm_check_func() on a pseudo-pointer that will
> actually be unique for each combination of backend and active CPU flags.
> ---
> tests/checkasm/Makefile   |   8 +-
> tests/checkasm/checkasm.c |   1 +
> tests/checkasm/checkasm.h |   1 +
> tests/checkasm/sw_ops.c   | 776 ++++++++++++++++++++++++++++++++++++++
> 4 files changed, 785 insertions(+), 1 deletion(-)
> create mode 100644 tests/checkasm/sw_ops.c

When adding a new checkasm test group like this, add it to 
tests/fate/checkasm.mak too, otherwise it is missed by "make fate" and 
"make fate-checkasm", which run all the individual test groups separately.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [FFmpeg-devel] (no subject)
  2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
                   ` (16 preceding siblings ...)
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas
@ 2025-05-27  8:29 ` Kieran Kunhya via ffmpeg-devel
  2025-05-27  8:51   ` Niklas Haas
  17 siblings, 1 reply; 34+ messages in thread
From: Kieran Kunhya via ffmpeg-devel @ 2025-05-27  8:29 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Kieran Kunhya

>
> - adding vzeroupper: ~12%
>

This seems quite suspicious.
Can you explain what you are doing here?

Kieran

>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [FFmpeg-devel] (no subject)
  2025-05-27  8:29 ` [FFmpeg-devel] (no subject) Kieran Kunhya via ffmpeg-devel
@ 2025-05-27  8:51   ` Niklas Haas
  0 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-27  8:51 UTC (permalink / raw)
  To: Kieran Kunhya via ffmpeg-devel,
	FFmpeg development discussions and patches
  Cc: Kieran Kunhya

On Tue, 27 May 2025 16:29:20 +0800 Kieran Kunhya via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote:
> >
> > - adding vzeroupper: ~12%
> >
>
> This seems quite suspicious.
> Can you explain what you are doing here?

I added a vzeroupper call whenever the code transitions from AVX to SSE. For
example:

Conversion pass for yuv444p -> rgba:
  [ u8 XXXX -> +++X] SWS_OP_READ         : 3 elem(s) planar >> 0
  [ u8 ...X -> +++X] SWS_OP_CONVERT      : u8 -> f32
  [f32 ...X -> ...X] SWS_OP_LINEAR       : matrix3+off3 [[85/73 0 1.596027 0 -222.921566] [85/73 -0.391762 -0.812968 0 135.575295] [85/73 2.017232 0 0 -276.835851] [0 0 0 1 0]]
  [f32 ...X -> ...X] SWS_OP_DITHER       : 16x16 matrix
  [f32 ...X -> ...X] SWS_OP_MAX          : {0 0 0 0} <= x
  [f32 ...X -> ...X] SWS_OP_MIN          : x <= {255 255 255 255}
  [f32 ...X -> +++X] SWS_OP_CONVERT      : f32 -> u8
                     ^-------- vzeroupper call added here
  [ u8 ...X -> ++++] SWS_OP_CLEAR        : {_ _ _ 255}
  [ u8 .... -> ++++] SWS_OP_WRITE        : 4 elem(s) packed >> 0

yuv444p 1920x1080 -> rgba 1920x1080, flags=0x100000 dither=1, SSIM {Y=1.000000 U=0.999999 V=0.999997 A=1.000000}
  time=911 us, ref=4257 us, speedup=4.669x faster

With the vzeroupper commented out:

yuv444p 1920x1080 -> rgba 1920x1080, flags=0x100000 dither=1, SSIM {Y=1.000000 U=0.999999 V=0.999997 A=1.000000}
  time=1361 us, ref=4265 us, speedup=3.133x faster

In most other cases, it does not matter, but in some cases like here, not
having the vzeroupper call introduces false dependencies.

Another example is grayf32 -> yuv444p, which goes from 268 us to 296 us if I
remove the vzeroupper calls. In general, anything involving switching between
32-bit floats (512 bits per block) and 8-bit integers (128 bits per block)
sees an effect.

>
> Kieran
>
> >
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [FFmpeg-devel] [PATCH v3 14/17] swscale/x86: add SIMD backend
  2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 14/17] swscale/x86: add SIMD backend Niklas Haas
@ 2025-05-30  2:23   ` Michael Niedermayer
  2025-05-30 10:34     ` Niklas Haas
  0 siblings, 1 reply; 34+ messages in thread
From: Michael Niedermayer @ 2025-05-30  2:23 UTC (permalink / raw)
  To: FFmpeg development discussions and patches


[-- Attachment #1.1: Type: text/plain, Size: 3314 bytes --]

On Tue, May 27, 2025 at 09:55:33AM +0200, Niklas Haas wrote:
> From: Niklas Haas <git@haasn.dev>
> 
> This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
> floating point operations. While this is not yet 100% coverage, it's good
> enough for the vast majority of formats out there.
> 
> Of special note is the packed shuffle fast path, which uses pshufb at vector
> sizes up to AVX512.
> ---
>  libswscale/ops.c              |    4 +
>  libswscale/x86/Makefile       |    3 +
>  libswscale/x86/ops.c          |  722 +++++++++++++++++++++++
>  libswscale/x86/ops_common.asm |  305 ++++++++++
>  libswscale/x86/ops_float.asm  |  389 ++++++++++++
>  libswscale/x86/ops_int.asm    | 1049 +++++++++++++++++++++++++++++++++
>  6 files changed, 2472 insertions(+)
>  create mode 100644 libswscale/x86/ops.c
>  create mode 100644 libswscale/x86/ops_common.asm
>  create mode 100644 libswscale/x86/ops_float.asm
>  create mode 100644 libswscale/x86/ops_int.asm

seems to break on x86-32 linux

...
src/libswscale/x86/ops_float.asm:389: error: symbol `m9' undefined
src/libswscale/x86/ops_float.asm:378: ... from macro `linear_fns' defined here
src/libswscale/x86/ops_float.asm:339: ... from macro `linear_mask' defined here
src/libswscale/x86/ops_float.asm:330: ... from macro `linear_inner' defined here
src/libswscale/x86/ops_common.asm:296: ... from macro `IF' defined here
src//libavutil/x86/x86inc.asm:1639: ... from macro `movdqa' defined here
src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
src//libavutil/x86/x86inc.asm:1996: ... from macro `vmovdqa' defined here
src/libswscale/x86/ops_float.asm:389: error: symbol `m10' undefined
src/libswscale/x86/ops_float.asm:378: ... from macro `linear_fns' defined here
src/libswscale/x86/ops_float.asm:339: ... from macro `linear_mask' defined here
src/libswscale/x86/ops_float.asm:331: ... from macro `linear_inner' defined here
src/libswscale/x86/ops_common.asm:296: ... from macro `IF' defined here
src//libavutil/x86/x86inc.asm:1639: ... from macro `movdqa' defined here
src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
src//libavutil/x86/x86inc.asm:1996: ... from macro `vmovdqa' defined here
src/libswscale/x86/ops_float.asm:389: error: symbol `m11' undefined
src/libswscale/x86/ops_float.asm:378: ... from macro `linear_fns' defined here
src/libswscale/x86/ops_float.asm:339: ... from macro `linear_mask' defined here
src/libswscale/x86/ops_float.asm:332: ... from macro `linear_inner' defined here
src/libswscale/x86/ops_common.asm:296: ... from macro `IF' defined here
src//libavutil/x86/x86inc.asm:1639: ... from macro `movdqa' defined here
src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
src//libavutil/x86/x86inc.asm:1996: ... from macro `vmovdqa' defined here
make: *** [src/ffbuild/common.mak:103: libswscale/x86/ops_float.o] Error 1



[...]
-- 
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

Old school: Use the lowest level language in which you can solve the problem
            conveniently.
New school: Use the highest level language in which the latest supercomputer
            can solve the problem without the user falling asleep waiting.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [FFmpeg-devel] [PATCH v3 14/17] swscale/x86: add SIMD backend
  2025-05-30  2:23   ` Michael Niedermayer
@ 2025-05-30 10:34     ` Niklas Haas
  0 siblings, 0 replies; 34+ messages in thread
From: Niklas Haas @ 2025-05-30 10:34 UTC (permalink / raw)
  To: FFmpeg development discussions and patches

On Fri, 30 May 2025 04:23:12 +0200 Michael Niedermayer <michael@niedermayer.cc> wrote:
> On Tue, May 27, 2025 at 09:55:33AM +0200, Niklas Haas wrote:
> > From: Niklas Haas <git@haasn.dev>
> >
> > This covers most 8-bit and 16-bit ops, and some 32-bit ops. It also covers all
> > floating point operations. While this is not yet 100% coverage, it's good
> > enough for the vast majority of formats out there.
> >
> > Of special note is the packed shuffle fast path, which uses pshufb at vector
> > sizes up to AVX512.
> > ---
> >  libswscale/ops.c              |    4 +
> >  libswscale/x86/Makefile       |    3 +
> >  libswscale/x86/ops.c          |  722 +++++++++++++++++++++++
> >  libswscale/x86/ops_common.asm |  305 ++++++++++
> >  libswscale/x86/ops_float.asm  |  389 ++++++++++++
> >  libswscale/x86/ops_int.asm    | 1049 +++++++++++++++++++++++++++++++++
> >  6 files changed, 2472 insertions(+)
> >  create mode 100644 libswscale/x86/ops.c
> >  create mode 100644 libswscale/x86/ops_common.asm
> >  create mode 100644 libswscale/x86/ops_float.asm
> >  create mode 100644 libswscale/x86/ops_int.asm
>
> seems to break on x86-32 linux

There was no intent to support x86-32 as part of this series. I will fix it by
adding the appropriate build condition.

>
> ...
> src/libswscale/x86/ops_float.asm:389: error: symbol `m9' undefined
> src/libswscale/x86/ops_float.asm:378: ... from macro `linear_fns' defined here
> src/libswscale/x86/ops_float.asm:339: ... from macro `linear_mask' defined here
> src/libswscale/x86/ops_float.asm:330: ... from macro `linear_inner' defined here
> src/libswscale/x86/ops_common.asm:296: ... from macro `IF' defined here
> src//libavutil/x86/x86inc.asm:1639: ... from macro `movdqa' defined here
> src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
> src//libavutil/x86/x86inc.asm:1996: ... from macro `vmovdqa' defined here
> src/libswscale/x86/ops_float.asm:389: error: symbol `m10' undefined
> src/libswscale/x86/ops_float.asm:378: ... from macro `linear_fns' defined here
> src/libswscale/x86/ops_float.asm:339: ... from macro `linear_mask' defined here
> src/libswscale/x86/ops_float.asm:331: ... from macro `linear_inner' defined here
> src/libswscale/x86/ops_common.asm:296: ... from macro `IF' defined here
> src//libavutil/x86/x86inc.asm:1639: ... from macro `movdqa' defined here
> src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
> src//libavutil/x86/x86inc.asm:1996: ... from macro `vmovdqa' defined here
> src/libswscale/x86/ops_float.asm:389: error: symbol `m11' undefined
> src/libswscale/x86/ops_float.asm:378: ... from macro `linear_fns' defined here
> src/libswscale/x86/ops_float.asm:339: ... from macro `linear_mask' defined here
> src/libswscale/x86/ops_float.asm:332: ... from macro `linear_inner' defined here
> src/libswscale/x86/ops_common.asm:296: ... from macro `IF' defined here
> src//libavutil/x86/x86inc.asm:1639: ... from macro `movdqa' defined here
> src//libavutil/x86/x86inc.asm:1501: ... from macro `RUN_AVX_INSTR' defined here
> src//libavutil/x86/x86inc.asm:1996: ... from macro `vmovdqa' defined here
> make: *** [src/ffbuild/common.mak:103: libswscale/x86/ops_float.o] Error 1
>
>
>
> [...]
> --
> Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
>
> Old school: Use the lowest level language in which you can solve the problem
>             conveniently.
> New school: Use the highest level language in which the latest supercomputer
>             can solve the problem without the user falling asleep waiting.
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2024-08-07 15:58 cyfdel-at-hotmail.com
  0 siblings, 0 replies; 34+ messages in thread
From: cyfdel-at-hotmail.com @ 2024-08-07 15:58 UTC (permalink / raw)
  To: ffmpeg-devel


hat the patch does:
        fix gdigrab capture a window with hwnd shows "Invalid window
        handle x, must be a vlid integer", althought a valid integer is
        input

why:
        line 284 of libavdevice/gdigrab.c, one of the condition leads to
        check failed is p[0]='\0'. if a integer only string is process,
        the p[0] after strtoull process will be null which equal to
        '\0', otherwise, a non-integer string will make p[0] not null to
        pass the check

how:
        change p[0]=='\0' to p[0]!='\0' will works. no any side effect

reproduce and verify:
        a simple command: ffmpeg -f gdigrab -i hwnd=12345
        * althought a workaround command will work currently:
        *       ffmpeg -f gdigrab -i hwnd=12345x. (x could be any char)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2024-04-18  9:42 pengxu
  0 siblings, 0 replies; 34+ messages in thread
From: pengxu @ 2024-04-18  9:42 UTC (permalink / raw)
  To: ffmpeg-devel

v2: Fixed fate errors in [Patch 2/2]
v3: Fixed fate errors in [Patch 2/2] 
Subject:[PATCH V3][Loongarch]Optimize aac decode/encode for Loongarch by LSX
In-Reply-To: 


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2024-04-18  7:36 pengxu
  0 siblings, 0 replies; 34+ messages in thread
From: pengxu @ 2024-04-18  7:36 UTC (permalink / raw)
  To: ffmpeg-devel

v2: Fixed build errors in [PATCH 2/2]
 
Subject: [PATCH V2][Loongarch]Optimize aac decode/encode for Loongarch by LSX
In-Reply-To: 


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2023-10-14  8:40 Logan.Lyu
  0 siblings, 0 replies; 34+ messages in thread
From: Logan.Lyu @ 2023-10-14  8:40 UTC (permalink / raw)
  To: ffmpeg-devel

[-- Attachment #1: Type: text/plain, Size: 21092 bytes --]

checkasm bench:
put_hevc_qpel_hv4_8_c: 422.1
put_hevc_qpel_hv4_8_i8mm: 101.6
put_hevc_qpel_hv6_8_c: 756.4
put_hevc_qpel_hv6_8_i8mm: 225.9
put_hevc_qpel_hv8_8_c: 1189.9
put_hevc_qpel_hv8_8_i8mm: 296.6
put_hevc_qpel_hv12_8_c: 2407.4
put_hevc_qpel_hv12_8_i8mm: 552.4
put_hevc_qpel_hv16_8_c: 4021.4
put_hevc_qpel_hv16_8_i8mm: 886.6
put_hevc_qpel_hv24_8_c: 8992.1
put_hevc_qpel_hv24_8_i8mm: 1968.9
put_hevc_qpel_hv32_8_c: 15197.9
put_hevc_qpel_hv32_8_i8mm: 3209.4
put_hevc_qpel_hv48_8_c: 32811.1
put_hevc_qpel_hv48_8_i8mm: 7442.1
put_hevc_qpel_hv64_8_c: 58106.1
put_hevc_qpel_hv64_8_i8mm: 12423.9

Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Logan Lyu <Logan.Lyu@myais.com.cn>
---
  libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
  libavcodec/aarch64/hevcdsp_qpel_neon.S    | 397 ++++++++++++++++++++++
  2 files changed, 402 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index f6b4c31d17..7d889efe68 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -208,6 +208,10 @@ NEON8_FNPROTO(qpel_v, (int16_t *dst,
          const uint8_t *src, ptrdiff_t srcstride,
          int height, intptr_t mx, intptr_t my, int width),);
  +NEON8_FNPROTO(qpel_hv, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width), _i8mm);
+
  NEON8_FNPROTO(qpel_uni_v, (uint8_t *dst,  ptrdiff_t dststride,
          const uint8_t *src, ptrdiff_t srcstride,
          int height, intptr_t mx, intptr_t my, int width),);
@@ -335,6 +339,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext 
*c, const int bit_depth)
              NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv, 
_i8mm);
              NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h 
,_i8mm);
              NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h, _i8mm);
+            NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv, _i8mm);
              NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv, 
_i8mm);
              NEON8_FNASSIGN(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h, 
_i8mm);
              NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, 
epel_uni_w_hv, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index eff70d70a4..e4475ba920 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -3070,6 +3070,403 @@ function ff_hevc_put_hevc_qpel_h64_8_neon_i8mm, 
export=1
          ret
  endfunc
  +
+function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        mov             x7, #128
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2, lsl #1
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ldr             d16, [sp]
+        ldr             d17, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             d18, [sp]
+        ldr             d19, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             d20, [sp]
+        ldr             d21, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             d22, [sp]
+        add             sp, sp, x7
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().4h}, [sp], x7
+        calc_qpelh      v1, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7, sqshrn
+        subs            w3, w3, #1
+        st1             {v1.4h}, [x0], x7
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        mov             x7, #128
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2, lsl #1
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x8, #120
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ldr             q16, [sp]
+        ldr             q17, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q18, [sp]
+        ldr             q19, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q20, [sp]
+        ldr             q21, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q22, [sp]
+        add             sp, sp, x7
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().8h}, [sp], x7
+        calc_qpelh      v1, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1, \src2, \src3, \src4, 
\src5, \src6, \src7, sqshrn2
+        st1             {v1.4h}, [x0], #8
+        subs            w3, w3, #1
+        st1             {v1.s}[2], [x0], x8
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ldr             q16, [sp]
+        ldr             q17, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q18, [sp]
+        ldr             q19, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q20, [sp]
+        ldr             q21, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q22, [sp]
+        add             sp, sp, x7
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1            {\tmp\().8h}, [sp], x7
+        calc_qpelh      v1, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1, \src2, \src3, \src4, 
\src5, \src6, \src7, sqshrn2
+        subs            w3, w3, #1
+        st1            {v1.8h}, [x0], x7
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        mov             x8, #112
+        ld1             {v16.8h, v17.8h}, [sp], x7
+        ld1             {v18.8h, v19.8h}, [sp], x7
+        ld1             {v20.8h, v21.8h}, [sp], x7
+        ld1             {v22.8h, v23.8h}, [sp], x7
+        ld1             {v24.8h, v25.8h}, [sp], x7
+        ld1             {v26.8h, v27.8h}, [sp], x7
+        ld1             {v28.8h, v29.8h}, [sp], x7
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, 
src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().8h, \tmp1\().8h}, [sp], x7
+        calc_qpelh      v1,     \src0,  \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1,  \src2,  \src3,  \src4, 
\src5,  \src6,  \src7, sqshrn2
+        calc_qpelh      v2,     \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15, sqshrn
+        st1             {v1.8h}, [x0], #16
+        subs            w3, w3, #1
+        st1             {v2.4h}, [x0], x8
+.endm
+1:      calc_all2
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x3, x3, #7
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ld1             {v16.8h, v17.8h}, [sp], x7
+        ld1             {v18.8h, v19.8h}, [sp], x7
+        ld1             {v20.8h, v21.8h}, [sp], x7
+        ld1             {v22.8h, v23.8h}, [sp], x7
+        ld1             {v24.8h, v25.8h}, [sp], x7
+        ld1             {v26.8h, v27.8h}, [sp], x7
+        ld1             {v28.8h, v29.8h}, [sp], x7
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, 
src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().8h, \tmp1\().8h}, [sp], x7
+        calc_qpelh      v1,     \src0,  \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1,  \src2,  \src3,  \src4, 
\src5,  \src6,  \src7, sqshrn2
+        calc_qpelh      v2,     \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15, sqshrn
+        calc_qpelh2     v2, v3, \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h, v2.8h}, [x0], x7
+.endm
+1:      calc_all2
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm, export=1
+        sub             sp, sp, #32
+        st1             {v8.8b-v11.8b}, [sp]
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, #32
+        add             w10, w3, #7
+        st1             {v12.8b-v15.8b}, [sp]
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h24_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ld1             {v8.8h-v10.8h}, [sp], x7
+        ld1             {v11.8h-v13.8h}, [sp], x7
+        ld1             {v14.8h-v16.8h}, [sp], x7
+        ld1             {v17.8h-v19.8h}, [sp], x7
+        ld1             {v20.8h-v22.8h}, [sp], x7
+        ld1             {v23.8h-v25.8h}, [sp], x7
+        ld1             {v26.8h-v28.8h}, [sp], x7
+1:      ld1             {v29.8h-v31.8h}, [sp], x7
+        calc_qpelh      v1, v8, v11, v14, v17, v20, v23, v26, v29, sqshrn
+        calc_qpelh2     v1, v2, v8, v11, v14, v17, v20, v23, v26, v29, 
sqshrn2
+        calc_qpelh      v2, v9, v12, v15, v18, v21, v24, v27, v30, sqshrn
+        calc_qpelh2     v2, v3, v9, v12, v15, v18, v21, v24, v27, v30, 
sqshrn2
+        calc_qpelh      v3, v10, v13, v16, v19, v22, v25, v28, v31, sqshrn
+        calc_qpelh2     v3, v4, v10, v13, v16, v19, v22, v25, v28, v31, 
sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v8.8h-v10.8h}, [sp], x7
+        calc_qpelh      v1, v11, v14, v17, v20, v23, v26, v29, v8, sqshrn
+        calc_qpelh2     v1, v2, v11, v14, v17, v20, v23, v26, v29, v8, 
sqshrn2
+        calc_qpelh      v2, v12, v15, v18, v21, v24, v27, v30, v9, sqshrn
+        calc_qpelh2     v2, v3, v12, v15, v18, v21, v24, v27, v30, v9, 
sqshrn2
+        calc_qpelh      v3, v13, v16, v19, v22, v25, v28, v31, v10, sqshrn
+        calc_qpelh2     v3, v4, v13, v16, v19, v22, v25, v28, v31, v10, 
sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v11.8h-v13.8h}, [sp], x7
+        calc_qpelh      v1, v14, v17, v20, v23, v26, v29, v8, v11, sqshrn
+        calc_qpelh2     v1, v2, v14, v17, v20, v23, v26, v29, v8, v11, 
sqshrn2
+        calc_qpelh      v2, v15, v18, v21, v24, v27, v30, v9, v12, sqshrn
+        calc_qpelh2     v2, v3, v15, v18, v21, v24, v27, v30, v9, v12, 
sqshrn2
+        calc_qpelh      v3, v16, v19, v22, v25, v28, v31, v10, v13, sqshrn
+        calc_qpelh2     v3, v4, v16, v19, v22, v25, v28, v31, v10, v13, 
sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v14.8h-v16.8h}, [sp], x7
+        calc_qpelh      v1, v17, v20, v23, v26, v29, v8, v11, v14, sqshrn
+        calc_qpelh2     v1, v2, v17, v20, v23, v26, v29, v8, v11, v14, 
sqshrn2
+        calc_qpelh      v2, v18, v21, v24, v27, v30, v9, v12, v15, sqshrn
+        calc_qpelh2     v2, v3, v18, v21, v24, v27, v30, v9, v12, v15, 
sqshrn2
+        calc_qpelh      v3, v19, v22, v25, v28, v31, v10, v13, v16, sqshrn
+        calc_qpelh2     v3, v4, v19, v22, v25, v28, v31, v10, v13, v16, 
sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v17.8h-v19.8h}, [sp], x7
+        calc_qpelh      v1, v20, v23, v26, v29, v8, v11, v14, v17, sqshrn
+        calc_qpelh2     v1, v2, v20, v23, v26, v29, v8, v11, v14, v17, 
sqshrn2
+        calc_qpelh      v2, v21, v24, v27, v30, v9, v12, v15, v18, sqshrn
+        calc_qpelh2     v2, v3, v21, v24, v27, v30, v9, v12, v15, v18, 
sqshrn2
+        calc_qpelh      v3, v22, v25, v28, v31, v10, v13, v16, v19, sqshrn
+        calc_qpelh2     v3, v4, v22, v25, v28, v31, v10, v13, v16, v19, 
sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v20.8h-v22.8h}, [sp], x7
+        calc_qpelh      v1, v23, v26, v29, v8, v11, v14, v17, v20, sqshrn
+        calc_qpelh2     v1, v2, v23, v26, v29, v8, v11, v14, v17, v20, 
sqshrn2
+        calc_qpelh      v2, v24, v27, v30, v9, v12, v15, v18, v21, sqshrn
+        calc_qpelh2     v2, v3, v24, v27, v30, v9, v12, v15, v18, v21, 
sqshrn2
+        calc_qpelh      v3, v25, v28, v31, v10, v13, v16, v19, v22, sqshrn
+        calc_qpelh2     v3, v4, v25, v28, v31, v10, v13, v16, v19, v22, 
sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v23.8h-v25.8h}, [sp], x7
+        calc_qpelh      v1, v26, v29, v8, v11, v14, v17, v20, v23, sqshrn
+        calc_qpelh2     v1, v2, v26, v29, v8, v11, v14, v17, v20, v23, 
sqshrn2
+        calc_qpelh      v2, v27, v30, v9, v12, v15, v18, v21, v24, sqshrn
+        calc_qpelh2     v2, v3, v27, v30, v9, v12, v15, v18, v21, v24, 
sqshrn2
+        calc_qpelh      v3, v28, v31, v10, v13, v16, v19, v22, v25, sqshrn
+        calc_qpelh2     v3, v4, v28, v31, v10, v13, v16, v19, v22, v25, 
sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v26.8h-v28.8h}, [sp], x7
+        calc_qpelh      v1, v29, v8, v11, v14, v17, v20, v23, v26, sqshrn
+        calc_qpelh2     v1, v2, v29, v8, v11, v14, v17, v20, v23, v26, 
sqshrn2
+        calc_qpelh      v2, v30, v9, v12, v15, v18, v21, v24, v27, sqshrn
+        calc_qpelh2     v2, v3, v30, v9, v12, v15, v18, v21, v24, v27, 
sqshrn2
+        calc_qpelh      v3, v31, v10, v13, v16, v19, v22, v25, v28, sqshrn
+        calc_qpelh2     v3, v4, v31, v10, v13, v16, v19, v22, v25, v28, 
sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.hi            1b
+2:      ld1             {v12.8b-v15.8b}, [sp], #32
+        ld1             {v8.8b-v11.8b}, [sp], #32
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        sub             x1, x1, x2, lsl #1
+        lsl             x10, x10, #7
+        sub             x1, x1, x2
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x3, x3, #7
+        add             x0, sp, #32
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+0:      mov             x8, sp          // src
+        ld1             {v16.8h, v17.8h}, [x8], x7
+        mov             w9, w3          // height
+        ld1             {v18.8h, v19.8h}, [x8], x7
+        mov             x5, x0          // dst
+        ld1             {v20.8h, v21.8h}, [x8], x7
+        ld1             {v22.8h, v23.8h}, [x8], x7
+        ld1             {v24.8h, v25.8h}, [x8], x7
+        ld1             {v26.8h, v27.8h}, [x8], x7
+        ld1             {v28.8h, v29.8h}, [x8], x7
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, 
src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().8h, \tmp1\().8h}, [x8], x7
+        calc_qpelh      v1,     \src0,  \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1,  \src2,  \src3,  \src4, 
\src5,  \src6,  \src7, sqshrn2
+        calc_qpelh      v2,     \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15, sqshrn
+        calc_qpelh2     v2, v3, \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15, sqshrn2
+        subs            x9, x9, #1
+        st1             {v1.8h, v2.8h}, [x5], x7
+.endm
+1:      calc_all2
+.purgem calc
+2:      add             x0, x0, #32
+        add             sp, sp, #32
+        subs            w6, w6, #16
+        b.hi            0b
+        add             w10, w3, #6
+        add             sp, sp, #64          // discard rest of first line
+        lsl             x10, x10, #7
+        add             sp, sp, x10         // tmp_array without first line
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv48_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        bl              X(ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x1, x1, #24
+        add             x0, x0, #48
+        bl              X(ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv64_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        mov             x6, #32
+        bl              X(ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x1, x1, #32
+        add             x0, x0, #64
+        mov             x6, #32
+        bl              X(ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
  .macro QPEL_UNI_W_HV_HEADER width
          ldp             x14, x15, [sp]          // mx, my
          ldr             w13, [sp, #16]          // width
-- 
2.38.0.windows.1

[-- Attachment #2: 0004-lavc-aarch64-new-optimization-for-8-bit-hevc_qpel_hv.patch --]
[-- Type: text/plain, Size: 21203 bytes --]

From 6a7f049fd0382c04297fb9cefd9f5ce022abbe5f Mon Sep 17 00:00:00 2001
From: Logan Lyu <Logan.Lyu@myais.com.cn>
Date: Sat, 9 Sep 2023 22:40:51 +0800
Subject: [PATCH 4/4] lavc/aarch64: new optimization for 8-bit hevc_qpel_hv

checkasm bench:
put_hevc_qpel_hv4_8_c: 422.1
put_hevc_qpel_hv4_8_i8mm: 101.6
put_hevc_qpel_hv6_8_c: 756.4
put_hevc_qpel_hv6_8_i8mm: 225.9
put_hevc_qpel_hv8_8_c: 1189.9
put_hevc_qpel_hv8_8_i8mm: 296.6
put_hevc_qpel_hv12_8_c: 2407.4
put_hevc_qpel_hv12_8_i8mm: 552.4
put_hevc_qpel_hv16_8_c: 4021.4
put_hevc_qpel_hv16_8_i8mm: 886.6
put_hevc_qpel_hv24_8_c: 8992.1
put_hevc_qpel_hv24_8_i8mm: 1968.9
put_hevc_qpel_hv32_8_c: 15197.9
put_hevc_qpel_hv32_8_i8mm: 3209.4
put_hevc_qpel_hv48_8_c: 32811.1
put_hevc_qpel_hv48_8_i8mm: 7442.1
put_hevc_qpel_hv64_8_c: 58106.1
put_hevc_qpel_hv64_8_i8mm: 12423.9

Co-Authored-By: J. Dekker <jdek@itanimul.li>
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 397 ++++++++++++++++++++++
 2 files changed, 402 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index f6b4c31d17..7d889efe68 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -208,6 +208,10 @@ NEON8_FNPROTO(qpel_v, (int16_t *dst,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(qpel_hv, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width), _i8mm);
+
 NEON8_FNPROTO(qpel_uni_v, (uint8_t *dst,  ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
@@ -335,6 +339,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
             NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h ,_i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h, _i8mm);
+            NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index eff70d70a4..e4475ba920 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -3070,6 +3070,403 @@ function ff_hevc_put_hevc_qpel_h64_8_neon_i8mm, export=1
         ret
 endfunc
 
+
+function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        mov             x7, #128
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2, lsl #1
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ldr             d16, [sp]
+        ldr             d17, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             d18, [sp]
+        ldr             d19, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             d20, [sp]
+        ldr             d21, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             d22, [sp]
+        add             sp, sp, x7
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().4h}, [sp], x7
+        calc_qpelh      v1, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7, sqshrn
+        subs            w3, w3, #1
+        st1             {v1.4h}, [x0], x7
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        mov             x7, #128
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2, lsl #1
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x8, #120
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ldr             q16, [sp]
+        ldr             q17, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q18, [sp]
+        ldr             q19, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q20, [sp]
+        ldr             q21, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q22, [sp]
+        add             sp, sp, x7
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().8h}, [sp], x7
+        calc_qpelh      v1, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7, sqshrn2
+        st1             {v1.4h}, [x0], #8
+        subs            w3, w3, #1
+        st1             {v1.s}[2], [x0], x8
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ldr             q16, [sp]
+        ldr             q17, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q18, [sp]
+        ldr             q19, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q20, [sp]
+        ldr             q21, [sp, x7]
+        add             sp, sp, x7, lsl #1
+        ldr             q22, [sp]
+        add             sp, sp, x7
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1            {\tmp\().8h}, [sp], x7
+        calc_qpelh      v1, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7, sqshrn2
+        subs            w3, w3, #1
+        st1            {v1.8h}, [x0], x7
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        mov             x8, #112
+        ld1             {v16.8h, v17.8h}, [sp], x7
+        ld1             {v18.8h, v19.8h}, [sp], x7
+        ld1             {v20.8h, v21.8h}, [sp], x7
+        ld1             {v22.8h, v23.8h}, [sp], x7
+        ld1             {v24.8h, v25.8h}, [sp], x7
+        ld1             {v26.8h, v27.8h}, [sp], x7
+        ld1             {v28.8h, v29.8h}, [sp], x7
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().8h, \tmp1\().8h}, [sp], x7
+        calc_qpelh      v1,     \src0,  \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1,  \src2,  \src3,  \src4,  \src5,  \src6,  \src7, sqshrn2
+        calc_qpelh      v2,     \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15, sqshrn
+        st1             {v1.8h}, [x0], #16
+        subs            w3, w3, #1
+        st1             {v2.4h}, [x0], x8
+.endm
+1:      calc_all2
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x3, x3, #7
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ld1             {v16.8h, v17.8h}, [sp], x7
+        ld1             {v18.8h, v19.8h}, [sp], x7
+        ld1             {v20.8h, v21.8h}, [sp], x7
+        ld1             {v22.8h, v23.8h}, [sp], x7
+        ld1             {v24.8h, v25.8h}, [sp], x7
+        ld1             {v26.8h, v27.8h}, [sp], x7
+        ld1             {v28.8h, v29.8h}, [sp], x7
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().8h, \tmp1\().8h}, [sp], x7
+        calc_qpelh      v1,     \src0,  \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1,  \src2,  \src3,  \src4,  \src5,  \src6,  \src7, sqshrn2
+        calc_qpelh      v2,     \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15, sqshrn
+        calc_qpelh2     v2, v3, \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h, v2.8h}, [x0], x7
+.endm
+1:      calc_all2
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm, export=1
+        sub             sp, sp, #32
+        st1             {v8.8b-v11.8b}, [sp]
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, #32
+        add             w10, w3, #7
+        st1             {v12.8b-v15.8b}, [sp]
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h24_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+        ld1             {v8.8h-v10.8h}, [sp], x7
+        ld1             {v11.8h-v13.8h}, [sp], x7
+        ld1             {v14.8h-v16.8h}, [sp], x7
+        ld1             {v17.8h-v19.8h}, [sp], x7
+        ld1             {v20.8h-v22.8h}, [sp], x7
+        ld1             {v23.8h-v25.8h}, [sp], x7
+        ld1             {v26.8h-v28.8h}, [sp], x7
+1:      ld1             {v29.8h-v31.8h}, [sp], x7
+        calc_qpelh      v1, v8, v11, v14, v17, v20, v23, v26, v29, sqshrn
+        calc_qpelh2     v1, v2, v8, v11, v14, v17, v20, v23, v26, v29, sqshrn2
+        calc_qpelh      v2, v9, v12, v15, v18, v21, v24, v27, v30, sqshrn
+        calc_qpelh2     v2, v3, v9, v12, v15, v18, v21, v24, v27, v30, sqshrn2
+        calc_qpelh      v3, v10, v13, v16, v19, v22, v25, v28, v31, sqshrn
+        calc_qpelh2     v3, v4, v10, v13, v16, v19, v22, v25, v28, v31, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v8.8h-v10.8h}, [sp], x7
+        calc_qpelh      v1, v11, v14, v17, v20, v23, v26, v29, v8, sqshrn
+        calc_qpelh2     v1, v2, v11, v14, v17, v20, v23, v26, v29, v8, sqshrn2
+        calc_qpelh      v2, v12, v15, v18, v21, v24, v27, v30, v9, sqshrn
+        calc_qpelh2     v2, v3, v12, v15, v18, v21, v24, v27, v30, v9, sqshrn2
+        calc_qpelh      v3, v13, v16, v19, v22, v25, v28, v31, v10, sqshrn
+        calc_qpelh2     v3, v4, v13, v16, v19, v22, v25, v28, v31, v10, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v11.8h-v13.8h}, [sp], x7
+        calc_qpelh      v1, v14, v17, v20, v23, v26, v29, v8, v11, sqshrn
+        calc_qpelh2     v1, v2, v14, v17, v20, v23, v26, v29, v8, v11, sqshrn2
+        calc_qpelh      v2, v15, v18, v21, v24, v27, v30, v9, v12, sqshrn
+        calc_qpelh2     v2, v3, v15, v18, v21, v24, v27, v30, v9, v12, sqshrn2
+        calc_qpelh      v3, v16, v19, v22, v25, v28, v31, v10, v13, sqshrn
+        calc_qpelh2     v3, v4, v16, v19, v22, v25, v28, v31, v10, v13, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v14.8h-v16.8h}, [sp], x7
+        calc_qpelh      v1, v17, v20, v23, v26, v29, v8, v11, v14, sqshrn
+        calc_qpelh2     v1, v2, v17, v20, v23, v26, v29, v8, v11, v14, sqshrn2
+        calc_qpelh      v2, v18, v21, v24, v27, v30, v9, v12, v15, sqshrn
+        calc_qpelh2     v2, v3, v18, v21, v24, v27, v30, v9, v12, v15, sqshrn2
+        calc_qpelh      v3, v19, v22, v25, v28, v31, v10, v13, v16, sqshrn
+        calc_qpelh2     v3, v4, v19, v22, v25, v28, v31, v10, v13, v16, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v17.8h-v19.8h}, [sp], x7
+        calc_qpelh      v1, v20, v23, v26, v29, v8, v11, v14, v17, sqshrn
+        calc_qpelh2     v1, v2, v20, v23, v26, v29, v8, v11, v14, v17, sqshrn2
+        calc_qpelh      v2, v21, v24, v27, v30, v9, v12, v15, v18, sqshrn
+        calc_qpelh2     v2, v3, v21, v24, v27, v30, v9, v12, v15, v18, sqshrn2
+        calc_qpelh      v3, v22, v25, v28, v31, v10, v13, v16, v19, sqshrn
+        calc_qpelh2     v3, v4, v22, v25, v28, v31, v10, v13, v16, v19, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v20.8h-v22.8h}, [sp], x7
+        calc_qpelh      v1, v23, v26, v29, v8, v11, v14, v17, v20, sqshrn
+        calc_qpelh2     v1, v2, v23, v26, v29, v8, v11, v14, v17, v20, sqshrn2
+        calc_qpelh      v2, v24, v27, v30, v9, v12, v15, v18, v21, sqshrn
+        calc_qpelh2     v2, v3, v24, v27, v30, v9, v12, v15, v18, v21, sqshrn2
+        calc_qpelh      v3, v25, v28, v31, v10, v13, v16, v19, v22, sqshrn
+        calc_qpelh2     v3, v4, v25, v28, v31, v10, v13, v16, v19, v22, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v23.8h-v25.8h}, [sp], x7
+        calc_qpelh      v1, v26, v29, v8, v11, v14, v17, v20, v23, sqshrn
+        calc_qpelh2     v1, v2, v26, v29, v8, v11, v14, v17, v20, v23, sqshrn2
+        calc_qpelh      v2, v27, v30, v9, v12, v15, v18, v21, v24, sqshrn
+        calc_qpelh2     v2, v3, v27, v30, v9, v12, v15, v18, v21, v24, sqshrn2
+        calc_qpelh      v3, v28, v31, v10, v13, v16, v19, v22, v25, sqshrn
+        calc_qpelh2     v3, v4, v28, v31, v10, v13, v16, v19, v22, v25, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.eq            2f
+
+        ld1             {v26.8h-v28.8h}, [sp], x7
+        calc_qpelh      v1, v29, v8, v11, v14, v17, v20, v23, v26, sqshrn
+        calc_qpelh2     v1, v2, v29, v8, v11, v14, v17, v20, v23, v26, sqshrn2
+        calc_qpelh      v2, v30, v9, v12, v15, v18, v21, v24, v27, sqshrn
+        calc_qpelh2     v2, v3, v30, v9, v12, v15, v18, v21, v24, v27, sqshrn2
+        calc_qpelh      v3, v31, v10, v13, v16, v19, v22, v25, v28, sqshrn
+        calc_qpelh2     v3, v4, v31, v10, v13, v16, v19, v22, v25, v28, sqshrn2
+        subs            w3, w3, #1
+        st1             {v1.8h-v3.8h}, [x0], x7
+        b.hi            1b
+2:      ld1             {v12.8b-v15.8b}, [sp], #32
+        ld1             {v8.8b-v11.8b}, [sp], #32
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        sub             x1, x1, x2, lsl #1
+        lsl             x10, x10, #7
+        sub             x1, x1, x2
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x3, x3, #7
+        add             x0, sp, #32
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        mov             x7, #128
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_qpel_filterh x5, x4
+0:      mov             x8, sp          // src
+        ld1             {v16.8h, v17.8h}, [x8], x7
+        mov             w9, w3          // height
+        ld1             {v18.8h, v19.8h}, [x8], x7
+        mov             x5, x0          // dst
+        ld1             {v20.8h, v21.8h}, [x8], x7
+        ld1             {v22.8h, v23.8h}, [x8], x7
+        ld1             {v24.8h, v25.8h}, [x8], x7
+        ld1             {v26.8h, v27.8h}, [x8], x7
+        ld1             {v28.8h, v29.8h}, [x8], x7
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().8h, \tmp1\().8h}, [x8], x7
+        calc_qpelh      v1,     \src0,  \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7, sqshrn
+        calc_qpelh2     v1, v2, \src0, \src1,  \src2,  \src3,  \src4,  \src5,  \src6,  \src7, sqshrn2
+        calc_qpelh      v2,     \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15, sqshrn
+        calc_qpelh2     v2, v3, \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15, sqshrn2
+        subs            x9, x9, #1
+        st1             {v1.8h, v2.8h}, [x5], x7
+.endm
+1:      calc_all2
+.purgem calc
+2:      add             x0, x0, #32
+        add             sp, sp, #32
+        subs            w6, w6, #16
+        b.hi            0b
+        add             w10, w3, #6
+        add             sp, sp, #64          // discard rest of first line
+        lsl             x10, x10, #7
+        add             sp, sp, x10         // tmp_array without first line
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv48_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        bl              X(ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x1, x1, #24
+        add             x0, x0, #48
+        bl              X(ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv64_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        mov             x6, #32
+        bl              X(ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x1, x1, #32
+        add             x0, x0, #64
+        mov             x6, #32
+        bl              X(ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
 .macro QPEL_UNI_W_HV_HEADER width
         ldp             x14, x15, [sp]          // mx, my
         ldr             w13, [sp, #16]          // width
-- 
2.38.0.windows.1


[-- Attachment #3: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2023-10-14  8:40 Logan.Lyu
  0 siblings, 0 replies; 34+ messages in thread
From: Logan.Lyu @ 2023-10-14  8:40 UTC (permalink / raw)
  To: ffmpeg-devel

[-- Attachment #1: Type: text/plain, Size: 17977 bytes --]

checkasm bench:
put_hevc_qpel_v4_8_c: 138.1
put_hevc_qpel_v4_8_neon: 41.1
put_hevc_qpel_v6_8_c: 276.6
put_hevc_qpel_v6_8_neon: 60.9
put_hevc_qpel_v8_8_c: 478.9
put_hevc_qpel_v8_8_neon: 72.9
put_hevc_qpel_v12_8_c: 1072.6
put_hevc_qpel_v12_8_neon: 203.9
put_hevc_qpel_v16_8_c: 1852.1
put_hevc_qpel_v16_8_neon: 264.1
put_hevc_qpel_v24_8_c: 4137.6
put_hevc_qpel_v24_8_neon: 586.9
put_hevc_qpel_v32_8_c: 7579.1
put_hevc_qpel_v32_8_neon: 1036.6
put_hevc_qpel_v48_8_c: 16355.6
put_hevc_qpel_v48_8_neon: 2326.4
put_hevc_qpel_v64_8_c: 33545.1
put_hevc_qpel_v64_8_neon: 4126.4

Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Logan Lyu <Logan.Lyu@myais.com.cn>
---
  libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
  libavcodec/aarch64/hevcdsp_qpel_neon.S    | 347 +++++++++++++++++++---
  2 files changed, 314 insertions(+), 38 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index e9a341ecb9..f6b4c31d17 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -204,6 +204,10 @@ NEON8_FNPROTO(qpel_h, (int16_t *dst,
          const uint8_t *_src, ptrdiff_t _srcstride,
          int height, intptr_t mx, intptr_t my, int width), _i8mm);
  +NEON8_FNPROTO(qpel_v, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width),);
+
  NEON8_FNPROTO(qpel_uni_v, (uint8_t *dst,  ptrdiff_t dststride,
          const uint8_t *src, ptrdiff_t srcstride,
          int height, intptr_t mx, intptr_t my, int width),);
@@ -315,6 +319,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext 
*c, const int bit_depth)
          NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels,);
          NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v,);
          NEON8_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels,);
+        NEON8_FNASSIGN(c->put_hevc_qpel, 1, 0, qpel_v,);
          NEON8_FNASSIGN(c->put_hevc_epel_uni, 0, 0, pel_uni_pixels,);
          NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 0, epel_uni_v,);
          NEON8_FNASSIGN(c->put_hevc_qpel_uni, 0, 0, pel_uni_pixels,);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S 
b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 4132d7a8a9..eff70d70a4 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -112,6 +112,44 @@ endconst
  .endif
  .endm
  +.macro calc_all
+        calc            v23, v16, v17, v18, v19, v20, v21, v22, v23
+        b.eq            2f
+        calc            v16, v17, v18, v19, v20, v21, v22, v23, v16
+        b.eq            2f
+        calc            v17, v18, v19, v20, v21, v22, v23, v16, v17
+        b.eq            2f
+        calc            v18, v19, v20, v21, v22, v23, v16, v17, v18
+        b.eq            2f
+        calc            v19, v20, v21, v22, v23, v16, v17, v18, v19
+        b.eq            2f
+        calc            v20, v21, v22, v23, v16, v17, v18, v19, v20
+        b.eq            2f
+        calc            v21, v22, v23, v16, v17, v18, v19, v20, v21
+        b.eq            2f
+        calc            v22, v23, v16, v17, v18, v19, v20, v21, v22
+        b.hi            1b
+.endm
+
+.macro calc_all2
+        calc v30, v31, v16, v18, v20, v22, v24, v26, v28, v30, v17, 
v19, v21, v23, v25, v27, v29, v31
+        b.eq            2f
+        calc v16, v17, v18, v20, v22, v24, v26, v28, v30, v16, v19, 
v21, v23, v25, v27, v29, v31, v17
+        b.eq            2f
+        calc v18, v19, v20, v22, v24, v26, v28, v30, v16, v18, v21, 
v23, v25, v27, v29, v31, v17, v19
+        b.eq            2f
+        calc v20, v21, v22, v24, v26, v28, v30, v16, v18, v20, v23, 
v25, v27, v29, v31, v17, v19, v21
+        b.eq            2f
+        calc v22, v23, v24, v26, v28, v30, v16, v18, v20, v22, v25, 
v27, v29, v31, v17, v19, v21, v23
+        b.eq            2f
+        calc v24, v25, v26, v28, v30, v16, v18, v20, v22, v24, v27, 
v29, v31, v17, v19, v21, v23, v25
+        b.eq            2f
+        calc v26, v27, v28, v30, v16, v18, v20, v22, v24, v26, v29, 
v31, v17, v19, v21, v23, v25, v27
+        b.eq            2f
+        calc v28, v29, v30, v16, v18, v20, v22, v24, v26, v28, v31, 
v17, v19, v21, v23, v25, v27, v29
+        b.hi            1b
+.endm
+
  .macro put_hevc type
  .ifc \type, qpel
          // void put_hevc_qpel_h(int16_t *dst,
@@ -558,6 +596,277 @@ put_hevc qpel
  put_hevc qpel_uni
  put_hevc qpel_bi
  +function ff_hevc_put_hevc_qpel_v4_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2)
+        sub             x1, x1, x2
+        ldr             s16, [x1]
+        ldr             s17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             s18, [x1]
+        ldr             s19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             s20, [x1]
+        ldr             s21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             s22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().s}[0], [x1], x2
+        movi            v24.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7
+        st1             {v24.4h}, [x0], x9
+        subs            w3, w3, #1
+        b.eq            2f
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v6_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2 - 8)
+        sub             x1, x1, x2
+        ldr             d16, [x1]
+        ldr             d17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d18, [x1]
+        ldr             d19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d20, [x1]
+        ldr             d21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().8b}, [x1], x2
+        movi            v24.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7
+        st1             {v24.4h}, [x0], #8
+        st1             {v24.s}[2], [x0], x9
+        subs            w3, w3, #1
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v8_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2)
+        sub             x1, x1, x2
+        ldr             d16, [x1]
+        ldr             d17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d18, [x1]
+        ldr             d19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d20, [x1]
+        ldr             d21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1            {\tmp\().8b}, [x1], x2
+        movi            v24.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7
+        st1            {v24.8h}, [x0], x9
+        subs            w3, w3, #1
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v12_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2 - 16)
+        sub             x1, x1, x2
+        ldr             q16, [x1]
+        ldr             q17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q18, [x1]
+        ldr             q19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q20, [x1]
+        ldr             q21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().16b}, [x1], x2
+        movi            v24.8h, #0
+        movi            v25.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7
+        calc_qpelb2     v25, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7
+        st1             {v24.8h}, [x0], #16
+        subs            w3, w3, #1
+        st1             {v25.4h}, [x0], x9
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v16_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2)
+        sub             x1, x1, x2
+        ldr             q16, [x1]
+        ldr             q17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q18, [x1]
+        ldr             q19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q20, [x1]
+        ldr             q21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().16b}, [x1], x2
+        movi            v24.8h, #0
+        movi            v25.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7
+        calc_qpelb2     v25, \src0, \src1, \src2, \src3, \src4, \src5, 
\src6, \src7
+        subs            w3, w3, #1
+        st1             {v24.8h, v25.8h}, [x0], x9
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+// todo: reads #32 bytes
+function ff_hevc_put_hevc_qpel_v24_8_neon, export=1
+        sub             sp, sp, #32
+        st1             {v8.8b, v9.8b, v10.8b}, [sp]
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        sub             x1, x1, x2
+        mov             x9, #(MAX_PB_SIZE * 2)
+        ld1             {v16.16b, v17.16b}, [x1], x2
+        ld1             {v18.16b, v19.16b}, [x1], x2
+        ld1             {v20.16b, v21.16b}, [x1], x2
+        ld1             {v22.16b, v23.16b}, [x1], x2
+        ld1             {v24.16b, v25.16b}, [x1], x2
+        ld1             {v26.16b, v27.16b}, [x1], x2
+        ld1             {v28.16b, v29.16b}, [x1], x2
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, 
src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().16b, \tmp1\().16b}, [x1], x2
+        movi            v8.8h, #0
+        movi            v9.8h, #0
+        movi            v10.8h, #0
+        calc_qpelb      v8,  \src0, \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7
+        calc_qpelb2     v9,  \src0, \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7
+        calc_qpelb      v10, \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15
+        subs            w3, w3, #1
+        st1             {v8.8h, v9.8h, v10.8h}, [x0], x9
+.endm
+1:      calc_all2
+.purgem calc
+2:      ld1             {v8.8b, v9.8b, v10.8b}, [sp]
+        add             sp, sp, #32
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v32_8_neon, export=1
+        sub             sp, sp, #32
+        st1             {v8.8b-v11.8b}, [sp]
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2)
+        sub             x1, x1, x2
+        ld1             {v16.16b, v17.16b}, [x1], x2
+        ld1             {v18.16b, v19.16b}, [x1], x2
+        ld1             {v20.16b, v21.16b}, [x1], x2
+        ld1             {v22.16b, v23.16b}, [x1], x2
+        ld1             {v24.16b, v25.16b}, [x1], x2
+        ld1             {v26.16b, v27.16b}, [x1], x2
+        ld1             {v28.16b, v29.16b}, [x1], x2
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, 
src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().16b, \tmp1\().16b}, [x1], x2
+        movi            v8.8h, #0
+        movi            v9.8h, #0
+        movi            v10.8h, #0
+        movi            v11.8h, #0
+        calc_qpelb      v8,  \src0, \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7
+        calc_qpelb2     v9,  \src0, \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7
+        calc_qpelb      v10, \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15
+        calc_qpelb2     v11, \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15
+        subs            w3, w3, #1
+        st1             {v8.8h-v11.8h}, [x0], x9
+.endm
+1:      calc_all2
+.purgem calc
+2:      ld1             {v8.8b-v11.8b}, [sp], #32
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v48_8_neon, export=1
+        stp             x2, x3, [sp, #-48]!
+        stp             x0, x1, [sp, #16]
+        stp             x5, x30, [sp, #32]
+        bl              X(ff_hevc_put_hevc_qpel_v24_8_neon)
+        ldp             x2, x3, [sp]
+        ldp             x0, x1, [sp, #16]
+        ldr             x5, [sp, #32]
+        add             sp, sp, #32
+        add             x0, x0, #48
+        add             x1, x1, #24
+        bl              X(ff_hevc_put_hevc_qpel_v24_8_neon)
+        ldr             x30, [sp, #8]
+        add             sp, sp, #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v64_8_neon, export=1
+        sub             sp, sp, #32
+        st1             {v8.8b-v11.8b}, [sp]
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        sub             x1, x1, x2
+        mov             x9, #(MAX_PB_SIZE * 2)
+0:      mov             x8, x1          // src
+        ld1             {v16.16b, v17.16b}, [x8], x2
+        mov             w11, w3         // height
+        ld1             {v18.16b, v19.16b}, [x8], x2
+        mov             x10, x0         // dst
+        ld1             {v20.16b, v21.16b}, [x8], x2
+        ld1             {v22.16b, v23.16b}, [x8], x2
+        ld1             {v24.16b, v25.16b}, [x8], x2
+        ld1             {v26.16b, v27.16b}, [x8], x2
+        ld1             {v28.16b, v29.16b}, [x8], x2
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, 
src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().16b, \tmp1\().16b}, [x8], x2
+        movi            v8.8h, #0
+        movi            v9.8h, #0
+        movi            v10.8h, #0
+        movi            v11.8h, #0
+        calc_qpelb      v8,  \src0, \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7
+        calc_qpelb2     v9,  \src0, \src1, \src2,  \src3,  \src4, 
\src5,  \src6,  \src7
+        calc_qpelb      v10, \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15
+        calc_qpelb2     v11, \src8, \src9, \src10, \src11, \src12, 
\src13, \src14, \src15
+        subs            x11, x11, #1
+        st1             {v8.8h-v11.8h}, [x10], x9
+.endm
+1:      calc_all2
+.purgem calc
+2:      add             x0, x0, #64
+        add             x1, x1, #32
+        subs            w6, w6, #32
+        b.hi            0b
+        ld1             {v8.8b-v11.8b}, [sp], #32
+        ret
+endfunc
+
+
  function ff_hevc_put_hevc_pel_uni_pixels4_8_neon, export=1
  1:
          ldr             s0, [x2]
@@ -663,25 +972,6 @@ function ff_hevc_put_hevc_pel_uni_pixels64_8_neon, 
export=1
          ret
  endfunc
  -.macro calc_all
-        calc            v23, v16, v17, v18, v19, v20, v21, v22, v23
-        b.eq            2f
-        calc            v16, v17, v18, v19, v20, v21, v22, v23, v16
-        b.eq            2f
-        calc            v17, v18, v19, v20, v21, v22, v23, v16, v17
-        b.eq            2f
-        calc            v18, v19, v20, v21, v22, v23, v16, v17, v18
-        b.eq            2f
-        calc            v19, v20, v21, v22, v23, v16, v17, v18, v19
-        b.eq            2f
-        calc            v20, v21, v22, v23, v16, v17, v18, v19, v20
-        b.eq            2f
-        calc            v21, v22, v23, v16, v17, v18, v19, v20, v21
-        b.eq            2f
-        calc            v22, v23, v16, v17, v18, v19, v20, v21, v22
-        b.hi            1b
-.endm
-
  function ff_hevc_put_hevc_qpel_uni_v4_8_neon, export=1
          load_qpel_filterb x6, x5
          sub             x2, x2, x3, lsl #1
@@ -1559,25 +1849,6 @@ endfunc
   #if HAVE_I8MM
  -.macro calc_all2
-        calc v30, v31, v16, v18, v20, v22, v24, v26, v28, v30, v17, 
v19, v21, v23, v25, v27, v29, v31
-        b.eq            2f
-        calc v16, v17, v18, v20, v22, v24, v26, v28, v30, v16, v19, 
v21, v23, v25, v27, v29, v31, v17
-        b.eq            2f
-        calc v18, v19, v20, v22, v24, v26, v28, v30, v16, v18, v21, 
v23, v25, v27, v29, v31, v17, v19
-        b.eq            2f
-        calc v20, v21, v22, v24, v26, v28, v30, v16, v18, v20, v23, 
v25, v27, v29, v31, v17, v19, v21
-        b.eq            2f
-        calc v22, v23, v24, v26, v28, v30, v16, v18, v20, v22, v25, 
v27, v29, v31, v17, v19, v21, v23
-        b.eq            2f
-        calc v24, v25, v26, v28, v30, v16, v18, v20, v22, v24, v27, 
v29, v31, v17, v19, v21, v23, v25
-        b.eq            2f
-        calc v26, v27, v28, v30, v16, v18, v20, v22, v24, v26, v29, 
v31, v17, v19, v21, v23, v25, v27
-        b.eq            2f
-        calc v28, v29, v30, v16, v18, v20, v22, v24, v26, v28, v31, 
v17, v19, v21, v23, v25, v27, v29
-        b.hi            1b
-.endm
-
  function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
          add             w10, w4, #7
          lsl             x10, x10, #7
-- 
2.38.0.windows.1

[-- Attachment #2: 0003-lavc-aarch64-new-optimization-for-8-bit-hevc_qpel_v.patch --]
[-- Type: text/plain, Size: 18085 bytes --]

From 3cb075a5fcf0e696a55bcce8fa6415c1d2830fad Mon Sep 17 00:00:00 2001
From: Logan Lyu <Logan.Lyu@myais.com.cn>
Date: Sat, 9 Sep 2023 21:54:48 +0800
Subject: [PATCH 3/4] lavc/aarch64: new optimization for 8-bit hevc_qpel_v

checkasm bench:
put_hevc_qpel_v4_8_c: 138.1
put_hevc_qpel_v4_8_neon: 41.1
put_hevc_qpel_v6_8_c: 276.6
put_hevc_qpel_v6_8_neon: 60.9
put_hevc_qpel_v8_8_c: 478.9
put_hevc_qpel_v8_8_neon: 72.9
put_hevc_qpel_v12_8_c: 1072.6
put_hevc_qpel_v12_8_neon: 203.9
put_hevc_qpel_v16_8_c: 1852.1
put_hevc_qpel_v16_8_neon: 264.1
put_hevc_qpel_v24_8_c: 4137.6
put_hevc_qpel_v24_8_neon: 586.9
put_hevc_qpel_v32_8_c: 7579.1
put_hevc_qpel_v32_8_neon: 1036.6
put_hevc_qpel_v48_8_c: 16355.6
put_hevc_qpel_v48_8_neon: 2326.4
put_hevc_qpel_v64_8_c: 33545.1
put_hevc_qpel_v64_8_neon: 4126.4

Co-Authored-By: J. Dekker <jdek@itanimul.li>
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 347 +++++++++++++++++++---
 2 files changed, 314 insertions(+), 38 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index e9a341ecb9..f6b4c31d17 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -204,6 +204,10 @@ NEON8_FNPROTO(qpel_h, (int16_t *dst,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO(qpel_v, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_uni_v, (uint8_t *dst,  ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
@@ -315,6 +319,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels,);
         NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v,);
         NEON8_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels,);
+        NEON8_FNASSIGN(c->put_hevc_qpel, 1, 0, qpel_v,);
         NEON8_FNASSIGN(c->put_hevc_epel_uni, 0, 0, pel_uni_pixels,);
         NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 0, epel_uni_v,);
         NEON8_FNASSIGN(c->put_hevc_qpel_uni, 0, 0, pel_uni_pixels,);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 4132d7a8a9..eff70d70a4 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -112,6 +112,44 @@ endconst
 .endif
 .endm
 
+.macro calc_all
+        calc            v23, v16, v17, v18, v19, v20, v21, v22, v23
+        b.eq            2f
+        calc            v16, v17, v18, v19, v20, v21, v22, v23, v16
+        b.eq            2f
+        calc            v17, v18, v19, v20, v21, v22, v23, v16, v17
+        b.eq            2f
+        calc            v18, v19, v20, v21, v22, v23, v16, v17, v18
+        b.eq            2f
+        calc            v19, v20, v21, v22, v23, v16, v17, v18, v19
+        b.eq            2f
+        calc            v20, v21, v22, v23, v16, v17, v18, v19, v20
+        b.eq            2f
+        calc            v21, v22, v23, v16, v17, v18, v19, v20, v21
+        b.eq            2f
+        calc            v22, v23, v16, v17, v18, v19, v20, v21, v22
+        b.hi            1b
+.endm
+
+.macro calc_all2
+        calc v30, v31, v16, v18, v20, v22, v24, v26, v28, v30, v17, v19, v21, v23, v25, v27, v29, v31
+        b.eq            2f
+        calc v16, v17, v18, v20, v22, v24, v26, v28, v30, v16, v19, v21, v23, v25, v27, v29, v31, v17
+        b.eq            2f
+        calc v18, v19, v20, v22, v24, v26, v28, v30, v16, v18, v21, v23, v25, v27, v29, v31, v17, v19
+        b.eq            2f
+        calc v20, v21, v22, v24, v26, v28, v30, v16, v18, v20, v23, v25, v27, v29, v31, v17, v19, v21
+        b.eq            2f
+        calc v22, v23, v24, v26, v28, v30, v16, v18, v20, v22, v25, v27, v29, v31, v17, v19, v21, v23
+        b.eq            2f
+        calc v24, v25, v26, v28, v30, v16, v18, v20, v22, v24, v27, v29, v31, v17, v19, v21, v23, v25
+        b.eq            2f
+        calc v26, v27, v28, v30, v16, v18, v20, v22, v24, v26, v29, v31, v17, v19, v21, v23, v25, v27
+        b.eq            2f
+        calc v28, v29, v30, v16, v18, v20, v22, v24, v26, v28, v31, v17, v19, v21, v23, v25, v27, v29
+        b.hi            1b
+.endm
+
 .macro put_hevc type
 .ifc \type, qpel
         // void put_hevc_qpel_h(int16_t *dst,
@@ -558,6 +596,277 @@ put_hevc qpel
 put_hevc qpel_uni
 put_hevc qpel_bi
 
+function ff_hevc_put_hevc_qpel_v4_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2)
+        sub             x1, x1, x2
+        ldr             s16, [x1]
+        ldr             s17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             s18, [x1]
+        ldr             s19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             s20, [x1]
+        ldr             s21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             s22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().s}[0], [x1], x2
+        movi            v24.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7
+        st1             {v24.4h}, [x0], x9
+        subs            w3, w3, #1
+        b.eq            2f
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v6_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2 - 8)
+        sub             x1, x1, x2
+        ldr             d16, [x1]
+        ldr             d17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d18, [x1]
+        ldr             d19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d20, [x1]
+        ldr             d21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().8b}, [x1], x2
+        movi            v24.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7
+        st1             {v24.4h}, [x0], #8
+        st1             {v24.s}[2], [x0], x9
+        subs            w3, w3, #1
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v8_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2)
+        sub             x1, x1, x2
+        ldr             d16, [x1]
+        ldr             d17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d18, [x1]
+        ldr             d19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d20, [x1]
+        ldr             d21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             d22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1            {\tmp\().8b}, [x1], x2
+        movi            v24.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7
+        st1            {v24.8h}, [x0], x9
+        subs            w3, w3, #1
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v12_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2 - 16)
+        sub             x1, x1, x2
+        ldr             q16, [x1]
+        ldr             q17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q18, [x1]
+        ldr             q19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q20, [x1]
+        ldr             q21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().16b}, [x1], x2
+        movi            v24.8h, #0
+        movi            v25.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7
+        calc_qpelb2     v25, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7
+        st1             {v24.8h}, [x0], #16
+        subs            w3, w3, #1
+        st1             {v25.4h}, [x0], x9
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v16_8_neon, export=1
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2)
+        sub             x1, x1, x2
+        ldr             q16, [x1]
+        ldr             q17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q18, [x1]
+        ldr             q19, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q20, [x1]
+        ldr             q21, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ldr             q22, [x1]
+        add             x1, x1, x2
+.macro calc tmp, src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\tmp\().16b}, [x1], x2
+        movi            v24.8h, #0
+        movi            v25.8h, #0
+        calc_qpelb      v24, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7
+        calc_qpelb2     v25, \src0, \src1, \src2, \src3, \src4, \src5, \src6, \src7
+        subs            w3, w3, #1
+        st1             {v24.8h, v25.8h}, [x0], x9
+.endm
+1:      calc_all
+.purgem calc
+2:      ret
+endfunc
+
+// todo: reads #32 bytes
+function ff_hevc_put_hevc_qpel_v24_8_neon, export=1
+        sub             sp, sp, #32
+        st1             {v8.8b, v9.8b, v10.8b}, [sp]
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        sub             x1, x1, x2
+        mov             x9, #(MAX_PB_SIZE * 2)
+        ld1             {v16.16b, v17.16b}, [x1], x2
+        ld1             {v18.16b, v19.16b}, [x1], x2
+        ld1             {v20.16b, v21.16b}, [x1], x2
+        ld1             {v22.16b, v23.16b}, [x1], x2
+        ld1             {v24.16b, v25.16b}, [x1], x2
+        ld1             {v26.16b, v27.16b}, [x1], x2
+        ld1             {v28.16b, v29.16b}, [x1], x2
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().16b, \tmp1\().16b}, [x1], x2
+        movi            v8.8h, #0
+        movi            v9.8h, #0
+        movi            v10.8h, #0
+        calc_qpelb      v8,  \src0, \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7
+        calc_qpelb2     v9,  \src0, \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7
+        calc_qpelb      v10, \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15
+        subs            w3, w3, #1
+        st1             {v8.8h, v9.8h, v10.8h}, [x0], x9
+.endm
+1:      calc_all2
+.purgem calc
+2:      ld1             {v8.8b, v9.8b, v10.8b}, [sp]
+        add             sp, sp, #32
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v32_8_neon, export=1
+        sub             sp, sp, #32
+        st1             {v8.8b-v11.8b}, [sp]
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        mov             x9, #(MAX_PB_SIZE * 2)
+        sub             x1, x1, x2
+        ld1             {v16.16b, v17.16b}, [x1], x2
+        ld1             {v18.16b, v19.16b}, [x1], x2
+        ld1             {v20.16b, v21.16b}, [x1], x2
+        ld1             {v22.16b, v23.16b}, [x1], x2
+        ld1             {v24.16b, v25.16b}, [x1], x2
+        ld1             {v26.16b, v27.16b}, [x1], x2
+        ld1             {v28.16b, v29.16b}, [x1], x2
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().16b, \tmp1\().16b}, [x1], x2
+        movi            v8.8h, #0
+        movi            v9.8h, #0
+        movi            v10.8h, #0
+        movi            v11.8h, #0
+        calc_qpelb      v8,  \src0, \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7
+        calc_qpelb2     v9,  \src0, \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7
+        calc_qpelb      v10, \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15
+        calc_qpelb2     v11, \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15
+        subs            w3, w3, #1
+        st1             {v8.8h-v11.8h}, [x0], x9
+.endm
+1:      calc_all2
+.purgem calc
+2:      ld1             {v8.8b-v11.8b}, [sp], #32
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v48_8_neon, export=1
+        stp             x2, x3, [sp, #-48]!
+        stp             x0, x1, [sp, #16]
+        stp             x5, x30, [sp, #32]
+        bl              X(ff_hevc_put_hevc_qpel_v24_8_neon)
+        ldp             x2, x3, [sp]
+        ldp             x0, x1, [sp, #16]
+        ldr             x5, [sp, #32]
+        add             sp, sp, #32
+        add             x0, x0, #48
+        add             x1, x1, #24
+        bl              X(ff_hevc_put_hevc_qpel_v24_8_neon)
+        ldr             x30, [sp, #8]
+        add             sp, sp, #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_v64_8_neon, export=1
+        sub             sp, sp, #32
+        st1             {v8.8b-v11.8b}, [sp]
+        load_qpel_filterb x5, x4
+        sub             x1, x1, x2, lsl #1
+        sub             x1, x1, x2
+        mov             x9, #(MAX_PB_SIZE * 2)
+0:      mov             x8, x1          // src
+        ld1             {v16.16b, v17.16b}, [x8], x2
+        mov             w11, w3         // height
+        ld1             {v18.16b, v19.16b}, [x8], x2
+        mov             x10, x0         // dst
+        ld1             {v20.16b, v21.16b}, [x8], x2
+        ld1             {v22.16b, v23.16b}, [x8], x2
+        ld1             {v24.16b, v25.16b}, [x8], x2
+        ld1             {v26.16b, v27.16b}, [x8], x2
+        ld1             {v28.16b, v29.16b}, [x8], x2
+.macro calc tmp0, tmp1, src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\tmp0\().16b, \tmp1\().16b}, [x8], x2
+        movi            v8.8h, #0
+        movi            v9.8h, #0
+        movi            v10.8h, #0
+        movi            v11.8h, #0
+        calc_qpelb      v8,  \src0, \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7
+        calc_qpelb2     v9,  \src0, \src1, \src2,  \src3,  \src4,  \src5,  \src6,  \src7
+        calc_qpelb      v10, \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15
+        calc_qpelb2     v11, \src8, \src9, \src10, \src11, \src12, \src13, \src14, \src15
+        subs            x11, x11, #1
+        st1             {v8.8h-v11.8h}, [x10], x9
+.endm
+1:      calc_all2
+.purgem calc
+2:      add             x0, x0, #64
+        add             x1, x1, #32
+        subs            w6, w6, #32
+        b.hi            0b
+        ld1             {v8.8b-v11.8b}, [sp], #32
+        ret
+endfunc
+
+
 function ff_hevc_put_hevc_pel_uni_pixels4_8_neon, export=1
 1:
         ldr             s0, [x2]
@@ -663,25 +972,6 @@ function ff_hevc_put_hevc_pel_uni_pixels64_8_neon, export=1
         ret
 endfunc
 
-.macro calc_all
-        calc            v23, v16, v17, v18, v19, v20, v21, v22, v23
-        b.eq            2f
-        calc            v16, v17, v18, v19, v20, v21, v22, v23, v16
-        b.eq            2f
-        calc            v17, v18, v19, v20, v21, v22, v23, v16, v17
-        b.eq            2f
-        calc            v18, v19, v20, v21, v22, v23, v16, v17, v18
-        b.eq            2f
-        calc            v19, v20, v21, v22, v23, v16, v17, v18, v19
-        b.eq            2f
-        calc            v20, v21, v22, v23, v16, v17, v18, v19, v20
-        b.eq            2f
-        calc            v21, v22, v23, v16, v17, v18, v19, v20, v21
-        b.eq            2f
-        calc            v22, v23, v16, v17, v18, v19, v20, v21, v22
-        b.hi            1b
-.endm
-
 function ff_hevc_put_hevc_qpel_uni_v4_8_neon, export=1
         load_qpel_filterb x6, x5
         sub             x2, x2, x3, lsl #1
@@ -1559,25 +1849,6 @@ endfunc
 
 #if HAVE_I8MM
 
-.macro calc_all2
-        calc v30, v31, v16, v18, v20, v22, v24, v26, v28, v30, v17, v19, v21, v23, v25, v27, v29, v31
-        b.eq            2f
-        calc v16, v17, v18, v20, v22, v24, v26, v28, v30, v16, v19, v21, v23, v25, v27, v29, v31, v17
-        b.eq            2f
-        calc v18, v19, v20, v22, v24, v26, v28, v30, v16, v18, v21, v23, v25, v27, v29, v31, v17, v19
-        b.eq            2f
-        calc v20, v21, v22, v24, v26, v28, v30, v16, v18, v20, v23, v25, v27, v29, v31, v17, v19, v21
-        b.eq            2f
-        calc v22, v23, v24, v26, v28, v30, v16, v18, v20, v22, v25, v27, v29, v31, v17, v19, v21, v23
-        b.eq            2f
-        calc v24, v25, v26, v28, v30, v16, v18, v20, v22, v24, v27, v29, v31, v17, v19, v21, v23, v25
-        b.eq            2f
-        calc v26, v27, v28, v30, v16, v18, v20, v22, v24, v26, v29, v31, v17, v19, v21, v23, v25, v27
-        b.eq            2f
-        calc v28, v29, v30, v16, v18, v20, v22, v24, v26, v28, v31, v17, v19, v21, v23, v25, v27, v29
-        b.hi            1b
-.endm
-
 function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
         add             w10, w4, #7
         lsl             x10, x10, #7
-- 
2.38.0.windows.1


[-- Attachment #3: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2023-10-14  8:39 Logan.Lyu
  0 siblings, 0 replies; 34+ messages in thread
From: Logan.Lyu @ 2023-10-14  8:39 UTC (permalink / raw)
  To: ffmpeg-devel

[-- Attachment #1: Type: text/plain, Size: 10960 bytes --]

checkasm bench:
put_hevc_epel_v4_8_c: 79.9
put_hevc_epel_v4_8_neon: 25.7
put_hevc_epel_v6_8_c: 151.4
put_hevc_epel_v6_8_neon: 46.4
put_hevc_epel_v8_8_c: 250.9
put_hevc_epel_v8_8_neon: 41.7
put_hevc_epel_v12_8_c: 542.7
put_hevc_epel_v12_8_neon: 108.7
put_hevc_epel_v16_8_c: 939.4
put_hevc_epel_v16_8_neon: 169.2
put_hevc_epel_v24_8_c: 2104.9
put_hevc_epel_v24_8_neon: 307.9
put_hevc_epel_v32_8_c: 3713.9
put_hevc_epel_v32_8_neon: 524.2
put_hevc_epel_v48_8_c: 8175.2
put_hevc_epel_v48_8_neon: 1197.2
put_hevc_epel_v64_8_c: 16049.4
put_hevc_epel_v64_8_neon: 2094.9

Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Logan Lyu <Logan.Lyu@myais.com.cn>
---
  libavcodec/aarch64/hevcdsp_epel_neon.S    | 223 ++++++++++++++++++++++
  libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
  2 files changed, 228 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index b4ca1e4c20..e541db5430 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -243,6 +243,229 @@ function ff_hevc_put_hevc_pel_pixels64_8_neon, 
export=1
          ret
  endfunc
  +
+function ff_hevc_put_hevc_epel_v4_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             s16, [x1]
+        ldr             s17, [x1 ,x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.s}[0], [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().s}[0], [x1], x2
+        movi            v4.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.4h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v6_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2 - 8)
+        ldr             d16, [x1]
+        ldr             d17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.8b}, [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().8b}, [x1], x2
+        movi            v4.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        st1             {v4.d}[0], [x0], #8
+        subs            w3, w3, #1
+        st1             {v4.s}[2], [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v8_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             d16, [x1]
+        ldr             d17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.8b}, [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().8b}, [x1], x2
+        movi            v4.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.8h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v12_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             q16, [x1]
+        ldr             q17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.16b}, [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        calc_epelb2     v5, \src0, \src1, \src2, \src3
+        str             q4, [x0]
+        subs            w3, w3, #1
+        str             d5, [x0, #16]
+        add             x0, x0, x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v16_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             q16, [x1]
+        ldr             q17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.16b}, [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1            {\src3\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        calc_epelb2     v5, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.8h, v5.8h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v24_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.8b, v17.8b, v18.8b}, [x1], x2
+        ld1             {v19.8b, v20.8b, v21.8b}, [x1], x2
+        ld1             {v22.8b, v23.8b, v24.8b}, [x1], x2
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, 
src10, src11
+        ld1             {\src9\().8b, \src10\().8b, \src11\().8b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        movi            v6.8h, #0
+        calc_epelb      v4, \src0, \src3, \src6, \src9
+        calc_epelb      v5, \src1, \src4, \src7, \src10
+        calc_epelb      v6, \src2, \src5, \src8, \src11
+        subs            w3, w3, #1
+        st1             {v4.8h-v6.8h}, [x0], x10
+.endm
+1:      calc_all12
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v32_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.16b, v17.16b}, [x1], x2
+        ld1             {v18.16b, v19.16b}, [x1], x2
+        ld1             {v20.16b, v21.16b}, [x1], x2
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\src6\().16b, \src7\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        movi            v6.8h, #0
+        movi            v7.8h, #0
+        calc_epelb      v4, \src0, \src2, \src4, \src6
+        calc_epelb2     v5, \src0, \src2, \src4, \src6
+        calc_epelb      v6, \src1, \src3, \src5, \src7
+        calc_epelb2     v7, \src1, \src3, \src5, \src7
+        subs            w3, w3, #1
+        st1             {v4.8h-v7.8h}, [x0], x10
+.endm
+1:      calc_all8
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v48_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #64
+        ld1             {v16.16b, v17.16b, v18.16b}, [x1], x2
+        ld1             {v19.16b, v20.16b, v21.16b}, [x1], x2
+        ld1             {v22.16b, v23.16b, v24.16b}, [x1], x2
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, 
src10, src11
+        ld1             {\src9\().16b, \src10\().16b, \src11\().16b}, 
[x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        movi            v6.8h, #0
+        movi            v7.8h, #0
+        movi            v28.8h, #0
+        movi            v29.8h, #0
+        calc_epelb      v4,  \src0, \src3, \src6, \src9
+        calc_epelb2     v5,  \src0, \src3, \src6, \src9
+        calc_epelb      v6,  \src1, \src4, \src7, \src10
+        calc_epelb2     v7,  \src1, \src4, \src7, \src10
+        calc_epelb      v28, \src2, \src5, \src8, \src11
+        calc_epelb2     v29, \src2, \src5, \src8, \src11
+        st1             {v4.8h-v7.8h}, [x0], #64
+        subs            w3, w3, #1
+        st1             {v28.8h-v29.8h}, [x0], x10
+.endm
+1:      calc_all12
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v64_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             sp, sp, #32
+        st1             {v8.8b-v11.8b}, [sp]
+        sub             x1, x1, x2
+        ld1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x1], x2
+        ld1             {v20.16b, v21.16b, v22.16b, v23.16b}, [x1], x2
+        ld1             {v24.16b, v25.16b, v26.16b, v27.16b}, [x1], x2
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, 
src10, src11, src12, src13, src14, src15
+        ld1             {\src12\().16b-\src15\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        movi            v6.8h, #0
+        movi            v7.8h, #0
+        movi            v8.8h, #0
+        movi            v9.8h, #0
+        movi            v10.8h, #0
+        movi            v11.8h, #0
+        calc_epelb      v4,  \src0, \src4, \src8,  \src12
+        calc_epelb2     v5,  \src0, \src4, \src8,  \src12
+        calc_epelb      v6,  \src1, \src5, \src9,  \src13
+        calc_epelb2     v7,  \src1, \src5, \src9,  \src13
+        calc_epelb      v8,  \src2, \src6, \src10, \src14
+        calc_epelb2     v9,  \src2, \src6, \src10, \src14
+        calc_epelb      v10, \src3, \src7, \src11, \src15
+        calc_epelb2     v11, \src3, \src7, \src11, \src15
+        st1             {v4.8h-v7.8h}, [x0], #64
+        subs            w3, w3, #1
+        st1             {v8.8h-v11.8h}, [x0], #64
+.endm
+1:      calc_all16
+.purgem calc
+2:     	ld1             {v8.8b-v11.8b}, [sp]
+        add             sp, sp, #32
+        ret
+endfunc
+
  function ff_hevc_put_hevc_epel_uni_v4_8_neon, export=1
          load_epel_filterb x6, x5
          sub             x2, x2, x3
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 4c377a7940..82e1623a67 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -156,6 +156,10 @@ NEON8_FNPROTO(pel_pixels, (int16_t *dst,
          const uint8_t *src, ptrdiff_t srcstride,
          int height, intptr_t mx, intptr_t my, int width),);
  +NEON8_FNPROTO(epel_v, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width),);
+
  NEON8_FNPROTO(pel_uni_pixels, (uint8_t *_dst, ptrdiff_t _dststride,
          const uint8_t *_src, ptrdiff_t _srcstride,
          int height, intptr_t mx, intptr_t my, int width),);
@@ -305,6 +309,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext 
*c, const int bit_depth)
          c->put_hevc_qpel_bi[9][0][1]   = 
ff_hevc_put_hevc_qpel_bi_h16_8_neon;
           NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels,);
+        NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v,);
          NEON8_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels,);
          NEON8_FNASSIGN(c->put_hevc_epel_uni, 0, 0, pel_uni_pixels,);
          NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 0, epel_uni_v,);
-- 
2.38.0.windows.1

[-- Attachment #2: 0001-lavc-aarch64-new-optimization-for-8-bit-hevc_epel_v.patch --]
[-- Type: text/plain, Size: 11109 bytes --]

From dfaaddf97b86817bc7adb50fdf0d29634b365bb1 Mon Sep 17 00:00:00 2001
From: Logan Lyu <Logan.Lyu@myais.com.cn>
Date: Sat, 9 Sep 2023 16:50:29 +0800
Subject: [PATCH 1/4] lavc/aarch64: new optimization for 8-bit hevc_epel_v

checkasm bench:
put_hevc_epel_v4_8_c: 79.9
put_hevc_epel_v4_8_neon: 25.7
put_hevc_epel_v6_8_c: 151.4
put_hevc_epel_v6_8_neon: 46.4
put_hevc_epel_v8_8_c: 250.9
put_hevc_epel_v8_8_neon: 41.7
put_hevc_epel_v12_8_c: 542.7
put_hevc_epel_v12_8_neon: 108.7
put_hevc_epel_v16_8_c: 939.4
put_hevc_epel_v16_8_neon: 169.2
put_hevc_epel_v24_8_c: 2104.9
put_hevc_epel_v24_8_neon: 307.9
put_hevc_epel_v32_8_c: 3713.9
put_hevc_epel_v32_8_neon: 524.2
put_hevc_epel_v48_8_c: 8175.2
put_hevc_epel_v48_8_neon: 1197.2
put_hevc_epel_v64_8_c: 16049.4
put_hevc_epel_v64_8_neon: 2094.9

Co-Authored-By: J. Dekker <jdek@itanimul.li>
---
 libavcodec/aarch64/hevcdsp_epel_neon.S    | 223 ++++++++++++++++++++++
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 2 files changed, 228 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index b4ca1e4c20..e541db5430 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -243,6 +243,229 @@ function ff_hevc_put_hevc_pel_pixels64_8_neon, export=1
         ret
 endfunc
 
+
+function ff_hevc_put_hevc_epel_v4_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             s16, [x1]
+        ldr             s17, [x1 ,x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.s}[0], [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().s}[0], [x1], x2
+        movi            v4.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.4h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v6_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2 - 8)
+        ldr             d16, [x1]
+        ldr             d17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.8b}, [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().8b}, [x1], x2
+        movi            v4.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        st1             {v4.d}[0], [x0], #8
+        subs            w3, w3, #1
+        st1             {v4.s}[2], [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v8_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             d16, [x1]
+        ldr             d17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.8b}, [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().8b}, [x1], x2
+        movi            v4.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.8h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v12_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             q16, [x1]
+        ldr             q17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.16b}, [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        calc_epelb2     v5, \src0, \src1, \src2, \src3
+        str             q4, [x0]
+        subs            w3, w3, #1
+        str             d5, [x0, #16]
+        add             x0, x0, x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v16_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             q16, [x1]
+        ldr             q17, [x1, x2]
+        add             x1, x1, x2, lsl #1
+        ld1             {v18.16b}, [x1], x2
+.macro calc src0, src1, src2, src3
+        ld1            {\src3\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        calc_epelb      v4, \src0, \src1, \src2, \src3
+        calc_epelb2     v5, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.8h, v5.8h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v24_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.8b, v17.8b, v18.8b}, [x1], x2
+        ld1             {v19.8b, v20.8b, v21.8b}, [x1], x2
+        ld1             {v22.8b, v23.8b, v24.8b}, [x1], x2
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11
+        ld1             {\src9\().8b, \src10\().8b, \src11\().8b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        movi            v6.8h, #0
+        calc_epelb      v4, \src0, \src3, \src6, \src9
+        calc_epelb      v5, \src1, \src4, \src7, \src10
+        calc_epelb      v6, \src2, \src5, \src8, \src11
+        subs            w3, w3, #1
+        st1             {v4.8h-v6.8h}, [x0], x10
+.endm
+1:      calc_all12
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v32_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.16b, v17.16b}, [x1], x2
+        ld1             {v18.16b, v19.16b}, [x1], x2
+        ld1             {v20.16b, v21.16b}, [x1], x2
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\src6\().16b, \src7\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        movi            v6.8h, #0
+        movi            v7.8h, #0
+        calc_epelb      v4, \src0, \src2, \src4, \src6
+        calc_epelb2     v5, \src0, \src2, \src4, \src6
+        calc_epelb      v6, \src1, \src3, \src5, \src7
+        calc_epelb2     v7, \src1, \src3, \src5, \src7
+        subs            w3, w3, #1
+        st1             {v4.8h-v7.8h}, [x0], x10
+.endm
+1:      calc_all8
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v48_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             x1, x1, x2
+        mov             x10, #64
+        ld1             {v16.16b, v17.16b, v18.16b}, [x1], x2
+        ld1             {v19.16b, v20.16b, v21.16b}, [x1], x2
+        ld1             {v22.16b, v23.16b, v24.16b}, [x1], x2
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11
+        ld1             {\src9\().16b, \src10\().16b, \src11\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        movi            v6.8h, #0
+        movi            v7.8h, #0
+        movi            v28.8h, #0
+        movi            v29.8h, #0
+        calc_epelb      v4,  \src0, \src3, \src6, \src9
+        calc_epelb2     v5,  \src0, \src3, \src6, \src9
+        calc_epelb      v6,  \src1, \src4, \src7, \src10
+        calc_epelb2     v7,  \src1, \src4, \src7, \src10
+        calc_epelb      v28, \src2, \src5, \src8, \src11
+        calc_epelb2     v29, \src2, \src5, \src8, \src11
+        st1             {v4.8h-v7.8h}, [x0], #64
+        subs            w3, w3, #1
+        st1             {v28.8h-v29.8h}, [x0], x10
+.endm
+1:      calc_all12
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_v64_8_neon, export=1
+        load_epel_filterb x5, x4
+        sub             sp, sp, #32
+        st1             {v8.8b-v11.8b}, [sp]
+        sub             x1, x1, x2
+        ld1             {v16.16b, v17.16b, v18.16b, v19.16b}, [x1], x2
+        ld1             {v20.16b, v21.16b, v22.16b, v23.16b}, [x1], x2
+        ld1             {v24.16b, v25.16b, v26.16b, v27.16b}, [x1], x2
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11, src12, src13, src14, src15
+        ld1             {\src12\().16b-\src15\().16b}, [x1], x2
+        movi            v4.8h, #0
+        movi            v5.8h, #0
+        movi            v6.8h, #0
+        movi            v7.8h, #0
+        movi            v8.8h, #0
+        movi            v9.8h, #0
+        movi            v10.8h, #0
+        movi            v11.8h, #0
+        calc_epelb      v4,  \src0, \src4, \src8,  \src12
+        calc_epelb2     v5,  \src0, \src4, \src8,  \src12
+        calc_epelb      v6,  \src1, \src5, \src9,  \src13
+        calc_epelb2     v7,  \src1, \src5, \src9,  \src13
+        calc_epelb      v8,  \src2, \src6, \src10, \src14
+        calc_epelb2     v9,  \src2, \src6, \src10, \src14
+        calc_epelb      v10, \src3, \src7, \src11, \src15
+        calc_epelb2     v11, \src3, \src7, \src11, \src15
+        st1             {v4.8h-v7.8h}, [x0], #64
+        subs            w3, w3, #1
+        st1             {v8.8h-v11.8h}, [x0], #64
+.endm
+1:      calc_all16
+.purgem calc
+2:     	ld1             {v8.8b-v11.8b}, [sp]
+        add             sp, sp, #32
+        ret
+endfunc
+
 function ff_hevc_put_hevc_epel_uni_v4_8_neon, export=1
         load_epel_filterb x6, x5
         sub             x2, x2, x3
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 4c377a7940..82e1623a67 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -156,6 +156,10 @@ NEON8_FNPROTO(pel_pixels, (int16_t *dst,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(epel_v, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(pel_uni_pixels, (uint8_t *_dst, ptrdiff_t _dststride,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
@@ -305,6 +309,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         c->put_hevc_qpel_bi[9][0][1]   = ff_hevc_put_hevc_qpel_bi_h16_8_neon;
 
         NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels,);
+        NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v,);
         NEON8_FNASSIGN(c->put_hevc_qpel, 0, 0, pel_pixels,);
         NEON8_FNASSIGN(c->put_hevc_epel_uni, 0, 0, pel_uni_pixels,);
         NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 0, epel_uni_v,);
-- 
2.38.0.windows.1


[-- Attachment #3: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2023-10-14  8:39 Logan.Lyu
  0 siblings, 0 replies; 34+ messages in thread
From: Logan.Lyu @ 2023-10-14  8:39 UTC (permalink / raw)
  To: ffmpeg-devel

[-- Attachment #1: Type: text/plain, Size: 13033 bytes --]

checkasm bench:
put_hevc_epel_hv4_8_c: 213.7
put_hevc_epel_hv4_8_i8mm: 59.4
put_hevc_epel_hv6_8_c: 350.9
put_hevc_epel_hv6_8_i8mm: 130.2
put_hevc_epel_hv8_8_c: 548.7
put_hevc_epel_hv8_8_i8mm: 136.9
put_hevc_epel_hv12_8_c: 1126.7
put_hevc_epel_hv12_8_i8mm: 302.2
put_hevc_epel_hv16_8_c: 1925.2
put_hevc_epel_hv16_8_i8mm: 459.9
put_hevc_epel_hv24_8_c: 4301.9
put_hevc_epel_hv24_8_i8mm: 1024.9
put_hevc_epel_hv32_8_c: 7509.2
put_hevc_epel_hv32_8_i8mm: 1680.4
put_hevc_epel_hv48_8_c: 16566.9
put_hevc_epel_hv48_8_i8mm: 3945.4
put_hevc_epel_hv64_8_c: 29134.2
put_hevc_epel_hv64_8_i8mm: 6567.7

Co-Authored-By: J. Dekker <jdek@itanimul.li>
Signed-off-by: Logan Lyu <Logan.Lyu@myais.com.cn>
---
  libavcodec/aarch64/hevcdsp_epel_neon.S    | 265 ++++++++++++++++++++++
  libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
  2 files changed, 270 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
b/libavcodec/aarch64/hevcdsp_epel_neon.S
index e541db5430..ebc16da5b6 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1018,6 +1018,271 @@ function ff_hevc_put_hevc_epel_h64_8_neon_i8mm, 
export=1
          ret
  endfunc
  +
+function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             d16, [sp]
+        ldr             d17, [sp, x10]
+        add             sp, sp, x10, lsl #1
+        ld1             {v18.4h}, [sp], x10
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().4h}, [sp], x10
+        calc_epelh      v4, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.4h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0,  x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x5, #120
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             q16, [sp]
+        ldr             q17, [sp, x10]
+        add             sp, sp, x10, lsl #1
+        ld1             {v18.8h}, [sp], x10
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src1, \src2, \src3
+        calc_epelh2     v4, v5, \src0, \src1, \src2, \src3
+        st1             {v4.d}[0], [x0], #8
+        subs            w3, w3, #1
+        st1             {v4.s}[2], [x0], x5
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             q16, [sp]
+        ldr             q17, [sp, x10]
+        add             sp, sp, x10, lsl #1
+        ld1             {v18.8h}, [sp], x10
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src1, \src2, \src3
+        calc_epelh2     v4, v5, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.8h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x5, #112
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.8h, v17.8h}, [sp], x10
+        ld1             {v18.8h, v19.8h}, [sp], x10
+        ld1             {v20.8h, v21.8h}, [sp], x10
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\src6\().8h, \src7\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src2, \src4, \src6
+        calc_epelh2     v4, v5, \src0, \src2, \src4, \src6
+        calc_epelh      v5,     \src1, \src3, \src5, \src7
+        st1             {v4.8h}, [x0], #16
+        subs            w3, w3, #1
+        st1             {v5.4h}, [x0], x5
+.endm
+1:      calc_all8
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.8h, v17.8h}, [sp], x10
+        ld1             {v18.8h, v19.8h}, [sp], x10
+        ld1             {v20.8h, v21.8h}, [sp], x10
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\src6\().8h, \src7\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src2, \src4, \src6
+        calc_epelh2     v4, v5, \src0, \src2, \src4, \src6
+        calc_epelh      v5,     \src1, \src3, \src5, \src7
+        calc_epelh2     v5, v6, \src1, \src3, \src5, \src7
+        subs            w3, w3, #1
+        st1             {v4.8h, v5.8h}, [x0], x10
+.endm
+1:      calc_all8
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.8h, v17.8h, v18.8h}, [sp], x10
+        ld1             {v19.8h, v20.8h, v21.8h}, [sp], x10
+        ld1             {v22.8h, v23.8h, v24.8h}, [sp], x10
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, 
src10, src11
+        ld1             {\src9\().8h-\src11\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src3, \src6, \src9
+        calc_epelh2     v4, v5, \src0, \src3, \src6, \src9
+        calc_epelh      v5,     \src1, \src4, \src7, \src10
+        calc_epelh2     v5, v6, \src1, \src4, \src7, \src10
+        calc_epelh      v6,     \src2, \src5, \src8, \src11
+        calc_epelh2     v6, v7, \src2, \src5, \src8, \src11
+        subs            w3, w3, #1
+        st1             {v4.8h-v6.8h}, [x0], x10
+.endm
+1:      calc_all12
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv32_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x0, x0, #32
+        add             x1, x1, #16
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv48_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        mov             x6, #24
+        bl              X(ff_hevc_put_hevc_epel_hv24_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x0, x0, #48
+        add             x1, x1, #24
+        mov             x6, #24
+        bl              X(ff_hevc_put_hevc_epel_hv24_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv64_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             x0, x0, #32
+        add             x1, x1, #16
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             x0, x0, #64
+        add             x1, x1, #32
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x0, x0, #96
+        add             x1, x1, #48
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
  function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
          add             w10, w4, #3
          lsl             x10, x10, #7
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c 
b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 82e1623a67..e9a341ecb9 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -191,6 +191,10 @@ NEON8_FNPROTO(epel_h, (int16_t *dst,
          const uint8_t *_src, ptrdiff_t _srcstride,
          int height, intptr_t mx, intptr_t my, int width), _i8mm);
  +NEON8_FNPROTO(epel_hv, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width), _i8mm);
+
  NEON8_FNPROTO(epel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
          const uint8_t *_src, ptrdiff_t _srcstride,
          int height, int denom, int wx, int ox,
@@ -322,6 +326,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext 
*c, const int bit_depth)
           if (have_i8mm(cpu_flags)) {
              NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
+            NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
              NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv, 
_i8mm);
              NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h 
,_i8mm);
              NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h, _i8mm);
-- 
2.38.0.windows.1

[-- Attachment #2: 0002-lavc-aarch64-new-optimization-for-8-bit-hevc_epel_hv.patch --]
[-- Type: text/plain, Size: 13185 bytes --]

From 83af7e79cf004c244ad3c771a0ca0e2357bbe944 Mon Sep 17 00:00:00 2001
From: Logan Lyu <Logan.Lyu@myais.com.cn>
Date: Sat, 9 Sep 2023 21:29:51 +0800
Subject: [PATCH 2/4] lavc/aarch64: new optimization for 8-bit hevc_epel_hv

checkasm bench:
put_hevc_epel_hv4_8_c: 213.7
put_hevc_epel_hv4_8_i8mm: 59.4
put_hevc_epel_hv6_8_c: 350.9
put_hevc_epel_hv6_8_i8mm: 130.2
put_hevc_epel_hv8_8_c: 548.7
put_hevc_epel_hv8_8_i8mm: 136.9
put_hevc_epel_hv12_8_c: 1126.7
put_hevc_epel_hv12_8_i8mm: 302.2
put_hevc_epel_hv16_8_c: 1925.2
put_hevc_epel_hv16_8_i8mm: 459.9
put_hevc_epel_hv24_8_c: 4301.9
put_hevc_epel_hv24_8_i8mm: 1024.9
put_hevc_epel_hv32_8_c: 7509.2
put_hevc_epel_hv32_8_i8mm: 1680.4
put_hevc_epel_hv48_8_c: 16566.9
put_hevc_epel_hv48_8_i8mm: 3945.4
put_hevc_epel_hv64_8_c: 29134.2
put_hevc_epel_hv64_8_i8mm: 6567.7

Co-Authored-By: J. Dekker <jdek@itanimul.li>
---
 libavcodec/aarch64/hevcdsp_epel_neon.S    | 265 ++++++++++++++++++++++
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 2 files changed, 270 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index e541db5430..ebc16da5b6 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1018,6 +1018,271 @@ function ff_hevc_put_hevc_epel_h64_8_neon_i8mm, export=1
         ret
 endfunc
 
+
+function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             d16, [sp]
+        ldr             d17, [sp, x10]
+        add             sp, sp, x10, lsl #1
+        ld1             {v18.4h}, [sp], x10
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().4h}, [sp], x10
+        calc_epelh      v4, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.4h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0,  x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x5, #120
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             q16, [sp]
+        ldr             q17, [sp, x10]
+        add             sp, sp, x10, lsl #1
+        ld1             {v18.8h}, [sp], x10
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src1, \src2, \src3
+        calc_epelh2     v4, v5, \src0, \src1, \src2, \src3
+        st1             {v4.d}[0], [x0], #8
+        subs            w3, w3, #1
+        st1             {v4.s}[2], [x0], x5
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ldr             q16, [sp]
+        ldr             q17, [sp, x10]
+        add             sp, sp, x10, lsl #1
+        ld1             {v18.8h}, [sp], x10
+.macro calc src0, src1, src2, src3
+        ld1             {\src3\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src1, \src2, \src3
+        calc_epelh2     v4, v5, \src0, \src1, \src2, \src3
+        subs            w3, w3, #1
+        st1             {v4.8h}, [x0], x10
+.endm
+1:      calc_all4
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x5, #112
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.8h, v17.8h}, [sp], x10
+        ld1             {v18.8h, v19.8h}, [sp], x10
+        ld1             {v20.8h, v21.8h}, [sp], x10
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\src6\().8h, \src7\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src2, \src4, \src6
+        calc_epelh2     v4, v5, \src0, \src2, \src4, \src6
+        calc_epelh      v5,     \src1, \src3, \src5, \src7
+        st1             {v4.8h}, [x0], #16
+        subs            w3, w3, #1
+        st1             {v5.4h}, [x0], x5
+.endm
+1:      calc_all8
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.8h, v17.8h}, [sp], x10
+        ld1             {v18.8h, v19.8h}, [sp], x10
+        ld1             {v20.8h, v21.8h}, [sp], x10
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7
+        ld1             {\src6\().8h, \src7\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src2, \src4, \src6
+        calc_epelh2     v4, v5, \src0, \src2, \src4, \src6
+        calc_epelh      v5,     \src1, \src3, \src5, \src7
+        calc_epelh2     v5, v6, \src1, \src3, \src5, \src7
+        subs            w3, w3, #1
+        st1             {v4.8h, v5.8h}, [x0], x10
+.endm
+1:      calc_all8
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        ldp             x5, x30, [sp]
+        ldp             x0, x3, [sp, #16]
+        add             sp, sp, #32
+        load_epel_filterh x5, x4
+        mov             x10, #(MAX_PB_SIZE * 2)
+        ld1             {v16.8h, v17.8h, v18.8h}, [sp], x10
+        ld1             {v19.8h, v20.8h, v21.8h}, [sp], x10
+        ld1             {v22.8h, v23.8h, v24.8h}, [sp], x10
+.macro calc src0, src1, src2, src3, src4, src5, src6, src7, src8, src9, src10, src11
+        ld1             {\src9\().8h-\src11\().8h}, [sp], x10
+        calc_epelh      v4,     \src0, \src3, \src6, \src9
+        calc_epelh2     v4, v5, \src0, \src3, \src6, \src9
+        calc_epelh      v5,     \src1, \src4, \src7, \src10
+        calc_epelh2     v5, v6, \src1, \src4, \src7, \src10
+        calc_epelh      v6,     \src2, \src5, \src8, \src11
+        calc_epelh2     v6, v7, \src2, \src5, \src8, \src11
+        subs            w3, w3, #1
+        st1             {v4.8h-v6.8h}, [x0], x10
+.endm
+1:      calc_all12
+.purgem calc
+2:      ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv32_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x0, x0, #32
+        add             x1, x1, #16
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv48_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        mov             x6, #24
+        bl              X(ff_hevc_put_hevc_epel_hv24_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x0, x0, #48
+        add             x1, x1, #24
+        mov             x6, #24
+        bl              X(ff_hevc_put_hevc_epel_hv24_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_hv64_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             x0, x0, #32
+        add             x1, x1, #16
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             x0, x0, #64
+        add             x1, x1, #32
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldp             x4, x5, [sp]
+        ldp             x2, x3, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        add             sp, sp, #48
+        add             x0, x0, #96
+        add             x1, x1, #48
+        mov             x6, #16
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
 function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
         add             w10, w4, #3
         lsl             x10, x10, #7
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 82e1623a67..e9a341ecb9 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -191,6 +191,10 @@ NEON8_FNPROTO(epel_h, (int16_t *dst,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO(epel_hv, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width), _i8mm);
+
 NEON8_FNPROTO(epel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, int denom, int wx, int ox,
@@ -322,6 +326,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
 
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
+            NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h ,_i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h, _i8mm);
-- 
2.38.0.windows.1


[-- Attachment #3: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2023-07-17  7:08 Водянников Александр
  0 siblings, 0 replies; 34+ messages in thread
From: Водянников Александр @ 2023-07-17  7:08 UTC (permalink / raw)
  To: ffmpeg-devel

[-- Attachment #1: Type: text/plain, Size: 1 bytes --]



[-- Attachment #2: 0001-Fixed-crash-when-using-hardware-acceleration-in-thir.txt --]
[-- Type: text/plain, Size: 1766 bytes --]

From 0fe666c4e3d10a689f4c6854a58eec3e7ff3c922 Mon Sep 17 00:00:00 2001
From: Aleksoid <Aleksoid1978@mail.ru>
Date: Mon, 17 Jul 2023 17:04:43 +1000
Subject: [PATCH] Fixed crash when using hardware acceleration in third party
 projects without using hw_frames_ctx.

---
 libavcodec/decode.c | 27 +++++++++++++++------------
 1 file changed, 15 insertions(+), 12 deletions(-)

diff --git a/libavcodec/decode.c b/libavcodec/decode.c
index a19cca1a7c..f34f169910 100644
--- a/libavcodec/decode.c
+++ b/libavcodec/decode.c
@@ -1802,18 +1802,21 @@ AVBufferRef *ff_hwaccel_frame_priv_alloc(AVCodecContext *avctx,
                                          const AVHWAccel *hwaccel)
 {
     AVBufferRef *ref;
-    AVHWFramesContext *frames_ctx = (AVHWFramesContext *)avctx->hw_frames_ctx->data;
-    uint8_t *data = av_mallocz(hwaccel->frame_priv_data_size);
-    if (!data)
-        return NULL;
-
-    ref = av_buffer_create(data, hwaccel->frame_priv_data_size,
-                           hwaccel->free_frame_priv,
-                           frames_ctx->device_ctx, 0);
-    if (!ref) {
-        av_free(data);
-        return NULL;
-    }
+    if (avctx->hw_frames_ctx) {
+        AVHWFramesContext *frames_ctx = (AVHWFramesContext *)avctx->hw_frames_ctx->data;
+        uint8_t *data = av_mallocz(hwaccel->frame_priv_data_size);
+        if (!data)
+            return NULL;
+
+        ref = av_buffer_create(data, hwaccel->frame_priv_data_size,
+                               hwaccel->free_frame_priv,
+                               frames_ctx->device_ctx, 0);
+        if (!ref) {
+            av_free(data);
+            return NULL;
+        }
+    } else
+        ref = av_buffer_allocz(hwaccel->frame_priv_data_size);
 
     return ref;
 }
-- 
2.41.0.windows.1


[-- Attachment #3: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
  2023-02-09 14:25 [FFmpeg-devel] [PATCH] avdevice/xcbgrab: enable window resizing aline.gondimsantos
@ 2023-02-09 18:19 ` Aline Gondim Santos
  0 siblings, 0 replies; 34+ messages in thread
From: Aline Gondim Santos @ 2023-02-09 18:19 UTC (permalink / raw)
  To: ffmpeg-devel



Hello Nicolas,
Bellow you can find the bechmarks using `ffmpeg -benchmark` option.

1 - master

./ffmpeg -benchmark -t 10 -framerate 25 -f x11grab -i ":1+0,0 1920x1080"  output1master.mp4
ffmpeg version N-109782-g458ae405ef Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.3.0-1ubuntu1~22.04)
  configuration: 
  libavutil      57. 44.100 / 57. 44.100
  libavcodec     59. 63.100 / 59. 63.100
  libavformat    59. 38.100 / 59. 38.100
  libavdevice    59.  8.101 / 59.  8.101
  libavfilter     8. 56.100 /  8. 56.100
  libswscale      6.  8.112 /  6.  8.112
  libswresample   4.  9.100 /  4.  9.100
[x11grab @ 0x564d03e165c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
Input #0, x11grab, from ':1+0,0 1920x1080':
  Duration: N/A, start: 1675963927.428661, bitrate: 3379200 kb/s
  Stream #0:0: Video: rawvideo (BGR[0] / 0x524742), bgr0, 3520x1200, 3379200 kb/s, 25 fps, 1000k tbr, 1000k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (rawvideo (native) -> mpeg4 (native))
Press [q] to stop, [?] for help
Output #0, mp4, to 'output1master.mp4':
  Metadata:
    encoder         : Lavf59.38.100
  Stream #0:0: Video: mpeg4 (mp4v / 0x7634706D), yuv420p(tv, progressive), 3520x1200, q=2-31, 200 kb/s, 25 fps, 12800 tbn
    Metadata:
      encoder         : Lavc59.63.100 mpeg4
    Side data:
      cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: N/A
frame=  251 fps= 25 q=31.0 Lsize=    5720kB time=00:00:10.00 bitrate=4686.0kbits/s speed=0.996x    
video:5718kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.034719%
bench: utime=13.142s stime=0.307s rtime=10.039s
bench: maxrss=207576kB

2 - master

./ffmpeg -benchmark -t 10 -framerate 25 -f x11grab -window_id 0x5600008 -i ":1+0,0 1920x1080"  output2master.mp4
ffmpeg version N-109782-g458ae405ef Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.3.0-1ubuntu1~22.04)
  configuration: 
  libavutil      57. 44.100 / 57. 44.100
  libavcodec     59. 63.100 / 59. 63.100
  libavformat    59. 38.100 / 59. 38.100
  libavdevice    59.  8.101 / 59.  8.101
  libavfilter     8. 56.100 /  8. 56.100
  libswscale      6.  8.112 /  6.  8.112
  libswresample   4.  9.100 /  4.  9.100
Input #0, x11grab, from ':1+0,0 1920x1080':
  Duration: N/A, start: 1675963986.581500, bitrate: 472305 kb/s
  Stream #0:0: Video: rawvideo (BGR[0] / 0x524742), bgr0, 841x702, 472305 kb/s, 25 fps, 25 tbr, 1000k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (rawvideo (native) -> mpeg4 (native))
Press [q] to stop, [?] for help
Output #0, mp4, to 'output2master.mp4':
  Metadata:
    encoder         : Lavf59.38.100
  Stream #0:0: Video: mpeg4 (mp4v / 0x7634706D), yuv420p(tv, progressive), 841x702, q=2-31, 200 kb/s, 25 fps, 12800 tbn
    Metadata:
      encoder         : Lavc59.63.100 mpeg4
    Side data:
      cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: N/A
frame=  250 fps= 25 q=31.0 Lsize=    1274kB time=00:00:09.96 bitrate=1047.9kbits/s speed=   1x    
video:1272kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.151768%
bench: utime=0.628s stime=1.465s rtime=9.920s
bench: maxrss=52268kB   


3 - patch applied

./ffmpeg -benchmark -t 10 -framerate 25 -f x11grab -i ":1+0,0 1920x1080"  output1.mp4
ffmpeg version N-109783-g2352934f8b Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.3.0-1ubuntu1~22.04)
  configuration: 
  libavutil      57. 44.100 / 57. 44.100
  libavcodec     59. 63.100 / 59. 63.100
  libavformat    59. 38.100 / 59. 38.100
  libavdevice    59.  8.101 / 59.  8.101
  libavfilter     8. 56.100 /  8. 56.100
  libswscale      6.  8.112 /  6.  8.112
  libswresample   4.  9.100 /  4.  9.100
[x11grab @ 0x55e86905b5c0] Stream #0: not enough frames to estimate rate; consider increasing probesize
Input #0, x11grab, from ':1+0,0 1920x1080':
  Duration: N/A, start: 1675964519.431271, bitrate: 3379200 kb/s
  Stream #0:0: Video: rawvideo (BGR[0] / 0x524742), bgr0, 3520x1200, 3379200 kb/s, 25 fps, 1000k tbr, 1000k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (rawvideo (native) -> mpeg4 (native))
Press [q] to stop, [?] for help
Output #0, mp4, to 'output1.mp4':
  Metadata:
    encoder         : Lavf59.38.100
  Stream #0:0: Video: mpeg4 (mp4v / 0x7634706D), yuv420p(tv, progressive), 3520x1200, q=2-31, 200 kb/s, 25 fps, 12800 tbn
    Metadata:
      encoder         : Lavc59.63.100 mpeg4
    Side data:
      cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: N/A
frame=  250 fps= 25 q=31.0 Lsize=    5723kB time=00:00:09.96 bitrate=4706.8kbits/s speed=0.996x    
video:5721kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.034227%
bench: utime=14.005s stime=0.168s rtime=9.998s
bench: maxrss=207828kB


4 - patch applied


./ffmpeg -benchmark -t 10 -framerate 25 -f x11grab -window_id 0x5600008 -i ":1+0,0 1920x1080"  output2.mp4
ffmpeg version N-109783-g2352934f8b Copyright (c) 2000-2023 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.3.0-1ubuntu1~22.04)
  configuration: 
  libavutil      57. 44.100 / 57. 44.100
  libavcodec     59. 63.100 / 59. 63.100
  libavformat    59. 38.100 / 59. 38.100
  libavdevice    59.  8.101 / 59.  8.101
  libavfilter     8. 56.100 /  8. 56.100
  libswscale      6.  8.112 /  6.  8.112
  libswresample   4.  9.100 /  4.  9.100
Input #0, x11grab, from ':1+0,0 1920x1080':
  Duration: N/A, start: 1675964455.191272, bitrate: 472305 kb/s
  Stream #0:0: Video: rawvideo (BGR[0] / 0x524742), bgr0, 841x702, 472305 kb/s, 25 fps, 24.83 tbr, 1000k tbn
Stream mapping:
  Stream #0:0 -> #0:0 (rawvideo (native) -> mpeg4 (native))
Press [q] to stop, [?] for help
Output #0, mp4, to 'output2.mp4':
  Metadata:
    encoder         : Lavf59.38.100
  Stream #0:0: Video: mpeg4 (mp4v / 0x7634706D), yuv420p(tv, progressive), 841x702, q=2-31, 200 kb/s, 24.83 fps, 19072 tbn
    Metadata:
      encoder         : Lavc59.63.100 mpeg4
    Side data:
      cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: N/A
frame=  250 fps= 25 q=31.0 Lsize=     961kB time=00:00:10.02 bitrate= 785.0kbits/s speed=1.01x    
video:959kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 0.199722%
bench: utime=1.624s stime=0.049s rtime=9.920s
bench: maxrss=51996kB

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [FFmpeg-devel] (no subject)
@ 2022-07-17 13:32 facefunk
  0 siblings, 0 replies; 34+ messages in thread
From: facefunk @ 2022-07-17 13:32 UTC (permalink / raw)
  To: ffmpeg-devel

Hi FFMDevs,

I've managed to get forced mov_text subtitles working in VLC Player. -disposition:s:0 +forced is honored but I'm not 100% sure about my approach.

The attached patch represents the best idea I came up with so far as the code is minimal and it doesn't require the user to set any extra parameters, however it does puncture an abstraction boundary ever so slightly by copying stream data to the codec, perhaps this isn't a problem.

If there's anybody who could look over my patch and let me know if there's a better way of going about this, that would be greatly appreciated. 

Love your work!

Kind regards,

facefunk

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2025-05-30 10:34 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-05-27  7:55 [FFmpeg-devel] (no subject) Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 01/17] swscale/format: rename legacy format conversion table Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 02/17] swscale/format: add ff_fmt_clear() Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 03/17] tests/checkasm: increase number of runs in between measurements Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 04/17] tests/checkasm: generalize DEF_CHECKASM_CHECK_FUNC to floats Niklas Haas
2025-05-27  8:24   ` Martin Storsjö
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 05/17] swscale: add SWS_UNSTABLE flag Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 06/17] swscale/ops: introduce new low level framework Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 07/17] swscale/optimizer: add high-level ops optimizer Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 08/17] swscale/ops_internal: add internal ops backend API Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 09/17] swscale/ops: add dispatch layer Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 10/17] swscale/optimizer: add packed shuffle solver Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 11/17] swscale/ops_chain: add internal abstraction for kernel linking Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 12/17] swscale/ops_backend: add reference backend basend on C templates Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 13/17] swscale/ops_memcpy: add 'memcpy' backend for plane->plane copies Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 14/17] swscale/x86: add SIMD backend Niklas Haas
2025-05-30  2:23   ` Michael Niedermayer
2025-05-30 10:34     ` Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 15/17] tests/checkasm: add checkasm tests for swscale ops Niklas Haas
2025-05-27  8:25   ` Martin Storsjö
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 16/17] swscale/format: add new format decode/encode logic Niklas Haas
2025-05-27  7:55 ` [FFmpeg-devel] [PATCH v3 17/17] swscale/graph: allow experimental use of new format handler Niklas Haas
2025-05-27  8:29 ` [FFmpeg-devel] (no subject) Kieran Kunhya via ffmpeg-devel
2025-05-27  8:51   ` Niklas Haas
  -- strict thread matches above, loose matches on Subject: below --
2024-08-07 15:58 cyfdel-at-hotmail.com
2024-04-18  9:42 pengxu
2024-04-18  7:36 pengxu
2023-10-14  8:40 Logan.Lyu
2023-10-14  8:40 Logan.Lyu
2023-10-14  8:39 Logan.Lyu
2023-10-14  8:39 Logan.Lyu
2023-07-17  7:08 Водянников Александр
2023-02-09 14:25 [FFmpeg-devel] [PATCH] avdevice/xcbgrab: enable window resizing aline.gondimsantos
2023-02-09 18:19 ` [FFmpeg-devel] (no subject) Aline Gondim Santos
2022-07-17 13:32 facefunk

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git