[FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion
@ 2026-02-06  6:45 David Christle via ffmpeg-devel
  2026-02-06  6:46 ` [FFmpeg-devel] [PATCH 2/2] tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded strides David Christle via ffmpeg-devel
  2026-02-08 21:15 ` [FFmpeg-devel] [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix David Christle via ffmpeg-devel
  0 siblings, 2 replies; 7+ messages in thread
From: David Christle via ffmpeg-devel @ 2026-02-06  6:45 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: David Christle

Add ARM64 NEON-accelerated unscaled YUV-to-RGB conversion for planar
YUV input formats. This extends the existing NV12/NV21 NEON paths with
YUV420P, YUV422P, and YUVA420P support for all packed RGB output
formats (ARGB, RGBA, ABGR, BGRA, RGB24, BGR24) and planar GBRP.

Register with ff_yuv2rgb_init_aarch64() to also cover the scaled path.

checkasm: all 42 sw_yuv2rgb tests pass.
Speedup vs C at 1920px width (Apple M3 Max, avg of 20 runs):
  yuv420p->rgb24:   4.3x    yuv420p->argb:   3.1x
  yuv422p->rgb24:   5.5x    yuv422p->argb:   4.1x
  yuva420p->argb:   3.5x    yuva420p->rgba:  3.5x

Signed-off-by: David Christle <dev@christle.is>
---
 libswscale/aarch64/swscale_unscaled.c |  90 ++++++++++++++++++
 libswscale/aarch64/yuv2rgb_neon.S     | 130 +++++++++++++++++++++++---
 libswscale/swscale_internal.h         |   1 +
 libswscale/yuv2rgb.c                  |   2 +
 4 files changed, 208 insertions(+), 15 deletions(-)

diff --git a/libswscale/aarch64/swscale_unscaled.c b/libswscale/aarch64/swscale_unscaled.c
index fdecafd94b..8fe4ae9953 100644
--- a/libswscale/aarch64/swscale_unscaled.c
+++ b/libswscale/aarch64/swscale_unscaled.c
@@ -89,10 +89,45 @@ DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, rgba)
 DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, abgr)                                                   \
 DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, bgra)                                                   \
 DECLARE_FF_YUVX_TO_GBRP_FUNCS(yuvx, gbrp)                                                   \
+DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, rgb24)                                                  \
+DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, bgr24)                                                  \
 
 DECLARE_FF_YUVX_TO_ALL_RGBX_FUNCS(yuv420p)
 DECLARE_FF_YUVX_TO_ALL_RGBX_FUNCS(yuv422p)
 
+#define DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(ofmt)                                             \
+int ff_yuva420p_to_##ofmt##_neon(int w, int h,                                               \
+                                 uint8_t *dst, int linesize,                                 \
+                                 const uint8_t *srcY, int linesizeY,                         \
+                                 const uint8_t *srcU, int linesizeU,                         \
+                                 const uint8_t *srcV, int linesizeV,                         \
+                                 const int16_t *table,                                       \
+                                 int y_offset, int y_coeff,                                  \
+                                 const uint8_t *srcA, int linesizeA);                        \
+                                                                                             \
+static int yuva420p_to_##ofmt##_neon_wrapper(SwsInternal *c,                                 \
+                                             const uint8_t *const src[],                     \
+                                             const int srcStride[], int srcSliceY,           \
+                                             int srcSliceH, uint8_t *const dst[],            \
+                                             const int dstStride[]) {                        \
+    const int16_t yuv2rgb_table[] = { YUV_TO_RGB_TABLE };                                    \
+                                                                                             \
+    return ff_yuva420p_to_##ofmt##_neon(c->opts.src_w, srcSliceH,                            \
+                                        dst[0] + srcSliceY * dstStride[0], dstStride[0],     \
+                                        src[0], srcStride[0],                                \
+                                        src[1], srcStride[1],                                \
+                                        src[2], srcStride[2],                                \
+                                        yuv2rgb_table,                                       \
+                                        c->yuv2rgb_y_offset >> 6,                            \
+                                        c->yuv2rgb_y_coeff,                                  \
+                                        src[3], srcStride[3]);                               \
+}
+
+DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(argb)
+DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(rgba)
+DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(abgr)
+DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(bgra)
+
 #define DECLARE_FF_NVX_TO_RGBX_FUNCS(ifmt, ofmt)                                            \
 int ff_##ifmt##_to_##ofmt##_neon(int w, int h,                                              \
                                  uint8_t *dst, int linesize,                                \
@@ -176,6 +211,8 @@ DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, rgba)
 DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, abgr)                                                     \
 DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, bgra)                                                     \
 DECLARE_FF_NVX_TO_GBRP_FUNCS(nvx, gbrp)                                                     \
+DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, rgb24)                                                    \
+DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, bgr24)                                                    \
 
 DECLARE_FF_NVX_TO_ALL_RGBX_FUNCS(nv12)
 DECLARE_FF_NVX_TO_ALL_RGBX_FUNCS(nv21)
@@ -199,6 +236,8 @@ DECLARE_FF_NVX_TO_ALL_RGBX_FUNCS(nv21)
     SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, abgr, ABGR, accurate_rnd);                            \
     SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, bgra, BGRA, accurate_rnd);                            \
     SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, gbrp, GBRP, accurate_rnd);                            \
+    SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, rgb24, RGB24, accurate_rnd);                          \
+    SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, bgr24, BGR24, accurate_rnd);                          \
 } while (0)
 
 static void get_unscaled_swscale_neon(SwsInternal *c) {
@@ -208,6 +247,13 @@ static void get_unscaled_swscale_neon(SwsInternal *c) {
     SET_FF_NVX_TO_ALL_RGBX_FUNC(nv21, NV21, accurate_rnd);
     SET_FF_NVX_TO_ALL_RGBX_FUNC(yuv420p, YUV420P, accurate_rnd);
     SET_FF_NVX_TO_ALL_RGBX_FUNC(yuv422p, YUV422P, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuva420p, YUVA420P, argb, ARGB, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuva420p, YUVA420P, rgba, RGBA, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuva420p, YUVA420P, abgr, ABGR, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuva420p, YUVA420P, bgra, BGRA, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuv420p, YUVA420P, rgb24, RGB24, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuv420p, YUVA420P, bgr24, BGR24, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuv420p, YUVA420P, gbrp,  GBRP,  accurate_rnd);
 
     if (c->opts.dst_format == AV_PIX_FMT_YUV420P &&
         (c->opts.src_format == AV_PIX_FMT_NV24 || c->opts.src_format == AV_PIX_FMT_NV42) &&
@@ -221,3 +267,47 @@ void ff_get_unscaled_swscale_aarch64(SwsInternal *c)
     if (have_neon(cpu_flags))
         get_unscaled_swscale_neon(c);
 }
+
+av_cold SwsFunc ff_yuv2rgb_init_aarch64(SwsInternal *c)
+{
+    int cpu_flags = av_get_cpu_flags();
+    if (!have_neon(cpu_flags) ||
+        (c->opts.src_h & 1) || (c->opts.src_w & 15) ||
+        (c->opts.flags & SWS_ACCURATE_RND))
+        return NULL;
+
+    if (c->opts.src_format == AV_PIX_FMT_YUV420P) {
+        switch (c->opts.dst_format) {
+        case AV_PIX_FMT_ARGB:  return yuv420p_to_argb_neon_wrapper;
+        case AV_PIX_FMT_RGBA:  return yuv420p_to_rgba_neon_wrapper;
+        case AV_PIX_FMT_ABGR:  return yuv420p_to_abgr_neon_wrapper;
+        case AV_PIX_FMT_BGRA:  return yuv420p_to_bgra_neon_wrapper;
+        case AV_PIX_FMT_RGB24: return yuv420p_to_rgb24_neon_wrapper;
+        case AV_PIX_FMT_BGR24: return yuv420p_to_bgr24_neon_wrapper;
+        case AV_PIX_FMT_GBRP:  return yuv420p_to_gbrp_neon_wrapper;
+        }
+    } else if (c->opts.src_format == AV_PIX_FMT_YUVA420P) {
+        switch (c->opts.dst_format) {
+#if CONFIG_SWSCALE_ALPHA
+        case AV_PIX_FMT_ARGB:  return yuva420p_to_argb_neon_wrapper;
+        case AV_PIX_FMT_RGBA:  return yuva420p_to_rgba_neon_wrapper;
+        case AV_PIX_FMT_ABGR:  return yuva420p_to_abgr_neon_wrapper;
+        case AV_PIX_FMT_BGRA:  return yuva420p_to_bgra_neon_wrapper;
+#endif
+        case AV_PIX_FMT_RGB24: return yuv420p_to_rgb24_neon_wrapper;
+        case AV_PIX_FMT_BGR24: return yuv420p_to_bgr24_neon_wrapper;
+        case AV_PIX_FMT_GBRP:  return yuv420p_to_gbrp_neon_wrapper;
+        }
+    } else if (c->opts.src_format == AV_PIX_FMT_YUV422P) {
+        switch (c->opts.dst_format) {
+        case AV_PIX_FMT_ARGB:  return yuv422p_to_argb_neon_wrapper;
+        case AV_PIX_FMT_RGBA:  return yuv422p_to_rgba_neon_wrapper;
+        case AV_PIX_FMT_ABGR:  return yuv422p_to_abgr_neon_wrapper;
+        case AV_PIX_FMT_BGRA:  return yuv422p_to_bgra_neon_wrapper;
+        case AV_PIX_FMT_RGB24: return yuv422p_to_rgb24_neon_wrapper;
+        case AV_PIX_FMT_BGR24: return yuv422p_to_bgr24_neon_wrapper;
+        case AV_PIX_FMT_GBRP:  return yuv422p_to_gbrp_neon_wrapper;
+        }
+    }
+    return NULL;
+}
diff --git a/libswscale/aarch64/yuv2rgb_neon.S b/libswscale/aarch64/yuv2rgb_neon.S
index 0797a6d5e0..19f750545f 100644
--- a/libswscale/aarch64/yuv2rgb_neon.S
+++ b/libswscale/aarch64/yuv2rgb_neon.S
@@ -55,7 +55,17 @@
         load_dst1_dst2  24, 32, 40, 48
         sub             w3, w3, w0                                      // w3 = linesize  - width     (padding)
 .else
+ .ifc \ofmt,rgb24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+ .else
+  .ifc \ofmt,bgr24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+  .else
         sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+  .endif
+ .endif
 .endif
         sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
         sub             w7, w7, w0                                      // w7 = linesizeC - width     (paddingC)
@@ -78,7 +88,17 @@
         load_dst1_dst2  40, 48, 56, 64
         sub             w3, w3, w0                                      // w3 = linesize  - width     (padding)
 .else
+ .ifc \ofmt,rgb24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+ .else
+  .ifc \ofmt,bgr24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+  .else
         sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+  .endif
+ .endif
 .endif
         sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
         sub             w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
@@ -87,6 +107,18 @@
         neg             w11, w11
 .endm
 
+.macro load_args_yuva420p ofmt
+        load_args_yuv420p \ofmt
+#if defined(__APPLE__)
+        ldr             x15, [sp, #32]                                  // srcA
+        ldr             w16, [sp, #40]                                  // linesizeA
+#else
+        ldr             x15, [sp, #40]                                  // srcA
+        ldr             w16, [sp, #48]                                  // linesizeA
+#endif
+        sub             w16, w16, w0                                    // w16 = linesizeA - width    (paddingA)
+.endm
+
 .macro load_args_yuv422p ofmt
         ldr             x13, [sp]                                       // srcV
         ldr             w14, [sp, #8]                                   // linesizeV
@@ -99,7 +131,17 @@
         load_dst1_dst2  40, 48, 56, 64
         sub             w3, w3, w0                                      // w3 = linesize  - width     (padding)
 .else
+ .ifc \ofmt,rgb24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+ .else
+  .ifc \ofmt,bgr24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+  .else
         sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+  .endif
+ .endif
 .endif
         sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
         sub             w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
@@ -125,6 +167,10 @@
         ushll           v19.8h, v17.8b, #3
 .endm
 
+.macro load_chroma_yuva420p
+        load_chroma_yuv420p
+.endm
+
 .macro load_chroma_yuv422p
         load_chroma_yuv420p
 .endm
@@ -147,6 +193,11 @@
         add             x13, x13, w17, sxtw                             // srcV += incV
 .endm
 
+.macro increment_yuva420p
+        increment_yuv420p
+        add             x15, x15, w16, sxtw                             // srcA += paddingA (every row)
+.endm
+
 .macro increment_yuv422p
         add             x6,  x6,  w7, sxtw                              // srcU += incU
         add             x13, x13, w14, sxtw                             // srcV += incV
@@ -169,65 +220,103 @@
 
 .macro compute_rgba r1 g1 b1 a1 r2 g2 b2 a2
         compute_rgb     \r1, \g1, \b1, \r2, \g2, \b2
-        movi            \a1, #255
-        movi            \a2, #255
+        mov             \a1, v30.8b
+        mov             \a2, v30.8b
+.endm
+
+.macro compute_rgba_alpha r1 g1 b1 a1 r2 g2 b2 a2
+        compute_rgb     \r1, \g1, \b1, \r2, \g2, \b2
+        mov             \a1, v28.8b                                     // real alpha (first 8 pixels)
+        mov             \a2, v29.8b                                     // real alpha (next 8 pixels)
 .endm
 
 .macro declare_func ifmt ofmt
 function ff_\ifmt\()_to_\ofmt\()_neon, export=1
         load_args_\ifmt \ofmt
 
+        movi            v31.8h, #4, lsl #8                              // 128 * (1<<3) (loop-invariant)
+        movi            v30.8b, #255                                    // alpha = 255  (loop-invariant)
         mov             w9, w1
 1:
         mov             w8, w0                                          // w8 = width
 2:
-        movi            v5.8h, #4, lsl #8                               // 128 * (1<<3)
         load_chroma_\ifmt
-        sub             v18.8h, v18.8h, v5.8h                           // U*(1<<3) - 128*(1<<3)
-        sub             v19.8h, v19.8h, v5.8h                           // V*(1<<3) - 128*(1<<3)
+        sub             v18.8h, v18.8h, v31.8h                          // U*(1<<3) - 128*(1<<3)
+        sub             v19.8h, v19.8h, v31.8h                          // V*(1<<3) - 128*(1<<3)
         sqdmulh         v20.8h, v19.8h, v1.h[0]                         // V * v2r            (R)
         sqdmulh         v22.8h, v18.8h, v1.h[1]                         // U * u2g
+        ld1             {v2.16b}, [x4], #16                             // load luma (interleaved)
+.ifc \ifmt,yuva420p
+        ld1             {v28.8b, v29.8b}, [x15], #16                    // load 16 alpha bytes
+.endif
         sqdmulh         v19.8h, v19.8h, v1.h[2]                         //           V * v2g
-        add             v22.8h, v22.8h, v19.8h                          // U * u2g + V * v2g  (G)
         sqdmulh         v24.8h, v18.8h, v1.h[3]                         // U * u2b            (B)
-        zip2            v21.8h, v20.8h, v20.8h                          // R2
-        zip1            v20.8h, v20.8h, v20.8h                          // R1
-        zip2            v23.8h, v22.8h, v22.8h                          // G2
-        zip1            v22.8h, v22.8h, v22.8h                          // G1
-        zip2            v25.8h, v24.8h, v24.8h                          // B2
-        zip1            v24.8h, v24.8h, v24.8h                          // B1
-        ld1             {v2.16b}, [x4], #16                             // load luma
         ushll           v26.8h, v2.8b,  #3                              // Y1*(1<<3)
         ushll2          v27.8h, v2.16b, #3                              // Y2*(1<<3)
+        add             v22.8h, v22.8h, v19.8h                          // U * u2g + V * v2g  (G)
         sub             v26.8h, v26.8h, v3.8h                           // Y1*(1<<3) - y_offset
         sub             v27.8h, v27.8h, v3.8h                           // Y2*(1<<3) - y_offset
+        zip2            v21.8h, v20.8h, v20.8h                          // R2
+        zip1            v20.8h, v20.8h, v20.8h                          // R1
         sqdmulh         v26.8h, v26.8h, v0.8h                           // ((Y1*(1<<3) - y_offset) * y_coeff) >> 15
         sqdmulh         v27.8h, v27.8h, v0.8h                           // ((Y2*(1<<3) - y_offset) * y_coeff) >> 15
+        zip2            v23.8h, v22.8h, v22.8h                          // G2
+        zip1            v22.8h, v22.8h, v22.8h                          // G1
+        zip2            v25.8h, v24.8h, v24.8h                          // B2
+        zip1            v24.8h, v24.8h, v24.8h                          // B1
 
 .ifc \ofmt,argb // 1 2 3 0
+ .ifc \ifmt,yuva420p
+        compute_rgba_alpha v5.8b,v6.8b,v7.8b,v4.8b, v17.8b,v18.8b,v19.8b,v16.8b
+ .else
         compute_rgba    v5.8b,v6.8b,v7.8b,v4.8b, v17.8b,v18.8b,v19.8b,v16.8b
+ .endif
 .endif
 
 .ifc \ofmt,rgba // 0 1 2 3
+ .ifc \ifmt,yuva420p
+        compute_rgba_alpha v4.8b,v5.8b,v6.8b,v7.8b, v16.8b,v17.8b,v18.8b,v19.8b
+ .else
         compute_rgba    v4.8b,v5.8b,v6.8b,v7.8b, v16.8b,v17.8b,v18.8b,v19.8b
+ .endif
 .endif
 
 .ifc \ofmt,abgr // 3 2 1 0
+ .ifc \ifmt,yuva420p
+        compute_rgba_alpha v7.8b,v6.8b,v5.8b,v4.8b, v19.8b,v18.8b,v17.8b,v16.8b
+ .else
         compute_rgba    v7.8b,v6.8b,v5.8b,v4.8b, v19.8b,v18.8b,v17.8b,v16.8b
+ .endif
 .endif
 
 .ifc \ofmt,bgra // 2 1 0 3
+ .ifc \ifmt,yuva420p
+        compute_rgba_alpha v6.8b,v5.8b,v4.8b,v7.8b, v18.8b,v17.8b,v16.8b,v19.8b
+ .else
         compute_rgba    v6.8b,v5.8b,v4.8b,v7.8b, v18.8b,v17.8b,v16.8b,v19.8b
+ .endif
 .endif
 
-.ifc \ofmt,gbrp
+.ifc \ofmt,rgb24
+        compute_rgb     v4.8b,v5.8b,v6.8b, v16.8b,v17.8b,v18.8b
+        st3             { v4.8b, v5.8b, v6.8b}, [x2], #24
+        st3             {v16.8b,v17.8b,v18.8b}, [x2], #24
+.else
+ .ifc \ofmt,bgr24
+        compute_rgb     v6.8b,v5.8b,v4.8b, v18.8b,v17.8b,v16.8b
+        st3             { v4.8b, v5.8b, v6.8b}, [x2], #24
+        st3             {v16.8b,v17.8b,v18.8b}, [x2], #24
+ .else
+  .ifc \ofmt,gbrp
         compute_rgb     v18.8b,v4.8b,v6.8b, v19.8b,v5.8b,v7.8b
         st1             {  v4.8b,  v5.8b }, [x2],  #16
         st1             {  v6.8b,  v7.8b }, [x10], #16
         st1             { v18.8b, v19.8b }, [x15], #16
-.else
+  .else
         st4             { v4.8b, v5.8b, v6.8b, v7.8b}, [x2], #32
         st4             {v16.8b,v17.8b,v18.8b,v19.8b}, [x2], #32
+  .endif
+ .endif
 .endif
         subs            w8, w8, #16                                     // width -= 16
         b.gt            2b
@@ -251,9 +340,20 @@ endfunc
         declare_func    \ifmt, abgr
         declare_func    \ifmt, bgra
         declare_func    \ifmt, gbrp
+        declare_func    \ifmt, rgb24
+        declare_func    \ifmt, bgr24
 .endm
 
 declare_rgb_funcs nv12
 declare_rgb_funcs nv21
 declare_rgb_funcs yuv420p
 declare_rgb_funcs yuv422p
+
+.macro declare_yuva_funcs ifmt
+        declare_func    \ifmt, argb
+        declare_func    \ifmt, rgba
+        declare_func    \ifmt, abgr
+        declare_func    \ifmt, bgra
+.endm
+
+declare_yuva_funcs yuva420p
diff --git a/libswscale/swscale_internal.h b/libswscale/swscale_internal.h
index 5c58272664..c671f1c7cd 100644
--- a/libswscale/swscale_internal.h
+++ b/libswscale/swscale_internal.h
@@ -739,6 +739,7 @@ av_cold int ff_sws_fill_xyztables(SwsInternal *c);
 SwsFunc ff_yuv2rgb_init_x86(SwsInternal *c);
 SwsFunc ff_yuv2rgb_init_ppc(SwsInternal *c);
 SwsFunc ff_yuv2rgb_init_loongarch(SwsInternal *c);
+SwsFunc ff_yuv2rgb_init_aarch64(SwsInternal *c);
 
 static av_always_inline int is16BPS(enum AVPixelFormat pix_fmt)
 {
diff --git a/libswscale/yuv2rgb.c b/libswscale/yuv2rgb.c
index 48089760f5..c62201856d 100644
--- a/libswscale/yuv2rgb.c
+++ b/libswscale/yuv2rgb.c
@@ -568,6 +568,8 @@ SwsFunc ff_yuv2rgb_get_func_ptr(SwsInternal *c)
     t = ff_yuv2rgb_init_x86(c);
 #elif ARCH_LOONGARCH64
     t = ff_yuv2rgb_init_loongarch(c);
+#elif ARCH_AARCH64
+    t = ff_yuv2rgb_init_aarch64(c);
 #endif
 
     if (t)
-- 
2.52.0


_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [FFmpeg-devel] [PATCH 2/2] tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded strides
  2026-02-06  6:45 [FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion David Christle via ffmpeg-devel
@ 2026-02-06  6:46 ` David Christle via ffmpeg-devel
  2026-02-08 21:15 ` [FFmpeg-devel] [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix David Christle via ffmpeg-devel
  1 sibling, 0 replies; 7+ messages in thread
From: David Christle via ffmpeg-devel @ 2026-02-06  6:46 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: David Christle

Increase test height from 2 to 4 rows and add 32 bytes of source stride
padding. This exercises chroma row sharing across multiple luma row pairs
in YUV420P and the stride increment arithmetic in SIMD implementations,
both of which were previously untested.

Signed-off-by: David Christle <dev@christle.is>
---
 tests/checkasm/sw_yuv2rgb.c | 96 +++++++++++++++++++------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/tests/checkasm/sw_yuv2rgb.c b/tests/checkasm/sw_yuv2rgb.c
index c6c1ad934b..ba579d0545 100644
--- a/tests/checkasm/sw_yuv2rgb.c
+++ b/tests/checkasm/sw_yuv2rgb.c
@@ -104,6 +104,8 @@ static void check_yuv2rgb(int src_pix_fmt)
 {
     const AVPixFmtDescriptor *src_desc = av_pix_fmt_desc_get(src_pix_fmt);
 #define MAX_LINE_SIZE 1920
+#define SRC_STRIDE_PAD 32
+#define NUM_LINES 4
     static const int input_sizes[] = {8, 128, 1080, MAX_LINE_SIZE};
 
     declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT,
@@ -111,36 +113,26 @@ static void check_yuv2rgb(int src_pix_fmt)
                            const int srcStride[], int srcSliceY, int srcSliceH,
                            uint8_t *const dst[], const int dstStride[]);
 
-    LOCAL_ALIGNED_8(uint8_t, src_y, [MAX_LINE_SIZE * 2]);
-    LOCAL_ALIGNED_8(uint8_t, src_u, [MAX_LINE_SIZE]);
-    LOCAL_ALIGNED_8(uint8_t, src_v, [MAX_LINE_SIZE]);
-    LOCAL_ALIGNED_8(uint8_t, src_a, [MAX_LINE_SIZE * 2]);
+    LOCAL_ALIGNED_8(uint8_t, src_y, [(MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES]);
+    LOCAL_ALIGNED_8(uint8_t, src_u, [(MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES]);
+    LOCAL_ALIGNED_8(uint8_t, src_v, [(MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES]);
+    LOCAL_ALIGNED_8(uint8_t, src_a, [(MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES]);
     const uint8_t *src[4] = { src_y, src_u, src_v, src_a };
 
-    LOCAL_ALIGNED_8(uint8_t, dst0_0, [2 * MAX_LINE_SIZE * 6]);
-    LOCAL_ALIGNED_8(uint8_t, dst0_1, [2 * MAX_LINE_SIZE]);
-    LOCAL_ALIGNED_8(uint8_t, dst0_2, [2 * MAX_LINE_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dst0_0, [NUM_LINES * MAX_LINE_SIZE * 6]);
+    LOCAL_ALIGNED_8(uint8_t, dst0_1, [NUM_LINES * MAX_LINE_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dst0_2, [NUM_LINES * MAX_LINE_SIZE]);
     uint8_t *dst0[4] = { dst0_0, dst0_1, dst0_2 };
-    uint8_t *lines0[4][2] = {
-        { dst0_0, dst0_0 + MAX_LINE_SIZE * 6 },
-        { dst0_1, dst0_1 + MAX_LINE_SIZE },
-        { dst0_2, dst0_2 + MAX_LINE_SIZE }
-    };
-
-    LOCAL_ALIGNED_8(uint8_t, dst1_0, [2 * MAX_LINE_SIZE * 6]);
-    LOCAL_ALIGNED_8(uint8_t, dst1_1, [2 * MAX_LINE_SIZE]);
-    LOCAL_ALIGNED_8(uint8_t, dst1_2, [2 * MAX_LINE_SIZE]);
+
+    LOCAL_ALIGNED_8(uint8_t, dst1_0, [NUM_LINES * MAX_LINE_SIZE * 6]);
+    LOCAL_ALIGNED_8(uint8_t, dst1_1, [NUM_LINES * MAX_LINE_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dst1_2, [NUM_LINES * MAX_LINE_SIZE]);
     uint8_t *dst1[4] = { dst1_0, dst1_1, dst1_2 };
-    uint8_t *lines1[4][2] = {
-        { dst1_0, dst1_0 + MAX_LINE_SIZE * 6 },
-        { dst1_1, dst1_1 + MAX_LINE_SIZE },
-        { dst1_2, dst1_2 + MAX_LINE_SIZE }
-    };
 
-    randomize_buffers(src_y, MAX_LINE_SIZE * 2);
-    randomize_buffers(src_u, MAX_LINE_SIZE);
-    randomize_buffers(src_v, MAX_LINE_SIZE);
-    randomize_buffers(src_a, MAX_LINE_SIZE * 2);
+    randomize_buffers(src_y, (MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES);
+    randomize_buffers(src_u, (MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES);
+    randomize_buffers(src_v, (MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES);
+    randomize_buffers(src_a, (MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES);
 
     for (int dfi = 0; dfi < FF_ARRAY_ELEMS(dst_fmts); dfi++) {
         int dst_pix_fmt = dst_fmts[dfi];
@@ -152,12 +144,12 @@ static void check_yuv2rgb(int src_pix_fmt)
             int log_level;
             int width = input_sizes[isi];
             int srcSliceY = 0;
-            int srcSliceH = 2;
+            int srcSliceH = NUM_LINES;
             int srcStride[4] = {
-                width,
-                width >> src_desc->log2_chroma_w,
-                width >> src_desc->log2_chroma_w,
-                width,
+                width + SRC_STRIDE_PAD,
+                (width >> src_desc->log2_chroma_w) + SRC_STRIDE_PAD,
+                (width >> src_desc->log2_chroma_w) + SRC_STRIDE_PAD,
+                width + SRC_STRIDE_PAD,
             };
             int dstStride[4] = {
                 MAX_LINE_SIZE * 6,
@@ -178,13 +170,13 @@ static void check_yuv2rgb(int src_pix_fmt)
 
             c = sws_internal(sws);
             if (check_func(c->convert_unscaled, "%s_%s_%d", src_desc->name, dst_desc->name, width)) {
-                memset(dst0_0, 0xFF, 2 * MAX_LINE_SIZE * 6);
-                memset(dst1_0, 0xFF, 2 * MAX_LINE_SIZE * 6);
+                memset(dst0_0, 0xFF, NUM_LINES * MAX_LINE_SIZE * 6);
+                memset(dst1_0, 0xFF, NUM_LINES * MAX_LINE_SIZE * 6);
                 if (dst_pix_fmt == AV_PIX_FMT_GBRP) {
-                    memset(dst0_1, 0xFF, MAX_LINE_SIZE);
-                    memset(dst0_2, 0xFF, MAX_LINE_SIZE);
-                    memset(dst1_1, 0xFF, MAX_LINE_SIZE);
-                    memset(dst1_2, 0xFF, MAX_LINE_SIZE);
+                    memset(dst0_1, 0xFF, NUM_LINES * MAX_LINE_SIZE);
+                    memset(dst0_2, 0xFF, NUM_LINES * MAX_LINE_SIZE);
+                    memset(dst1_1, 0xFF, NUM_LINES * MAX_LINE_SIZE);
+                    memset(dst1_2, 0xFF, NUM_LINES * MAX_LINE_SIZE);
                 }
 
                 call_ref(c, src, srcStride, srcSliceY,
@@ -198,23 +190,31 @@ static void check_yuv2rgb(int src_pix_fmt)
                     dst_pix_fmt == AV_PIX_FMT_BGRA  ||
                     dst_pix_fmt == AV_PIX_FMT_RGB24 ||
                     dst_pix_fmt == AV_PIX_FMT_BGR24) {
-                    if (cmp_off_by_n(lines0[0][0], lines1[0][0], width * sample_size, 3) ||
-                        cmp_off_by_n(lines0[0][1], lines1[0][1], width * sample_size, 3))
-                        fail();
+                    for (int row = 0; row < srcSliceH; row++)
+                        if (cmp_off_by_n(dst0_0 + row * dstStride[0],
+                                         dst1_0 + row * dstStride[0],
+                                         width * sample_size, 3))
+                            fail();
                 } else if (dst_pix_fmt == AV_PIX_FMT_RGB565 ||
                            dst_pix_fmt == AV_PIX_FMT_BGR565) {
-                    if (cmp_565_by_n(lines0[0][0], lines1[0][0], width, 2) ||
-                        cmp_565_by_n(lines0[0][1], lines1[0][1], width, 2))
-                        fail();
+                    for (int row = 0; row < srcSliceH; row++)
+                        if (cmp_565_by_n(dst0_0 + row * dstStride[0],
+                                         dst1_0 + row * dstStride[0],
+                                         width, 2))
+                            fail();
                 } else if (dst_pix_fmt == AV_PIX_FMT_RGB555 ||
                            dst_pix_fmt == AV_PIX_FMT_BGR555) {
-                    if (cmp_555_by_n(lines0[0][0], lines1[0][0], width, 2) ||
-                        cmp_555_by_n(lines0[0][1], lines1[0][1], width, 2))
-                        fail();
+                    for (int row = 0; row < srcSliceH; row++)
+                        if (cmp_555_by_n(dst0_0 + row * dstStride[0],
+                                         dst1_0 + row * dstStride[0],
+                                         width, 2))
+                            fail();
                 } else if (dst_pix_fmt == AV_PIX_FMT_GBRP) {
                     for (int p = 0; p < 3; p++)
-                        for (int l = 0; l < 2; l++)
-                            if (cmp_off_by_n(lines0[p][l], lines1[p][l], width, 3))
+                        for (int row = 0; row < srcSliceH; row++)
+                            if (cmp_off_by_n(dst0[p] + row * dstStride[p],
+                                             dst1[p] + row * dstStride[p],
+                                             width, 3))
                                 fail();
                 } else {
                     fail();
@@ -228,6 +228,8 @@ static void check_yuv2rgb(int src_pix_fmt)
     }
 }
 
+#undef NUM_LINES
+#undef SRC_STRIDE_PAD
 #undef MAX_LINE_SIZE
 
 void checkasm_check_sw_yuv2rgb(void)
-- 
2.52.0


_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [FFmpeg-devel] [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix
  2026-02-06  6:45 [FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion David Christle via ffmpeg-devel
  2026-02-06  6:46 ` [FFmpeg-devel] [PATCH 2/2] tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded strides David Christle via ffmpeg-devel
@ 2026-02-08 21:15 ` David Christle via ffmpeg-devel
  2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 1/3] swscale/loongarch: fix LASX YUV2RGB residual for multi-row slices David Christle via ffmpeg-devel
                     ` (3 more replies)
  1 sibling, 4 replies; 7+ messages in thread
From: David Christle via ffmpeg-devel @ 2026-02-08 21:15 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: David Christle

v1 of patches 2-3 (NEON YUV2RGB + checkasm expansion) exposed a
pre-existing bug in the LoongArch LASX YUV2RGB path: the res variable
(residual pixel count for widths not divisible by 16) is destructively
modified by DEALYUV2RGBLINERES/DEALYUV2RGBLINERES32 inside the row
loop, producing wrong output when srcSliceH > 2. v2 prepends a fix.

Changes since v1:
- NEW patch 1/3: fix LoongArch LASX res variable across row iterations
- Patches 2/3 and 3/3 unchanged from v1

David Christle (3):
  swscale/loongarch: fix LASX YUV2RGB residual for multi-row slices
  swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion
  tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded
    strides

 libswscale/aarch64/swscale_unscaled.c |  90 ++++++++++++++++++
 libswscale/aarch64/yuv2rgb_neon.S     | 130 +++++++++++++++++++++++---
 libswscale/loongarch/yuv2rgb_lasx.c   |   2 +
 libswscale/swscale_internal.h         |   1 +
 libswscale/yuv2rgb.c                  |   2 +
 tests/checkasm/sw_yuv2rgb.c           |  96 +++++++++----------
 6 files changed, 259 insertions(+), 62 deletions(-)

-- 
2.52.0


_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [FFmpeg-devel] [PATCH v2 1/3] swscale/loongarch: fix LASX YUV2RGB residual for multi-row slices
  2026-02-08 21:15 ` [FFmpeg-devel] [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix David Christle via ffmpeg-devel
@ 2026-02-08 21:16   ` David Christle via ffmpeg-devel
  2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 2/3] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion David Christle via ffmpeg-devel
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 7+ messages in thread
From: David Christle via ffmpeg-devel @ 2026-02-08 21:16 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: David Christle

The res variable (pixel residual count for widths not divisible by 16)
is computed once before the row loop, but DEALYUV2RGBLINERES and
DEALYUV2RGBLINERES32 destructively subtract 8 from it inside the loop
body. When srcSliceH > 2, subsequent row pairs get an incorrect
residual count, producing wrong output for the tail pixels.

Fix by recomputing res from the constant c->opts.dst_w at the top of
each row-pair iteration.

Signed-off-by: David Christle <dev@christle.is>
---
 libswscale/loongarch/yuv2rgb_lasx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/libswscale/loongarch/yuv2rgb_lasx.c b/libswscale/loongarch/yuv2rgb_lasx.c
index 9032887ff8..d08cf10d4b 100644
--- a/libswscale/loongarch/yuv2rgb_lasx.c
+++ b/libswscale/loongarch/yuv2rgb_lasx.c
@@ -185,6 +185,7 @@
         const uint8_t *py_2 = py_1   +                   srcStride[0];              \
         const uint8_t *pu   = src[1] +   (y >> vshift) * srcStride[1];              \
         const uint8_t *pv   = src[2] +   (y >> vshift) * srcStride[2];              \
+        res = c->opts.dst_w & 15;                                                   \
         for(x = 0; x < h_size; x++) {                                               \
 
 #define YUV2RGBFUNC32(func_name, dst_type, alpha)                                   \
@@ -213,6 +214,7 @@
         const uint8_t *py_2 = py_1   +                   srcStride[0];              \
         const uint8_t *pu   = src[1] +   (y >> vshift) * srcStride[1];              \
         const uint8_t *pv   = src[2] +   (y >> vshift) * srcStride[2];              \
+        res = c->opts.dst_w & 15;                                                   \
         for(x = 0; x < h_size; x++) {                                               \
 
 #define DEALYUV2RGBLINE                                                             \
-- 
2.52.0


_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [FFmpeg-devel] [PATCH v2 2/3] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion
  2026-02-08 21:15 ` [FFmpeg-devel] [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix David Christle via ffmpeg-devel
  2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 1/3] swscale/loongarch: fix LASX YUV2RGB residual for multi-row slices David Christle via ffmpeg-devel
@ 2026-02-08 21:16   ` David Christle via ffmpeg-devel
  2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 3/3] tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded strides David Christle via ffmpeg-devel
  2026-02-08 21:35   ` [FFmpeg-devel] Re: [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix Martin Storsjö via ffmpeg-devel
  3 siblings, 0 replies; 7+ messages in thread
From: David Christle via ffmpeg-devel @ 2026-02-08 21:16 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: David Christle

Add ARM64 NEON-accelerated unscaled YUV-to-RGB conversion for planar
YUV input formats. This extends the existing NV12/NV21 NEON paths with
YUV420P, YUV422P, and YUVA420P support for all packed RGB output
formats (ARGB, RGBA, ABGR, BGRA, RGB24, BGR24) and planar GBRP.

Register with ff_yuv2rgb_init_aarch64() to also cover the scaled path.

checkasm: all 42 sw_yuv2rgb tests pass.
Speedup vs C at 1920px width (Apple M3 Max, avg of 20 runs):
  yuv420p->rgb24:   4.3x    yuv420p->argb:   3.1x
  yuv422p->rgb24:   5.5x    yuv422p->argb:   4.1x
  yuva420p->argb:   3.5x    yuva420p->rgba:  3.5x

Signed-off-by: David Christle <dev@christle.is>
---
 libswscale/aarch64/swscale_unscaled.c |  90 ++++++++++++++++++
 libswscale/aarch64/yuv2rgb_neon.S     | 130 +++++++++++++++++++++++---
 libswscale/swscale_internal.h         |   1 +
 libswscale/yuv2rgb.c                  |   2 +
 4 files changed, 208 insertions(+), 15 deletions(-)

diff --git a/libswscale/aarch64/swscale_unscaled.c b/libswscale/aarch64/swscale_unscaled.c
index fdecafd94b..ba24775210 100644
--- a/libswscale/aarch64/swscale_unscaled.c
+++ b/libswscale/aarch64/swscale_unscaled.c
@@ -89,10 +89,45 @@ DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, rgba)
 DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, abgr)                                                   \
 DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, bgra)                                                   \
 DECLARE_FF_YUVX_TO_GBRP_FUNCS(yuvx, gbrp)                                                   \
+DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, rgb24)                                                  \
+DECLARE_FF_YUVX_TO_RGBX_FUNCS(yuvx, bgr24)                                                  \
 
 DECLARE_FF_YUVX_TO_ALL_RGBX_FUNCS(yuv420p)
 DECLARE_FF_YUVX_TO_ALL_RGBX_FUNCS(yuv422p)
 
+#define DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(ofmt)                                            \
+int ff_yuva420p_to_##ofmt##_neon(int w, int h,                                              \
+                                 uint8_t *dst, int linesize,                                \
+                                 const uint8_t *srcY, int linesizeY,                        \
+                                 const uint8_t *srcU, int linesizeU,                        \
+                                 const uint8_t *srcV, int linesizeV,                        \
+                                 const int16_t *table,                                      \
+                                 int y_offset, int y_coeff,                                 \
+                                 const uint8_t *srcA, int linesizeA);                       \
+                                                                                            \
+static int yuva420p_to_##ofmt##_neon_wrapper(SwsInternal *c,                                \
+                                             const uint8_t *const src[],                    \
+                                             const int srcStride[], int srcSliceY,          \
+                                             int srcSliceH, uint8_t *const dst[],           \
+                                             const int dstStride[]) {                       \
+    const int16_t yuv2rgb_table[] = { YUV_TO_RGB_TABLE };                                   \
+                                                                                            \
+    return ff_yuva420p_to_##ofmt##_neon(c->opts.src_w, srcSliceH,                           \
+                                        dst[0] + srcSliceY * dstStride[0], dstStride[0],    \
+                                        src[0], srcStride[0],                               \
+                                        src[1], srcStride[1],                               \
+                                        src[2], srcStride[2],                               \
+                                        yuv2rgb_table,                                      \
+                                        c->yuv2rgb_y_offset >> 6,                           \
+                                        c->yuv2rgb_y_coeff,                                 \
+                                        src[3], srcStride[3]);                              \
+}
+
+DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(argb)
+DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(rgba)
+DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(abgr)
+DECLARE_FF_YUVA420P_TO_RGBX_FUNCS(bgra)
+
 #define DECLARE_FF_NVX_TO_RGBX_FUNCS(ifmt, ofmt)                                            \
 int ff_##ifmt##_to_##ofmt##_neon(int w, int h,                                              \
                                  uint8_t *dst, int linesize,                                \
@@ -176,6 +211,8 @@ DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, rgba)
 DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, abgr)                                                     \
 DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, bgra)                                                     \
 DECLARE_FF_NVX_TO_GBRP_FUNCS(nvx, gbrp)                                                     \
+DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, rgb24)                                                    \
+DECLARE_FF_NVX_TO_RGBX_FUNCS(nvx, bgr24)                                                    \
 
 DECLARE_FF_NVX_TO_ALL_RGBX_FUNCS(nv12)
 DECLARE_FF_NVX_TO_ALL_RGBX_FUNCS(nv21)
@@ -199,6 +236,8 @@ DECLARE_FF_NVX_TO_ALL_RGBX_FUNCS(nv21)
     SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, abgr, ABGR, accurate_rnd);                            \
     SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, bgra, BGRA, accurate_rnd);                            \
     SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, gbrp, GBRP, accurate_rnd);                            \
+    SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, rgb24, RGB24, accurate_rnd);                          \
+    SET_FF_NVX_TO_RGBX_FUNC(nvx, NVX, bgr24, BGR24, accurate_rnd);                          \
 } while (0)
 
 static void get_unscaled_swscale_neon(SwsInternal *c) {
@@ -208,6 +247,13 @@ static void get_unscaled_swscale_neon(SwsInternal *c) {
     SET_FF_NVX_TO_ALL_RGBX_FUNC(nv21, NV21, accurate_rnd);
     SET_FF_NVX_TO_ALL_RGBX_FUNC(yuv420p, YUV420P, accurate_rnd);
     SET_FF_NVX_TO_ALL_RGBX_FUNC(yuv422p, YUV422P, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuva420p, YUVA420P, argb, ARGB, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuva420p, YUVA420P, rgba, RGBA, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuva420p, YUVA420P, abgr, ABGR, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuva420p, YUVA420P, bgra, BGRA, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuv420p, YUVA420P, rgb24, RGB24, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuv420p, YUVA420P, bgr24, BGR24, accurate_rnd);
+    SET_FF_NVX_TO_RGBX_FUNC(yuv420p, YUVA420P, gbrp,  GBRP,  accurate_rnd);
 
     if (c->opts.dst_format == AV_PIX_FMT_YUV420P &&
         (c->opts.src_format == AV_PIX_FMT_NV24 || c->opts.src_format == AV_PIX_FMT_NV42) &&
@@ -221,3 +267,47 @@ void ff_get_unscaled_swscale_aarch64(SwsInternal *c)
     if (have_neon(cpu_flags))
         get_unscaled_swscale_neon(c);
 }
+
+av_cold SwsFunc ff_yuv2rgb_init_aarch64(SwsInternal *c)
+{
+    int cpu_flags = av_get_cpu_flags();
+    if (!have_neon(cpu_flags) ||
+        (c->opts.src_h & 1) || (c->opts.src_w & 15) ||
+        (c->opts.flags & SWS_ACCURATE_RND))
+        return NULL;
+
+    if (c->opts.src_format == AV_PIX_FMT_YUV420P) {
+        switch (c->opts.dst_format) {
+        case AV_PIX_FMT_ARGB:  return yuv420p_to_argb_neon_wrapper;
+        case AV_PIX_FMT_RGBA:  return yuv420p_to_rgba_neon_wrapper;
+        case AV_PIX_FMT_ABGR:  return yuv420p_to_abgr_neon_wrapper;
+        case AV_PIX_FMT_BGRA:  return yuv420p_to_bgra_neon_wrapper;
+        case AV_PIX_FMT_RGB24: return yuv420p_to_rgb24_neon_wrapper;
+        case AV_PIX_FMT_BGR24: return yuv420p_to_bgr24_neon_wrapper;
+        case AV_PIX_FMT_GBRP:  return yuv420p_to_gbrp_neon_wrapper;
+        }
+    } else if (c->opts.src_format == AV_PIX_FMT_YUVA420P) {
+        switch (c->opts.dst_format) {
+#if CONFIG_SWSCALE_ALPHA
+        case AV_PIX_FMT_ARGB:  return yuva420p_to_argb_neon_wrapper;
+        case AV_PIX_FMT_RGBA:  return yuva420p_to_rgba_neon_wrapper;
+        case AV_PIX_FMT_ABGR:  return yuva420p_to_abgr_neon_wrapper;
+        case AV_PIX_FMT_BGRA:  return yuva420p_to_bgra_neon_wrapper;
+#endif
+        case AV_PIX_FMT_RGB24: return yuv420p_to_rgb24_neon_wrapper;
+        case AV_PIX_FMT_BGR24: return yuv420p_to_bgr24_neon_wrapper;
+        case AV_PIX_FMT_GBRP:  return yuv420p_to_gbrp_neon_wrapper;
+        }
+    } else if (c->opts.src_format == AV_PIX_FMT_YUV422P) {
+        switch (c->opts.dst_format) {
+        case AV_PIX_FMT_ARGB:  return yuv422p_to_argb_neon_wrapper;
+        case AV_PIX_FMT_RGBA:  return yuv422p_to_rgba_neon_wrapper;
+        case AV_PIX_FMT_ABGR:  return yuv422p_to_abgr_neon_wrapper;
+        case AV_PIX_FMT_BGRA:  return yuv422p_to_bgra_neon_wrapper;
+        case AV_PIX_FMT_RGB24: return yuv422p_to_rgb24_neon_wrapper;
+        case AV_PIX_FMT_BGR24: return yuv422p_to_bgr24_neon_wrapper;
+        case AV_PIX_FMT_GBRP:  return yuv422p_to_gbrp_neon_wrapper;
+        }
+    }
+    return NULL;
+}
diff --git a/libswscale/aarch64/yuv2rgb_neon.S b/libswscale/aarch64/yuv2rgb_neon.S
index 0797a6d5e0..19f750545f 100644
--- a/libswscale/aarch64/yuv2rgb_neon.S
+++ b/libswscale/aarch64/yuv2rgb_neon.S
@@ -55,7 +55,17 @@
         load_dst1_dst2  24, 32, 40, 48
         sub             w3, w3, w0                                      // w3 = linesize  - width     (padding)
 .else
+ .ifc \ofmt,rgb24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+ .else
+  .ifc \ofmt,bgr24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+  .else
         sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+  .endif
+ .endif
 .endif
         sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
         sub             w7, w7, w0                                      // w7 = linesizeC - width     (paddingC)
@@ -78,7 +88,17 @@
         load_dst1_dst2  40, 48, 56, 64
         sub             w3, w3, w0                                      // w3 = linesize  - width     (padding)
 .else
+ .ifc \ofmt,rgb24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+ .else
+  .ifc \ofmt,bgr24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+  .else
         sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+  .endif
+ .endif
 .endif
         sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
         sub             w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
@@ -87,6 +107,18 @@
         neg             w11, w11
 .endm
 
+.macro load_args_yuva420p ofmt
+        load_args_yuv420p \ofmt
+#if defined(__APPLE__)
+        ldr             x15, [sp, #32]                                  // srcA
+        ldr             w16, [sp, #40]                                  // linesizeA
+#else
+        ldr             x15, [sp, #40]                                  // srcA
+        ldr             w16, [sp, #48]                                  // linesizeA
+#endif
+        sub             w16, w16, w0                                    // w16 = linesizeA - width    (paddingA)
+.endm
+
 .macro load_args_yuv422p ofmt
         ldr             x13, [sp]                                       // srcV
         ldr             w14, [sp, #8]                                   // linesizeV
@@ -99,7 +131,17 @@
         load_dst1_dst2  40, 48, 56, 64
         sub             w3, w3, w0                                      // w3 = linesize  - width     (padding)
 .else
+ .ifc \ofmt,rgb24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+ .else
+  .ifc \ofmt,bgr24
+        add             w17, w0, w0, lsl #1
+        sub             w3, w3, w17                                     // w3 = linesize  - width * 3 (padding)
+  .else
         sub             w3, w3, w0, lsl #2                              // w3 = linesize  - width * 4 (padding)
+  .endif
+ .endif
 .endif
         sub             w5, w5, w0                                      // w5 = linesizeY - width     (paddingY)
         sub             w7,  w7,  w0, lsr #1                            // w7  = linesizeU - width / 2 (paddingU)
@@ -125,6 +167,10 @@
         ushll           v19.8h, v17.8b, #3
 .endm
 
+.macro load_chroma_yuva420p
+        load_chroma_yuv420p
+.endm
+
 .macro load_chroma_yuv422p
         load_chroma_yuv420p
 .endm
@@ -147,6 +193,11 @@
         add             x13, x13, w17, sxtw                             // srcV += incV
 .endm
 
+.macro increment_yuva420p
+        increment_yuv420p
+        add             x15, x15, w16, sxtw                             // srcA += paddingA (every row)
+.endm
+
 .macro increment_yuv422p
         add             x6,  x6,  w7, sxtw                              // srcU += incU
         add             x13, x13, w14, sxtw                             // srcV += incV
@@ -169,65 +220,103 @@
 
 .macro compute_rgba r1 g1 b1 a1 r2 g2 b2 a2
         compute_rgb     \r1, \g1, \b1, \r2, \g2, \b2
-        movi            \a1, #255
-        movi            \a2, #255
+        mov             \a1, v30.8b
+        mov             \a2, v30.8b
+.endm
+
+.macro compute_rgba_alpha r1 g1 b1 a1 r2 g2 b2 a2
+        compute_rgb     \r1, \g1, \b1, \r2, \g2, \b2
+        mov             \a1, v28.8b                                     // real alpha (first 8 pixels)
+        mov             \a2, v29.8b                                     // real alpha (next 8 pixels)
 .endm
 
 .macro declare_func ifmt ofmt
 function ff_\ifmt\()_to_\ofmt\()_neon, export=1
         load_args_\ifmt \ofmt
 
+        movi            v31.8h, #4, lsl #8                              // 128 * (1<<3) (loop-invariant)
+        movi            v30.8b, #255                                    // alpha = 255  (loop-invariant)
         mov             w9, w1
 1:
         mov             w8, w0                                          // w8 = width
 2:
-        movi            v5.8h, #4, lsl #8                               // 128 * (1<<3)
         load_chroma_\ifmt
-        sub             v18.8h, v18.8h, v5.8h                           // U*(1<<3) - 128*(1<<3)
-        sub             v19.8h, v19.8h, v5.8h                           // V*(1<<3) - 128*(1<<3)
+        sub             v18.8h, v18.8h, v31.8h                          // U*(1<<3) - 128*(1<<3)
+        sub             v19.8h, v19.8h, v31.8h                          // V*(1<<3) - 128*(1<<3)
         sqdmulh         v20.8h, v19.8h, v1.h[0]                         // V * v2r            (R)
         sqdmulh         v22.8h, v18.8h, v1.h[1]                         // U * u2g
+        ld1             {v2.16b}, [x4], #16                             // load luma (interleaved)
+.ifc \ifmt,yuva420p
+        ld1             {v28.8b, v29.8b}, [x15], #16                    // load 16 alpha bytes
+.endif
         sqdmulh         v19.8h, v19.8h, v1.h[2]                         //           V * v2g
-        add             v22.8h, v22.8h, v19.8h                          // U * u2g + V * v2g  (G)
         sqdmulh         v24.8h, v18.8h, v1.h[3]                         // U * u2b            (B)
-        zip2            v21.8h, v20.8h, v20.8h                          // R2
-        zip1            v20.8h, v20.8h, v20.8h                          // R1
-        zip2            v23.8h, v22.8h, v22.8h                          // G2
-        zip1            v22.8h, v22.8h, v22.8h                          // G1
-        zip2            v25.8h, v24.8h, v24.8h                          // B2
-        zip1            v24.8h, v24.8h, v24.8h                          // B1
-        ld1             {v2.16b}, [x4], #16                             // load luma
         ushll           v26.8h, v2.8b,  #3                              // Y1*(1<<3)
         ushll2          v27.8h, v2.16b, #3                              // Y2*(1<<3)
+        add             v22.8h, v22.8h, v19.8h                          // U * u2g + V * v2g  (G)
         sub             v26.8h, v26.8h, v3.8h                           // Y1*(1<<3) - y_offset
         sub             v27.8h, v27.8h, v3.8h                           // Y2*(1<<3) - y_offset
+        zip2            v21.8h, v20.8h, v20.8h                          // R2
+        zip1            v20.8h, v20.8h, v20.8h                          // R1
         sqdmulh         v26.8h, v26.8h, v0.8h                           // ((Y1*(1<<3) - y_offset) * y_coeff) >> 15
         sqdmulh         v27.8h, v27.8h, v0.8h                           // ((Y2*(1<<3) - y_offset) * y_coeff) >> 15
+        zip2            v23.8h, v22.8h, v22.8h                          // G2
+        zip1            v22.8h, v22.8h, v22.8h                          // G1
+        zip2            v25.8h, v24.8h, v24.8h                          // B2
+        zip1            v24.8h, v24.8h, v24.8h                          // B1
 
 .ifc \ofmt,argb // 1 2 3 0
+ .ifc \ifmt,yuva420p
+        compute_rgba_alpha v5.8b,v6.8b,v7.8b,v4.8b, v17.8b,v18.8b,v19.8b,v16.8b
+ .else
         compute_rgba    v5.8b,v6.8b,v7.8b,v4.8b, v17.8b,v18.8b,v19.8b,v16.8b
+ .endif
 .endif
 
 .ifc \ofmt,rgba // 0 1 2 3
+ .ifc \ifmt,yuva420p
+        compute_rgba_alpha v4.8b,v5.8b,v6.8b,v7.8b, v16.8b,v17.8b,v18.8b,v19.8b
+ .else
         compute_rgba    v4.8b,v5.8b,v6.8b,v7.8b, v16.8b,v17.8b,v18.8b,v19.8b
+ .endif
 .endif
 
 .ifc \ofmt,abgr // 3 2 1 0
+ .ifc \ifmt,yuva420p
+        compute_rgba_alpha v7.8b,v6.8b,v5.8b,v4.8b, v19.8b,v18.8b,v17.8b,v16.8b
+ .else
         compute_rgba    v7.8b,v6.8b,v5.8b,v4.8b, v19.8b,v18.8b,v17.8b,v16.8b
+ .endif
 .endif
 
 .ifc \ofmt,bgra // 2 1 0 3
+ .ifc \ifmt,yuva420p
+        compute_rgba_alpha v6.8b,v5.8b,v4.8b,v7.8b, v18.8b,v17.8b,v16.8b,v19.8b
+ .else
         compute_rgba    v6.8b,v5.8b,v4.8b,v7.8b, v18.8b,v17.8b,v16.8b,v19.8b
+ .endif
 .endif
 
-.ifc \ofmt,gbrp
+.ifc \ofmt,rgb24
+        compute_rgb     v4.8b,v5.8b,v6.8b, v16.8b,v17.8b,v18.8b
+        st3             { v4.8b, v5.8b, v6.8b}, [x2], #24
+        st3             {v16.8b,v17.8b,v18.8b}, [x2], #24
+.else
+ .ifc \ofmt,bgr24
+        compute_rgb     v6.8b,v5.8b,v4.8b, v18.8b,v17.8b,v16.8b
+        st3             { v4.8b, v5.8b, v6.8b}, [x2], #24
+        st3             {v16.8b,v17.8b,v18.8b}, [x2], #24
+ .else
+  .ifc \ofmt,gbrp
         compute_rgb     v18.8b,v4.8b,v6.8b, v19.8b,v5.8b,v7.8b
         st1             {  v4.8b,  v5.8b }, [x2],  #16
         st1             {  v6.8b,  v7.8b }, [x10], #16
         st1             { v18.8b, v19.8b }, [x15], #16
-.else
+  .else
         st4             { v4.8b, v5.8b, v6.8b, v7.8b}, [x2], #32
         st4             {v16.8b,v17.8b,v18.8b,v19.8b}, [x2], #32
+  .endif
+ .endif
 .endif
         subs            w8, w8, #16                                     // width -= 16
         b.gt            2b
@@ -251,9 +340,20 @@ endfunc
         declare_func    \ifmt, abgr
         declare_func    \ifmt, bgra
         declare_func    \ifmt, gbrp
+        declare_func    \ifmt, rgb24
+        declare_func    \ifmt, bgr24
 .endm
 
 declare_rgb_funcs nv12
 declare_rgb_funcs nv21
 declare_rgb_funcs yuv420p
 declare_rgb_funcs yuv422p
+
+.macro declare_yuva_funcs ifmt
+        declare_func    \ifmt, argb
+        declare_func    \ifmt, rgba
+        declare_func    \ifmt, abgr
+        declare_func    \ifmt, bgra
+.endm
+
+declare_yuva_funcs yuva420p
diff --git a/libswscale/swscale_internal.h b/libswscale/swscale_internal.h
index 5c58272664..c671f1c7cd 100644
--- a/libswscale/swscale_internal.h
+++ b/libswscale/swscale_internal.h
@@ -739,6 +739,7 @@ av_cold int ff_sws_fill_xyztables(SwsInternal *c);
 SwsFunc ff_yuv2rgb_init_x86(SwsInternal *c);
 SwsFunc ff_yuv2rgb_init_ppc(SwsInternal *c);
 SwsFunc ff_yuv2rgb_init_loongarch(SwsInternal *c);
+SwsFunc ff_yuv2rgb_init_aarch64(SwsInternal *c);
 
 static av_always_inline int is16BPS(enum AVPixelFormat pix_fmt)
 {
diff --git a/libswscale/yuv2rgb.c b/libswscale/yuv2rgb.c
index 48089760f5..c62201856d 100644
--- a/libswscale/yuv2rgb.c
+++ b/libswscale/yuv2rgb.c
@@ -568,6 +568,8 @@ SwsFunc ff_yuv2rgb_get_func_ptr(SwsInternal *c)
     t = ff_yuv2rgb_init_x86(c);
 #elif ARCH_LOONGARCH64
     t = ff_yuv2rgb_init_loongarch(c);
+#elif ARCH_AARCH64
+    t = ff_yuv2rgb_init_aarch64(c);
 #endif
 
     if (t)
-- 
2.52.0


_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [FFmpeg-devel] [PATCH v2 3/3] tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded strides
  2026-02-08 21:15 ` [FFmpeg-devel] [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix David Christle via ffmpeg-devel
  2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 1/3] swscale/loongarch: fix LASX YUV2RGB residual for multi-row slices David Christle via ffmpeg-devel
  2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 2/3] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion David Christle via ffmpeg-devel
@ 2026-02-08 21:16   ` David Christle via ffmpeg-devel
  2026-02-08 21:35   ` [FFmpeg-devel] Re: [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix Martin Storsjö via ffmpeg-devel
  3 siblings, 0 replies; 7+ messages in thread
From: David Christle via ffmpeg-devel @ 2026-02-08 21:16 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: David Christle

Increase test height from 2 to 4 rows and add 32 bytes of source stride
padding. This exercises chroma row sharing across multiple luma row pairs
in YUV420P and the stride increment arithmetic in SIMD implementations,
both of which were previously untested.

Signed-off-by: David Christle <dev@christle.is>
---
 tests/checkasm/sw_yuv2rgb.c | 96 +++++++++++++++++++------------------
 1 file changed, 49 insertions(+), 47 deletions(-)

diff --git a/tests/checkasm/sw_yuv2rgb.c b/tests/checkasm/sw_yuv2rgb.c
index c6c1ad934b..ba579d0545 100644
--- a/tests/checkasm/sw_yuv2rgb.c
+++ b/tests/checkasm/sw_yuv2rgb.c
@@ -104,6 +104,8 @@ static void check_yuv2rgb(int src_pix_fmt)
 {
     const AVPixFmtDescriptor *src_desc = av_pix_fmt_desc_get(src_pix_fmt);
 #define MAX_LINE_SIZE 1920
+#define SRC_STRIDE_PAD 32
+#define NUM_LINES 4
     static const int input_sizes[] = {8, 128, 1080, MAX_LINE_SIZE};
 
     declare_func_emms(AV_CPU_FLAG_MMX | AV_CPU_FLAG_MMXEXT,
@@ -111,36 +113,26 @@ static void check_yuv2rgb(int src_pix_fmt)
                            const int srcStride[], int srcSliceY, int srcSliceH,
                            uint8_t *const dst[], const int dstStride[]);
 
-    LOCAL_ALIGNED_8(uint8_t, src_y, [MAX_LINE_SIZE * 2]);
-    LOCAL_ALIGNED_8(uint8_t, src_u, [MAX_LINE_SIZE]);
-    LOCAL_ALIGNED_8(uint8_t, src_v, [MAX_LINE_SIZE]);
-    LOCAL_ALIGNED_8(uint8_t, src_a, [MAX_LINE_SIZE * 2]);
+    LOCAL_ALIGNED_8(uint8_t, src_y, [(MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES]);
+    LOCAL_ALIGNED_8(uint8_t, src_u, [(MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES]);
+    LOCAL_ALIGNED_8(uint8_t, src_v, [(MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES]);
+    LOCAL_ALIGNED_8(uint8_t, src_a, [(MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES]);
     const uint8_t *src[4] = { src_y, src_u, src_v, src_a };
 
-    LOCAL_ALIGNED_8(uint8_t, dst0_0, [2 * MAX_LINE_SIZE * 6]);
-    LOCAL_ALIGNED_8(uint8_t, dst0_1, [2 * MAX_LINE_SIZE]);
-    LOCAL_ALIGNED_8(uint8_t, dst0_2, [2 * MAX_LINE_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dst0_0, [NUM_LINES * MAX_LINE_SIZE * 6]);
+    LOCAL_ALIGNED_8(uint8_t, dst0_1, [NUM_LINES * MAX_LINE_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dst0_2, [NUM_LINES * MAX_LINE_SIZE]);
     uint8_t *dst0[4] = { dst0_0, dst0_1, dst0_2 };
-    uint8_t *lines0[4][2] = {
-        { dst0_0, dst0_0 + MAX_LINE_SIZE * 6 },
-        { dst0_1, dst0_1 + MAX_LINE_SIZE },
-        { dst0_2, dst0_2 + MAX_LINE_SIZE }
-    };
-
-    LOCAL_ALIGNED_8(uint8_t, dst1_0, [2 * MAX_LINE_SIZE * 6]);
-    LOCAL_ALIGNED_8(uint8_t, dst1_1, [2 * MAX_LINE_SIZE]);
-    LOCAL_ALIGNED_8(uint8_t, dst1_2, [2 * MAX_LINE_SIZE]);
+
+    LOCAL_ALIGNED_8(uint8_t, dst1_0, [NUM_LINES * MAX_LINE_SIZE * 6]);
+    LOCAL_ALIGNED_8(uint8_t, dst1_1, [NUM_LINES * MAX_LINE_SIZE]);
+    LOCAL_ALIGNED_8(uint8_t, dst1_2, [NUM_LINES * MAX_LINE_SIZE]);
     uint8_t *dst1[4] = { dst1_0, dst1_1, dst1_2 };
-    uint8_t *lines1[4][2] = {
-        { dst1_0, dst1_0 + MAX_LINE_SIZE * 6 },
-        { dst1_1, dst1_1 + MAX_LINE_SIZE },
-        { dst1_2, dst1_2 + MAX_LINE_SIZE }
-    };
 
-    randomize_buffers(src_y, MAX_LINE_SIZE * 2);
-    randomize_buffers(src_u, MAX_LINE_SIZE);
-    randomize_buffers(src_v, MAX_LINE_SIZE);
-    randomize_buffers(src_a, MAX_LINE_SIZE * 2);
+    randomize_buffers(src_y, (MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES);
+    randomize_buffers(src_u, (MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES);
+    randomize_buffers(src_v, (MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES);
+    randomize_buffers(src_a, (MAX_LINE_SIZE + SRC_STRIDE_PAD) * NUM_LINES);
 
     for (int dfi = 0; dfi < FF_ARRAY_ELEMS(dst_fmts); dfi++) {
         int dst_pix_fmt = dst_fmts[dfi];
@@ -152,12 +144,12 @@ static void check_yuv2rgb(int src_pix_fmt)
             int log_level;
             int width = input_sizes[isi];
             int srcSliceY = 0;
-            int srcSliceH = 2;
+            int srcSliceH = NUM_LINES;
             int srcStride[4] = {
-                width,
-                width >> src_desc->log2_chroma_w,
-                width >> src_desc->log2_chroma_w,
-                width,
+                width + SRC_STRIDE_PAD,
+                (width >> src_desc->log2_chroma_w) + SRC_STRIDE_PAD,
+                (width >> src_desc->log2_chroma_w) + SRC_STRIDE_PAD,
+                width + SRC_STRIDE_PAD,
             };
             int dstStride[4] = {
                 MAX_LINE_SIZE * 6,
@@ -178,13 +170,13 @@ static void check_yuv2rgb(int src_pix_fmt)
 
             c = sws_internal(sws);
             if (check_func(c->convert_unscaled, "%s_%s_%d", src_desc->name, dst_desc->name, width)) {
-                memset(dst0_0, 0xFF, 2 * MAX_LINE_SIZE * 6);
-                memset(dst1_0, 0xFF, 2 * MAX_LINE_SIZE * 6);
+                memset(dst0_0, 0xFF, NUM_LINES * MAX_LINE_SIZE * 6);
+                memset(dst1_0, 0xFF, NUM_LINES * MAX_LINE_SIZE * 6);
                 if (dst_pix_fmt == AV_PIX_FMT_GBRP) {
-                    memset(dst0_1, 0xFF, MAX_LINE_SIZE);
-                    memset(dst0_2, 0xFF, MAX_LINE_SIZE);
-                    memset(dst1_1, 0xFF, MAX_LINE_SIZE);
-                    memset(dst1_2, 0xFF, MAX_LINE_SIZE);
+                    memset(dst0_1, 0xFF, NUM_LINES * MAX_LINE_SIZE);
+                    memset(dst0_2, 0xFF, NUM_LINES * MAX_LINE_SIZE);
+                    memset(dst1_1, 0xFF, NUM_LINES * MAX_LINE_SIZE);
+                    memset(dst1_2, 0xFF, NUM_LINES * MAX_LINE_SIZE);
                 }
 
                 call_ref(c, src, srcStride, srcSliceY,
@@ -198,23 +190,31 @@ static void check_yuv2rgb(int src_pix_fmt)
                     dst_pix_fmt == AV_PIX_FMT_BGRA  ||
                     dst_pix_fmt == AV_PIX_FMT_RGB24 ||
                     dst_pix_fmt == AV_PIX_FMT_BGR24) {
-                    if (cmp_off_by_n(lines0[0][0], lines1[0][0], width * sample_size, 3) ||
-                        cmp_off_by_n(lines0[0][1], lines1[0][1], width * sample_size, 3))
-                        fail();
+                    for (int row = 0; row < srcSliceH; row++)
+                        if (cmp_off_by_n(dst0_0 + row * dstStride[0],
+                                         dst1_0 + row * dstStride[0],
+                                         width * sample_size, 3))
+                            fail();
                 } else if (dst_pix_fmt == AV_PIX_FMT_RGB565 ||
                            dst_pix_fmt == AV_PIX_FMT_BGR565) {
-                    if (cmp_565_by_n(lines0[0][0], lines1[0][0], width, 2) ||
-                        cmp_565_by_n(lines0[0][1], lines1[0][1], width, 2))
-                        fail();
+                    for (int row = 0; row < srcSliceH; row++)
+                        if (cmp_565_by_n(dst0_0 + row * dstStride[0],
+                                         dst1_0 + row * dstStride[0],
+                                         width, 2))
+                            fail();
                 } else if (dst_pix_fmt == AV_PIX_FMT_RGB555 ||
                            dst_pix_fmt == AV_PIX_FMT_BGR555) {
-                    if (cmp_555_by_n(lines0[0][0], lines1[0][0], width, 2) ||
-                        cmp_555_by_n(lines0[0][1], lines1[0][1], width, 2))
-                        fail();
+                    for (int row = 0; row < srcSliceH; row++)
+                        if (cmp_555_by_n(dst0_0 + row * dstStride[0],
+                                         dst1_0 + row * dstStride[0],
+                                         width, 2))
+                            fail();
                 } else if (dst_pix_fmt == AV_PIX_FMT_GBRP) {
                     for (int p = 0; p < 3; p++)
-                        for (int l = 0; l < 2; l++)
-                            if (cmp_off_by_n(lines0[p][l], lines1[p][l], width, 3))
+                        for (int row = 0; row < srcSliceH; row++)
+                            if (cmp_off_by_n(dst0[p] + row * dstStride[p],
+                                             dst1[p] + row * dstStride[p],
+                                             width, 3))
                                 fail();
                 } else {
                     fail();
@@ -228,6 +228,8 @@ static void check_yuv2rgb(int src_pix_fmt)
     }
 }
 
+#undef NUM_LINES
+#undef SRC_STRIDE_PAD
 #undef MAX_LINE_SIZE
 
 void checkasm_check_sw_yuv2rgb(void)
-- 
2.52.0


_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [FFmpeg-devel] Re: [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix
  2026-02-08 21:15 ` [FFmpeg-devel] [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix David Christle via ffmpeg-devel
                     ` (2 preceding siblings ...)
  2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 3/3] tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded strides David Christle via ffmpeg-devel
@ 2026-02-08 21:35   ` Martin Storsjö via ffmpeg-devel
  3 siblings, 0 replies; 7+ messages in thread
From: Martin Storsjö via ffmpeg-devel @ 2026-02-08 21:35 UTC (permalink / raw)
  To: David Christle via ffmpeg-devel; +Cc: David Christle, Martin Storsjö

On Sun, 8 Feb 2026, David Christle via ffmpeg-devel wrote:

> v1 of patches 2-3 (NEON YUV2RGB + checkasm expansion) exposed a
> pre-existing bug in the LoongArch LASX YUV2RGB path: the res variable
> (residual pixel count for widths not divisible by 16) is destructively
> modified by DEALYUV2RGBLINERES/DEALYUV2RGBLINERES32 inside the row
> loop, producing wrong output when srcSliceH > 2. v2 prepends a fix.
>
> Changes since v1:
> - NEW patch 1/3: fix LoongArch LASX res variable across row iterations
> - Patches 2/3 and 3/3 unchanged from v1
>
> David Christle (3):
>  swscale/loongarch: fix LASX YUV2RGB residual for multi-row slices
>  swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion
>  tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded
>    strides

Please submit these patches at https://code.ffmpeg.org/FFmpeg/FFmpeg

// Martin

_______________________________________________
ffmpeg-devel mailing list -- ffmpeg-devel@ffmpeg.org
To unsubscribe send an email to ffmpeg-devel-leave@ffmpeg.org

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-02-08 21:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2026-02-06  6:45 [FFmpeg-devel] [PATCH 1/2] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion David Christle via ffmpeg-devel
2026-02-06  6:46 ` [FFmpeg-devel] [PATCH 2/2] tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded strides David Christle via ffmpeg-devel
2026-02-08 21:15 ` [FFmpeg-devel] [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix David Christle via ffmpeg-devel
2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 1/3] swscale/loongarch: fix LASX YUV2RGB residual for multi-row slices David Christle via ffmpeg-devel
2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 2/3] swscale/aarch64: add NEON YUV420P/YUV422P/YUVA420P to RGB conversion David Christle via ffmpeg-devel
2026-02-08 21:16   ` [FFmpeg-devel] [PATCH v2 3/3] tests/checkasm/sw_yuv2rgb: test multi-row conversion with padded strides David Christle via ffmpeg-devel
2026-02-08 21:35   ` [FFmpeg-devel] Re: [PATCH v2 0/3] swscale: NEON YUV2RGB + LoongArch LASX multi-row fix Martin Storsjö via ffmpeg-devel

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git