* [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template()
@ 2025-07-02 7:31 Logaprakash Ramajayam
2025-07-02 9:10 ` Logaprakash Ramajayam
2025-07-02 9:27 ` Logaprakash Ramajayam
0 siblings, 2 replies; 3+ messages in thread
From: Logaprakash Ramajayam @ 2025-07-02 7:31 UTC (permalink / raw)
To: FFmpeg development discussions and patches
[-- Attachment #1: Type: text/plain, Size: 2374 bytes --]
Handled all the comments and updated checkasm for yuv2planeX_10_c()
Checkasm Benchmark results:
yuv2yuvX_10_LE_16_0_512_accurate_c: 7836.9 ( 1.00x)
yuv2yuvX_10_LE_16_0_512_accurate_neon: 840.4 ( 9.33x)
yuv2yuvX_10_LE_16_0_512_approximate_c: 7930.8 ( 1.00x)
yuv2yuvX_10_LE_16_0_512_approximate_neon: 838.5 ( 9.46x)
yuv2yuvX_10_LE_16_16_512_accurate_c: 7594.3 ( 1.00x)
yuv2yuvX_10_LE_16_16_512_accurate_neon: 815.2 ( 9.32x)
yuv2yuvX_10_LE_16_16_512_approximate_c: 7687.0 ( 1.00x)
yuv2yuvX_10_LE_16_16_512_approximate_neon: 811.9 ( 9.47x)
yuv2yuvX_10_LE_16_32_512_accurate_c: 7366.4 ( 1.00x)
yuv2yuvX_10_LE_16_32_512_accurate_neon: 785.8 ( 9.37x)
yuv2yuvX_10_LE_16_32_512_approximate_c: 7426.5 ( 1.00x)
yuv2yuvX_10_LE_16_32_512_approximate_neon: 786.4 ( 9.44x)
yuv2yuvX_10_LE_16_48_512_accurate_c: 7123.1 ( 1.00x)
yuv2yuvX_10_LE_16_48_512_accurate_neon: 761.7 ( 9.35x)
yuv2yuvX_10_LE_16_48_512_approximate_c: 7182.7 ( 1.00x)
yuv2yuvX_10_LE_16_48_512_approximate_neon: 763.0 ( 9.41x)
yuv2yuvX_10_BE_16_0_512_accurate_c: 8092.6 ( 1.00x)
yuv2yuvX_10_BE_16_0_512_accurate_neon: 860.2 ( 9.41x)
yuv2yuvX_10_BE_16_0_512_approximate_c: 8183.5 ( 1.00x)
yuv2yuvX_10_BE_16_0_512_approximate_neon: 861.4 ( 9.50x)
yuv2yuvX_10_BE_16_16_512_accurate_c: 7837.4 ( 1.00x)
yuv2yuvX_10_BE_16_16_512_accurate_neon: 834.0 ( 9.40x)
yuv2yuvX_10_BE_16_16_512_approximate_c: 7927.9 ( 1.00x)
yuv2yuvX_10_BE_16_16_512_approximate_neon: 834.6 ( 9.50x)
yuv2yuvX_10_BE_16_32_512_accurate_c: 7605.1 ( 1.00x)
yuv2yuvX_10_BE_16_32_512_accurate_neon: 807.5 ( 9.42x)
yuv2yuvX_10_BE_16_32_512_approximate_c: 7691.4 ( 1.00x)
yuv2yuvX_10_BE_16_32_512_approximate_neon: 807.3 ( 9.53x)
yuv2yuvX_10_BE_16_48_512_accurate_c: 7344.3 ( 1.00x)
yuv2yuvX_10_BE_16_48_512_accurate_neon: 782.7 ( 9.38x)
yuv2yuvX_10_BE_16_48_512_approximate_c: 7440.1 ( 1.00x)
yuv2yuvX_10_BE_16_48_512_approximate_neon: 781.9 ( 9.51x)
[-- Attachment #2: Swscale-Aarch64-Implement-neon-assembly-yuv2planeX_10_c_template.patch --]
[-- Type: application/octet-stream, Size: 25716 bytes --]
[-- Attachment #3: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template()
2025-07-02 7:31 [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() Logaprakash Ramajayam
@ 2025-07-02 9:10 ` Logaprakash Ramajayam
2025-07-02 9:27 ` Logaprakash Ramajayam
1 sibling, 0 replies; 3+ messages in thread
From: Logaprakash Ramajayam @ 2025-07-02 9:10 UTC (permalink / raw)
To: FFmpeg development discussions and patches
[-- Attachment #1: Type: text/plain, Size: 2759 bytes --]
Attaching the Assembly implementation of yuv2planeX_10_c() patch in text format.
________________________________
From: Logaprakash Ramajayam
Sent: Wednesday, July 2, 2025 1:01 PM
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template()
Handled all the comments and updated checkasm for yuv2planeX_10_c()
Checkasm Benchmark results:
yuv2yuvX_10_LE_16_0_512_accurate_c: 7836.9 ( 1.00x)
yuv2yuvX_10_LE_16_0_512_accurate_neon: 840.4 ( 9.33x)
yuv2yuvX_10_LE_16_0_512_approximate_c: 7930.8 ( 1.00x)
yuv2yuvX_10_LE_16_0_512_approximate_neon: 838.5 ( 9.46x)
yuv2yuvX_10_LE_16_16_512_accurate_c: 7594.3 ( 1.00x)
yuv2yuvX_10_LE_16_16_512_accurate_neon: 815.2 ( 9.32x)
yuv2yuvX_10_LE_16_16_512_approximate_c: 7687.0 ( 1.00x)
yuv2yuvX_10_LE_16_16_512_approximate_neon: 811.9 ( 9.47x)
yuv2yuvX_10_LE_16_32_512_accurate_c: 7366.4 ( 1.00x)
yuv2yuvX_10_LE_16_32_512_accurate_neon: 785.8 ( 9.37x)
yuv2yuvX_10_LE_16_32_512_approximate_c: 7426.5 ( 1.00x)
yuv2yuvX_10_LE_16_32_512_approximate_neon: 786.4 ( 9.44x)
yuv2yuvX_10_LE_16_48_512_accurate_c: 7123.1 ( 1.00x)
yuv2yuvX_10_LE_16_48_512_accurate_neon: 761.7 ( 9.35x)
yuv2yuvX_10_LE_16_48_512_approximate_c: 7182.7 ( 1.00x)
yuv2yuvX_10_LE_16_48_512_approximate_neon: 763.0 ( 9.41x)
yuv2yuvX_10_BE_16_0_512_accurate_c: 8092.6 ( 1.00x)
yuv2yuvX_10_BE_16_0_512_accurate_neon: 860.2 ( 9.41x)
yuv2yuvX_10_BE_16_0_512_approximate_c: 8183.5 ( 1.00x)
yuv2yuvX_10_BE_16_0_512_approximate_neon: 861.4 ( 9.50x)
yuv2yuvX_10_BE_16_16_512_accurate_c: 7837.4 ( 1.00x)
yuv2yuvX_10_BE_16_16_512_accurate_neon: 834.0 ( 9.40x)
yuv2yuvX_10_BE_16_16_512_approximate_c: 7927.9 ( 1.00x)
yuv2yuvX_10_BE_16_16_512_approximate_neon: 834.6 ( 9.50x)
yuv2yuvX_10_BE_16_32_512_accurate_c: 7605.1 ( 1.00x)
yuv2yuvX_10_BE_16_32_512_accurate_neon: 807.5 ( 9.42x)
yuv2yuvX_10_BE_16_32_512_approximate_c: 7691.4 ( 1.00x)
yuv2yuvX_10_BE_16_32_512_approximate_neon: 807.3 ( 9.53x)
yuv2yuvX_10_BE_16_48_512_accurate_c: 7344.3 ( 1.00x)
yuv2yuvX_10_BE_16_48_512_accurate_neon: 782.7 ( 9.38x)
yuv2yuvX_10_BE_16_48_512_approximate_c: 7440.1 ( 1.00x)
yuv2yuvX_10_BE_16_48_512_approximate_neon: 781.9 ( 9.51x)
[-- Attachment #2: Swscale-Aarch64-Implement-neon-assembly-yuv2planeX_10_c_template.patch --]
[-- Type: application/octet-stream, Size: 25716 bytes --]
[-- Attachment #3: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template()
2025-07-02 7:31 [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() Logaprakash Ramajayam
2025-07-02 9:10 ` Logaprakash Ramajayam
@ 2025-07-02 9:27 ` Logaprakash Ramajayam
1 sibling, 0 replies; 3+ messages in thread
From: Logaprakash Ramajayam @ 2025-07-02 9:27 UTC (permalink / raw)
To: FFmpeg development discussions and patches
From 3e14b4c2e763d2d0c8979e3e99578f5492b7130c Mon Sep 17 00:00:00 2001
From: Logaprakash Ramajayam <logaprakash.ramajayam@multicorewareinc.com>
Date: Tue, 1 Jul 2025 23:48:36 -0700
Subject: [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template()
---
libswscale/aarch64/output.S | 189 +++++++++++++++++++++++++++++++++++
libswscale/aarch64/swscale.c | 38 +++++++
tests/checkasm/sw_scale.c | 170 ++++++++++++++++++++-----------
3 files changed, 337 insertions(+), 60 deletions(-)
diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
index 190c438870..2aad420db2 100644
--- a/libswscale/aarch64/output.S
+++ b/libswscale/aarch64/output.S
@@ -20,6 +20,195 @@
#include "libavutil/aarch64/asm.S"
+function ff_yuv2planeX_10_neon, export=1
+// x0 = filter (int16_t*)
+// w1 = filterSize
+// x2 = src (int16_t**)
+// x3 = dest (uint16_t*)
+// w4 = dstW
+// w5 = big_endian
+// w6 = output_bits
+
+ mov w8, #27
+ sub w8, w8, w6 // shift = 11 + 16 - output_bits
+
+ sub w9, w8, #1
+ mov w10, #1
+ lsl w9, w10, w9 // val = 1 << (shift - 1)
+
+ dup v1.4s, w9
+ dup v2.4s, w9 // Create vectors with val
+
+ neg w16, w8
+ dup v20.4s, w16 // Create (-shift) vector for right shift
+
+ mov w10, #1
+ lsl w10, w10, w6
+ sub w10, w10, #1 // (1U << output_bits) - 1
+ dup v21.4s, w10 // Create Clip vector for upper bound
+
+ mov x7, #0 // i = 0
+
+1:
+ cmp w4, #16 // Process 16-pixels if available
+ blt 4f
+
+ mov v3.16b, v1.16b
+ mov v4.16b, v2.16b
+ mov v5.16b, v1.16b
+ mov v6.16b, v2.16b
+
+ mov w11, w1 // tmpfilterSize = filterSize
+ mov x12, x2 // srcp = src
+ mov x13, x0 // filterp = filter
+
+2: // Filter loop
+
+ ldp x14, x15, [x12], #16 // get 2 pointers: src[j] and src[j+1]
+ ldr s7, [x13], #4 // load filter coefficients
+ add x14, x14, x7, lsl #1
+ add x15, x15, x7, lsl #1
+ ld1 {v16.8h, v17.8h}, [x14]
+ ld1 {v18.8h, v19.8h}, [x15]
+
+ // Multiply-accumulate
+ smlal v3.4s, v16.4h, v7.h[0]
+ smlal2 v4.4s, v16.8h, v7.h[0]
+ smlal v5.4s, v17.4h, v7.h[0]
+ smlal2 v6.4s, v17.8h, v7.h[0]
+
+ smlal v3.4s, v18.4h, v7.h[1]
+ smlal2 v4.4s, v18.8h, v7.h[1]
+ smlal v5.4s, v19.4h, v7.h[1]
+ smlal2 v6.4s, v19.8h, v7.h[1]
+
+ subs w11, w11, #2 // tmpfilterSize -= 2
+ b.gt 2b // continue filter loop
+
+ // Shift results
+ sshl v3.4s, v3.4s, v20.4s
+ sshl v4.4s, v4.4s, v20.4s
+ sshl v5.4s, v5.4s, v20.4s
+ sshl v6.4s, v6.4s, v20.4s
+
+ // Clamp to upper bound
+ smin v3.4s, v3.4s, v21.4s
+ smin v4.4s, v4.4s, v21.4s
+ smin v5.4s, v5.4s, v21.4s
+ smin v6.4s, v6.4s, v21.4s
+
+ // Narrow and clamp to 0
+ sqxtun v23.4h, v3.4s
+ sqxtun2 v23.8h, v4.4s
+ sqxtun v24.4h, v5.4s
+ sqxtun2 v24.8h, v6.4s
+
+ cbz w5, 3f // Check if big endian
+ rev16 v23.16b, v23.16b
+ rev16 v24.16b, v24.16b // Swap bits for big endian
+3:
+ st1 {v23.8h, v24.8h}, [x3], #32
+
+ subs w4, w4, #16 // dstW = dstW - 16
+ add x7, x7, #16 // i = i + 16
+ b 1b // Continue loop
+
+4:
+ cmp w4, #8 // Process 8-pixels if available
+ blt 8f
+5:
+ mov v3.16b, v1.16b
+ mov v4.16b, v2.16b
+
+ mov w11, w1 // tmpfilterSize = filterSize
+ mov x12, x2 // srcp = src
+ mov x13, x0 // filterp = filter
+
+6: // Filter loop
+
+ ldp x14, x15, [x12], #16
+ ldr s7, [x13], #4
+ add x14, x14, x7, lsl #1
+ add x15, x15, x7, lsl #1
+ ld1 {v5.8h}, [x14]
+ ld1 {v6.8h}, [x15]
+
+ // Multiply-accumulate
+ smlal v3.4s, v5.4h, v7.h[0]
+ smlal2 v4.4s, v5.8h, v7.h[0]
+ smlal v3.4s, v6.4h, v7.h[1]
+ smlal2 v4.4s, v6.8h, v7.h[1]
+
+ subs w11, w11, #2 // tmpfilterSize -= 2
+ b.gt 6b // loop until filterSize consumed
+
+ // Shift results
+ sshl v3.4s, v3.4s, v20.4s
+ sshl v4.4s, v4.4s, v20.4s
+
+ // Clip upper bound
+ smin v3.4s, v3.4s, v21.4s
+ smin v4.4s, v4.4s, v21.4s
+
+ // Narrow and clamp to 0
+ sqxtun v25.4h, v3.4s
+ sqxtun v26.4h, v4.4s
+
+ cbz w5, 7f // Check if big endian
+ rev16 v25.8b, v25.8b
+ rev16 v26.8b, v26.8b // Swap bits for big endian
+
+7:
+ // Store 8 pixels
+ st1 {v25.4h, v26.4h}, [x3], #16
+
+ subs w4, w4, #8 // dstW = dstW - 8
+ add x7, x7, #8 // i = i + 8
+
+8:
+ cbz w4, 12f // Scalar loop for remaining pixels
+9:
+ mov w11, w1 // tmpfilterSize = filterSize
+ mov x12, x2 // srcp = src
+ mov x13, x0 // filterp = filter
+ sxtw x9, w9
+ mov x17, x9
+
+10: // Filter loop
+ ldr x14, [x12], #8 // Load src pointer
+ ldrsh w15, [x13], #2 // Load filter coefficient
+ add x14, x14, x7, lsl #1 // Add pixel offset
+ ldrh w16, [x14]
+
+ sxtw x16, w16
+ sxtw x15, w15
+ madd x17, x16, x15, x17
+
+ subs w11, w11, #1 // tmpfilterSize -= 1
+ b.gt 10b // loop until filterSize consumed
+
+ sxtw x8, w8
+ asr x17, x17, x8
+ cmp x17, #0
+ csel x17, x17, xzr, ge // Clamp to 0 if negative
+
+ sxtw x10, w10
+ cmp x17, x10
+ csel x17, x10, x17, gt // Clamp to max if greater than max
+
+ cbz w5, 11f // Check if big endian
+ rev16 x17, x17 // Swap bits for big endian
+11:
+ strh w17, [x3], #2
+
+ subs w4, w4, #1 // dstW = dstW - 1
+ add x7, x7, #1 // i = i + 1
+ b.gt 9b // Loop if more pixels
+
+12:
+ ret
+endfunc
+
function ff_yuv2planeX_8_neon, export=1
// x0 - const int16_t *filter,
// x1 - int filterSize,
diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
index 6e5a721c1f..2c3f096a84 100644
--- a/libswscale/aarch64/swscale.c
+++ b/libswscale/aarch64/swscale.c
@@ -158,6 +158,29 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
ALL_SCALE_FUNCS(neon);
+void ff_yuv2planeX_10_neon(const int16_t *filter, int filterSize,
+ const int16_t **src, uint16_t *dest, int dstW,
+ int big_endian, int output_bits);
+
+#define yuv2NBPS(bits, BE_LE, is_be, template_size, typeX_t) \
+static void yuv2planeX_ ## bits ## BE_LE ## _neon(const int16_t *filter, int filterSize, \
+ const int16_t **src, uint8_t *dest, int dstW, \
+ const uint8_t *dither, int offset) \
+{ \
+ ff_yuv2planeX_## template_size ## _neon(filter, \
+ filterSize, (const typeX_t **) src, \
+ (uint16_t *) dest, dstW, is_be, bits); \
+}
+
+yuv2NBPS( 9, BE, 1, 10, int16_t)
+yuv2NBPS( 9, LE, 0, 10, int16_t)
+yuv2NBPS(10, BE, 1, 10, int16_t)
+yuv2NBPS(10, LE, 0, 10, int16_t)
+yuv2NBPS(12, BE, 1, 10, int16_t)
+yuv2NBPS(12, LE, 0, 10, int16_t)
+yuv2NBPS(14, BE, 1, 10, int16_t)
+yuv2NBPS(14, LE, 0, 10, int16_t)
+
void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize,
const int16_t **src, uint8_t *dest, int dstW,
const uint8_t *dither, int offset);
@@ -268,6 +291,8 @@ av_cold void ff_sws_init_range_convert_aarch64(SwsInternal *c)
av_cold void ff_sws_init_swscale_aarch64(SwsInternal *c)
{
int cpu_flags = av_get_cpu_flags();
+ enum AVPixelFormat dstFormat = c->opts.dst_format;
+ const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(dstFormat);
if (have_neon(cpu_flags)) {
ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon);
@@ -276,6 +301,19 @@ av_cold void ff_sws_init_swscale_aarch64(SwsInternal *c)
if (c->dstBpc == 8) {
c->yuv2planeX = ff_yuv2planeX_8_neon;
}
+
+ if (isNBPS(dstFormat) && !isSemiPlanarYUV(dstFormat)) {
+ if (desc->comp[0].depth == 9) {
+ c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_9BE_neon : yuv2planeX_9LE_neon;
+ } else if (desc->comp[0].depth == 10) {
+ c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_10BE_neon : yuv2planeX_10LE_neon;
+ } else if (desc->comp[0].depth == 12) {
+ c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_12BE_neon : yuv2planeX_12LE_neon;
+ } else if (desc->comp[0].depth == 14) {
+ c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_14BE_neon : yuv2planeX_14LE_neon;
+ } else
+ av_assert0(0);
+ }
switch (c->opts.src_format) {
case AV_PIX_FMT_ABGR:
c->lumToYV12 = ff_abgr32ToY_neon;
diff --git a/tests/checkasm/sw_scale.c b/tests/checkasm/sw_scale.c
index 11c9174a6b..5a659571df 100644
--- a/tests/checkasm/sw_scale.c
+++ b/tests/checkasm/sw_scale.c
@@ -52,50 +52,59 @@ static void yuv2planeX_8_ref(const int16_t *filter, int filterSize,
}
}
-static int cmp_off_by_n(const uint8_t *ref, const uint8_t *test, size_t n, int accuracy)
-{
- for (size_t i = 0; i < n; i++) {
- if (abs(ref[i] - test[i]) > accuracy)
- return 1;
- }
- return 0;
+#define CMP_FUNC(bits) \
+static int cmp_off_by_##bits(const uint##bits##_t *ref, const uint##bits##_t *test, \
+ size_t n, int accuracy) \
+{ \
+ for (size_t i = 0; i < n; i++) { \
+ if (abs((int)ref[i] - (int)test[i]) > accuracy) \
+ return 1; \
+ } \
+ return 0; \
}
-static void print_data(uint8_t *p, size_t len, size_t offset)
-{
- size_t i = 0;
- for (; i < len; i++) {
- if (i % 8 == 0) {
- printf("0x%04zx: ", i+offset);
- }
- printf("0x%02x ", (uint32_t) p[i]);
- if (i % 8 == 7) {
- printf("\n");
- }
- }
- if (i % 8 != 0) {
- printf("\n");
- }
+CMP_FUNC(8)
+CMP_FUNC(16)
+
+#define SHOW_DIFF_FUNC(bits) \
+static void print_data_##bits(const uint##bits##_t *p, size_t len, size_t offset) \
+{ \
+ size_t i = 0; \
+ for (; i < len; i++) { \
+ if (i % 8 == 0) { \
+ printf("0x%04zx: ", i+offset); \
+ } \
+ printf("0x%02x ", (uint32_t) p[i]); \
+ if (i % 8 == 7) { \
+ printf("\n"); \
+ } \
+ } \
+ if (i % 8 != 0) { \
+ printf("\n"); \
+ } \
+} \
+static size_t show_differences_##bits(const uint##bits##_t *a, const uint##bits##_t *b, \
+ size_t len) \
+{ \
+ for (size_t i = 0; i < len; i++) { \
+ if (a[i] != b[i]) { \
+ size_t offset_of_mismatch = i; \
+ size_t offset; \
+ if (i >= 8) i-=8; \
+ offset = i & (~7); \
+ printf("test a:\n"); \
+ print_data_##bits(&a[offset], 32, offset); \
+ printf("\ntest b:\n"); \
+ print_data_##bits(&b[offset], 32, offset); \
+ printf("\n"); \
+ return offset_of_mismatch; \
+ } \
+ } \
+ return len; \
}
-static size_t show_differences(uint8_t *a, uint8_t *b, size_t len)
-{
- for (size_t i = 0; i < len; i++) {
- if (a[i] != b[i]) {
- size_t offset_of_mismatch = i;
- size_t offset;
- if (i >= 8) i-=8;
- offset = i & (~7);
- printf("test a:\n");
- print_data(&a[offset], 32, offset);
- printf("\ntest b:\n");
- print_data(&b[offset], 32, offset);
- printf("\n");
- return offset_of_mismatch;
- }
- }
- return len;
-}
+SHOW_DIFF_FUNC(8)
+SHOW_DIFF_FUNC(16)
static void check_yuv2yuv1(int accurate)
{
@@ -140,10 +149,10 @@ static void check_yuv2yuv1(int accurate)
call_ref(src_pixels, dst0, dstW, dither, offset);
call_new(src_pixels, dst1, dstW, dither, offset);
- if (cmp_off_by_n(dst0, dst1, dstW * sizeof(dst0[0]), accurate ? 0 : 2)) {
+ if (cmp_off_by_8(dst0, dst1, dstW * sizeof(dst0[0]), accurate ? 0 : 2)) {
fail();
printf("failed: yuv2yuv1_%d_%di_%s\n", offset, dstW, accurate_str);
- fail_offset = show_differences(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]));
+ fail_offset = show_differences_8(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]));
printf("failing values: src: 0x%04x dither: 0x%02x dst-c: %02x dst-asm: %02x\n",
(int) src_pixels[fail_offset],
(int) dither[(fail_offset + fail_offset) & 7],
@@ -158,7 +167,7 @@ static void check_yuv2yuv1(int accurate)
sws_freeContext(sws);
}
-static void check_yuv2yuvX(int accurate)
+static void check_yuv2yuvX(int accurate, int bit_depth, int dst_pix_format)
{
SwsContext *sws;
SwsInternal *c;
@@ -179,8 +188,8 @@ static void check_yuv2yuvX(int accurate)
const int16_t **src;
LOCAL_ALIGNED_16(int16_t, src_pixels, [LARGEST_FILTER * LARGEST_INPUT_SIZE]);
LOCAL_ALIGNED_16(int16_t, filter_coeff, [LARGEST_FILTER]);
- LOCAL_ALIGNED_16(uint8_t, dst0, [LARGEST_INPUT_SIZE]);
- LOCAL_ALIGNED_16(uint8_t, dst1, [LARGEST_INPUT_SIZE]);
+ LOCAL_ALIGNED_16(uint16_t, dst0, [LARGEST_INPUT_SIZE]);
+ LOCAL_ALIGNED_16(uint16_t, dst1, [LARGEST_INPUT_SIZE]);
LOCAL_ALIGNED_16(uint8_t, dither, [LARGEST_INPUT_SIZE]);
union VFilterData{
const int16_t *src;
@@ -190,12 +199,14 @@ static void check_yuv2yuvX(int accurate)
memset(dither, d_val, LARGEST_INPUT_SIZE);
randomize_buffers((uint8_t*)src_pixels, LARGEST_FILTER * LARGEST_INPUT_SIZE * sizeof(int16_t));
sws = sws_alloc_context();
+ sws->dst_format = dst_pix_format;
if (accurate)
sws->flags |= SWS_ACCURATE_RND;
if (sws_init_context(sws, NULL, NULL) < 0)
fail();
c = sws_internal(sws);
+ c->dstBpc = bit_depth;
ff_sws_init_scale(c);
for(isi = 0; isi < FF_ARRAY_ELEMS(input_sizes); ++isi){
dstW = input_sizes[isi];
@@ -227,24 +238,39 @@ static void check_yuv2yuvX(int accurate)
for(j = 0; j < 4; ++j)
vFilterData[i].coeff[j + 4] = filter_coeff[i];
}
- if (check_func(c->yuv2planeX, "yuv2yuvX_%d_%d_%d_%s", filter_sizes[fsi], osi, dstW, accurate_str)){
+ if (check_func(c->yuv2planeX, "yuv2yuvX_%d_%s_%d_%d_%d_%s", bit_depth, isBE(dst_pix_format) ? "BE" : "LE", filter_sizes[fsi], osi, dstW, accurate_str)){
// use vFilterData for the mmx function
const int16_t *filter = c->use_mmx_vfilter ? (const int16_t*)vFilterData : &filter_coeff[0];
memset(dst0, 0, LARGEST_INPUT_SIZE * sizeof(dst0[0]));
memset(dst1, 0, LARGEST_INPUT_SIZE * sizeof(dst1[0]));
- // We can't use call_ref here, because we don't know if use_mmx_vfilter was set for that
- // function or not, so we can't pass it the parameters correctly.
- yuv2planeX_8_ref(&filter_coeff[0], filter_sizes[fsi], src, dst0, dstW - osi, dither, osi);
+ if(c->dstBpc == 8)
+ {
+ // We can't use call_ref here, because we don't know if use_mmx_vfilter was set for that
+ // function or not, so we can't pass it the parameters correctly.
- call_new(filter, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi);
- if (cmp_off_by_n(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]), accurate ? 0 : 2)) {
- fail();
- printf("failed: yuv2yuvX_%d_%d_%d_%s\n", filter_sizes[fsi], osi, dstW, accurate_str);
- show_differences(dst0, dst1, LARGEST_INPUT_SIZE * sizeof(dst0[0]));
+ yuv2planeX_8_ref(&filter_coeff[0], filter_sizes[fsi], src, (uint8_t*)dst0, dstW - osi, dither, osi);
+ call_new(filter, filter_sizes[fsi], src, (uint8_t*)dst1, dstW - osi, dither, osi);
+
+ if (cmp_off_by_8((uint8_t*)dst0, (uint8_t*)dst1, LARGEST_INPUT_SIZE, accurate ? 0 : 2)) {
+ fail();
+ printf("failed: yuv2yuvX_%d_%s_%d_%d_%d_%s\n", bit_depth, isBE(dst_pix_format) ? "BE" : "LE", filter_sizes[fsi], osi, dstW, accurate_str);
+ show_differences_8((uint8_t*)dst0, (uint8_t*)dst1, LARGEST_INPUT_SIZE);
+ }
+ }
+ else
+ {
+ call_ref(&filter_coeff[0], filter_sizes[fsi], src, (uint8_t*)dst0, dstW - osi, dither, osi);
+ call_new(&filter_coeff[0], filter_sizes[fsi], src, (uint8_t*)dst1, dstW - osi, dither, osi);
+
+ if (cmp_off_by_16(dst0, dst1, LARGEST_INPUT_SIZE, accurate ? 0 : 2)) {
+ fail();
+ printf("failed: yuv2yuvX_%d_%s_%d_%d_%d_%s\n", bit_depth, isBE(dst_pix_format) ? "BE" : "LE", filter_sizes[fsi], osi, dstW, accurate_str);
+ show_differences_16(dst0, dst1, LARGEST_INPUT_SIZE);
+ }
}
if(dstW == LARGEST_INPUT_SIZE)
- bench_new((const int16_t*)vFilterData, filter_sizes[fsi], src, dst1, dstW - osi, dither, osi);
+ bench_new(filter, filter_sizes[fsi], src, (uint8_t*)dst1, dstW - osi, dither, osi);
}
av_freep(&src);
@@ -311,10 +337,10 @@ static void check_yuv2nv12cX(int accurate)
call_ref(sws->dst_format, dither, &filter_coeff[0], filter_size, srcU, srcV, dst0, dstW);
call_new(sws->dst_format, dither, &filter_coeff[0], filter_size, srcU, srcV, dst1, dstW);
- if (cmp_off_by_n(dst0, dst1, dstW * 2 * sizeof(dst0[0]), accurate ? 0 : 2)) {
+ if (cmp_off_by_8(dst0, dst1, dstW * 2 * sizeof(dst0[0]), accurate ? 0 : 2)) {
fail();
printf("failed: yuv2nv12wX_%d_%d_%s\n", filter_size, dstW, accurate_str);
- show_differences(dst0, dst1, dstW * 2 * sizeof(dst0[0]));
+ show_differences_8(dst0, dst1, dstW * 2 * sizeof(dst0[0]));
}
if (dstW == LARGEST_INPUT_SIZE)
bench_new(sws->dst_format, dither, &filter_coeff[0], filter_size, srcU, srcV, dst1, dstW);
@@ -441,9 +467,33 @@ void checkasm_check_sw_scale(void)
check_yuv2yuv1(0);
check_yuv2yuv1(1);
report("yuv2yuv1");
- check_yuv2yuvX(0);
- check_yuv2yuvX(1);
- report("yuv2yuvX");
+ check_yuv2yuvX(0, 8, AV_PIX_FMT_YUV420P);
+ check_yuv2yuvX(1, 8, AV_PIX_FMT_YUV420P);
+ report("yuv2yuvX_8");
+ check_yuv2yuvX(0, 9, AV_PIX_FMT_YUV420P9LE);
+ check_yuv2yuvX(1, 9, AV_PIX_FMT_YUV420P9LE);
+ report("yuv2yuvX_9LE");
+ check_yuv2yuvX(0, 9, AV_PIX_FMT_YUV420P9BE);
+ check_yuv2yuvX(1, 9, AV_PIX_FMT_YUV420P9BE);
+ report("yuv2yuvX_9BE");
+ check_yuv2yuvX(0, 10, AV_PIX_FMT_YUV420P10LE);
+ check_yuv2yuvX(1, 10, AV_PIX_FMT_YUV420P10LE);
+ report("yuv2yuvX_10LE");
+ check_yuv2yuvX(0, 10, AV_PIX_FMT_YUV420P10BE);
+ check_yuv2yuvX(1, 10, AV_PIX_FMT_YUV420P10BE);
+ report("yuv2yuvX_10BE");
+ check_yuv2yuvX(0, 12, AV_PIX_FMT_YUV420P12LE);
+ check_yuv2yuvX(1, 12, AV_PIX_FMT_YUV420P12LE);
+ report("yuv2yuvX_12LE");
+ check_yuv2yuvX(0, 12, AV_PIX_FMT_YUV420P12BE);
+ check_yuv2yuvX(1, 12, AV_PIX_FMT_YUV420P12BE);
+ report("yuv2yuvX_12BE");
+ check_yuv2yuvX(0, 14, AV_PIX_FMT_YUV420P14LE);
+ check_yuv2yuvX(1, 14, AV_PIX_FMT_YUV420P14LE);
+ report("yuv2yuvX_14LE");
+ check_yuv2yuvX(0, 14, AV_PIX_FMT_YUV420P14BE);
+ check_yuv2yuvX(1, 14, AV_PIX_FMT_YUV420P14BE);
+ report("yuv2yuvX_14BE");
check_yuv2nv12cX(0);
check_yuv2nv12cX(1);
report("yuv2nv12cX");
--
2.34.1
________________________________
From: Logaprakash Ramajayam
Sent: Wednesday, July 2, 2025 1:01 PM
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template()
Handled all the comments and updated checkasm for yuv2planeX_10_c()
Checkasm Benchmark results:
yuv2yuvX_10_LE_16_0_512_accurate_c: 7836.9 ( 1.00x)
yuv2yuvX_10_LE_16_0_512_accurate_neon: 840.4 ( 9.33x)
yuv2yuvX_10_LE_16_0_512_approximate_c: 7930.8 ( 1.00x)
yuv2yuvX_10_LE_16_0_512_approximate_neon: 838.5 ( 9.46x)
yuv2yuvX_10_LE_16_16_512_accurate_c: 7594.3 ( 1.00x)
yuv2yuvX_10_LE_16_16_512_accurate_neon: 815.2 ( 9.32x)
yuv2yuvX_10_LE_16_16_512_approximate_c: 7687.0 ( 1.00x)
yuv2yuvX_10_LE_16_16_512_approximate_neon: 811.9 ( 9.47x)
yuv2yuvX_10_LE_16_32_512_accurate_c: 7366.4 ( 1.00x)
yuv2yuvX_10_LE_16_32_512_accurate_neon: 785.8 ( 9.37x)
yuv2yuvX_10_LE_16_32_512_approximate_c: 7426.5 ( 1.00x)
yuv2yuvX_10_LE_16_32_512_approximate_neon: 786.4 ( 9.44x)
yuv2yuvX_10_LE_16_48_512_accurate_c: 7123.1 ( 1.00x)
yuv2yuvX_10_LE_16_48_512_accurate_neon: 761.7 ( 9.35x)
yuv2yuvX_10_LE_16_48_512_approximate_c: 7182.7 ( 1.00x)
yuv2yuvX_10_LE_16_48_512_approximate_neon: 763.0 ( 9.41x)
yuv2yuvX_10_BE_16_0_512_accurate_c: 8092.6 ( 1.00x)
yuv2yuvX_10_BE_16_0_512_accurate_neon: 860.2 ( 9.41x)
yuv2yuvX_10_BE_16_0_512_approximate_c: 8183.5 ( 1.00x)
yuv2yuvX_10_BE_16_0_512_approximate_neon: 861.4 ( 9.50x)
yuv2yuvX_10_BE_16_16_512_accurate_c: 7837.4 ( 1.00x)
yuv2yuvX_10_BE_16_16_512_accurate_neon: 834.0 ( 9.40x)
yuv2yuvX_10_BE_16_16_512_approximate_c: 7927.9 ( 1.00x)
yuv2yuvX_10_BE_16_16_512_approximate_neon: 834.6 ( 9.50x)
yuv2yuvX_10_BE_16_32_512_accurate_c: 7605.1 ( 1.00x)
yuv2yuvX_10_BE_16_32_512_accurate_neon: 807.5 ( 9.42x)
yuv2yuvX_10_BE_16_32_512_approximate_c: 7691.4 ( 1.00x)
yuv2yuvX_10_BE_16_32_512_approximate_neon: 807.3 ( 9.53x)
yuv2yuvX_10_BE_16_48_512_accurate_c: 7344.3 ( 1.00x)
yuv2yuvX_10_BE_16_48_512_accurate_neon: 782.7 ( 9.38x)
yuv2yuvX_10_BE_16_48_512_approximate_c: 7440.1 ( 1.00x)
yuv2yuvX_10_BE_16_48_512_approximate_neon: 781.9 ( 9.51x)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2025-07-02 9:27 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-07-02 7:31 [FFmpeg-devel] [PATCH v2 1/1] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() Logaprakash Ramajayam
2025-07-02 9:10 ` Logaprakash Ramajayam
2025-07-02 9:27 ` Logaprakash Ramajayam
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git