* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} @ 2025-02-19 17:40 Krzysztof Pyrkosz via ffmpeg-devel 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz --- This patch replaces integer widening with halving addition, and multi-step "emulated" rounding shift with a single asm instruction doing exactly that. This pattern repeats in other functions in this file, I fixed some in the succeeding patch. There's a lot of performance to be gained there. I didn't modify the existing function because it adds a few extra steps solely for the shared w_avg implementation (every cycle matters), but also because I find this linear version easier to digest and understand. Besides, I noticed that removing smin and smax instructions used for clamping the values for 10 and 12 bit_depth instantiations does not affect the checkasm result, but it breaks FATE. Benchmarks before and after: A78 avg_8_2x2_neon: 21.0 ( 1.55x) avg_8_4x4_neon: 25.8 ( 3.05x) avg_8_8x8_neon: 45.0 ( 5.86x) avg_8_16x16_neon: 178.5 ( 5.49x) avg_8_32x32_neon: 709.2 ( 6.20x) avg_8_64x64_neon: 2686.2 ( 6.12x) avg_8_128x128_neon: 10734.2 ( 5.88x) avg_10_2x2_neon: 19.0 ( 1.75x) avg_10_4x4_neon: 28.2 ( 2.76x) avg_10_8x8_neon: 44.0 ( 5.82x) avg_10_16x16_neon: 179.5 ( 4.81x) avg_10_32x32_neon: 680.8 ( 5.58x) avg_10_64x64_neon: 2536.8 ( 5.40x) avg_10_128x128_neon: 10079.0 ( 5.22x) avg_12_2x2_neon: 20.8 ( 1.59x) avg_12_4x4_neon: 25.2 ( 3.09x) avg_12_8x8_neon: 44.0 ( 5.79x) avg_12_16x16_neon: 182.2 ( 4.80x) avg_12_32x32_neon: 696.2 ( 5.46x) avg_12_64x64_neon: 2548.2 ( 5.38x) avg_12_128x128_neon: 10133.8 ( 5.19x) avg_8_2x2_neon: 16.5 ( 1.98x) avg_8_4x4_neon: 26.2 ( 2.93x) avg_8_8x8_neon: 31.8 ( 8.55x) avg_8_16x16_neon: 82.0 (12.02x) avg_8_32x32_neon: 310.2 (14.12x) avg_8_64x64_neon: 897.8 (18.26x) avg_8_128x128_neon: 3608.5 (17.37x) avg_10_2x2_neon: 19.5 ( 1.69x) avg_10_4x4_neon: 28.0 ( 2.79x) avg_10_8x8_neon: 34.8 ( 7.32x) avg_10_16x16_neon: 119.8 ( 7.35x) avg_10_32x32_neon: 444.2 ( 8.51x) avg_10_64x64_neon: 1711.8 ( 8.00x) avg_10_128x128_neon: 7065.2 ( 7.43x) avg_12_2x2_neon: 19.5 ( 1.71x) avg_12_4x4_neon: 24.2 ( 3.22x) avg_12_8x8_neon: 33.8 ( 7.57x) avg_12_16x16_neon: 120.2 ( 7.33x) avg_12_32x32_neon: 442.5 ( 8.53x) avg_12_64x64_neon: 1706.2 ( 8.02x) avg_12_128x128_neon: 7010.0 ( 7.46x) A72 avg_8_2x2_neon: 30.2 ( 1.48x) avg_8_4x4_neon: 40.0 ( 3.10x) avg_8_8x8_neon: 91.0 ( 4.14x) avg_8_16x16_neon: 340.4 ( 3.92x) avg_8_32x32_neon: 1220.7 ( 4.67x) avg_8_64x64_neon: 5823.4 ( 3.88x) avg_8_128x128_neon: 17430.5 ( 4.73x) avg_10_2x2_neon: 34.0 ( 1.66x) avg_10_4x4_neon: 45.2 ( 2.73x) avg_10_8x8_neon: 97.5 ( 3.87x) avg_10_16x16_neon: 317.7 ( 3.90x) avg_10_32x32_neon: 1376.2 ( 4.21x) avg_10_64x64_neon: 5228.1 ( 3.71x) avg_10_128x128_neon: 16722.2 ( 4.17x) avg_12_2x2_neon: 31.7 ( 1.76x) avg_12_4x4_neon: 36.0 ( 3.44x) avg_12_8x8_neon: 91.7 ( 4.10x) avg_12_16x16_neon: 297.2 ( 4.13x) avg_12_32x32_neon: 1400.5 ( 4.14x) avg_12_64x64_neon: 5379.1 ( 3.51x) avg_12_128x128_neon: 16715.7 ( 4.17x) avg_8_2x2_neon: 33.7 ( 1.72x) avg_8_4x4_neon: 45.5 ( 2.84x) avg_8_8x8_neon: 65.0 ( 5.98x) avg_8_16x16_neon: 171.0 ( 7.81x) avg_8_32x32_neon: 558.2 (10.05x) avg_8_64x64_neon: 2006.5 (10.61x) avg_8_128x128_neon: 9158.7 ( 8.96x) avg_10_2x2_neon: 38.0 ( 1.92x) avg_10_4x4_neon: 53.2 ( 2.69x) avg_10_8x8_neon: 95.2 ( 4.08x) avg_10_16x16_neon: 243.0 ( 5.02x) avg_10_32x32_neon: 891.7 ( 5.64x) avg_10_64x64_neon: 3357.7 ( 5.60x) avg_10_128x128_neon: 12411.7 ( 5.56x) avg_12_2x2_neon: 34.7 ( 1.97x) avg_12_4x4_neon: 53.2 ( 2.68x) avg_12_8x8_neon: 91.7 ( 4.22x) avg_12_16x16_neon: 239.0 ( 5.08x) avg_12_32x32_neon: 895.7 ( 5.62x) avg_12_64x64_neon: 3317.5 ( 5.67x) avg_12_128x128_neon: 12358.5 ( 5.58x) A53 avg_8_2x2_neon: 58.3 ( 1.41x) avg_8_4x4_neon: 101.8 ( 2.21x) avg_8_8x8_neon: 178.6 ( 4.53x) avg_8_16x16_neon: 569.5 ( 5.01x) avg_8_32x32_neon: 1962.5 ( 5.50x) avg_8_64x64_neon: 8327.8 ( 5.18x) avg_8_128x128_neon: 31631.3 ( 5.34x) avg_10_2x2_neon: 54.5 ( 1.56x) avg_10_4x4_neon: 88.8 ( 2.53x) avg_10_8x8_neon: 163.6 ( 4.97x) avg_10_16x16_neon: 550.5 ( 5.16x) avg_10_32x32_neon: 1942.5 ( 5.64x) avg_10_64x64_neon: 8783.5 ( 4.98x) avg_10_128x128_neon: 32617.0 ( 5.25x) avg_12_2x2_neon: 53.3 ( 1.66x) avg_12_4x4_neon: 86.8 ( 2.61x) avg_12_8x8_neon: 156.6 ( 5.12x) avg_12_16x16_neon: 541.3 ( 5.25x) avg_12_32x32_neon: 1955.3 ( 5.59x) avg_12_64x64_neon: 8686.0 ( 5.06x) avg_12_128x128_neon: 32487.5 ( 5.25x) avg_8_2x2_neon: 39.5 ( 1.96x) avg_8_4x4_neon: 65.3 ( 3.41x) avg_8_8x8_neon: 168.8 ( 4.79x) avg_8_16x16_neon: 348.0 ( 8.20x) avg_8_32x32_neon: 1207.5 ( 8.98x) avg_8_64x64_neon: 6032.3 ( 7.17x) avg_8_128x128_neon: 22008.5 ( 7.69x) avg_10_2x2_neon: 55.5 ( 1.52x) avg_10_4x4_neon: 73.8 ( 3.08x) avg_10_8x8_neon: 157.8 ( 5.12x) avg_10_16x16_neon: 445.0 ( 6.43x) avg_10_32x32_neon: 1587.3 ( 6.87x) avg_10_64x64_neon: 7738.0 ( 5.68x) avg_10_128x128_neon: 27813.8 ( 6.14x) avg_12_2x2_neon: 48.3 ( 1.80x) avg_12_4x4_neon: 77.0 ( 2.95x) avg_12_8x8_neon: 161.5 ( 4.98x) avg_12_16x16_neon: 433.5 ( 6.59x) avg_12_32x32_neon: 1622.0 ( 6.75x) avg_12_64x64_neon: 7844.5 ( 5.60x) avg_12_128x128_neon: 26999.5 ( 6.34x) Krzysztof libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++- 1 file changed, 121 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 0edc861f97..c9d698ee29 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -217,13 +217,131 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1 endfunc .endm -vvc_avg avg, 8 -vvc_avg avg, 10 -vvc_avg avg, 12 vvc_avg w_avg, 8 vvc_avg w_avg, 10 vvc_avg w_avg, 12 +.macro vvc_avg2 bit_depth +function ff_vvc_avg_\bit_depth\()_neon, export=1 + mov x10, #(VVC_MAX_PB_SIZE * 2) + movi v16.8h, #0 + movi v17.16b, #255 + ushr v17.8h, v17.8h, #(16 - \bit_depth) + + cmp w4, #8 + b.gt 16f + b.eq 8f + cmp w4, #4 + b.eq 4f + +2: // width == 2 + ldr s0, [x2] + subs w5, w5, #1 + ldr s1, [x3] +.if \bit_depth == 8 + shadd v0.4h, v0.4h, v1.4h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str h0, [x0] +.else + shadd v0.4h, v0.4h, v1.4h + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) + smax v0.4h, v0.4h, v16.4h + smin v0.4h, v0.4h, v17.4h + str s0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 2b + ret + +4: // width == 4 + ldr d0, [x2] + subs w5, w5, #1 + ldr d1, [x3] +.if \bit_depth == 8 + shadd v0.4h, v0.4h, v1.4h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str s0, [x0] +.else + shadd v0.4h, v0.4h, v1.4h + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) + smax v0.4h, v0.4h, v16.4h + smin v0.4h, v0.4h, v17.4h + str d0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 4b + ret + +8: // width == 8 + ldr q0, [x2] + subs w5, w5, #1 + ldr q1, [x3] +.if \bit_depth == 8 + shadd v0.8h, v0.8h, v1.8h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str d0, [x0] +.else + shadd v0.8h, v0.8h, v1.8h + srshr v0.8h, v0.8h, #(15 - 1 - \bit_depth) + smax v0.8h, v0.8h, v16.8h + smin v0.8h, v0.8h, v17.8h + str q0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 8b + ret + +16: // width >= 16 +.if \bit_depth == 8 + sub x1, x1, w4, sxtw +.else + sub x1, x1, w4, sxtw #1 +.endif + sub x10, x10, w4, sxtw #1 +3: + mov w6, w4 // width +1: + ldp q0, q1, [x2], #32 + subs w6, w6, #16 + ldp q2, q3, [x3], #32 +.if \bit_depth == 8 + shadd v4.8h, v0.8h, v2.8h + shadd v5.8h, v1.8h, v3.8h + sqrshrun v0.8b, v4.8h, #6 + sqrshrun2 v0.16b, v5.8h, #6 + st1 {v0.16b}, [x0], #16 +.else + shadd v4.8h, v0.8h, v2.8h + shadd v5.8h, v1.8h, v3.8h + srshr v0.8h, v4.8h, #(15 - 1 - \bit_depth) + srshr v1.8h, v5.8h, #(15 - 1 - \bit_depth) + smax v0.8h, v0.8h, v16.8h + smax v1.8h, v1.8h, v16.8h + smin v0.8h, v0.8h, v17.8h + smin v1.8h, v1.8h, v17.8h + stp q0, q1, [x0], #32 +.endif + b.ne 1b + + subs w5, w5, #1 + add x2, x2, x10 + add x3, x3, x10 + add x0, x0, x1 + b.ne 3b + ret +endfunc +.endm + +vvc_avg2 8 +vvc_avg2 10 +vvc_avg2 12 + /* x0: int16_t *dst * x1: const uint8_t *_src * x2: ptrdiff_t _src_stride -- 2.47.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction 2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 8:08 ` Zhao Zhili 2025-03-01 22:34 ` Martin Storsjö 2025-02-20 7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili 2025-03-01 22:21 ` Martin Storsjö 2 siblings, 2 replies; 9+ messages in thread From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz --- Before and after on A78 dmvr_8_12x20_neon: 86.2 ( 6.90x) dmvr_8_20x12_neon: 94.8 ( 5.93x) dmvr_8_20x20_neon: 141.5 ( 6.50x) dmvr_12_12x20_neon: 158.0 ( 3.76x) dmvr_12_20x12_neon: 151.2 ( 3.73x) dmvr_12_20x20_neon: 247.2 ( 3.71x) dmvr_hv_8_12x20_neon: 423.2 ( 3.75x) dmvr_hv_8_20x12_neon: 434.0 ( 3.69x) dmvr_hv_8_20x20_neon: 706.0 ( 3.69x) dmvr_8_12x20_neon: 77.2 ( 7.70x) dmvr_8_20x12_neon: 66.5 ( 8.49x) dmvr_8_20x20_neon: 92.2 ( 9.90x) dmvr_12_12x20_neon: 80.2 ( 7.38x) dmvr_12_20x12_neon: 58.2 ( 9.59x) dmvr_12_20x20_neon: 90.0 (10.15x) dmvr_hv_8_12x20_neon: 369.0 ( 4.34x) dmvr_hv_8_20x12_neon: 355.8 ( 4.49x) dmvr_hv_8_20x20_neon: 574.2 ( 4.51x) libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------ 1 file changed, 20 insertions(+), 52 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index c9d698ee29..45add44b6e 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1 1: cbz w15, 2f ldr q0, [src], #16 - uxtl v1.8h, v0.8b - uxtl2 v2.8h, v0.16b - ushl v1.8h, v1.8h, v16.8h - ushl v2.8h, v2.8h, v16.8h + ushll v1.8h, v0.8b, #2 + ushll2 v2.8h, v0.16b, #2 stp q1, q2, [dst], #32 b 3f 2: ldr d0, [src], #8 - uxtl v1.8h, v0.8b - ushl v1.8h, v1.8h, v16.8h + ushll v1.8h, v0.8b, #2 str q1, [dst], #16 3: subs height, height, #1 ldr s3, [src], #4 - uxtl v4.8h, v3.8b - ushl v4.4h, v4.4h, v16.4h + ushll v4.8h, v3.8b, #2 st1 {v4.4h}, [dst], x7 add src, src, src_stride @@ -399,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1 cmp width, #16 sub src_stride, src_stride, x6, lsl #1 cset w15, gt // width > 16 - movi v16.8h, #2 // offset4 sub x7, x7, x6, lsl #1 1: cbz w15, 2f ldp q0, q1, [src], #32 - uaddl v2.4s, v0.4h, v16.4h - uaddl2 v3.4s, v0.8h, v16.8h - uaddl v4.4s, v1.4h, v16.4h - uaddl2 v5.4s, v1.8h, v16.8h - ushr v2.4s, v2.4s, #2 - ushr v3.4s, v3.4s, #2 - ushr v4.4s, v4.4s, #2 - ushr v5.4s, v5.4s, #2 - uqxtn v2.4h, v2.4s - uqxtn2 v2.8h, v3.4s - uqxtn v4.4h, v4.4s - uqxtn2 v4.8h, v5.4s - - stp q2, q4, [dst], #32 + urshr v0.8h, v0.8h, #2 + urshr v1.8h, v1.8h, #2 + + stp q0, q1, [dst], #32 b 3f 2: ldr q0, [src], #16 - uaddl v2.4s, v0.4h, v16.4h - uaddl2 v3.4s, v0.8h, v16.8h - ushr v2.4s, v2.4s, #2 - ushr v3.4s, v3.4s, #2 - uqxtn v2.4h, v2.4s - uqxtn2 v2.8h, v3.4s - str q2, [dst], #16 + urshr v0.8h, v0.8h, #2 + str q0, [dst], #16 3: subs height, height, #1 ldr d0, [src], #8 - uaddl v3.4s, v0.4h, v16.4h - ushr v3.4s, v3.4s, #2 - uqxtn v3.4h, v3.4s - st1 {v3.4h}, [dst], x7 + urshr v0.4h, v0.4h, #2 + st1 {v0.4h}, [dst], x7 add src, src, src_stride b.ne 1b @@ -462,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldrb w10, [x12] ldrb w11, [x12, #1] sxtw x6, w6 - movi v30.8h, #(1 << (8 - 7)) // offset1 - movi v31.8h, #8 // offset2 dup v2.8h, w10 // filter_y[0] dup v3.8h, w11 // filter_y[1] @@ -491,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mul v16.8h, v16.8h, v0.8h mla v6.8h, v7.8h, v1.8h mla v16.8h, v17.8h, v1.8h - add v6.8h, v6.8h, v30.8h - add v16.8h, v16.8h, v30.8h - ushr v6.8h, v6.8h, #(8 - 6) - ushr v7.8h, v16.8h, #(8 - 6) + urshr v6.8h, v6.8h, #(8 - 6) + urshr v7.8h, v16.8h, #(8 - 6) stp q6, q7, [x13], #32 cbz w10, 3f @@ -504,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mul v17.8h, v17.8h, v2.8h mla v16.8h, v6.8h, v3.8h mla v17.8h, v7.8h, v3.8h - add v16.8h, v16.8h, v31.8h - add v17.8h, v17.8h, v31.8h - ushr v16.8h, v16.8h, #4 - ushr v17.8h, v17.8h, #4 + urshr v16.8h, v16.8h, #4 + urshr v17.8h, v17.8h, #4 stp q16, q17, [x14], #32 b 3f 2: @@ -518,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 uxtl v6.8h, v4.8b mul v6.8h, v6.8h, v0.8h mla v6.8h, v7.8h, v1.8h - add v6.8h, v6.8h, v30.8h - ushr v6.8h, v6.8h, #(8 - 6) + urshr v6.8h, v6.8h, #(8 - 6) str q6, [x13], #16 cbz w10, 3f @@ -527,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldr q16, [x12], #16 mul v16.8h, v16.8h, v2.8h mla v16.8h, v6.8h, v3.8h - add v16.8h, v16.8h, v31.8h - ushr v16.8h, v16.8h, #4 + urshr v16.8h, v16.8h, #4 str q16, [x14], #16 3: ldur s5, [src, #1] @@ -537,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 uxtl v6.8h, v4.8b mul v6.4h, v6.4h, v0.4h mla v6.4h, v7.4h, v1.4h - add v6.4h, v6.4h, v30.4h - ushr v6.4h, v6.4h, #(8 - 6) + urshr v6.4h, v6.4h, #(8 - 6) str d6, [x13], #8 cbz w10, 4f @@ -546,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldr d16, [x12], #8 mul v16.4h, v16.4h, v2.4h mla v16.4h, v6.4h, v3.4h - add v16.4h, v16.4h, v31.4h - ushr v16.4h, v16.4h, #4 + urshr v16.4h, v16.4h, #4 str d16, [x14], #8 4: subs height, height, #1 -- 2.47.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 8:08 ` Zhao Zhili 2025-03-01 22:34 ` Martin Storsjö 1 sibling, 0 replies; 9+ messages in thread From: Zhao Zhili @ 2025-02-20 8:08 UTC (permalink / raw) To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz > On Feb 20, 2025, at 01:40, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote: > > --- > > Before and after on A78 > > dmvr_8_12x20_neon: 86.2 ( 6.90x) > dmvr_8_20x12_neon: 94.8 ( 5.93x) > dmvr_8_20x20_neon: 141.5 ( 6.50x) > dmvr_12_12x20_neon: 158.0 ( 3.76x) > dmvr_12_20x12_neon: 151.2 ( 3.73x) > dmvr_12_20x20_neon: 247.2 ( 3.71x) > dmvr_hv_8_12x20_neon: 423.2 ( 3.75x) > dmvr_hv_8_20x12_neon: 434.0 ( 3.69x) > dmvr_hv_8_20x20_neon: 706.0 ( 3.69x) > > dmvr_8_12x20_neon: 77.2 ( 7.70x) > dmvr_8_20x12_neon: 66.5 ( 8.49x) > dmvr_8_20x20_neon: 92.2 ( 9.90x) > dmvr_12_12x20_neon: 80.2 ( 7.38x) > dmvr_12_20x12_neon: 58.2 ( 9.59x) > dmvr_12_20x20_neon: 90.0 (10.15x) > dmvr_hv_8_12x20_neon: 369.0 ( 4.34x) > dmvr_hv_8_20x12_neon: 355.8 ( 4.49x) > dmvr_hv_8_20x20_neon: 574.2 ( 4.51x) > > libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------ > 1 file changed, 20 insertions(+), 52 deletions(-) > > diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S > index c9d698ee29..45add44b6e 100644 > --- a/libavcodec/aarch64/vvc/inter.S > +++ b/libavcodec/aarch64/vvc/inter.S > @@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1 > 1: > cbz w15, 2f > ldr q0, [src], #16 > - uxtl v1.8h, v0.8b > - uxtl2 v2.8h, v0.16b > - ushl v1.8h, v1.8h, v16.8h > - ushl v2.8h, v2.8h, v16.8h Please remove assignment to v16. LGTM otherwise. > + ushll v1.8h, v0.8b, #2 > + ushll2 v2.8h, v0.16b, #2 > stp q1, q2, [dst], #32 > b 3f > 2: > ldr d0, [src], #8 > - uxtl v1.8h, v0.8b > - ushl v1.8h, v1.8h, v16.8h > + ushll v1.8h, v0.8b, #2 > str q1, [dst], #16 > 3: > subs height, height, #1 > ldr s3, [src], #4 > - uxtl v4.8h, v3.8b > - ushl v4.4h, v4.4h, v16.4h > + ushll v4.8h, v3.8b, #2 > st1 {v4.4h}, [dst], x7 > > add src, src, src_stride > @@ -399,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1 > cmp width, #16 > sub src_stride, src_stride, x6, lsl #1 > cset w15, gt // width > 16 > - movi v16.8h, #2 // offset4 > sub x7, x7, x6, lsl #1 > 1: > cbz w15, 2f > ldp q0, q1, [src], #32 > - uaddl v2.4s, v0.4h, v16.4h > - uaddl2 v3.4s, v0.8h, v16.8h > - uaddl v4.4s, v1.4h, v16.4h > - uaddl2 v5.4s, v1.8h, v16.8h > - ushr v2.4s, v2.4s, #2 > - ushr v3.4s, v3.4s, #2 > - ushr v4.4s, v4.4s, #2 > - ushr v5.4s, v5.4s, #2 > - uqxtn v2.4h, v2.4s > - uqxtn2 v2.8h, v3.4s > - uqxtn v4.4h, v4.4s > - uqxtn2 v4.8h, v5.4s > - > - stp q2, q4, [dst], #32 > + urshr v0.8h, v0.8h, #2 > + urshr v1.8h, v1.8h, #2 > + > + stp q0, q1, [dst], #32 > b 3f > 2: > ldr q0, [src], #16 > - uaddl v2.4s, v0.4h, v16.4h > - uaddl2 v3.4s, v0.8h, v16.8h > - ushr v2.4s, v2.4s, #2 > - ushr v3.4s, v3.4s, #2 > - uqxtn v2.4h, v2.4s > - uqxtn2 v2.8h, v3.4s > - str q2, [dst], #16 > + urshr v0.8h, v0.8h, #2 > + str q0, [dst], #16 > 3: > subs height, height, #1 > ldr d0, [src], #8 > - uaddl v3.4s, v0.4h, v16.4h > - ushr v3.4s, v3.4s, #2 > - uqxtn v3.4h, v3.4s > - st1 {v3.4h}, [dst], x7 > + urshr v0.4h, v0.4h, #2 > + st1 {v0.4h}, [dst], x7 > > add src, src, src_stride > b.ne 1b > @@ -462,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > ldrb w10, [x12] > ldrb w11, [x12, #1] > sxtw x6, w6 > - movi v30.8h, #(1 << (8 - 7)) // offset1 > - movi v31.8h, #8 // offset2 > dup v2.8h, w10 // filter_y[0] > dup v3.8h, w11 // filter_y[1] > > @@ -491,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > mul v16.8h, v16.8h, v0.8h > mla v6.8h, v7.8h, v1.8h > mla v16.8h, v17.8h, v1.8h > - add v6.8h, v6.8h, v30.8h > - add v16.8h, v16.8h, v30.8h > - ushr v6.8h, v6.8h, #(8 - 6) > - ushr v7.8h, v16.8h, #(8 - 6) > + urshr v6.8h, v6.8h, #(8 - 6) > + urshr v7.8h, v16.8h, #(8 - 6) > stp q6, q7, [x13], #32 > > cbz w10, 3f > @@ -504,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > mul v17.8h, v17.8h, v2.8h > mla v16.8h, v6.8h, v3.8h > mla v17.8h, v7.8h, v3.8h > - add v16.8h, v16.8h, v31.8h > - add v17.8h, v17.8h, v31.8h > - ushr v16.8h, v16.8h, #4 > - ushr v17.8h, v17.8h, #4 > + urshr v16.8h, v16.8h, #4 > + urshr v17.8h, v17.8h, #4 > stp q16, q17, [x14], #32 > b 3f > 2: > @@ -518,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > uxtl v6.8h, v4.8b > mul v6.8h, v6.8h, v0.8h > mla v6.8h, v7.8h, v1.8h > - add v6.8h, v6.8h, v30.8h > - ushr v6.8h, v6.8h, #(8 - 6) > + urshr v6.8h, v6.8h, #(8 - 6) > str q6, [x13], #16 > > cbz w10, 3f > @@ -527,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > ldr q16, [x12], #16 > mul v16.8h, v16.8h, v2.8h > mla v16.8h, v6.8h, v3.8h > - add v16.8h, v16.8h, v31.8h > - ushr v16.8h, v16.8h, #4 > + urshr v16.8h, v16.8h, #4 > str q16, [x14], #16 > 3: > ldur s5, [src, #1] > @@ -537,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > uxtl v6.8h, v4.8b > mul v6.4h, v6.4h, v0.4h > mla v6.4h, v7.4h, v1.4h > - add v6.4h, v6.4h, v30.4h > - ushr v6.4h, v6.4h, #(8 - 6) > + urshr v6.4h, v6.4h, #(8 - 6) > str d6, [x13], #8 > > cbz w10, 4f > @@ -546,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > ldr d16, [x12], #8 > mul v16.4h, v16.4h, v2.4h > mla v16.4h, v6.4h, v3.4h > - add v16.4h, v16.4h, v31.4h > - ushr v16.4h, v16.4h, #4 > + urshr v16.4h, v16.4h, #4 > str d16, [x14], #8 > 4: > subs height, height, #1 > -- > 2.47.2 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 8:08 ` Zhao Zhili @ 2025-03-01 22:34 ` Martin Storsjö 1 sibling, 0 replies; 9+ messages in thread From: Martin Storsjö @ 2025-03-01 22:34 UTC (permalink / raw) To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz On Wed, 19 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > --- > > Before and after on A78 > > dmvr_8_12x20_neon: 86.2 ( 6.90x) > dmvr_8_20x12_neon: 94.8 ( 5.93x) > dmvr_8_20x20_neon: 141.5 ( 6.50x) > dmvr_12_12x20_neon: 158.0 ( 3.76x) > dmvr_12_20x12_neon: 151.2 ( 3.73x) > dmvr_12_20x20_neon: 247.2 ( 3.71x) > dmvr_hv_8_12x20_neon: 423.2 ( 3.75x) > dmvr_hv_8_20x12_neon: 434.0 ( 3.69x) > dmvr_hv_8_20x20_neon: 706.0 ( 3.69x) > > dmvr_8_12x20_neon: 77.2 ( 7.70x) > dmvr_8_20x12_neon: 66.5 ( 8.49x) > dmvr_8_20x20_neon: 92.2 ( 9.90x) > dmvr_12_12x20_neon: 80.2 ( 7.38x) > dmvr_12_20x12_neon: 58.2 ( 9.59x) > dmvr_12_20x20_neon: 90.0 (10.15x) > dmvr_hv_8_12x20_neon: 369.0 ( 4.34x) > dmvr_hv_8_20x12_neon: 355.8 ( 4.49x) > dmvr_hv_8_20x20_neon: 574.2 ( 4.51x) > > libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------ > 1 file changed, 20 insertions(+), 52 deletions(-) > > diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S > index c9d698ee29..45add44b6e 100644 > --- a/libavcodec/aarch64/vvc/inter.S > +++ b/libavcodec/aarch64/vvc/inter.S > @@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1 > 1: > cbz w15, 2f > ldr q0, [src], #16 > - uxtl v1.8h, v0.8b > - uxtl2 v2.8h, v0.16b > - ushl v1.8h, v1.8h, v16.8h > - ushl v2.8h, v2.8h, v16.8h > + ushll v1.8h, v0.8b, #2 > + ushll2 v2.8h, v0.16b, #2 In addition to what's mentioned in the commit message, this bit is semantically a different one, so we should probably mention that in the commit message as well. If you're reposting patch 1/2 of this set, can you update the commit message on this one, to mention this (and move the measurements into the actual commit message). Other than that, this patch looks very good to me, thanks! // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} 2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 7:20 ` Zhao Zhili 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-03-01 22:21 ` Martin Storsjö 2 siblings, 1 reply; 9+ messages in thread From: Zhao Zhili @ 2025-02-20 7:20 UTC (permalink / raw) To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz > On Feb 20, 2025, at 01:40, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote: > > --- > > This patch replaces integer widening with halving addition, and > multi-step "emulated" rounding shift with a single asm instruction doing > exactly that. This pattern repeats in other functions in this file, I > fixed some in the succeeding patch. There's a lot of performance to be > gained there. > > I didn't modify the existing function because it adds a few extra steps > solely for the shared w_avg implementation (every cycle matters), but also > because I find this linear version easier to digest and understand. > > Besides, I noticed that removing smin and smax instructions used for > clamping the values for 10 and 12 bit_depth instantiations does not > affect the checkasm result, but it breaks FATE. > > Benchmarks before and after: > A78 > avg_8_2x2_neon: 21.0 ( 1.55x) > avg_8_4x4_neon: 25.8 ( 3.05x) > avg_8_8x8_neon: 45.0 ( 5.86x) > avg_8_16x16_neon: 178.5 ( 5.49x) > avg_8_32x32_neon: 709.2 ( 6.20x) > avg_8_64x64_neon: 2686.2 ( 6.12x) > avg_8_128x128_neon: 10734.2 ( 5.88x) > avg_10_2x2_neon: 19.0 ( 1.75x) > avg_10_4x4_neon: 28.2 ( 2.76x) > avg_10_8x8_neon: 44.0 ( 5.82x) > avg_10_16x16_neon: 179.5 ( 4.81x) > avg_10_32x32_neon: 680.8 ( 5.58x) > avg_10_64x64_neon: 2536.8 ( 5.40x) > avg_10_128x128_neon: 10079.0 ( 5.22x) > avg_12_2x2_neon: 20.8 ( 1.59x) > avg_12_4x4_neon: 25.2 ( 3.09x) > avg_12_8x8_neon: 44.0 ( 5.79x) > avg_12_16x16_neon: 182.2 ( 4.80x) > avg_12_32x32_neon: 696.2 ( 5.46x) > avg_12_64x64_neon: 2548.2 ( 5.38x) > avg_12_128x128_neon: 10133.8 ( 5.19x) > > avg_8_2x2_neon: 16.5 ( 1.98x) > avg_8_4x4_neon: 26.2 ( 2.93x) > avg_8_8x8_neon: 31.8 ( 8.55x) > avg_8_16x16_neon: 82.0 (12.02x) > avg_8_32x32_neon: 310.2 (14.12x) > avg_8_64x64_neon: 897.8 (18.26x) > avg_8_128x128_neon: 3608.5 (17.37x) > avg_10_2x2_neon: 19.5 ( 1.69x) > avg_10_4x4_neon: 28.0 ( 2.79x) > avg_10_8x8_neon: 34.8 ( 7.32x) > avg_10_16x16_neon: 119.8 ( 7.35x) > avg_10_32x32_neon: 444.2 ( 8.51x) > avg_10_64x64_neon: 1711.8 ( 8.00x) > avg_10_128x128_neon: 7065.2 ( 7.43x) > avg_12_2x2_neon: 19.5 ( 1.71x) > avg_12_4x4_neon: 24.2 ( 3.22x) > avg_12_8x8_neon: 33.8 ( 7.57x) > avg_12_16x16_neon: 120.2 ( 7.33x) > avg_12_32x32_neon: 442.5 ( 8.53x) > avg_12_64x64_neon: 1706.2 ( 8.02x) > avg_12_128x128_neon: 7010.0 ( 7.46x) > > A72 > avg_8_2x2_neon: 30.2 ( 1.48x) > avg_8_4x4_neon: 40.0 ( 3.10x) > avg_8_8x8_neon: 91.0 ( 4.14x) > avg_8_16x16_neon: 340.4 ( 3.92x) > avg_8_32x32_neon: 1220.7 ( 4.67x) > avg_8_64x64_neon: 5823.4 ( 3.88x) > avg_8_128x128_neon: 17430.5 ( 4.73x) > avg_10_2x2_neon: 34.0 ( 1.66x) > avg_10_4x4_neon: 45.2 ( 2.73x) > avg_10_8x8_neon: 97.5 ( 3.87x) > avg_10_16x16_neon: 317.7 ( 3.90x) > avg_10_32x32_neon: 1376.2 ( 4.21x) > avg_10_64x64_neon: 5228.1 ( 3.71x) > avg_10_128x128_neon: 16722.2 ( 4.17x) > avg_12_2x2_neon: 31.7 ( 1.76x) > avg_12_4x4_neon: 36.0 ( 3.44x) > avg_12_8x8_neon: 91.7 ( 4.10x) > avg_12_16x16_neon: 297.2 ( 4.13x) > avg_12_32x32_neon: 1400.5 ( 4.14x) > avg_12_64x64_neon: 5379.1 ( 3.51x) > avg_12_128x128_neon: 16715.7 ( 4.17x) > > avg_8_2x2_neon: 33.7 ( 1.72x) > avg_8_4x4_neon: 45.5 ( 2.84x) > avg_8_8x8_neon: 65.0 ( 5.98x) > avg_8_16x16_neon: 171.0 ( 7.81x) > avg_8_32x32_neon: 558.2 (10.05x) > avg_8_64x64_neon: 2006.5 (10.61x) > avg_8_128x128_neon: 9158.7 ( 8.96x) > avg_10_2x2_neon: 38.0 ( 1.92x) > avg_10_4x4_neon: 53.2 ( 2.69x) > avg_10_8x8_neon: 95.2 ( 4.08x) > avg_10_16x16_neon: 243.0 ( 5.02x) > avg_10_32x32_neon: 891.7 ( 5.64x) > avg_10_64x64_neon: 3357.7 ( 5.60x) > avg_10_128x128_neon: 12411.7 ( 5.56x) > avg_12_2x2_neon: 34.7 ( 1.97x) > avg_12_4x4_neon: 53.2 ( 2.68x) > avg_12_8x8_neon: 91.7 ( 4.22x) > avg_12_16x16_neon: 239.0 ( 5.08x) > avg_12_32x32_neon: 895.7 ( 5.62x) > avg_12_64x64_neon: 3317.5 ( 5.67x) > avg_12_128x128_neon: 12358.5 ( 5.58x) > > > A53 > avg_8_2x2_neon: 58.3 ( 1.41x) > avg_8_4x4_neon: 101.8 ( 2.21x) > avg_8_8x8_neon: 178.6 ( 4.53x) > avg_8_16x16_neon: 569.5 ( 5.01x) > avg_8_32x32_neon: 1962.5 ( 5.50x) > avg_8_64x64_neon: 8327.8 ( 5.18x) > avg_8_128x128_neon: 31631.3 ( 5.34x) > avg_10_2x2_neon: 54.5 ( 1.56x) > avg_10_4x4_neon: 88.8 ( 2.53x) > avg_10_8x8_neon: 163.6 ( 4.97x) > avg_10_16x16_neon: 550.5 ( 5.16x) > avg_10_32x32_neon: 1942.5 ( 5.64x) > avg_10_64x64_neon: 8783.5 ( 4.98x) > avg_10_128x128_neon: 32617.0 ( 5.25x) > avg_12_2x2_neon: 53.3 ( 1.66x) > avg_12_4x4_neon: 86.8 ( 2.61x) > avg_12_8x8_neon: 156.6 ( 5.12x) > avg_12_16x16_neon: 541.3 ( 5.25x) > avg_12_32x32_neon: 1955.3 ( 5.59x) > avg_12_64x64_neon: 8686.0 ( 5.06x) > avg_12_128x128_neon: 32487.5 ( 5.25x) > > avg_8_2x2_neon: 39.5 ( 1.96x) > avg_8_4x4_neon: 65.3 ( 3.41x) > avg_8_8x8_neon: 168.8 ( 4.79x) > avg_8_16x16_neon: 348.0 ( 8.20x) > avg_8_32x32_neon: 1207.5 ( 8.98x) > avg_8_64x64_neon: 6032.3 ( 7.17x) > avg_8_128x128_neon: 22008.5 ( 7.69x) > avg_10_2x2_neon: 55.5 ( 1.52x) > avg_10_4x4_neon: 73.8 ( 3.08x) > avg_10_8x8_neon: 157.8 ( 5.12x) > avg_10_16x16_neon: 445.0 ( 6.43x) > avg_10_32x32_neon: 1587.3 ( 6.87x) > avg_10_64x64_neon: 7738.0 ( 5.68x) > avg_10_128x128_neon: 27813.8 ( 6.14x) > avg_12_2x2_neon: 48.3 ( 1.80x) > avg_12_4x4_neon: 77.0 ( 2.95x) > avg_12_8x8_neon: 161.5 ( 4.98x) > avg_12_16x16_neon: 433.5 ( 6.59x) > avg_12_32x32_neon: 1622.0 ( 6.75x) > avg_12_64x64_neon: 7844.5 ( 5.60x) > avg_12_128x128_neon: 26999.5 ( 6.34x) > > Krzysztof > > libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++- > 1 file changed, 121 insertions(+), 3 deletions(-) > > diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S > index 0edc861f97..c9d698ee29 100644 > --- a/libavcodec/aarch64/vvc/inter.S > +++ b/libavcodec/aarch64/vvc/inter.S > @@ -217,13 +217,131 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1 > endfunc > .endm > > -vvc_avg avg, 8 > -vvc_avg avg, 10 > -vvc_avg avg, 12 > vvc_avg w_avg, 8 > vvc_avg w_avg, 10 > vvc_avg w_avg, 12 > > +.macro vvc_avg2 bit_depth > +function ff_vvc_avg_\bit_depth\()_neon, export=1 > + mov x10, #(VVC_MAX_PB_SIZE * 2) > + movi v16.8h, #0 > + movi v17.16b, #255 > + ushr v17.8h, v17.8h, #(16 - \bit_depth) Please set v16 v17 only for bit_depth > 8. LGTM otherwise. > + > + cmp w4, #8 > + b.gt 16f > + b.eq 8f > + cmp w4, #4 > + b.eq 4f > + > +2: // width == 2 > + ldr s0, [x2] > + subs w5, w5, #1 > + ldr s1, [x3] > +.if \bit_depth == 8 > + shadd v0.4h, v0.4h, v1.4h > + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) > + str h0, [x0] > +.else > + shadd v0.4h, v0.4h, v1.4h > + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) > + smax v0.4h, v0.4h, v16.4h > + smin v0.4h, v0.4h, v17.4h > + str s0, [x0] > +.endif > + add x2, x2, #(VVC_MAX_PB_SIZE * 2) > + add x3, x3, #(VVC_MAX_PB_SIZE * 2) > + add x0, x0, x1 > + b.ne 2b > + ret > + > +4: // width == 4 > + ldr d0, [x2] > + subs w5, w5, #1 > + ldr d1, [x3] > +.if \bit_depth == 8 > + shadd v0.4h, v0.4h, v1.4h > + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) > + str s0, [x0] > +.else > + shadd v0.4h, v0.4h, v1.4h > + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) > + smax v0.4h, v0.4h, v16.4h > + smin v0.4h, v0.4h, v17.4h > + str d0, [x0] > +.endif > + add x2, x2, #(VVC_MAX_PB_SIZE * 2) > + add x3, x3, #(VVC_MAX_PB_SIZE * 2) > + add x0, x0, x1 > + b.ne 4b > + ret > + > +8: // width == 8 > + ldr q0, [x2] > + subs w5, w5, #1 > + ldr q1, [x3] > +.if \bit_depth == 8 > + shadd v0.8h, v0.8h, v1.8h > + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) > + str d0, [x0] > +.else > + shadd v0.8h, v0.8h, v1.8h > + srshr v0.8h, v0.8h, #(15 - 1 - \bit_depth) > + smax v0.8h, v0.8h, v16.8h > + smin v0.8h, v0.8h, v17.8h > + str q0, [x0] > +.endif > + add x2, x2, #(VVC_MAX_PB_SIZE * 2) > + add x3, x3, #(VVC_MAX_PB_SIZE * 2) > + add x0, x0, x1 > + b.ne 8b > + ret > + > +16: // width >= 16 > +.if \bit_depth == 8 > + sub x1, x1, w4, sxtw > +.else > + sub x1, x1, w4, sxtw #1 > +.endif > + sub x10, x10, w4, sxtw #1 > +3: > + mov w6, w4 // width > +1: > + ldp q0, q1, [x2], #32 > + subs w6, w6, #16 > + ldp q2, q3, [x3], #32 > +.if \bit_depth == 8 > + shadd v4.8h, v0.8h, v2.8h > + shadd v5.8h, v1.8h, v3.8h > + sqrshrun v0.8b, v4.8h, #6 > + sqrshrun2 v0.16b, v5.8h, #6 > + st1 {v0.16b}, [x0], #16 > +.else > + shadd v4.8h, v0.8h, v2.8h > + shadd v5.8h, v1.8h, v3.8h > + srshr v0.8h, v4.8h, #(15 - 1 - \bit_depth) > + srshr v1.8h, v5.8h, #(15 - 1 - \bit_depth) > + smax v0.8h, v0.8h, v16.8h > + smax v1.8h, v1.8h, v16.8h > + smin v0.8h, v0.8h, v17.8h > + smin v1.8h, v1.8h, v17.8h > + stp q0, q1, [x0], #32 > +.endif > + b.ne 1b > + > + subs w5, w5, #1 > + add x2, x2, x10 > + add x3, x3, x10 > + add x0, x0, x1 > + b.ne 3b > + ret > +endfunc > +.endm > + > +vvc_avg2 8 > +vvc_avg2 10 > +vvc_avg2 12 > + > /* x0: int16_t *dst > * x1: const uint8_t *_src > * x2: ptrdiff_t _src_stride > -- > 2.47.2 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} 2025-02-20 7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili @ 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 18:49 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel 2025-02-26 8:54 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili 0 siblings, 2 replies; 9+ messages in thread From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz --- libavcodec/aarch64/vvc/inter.S | 125 ++++++++++++++++++++++++++++++++- 1 file changed, 122 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 0edc861f97..b65920e640 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -217,13 +217,132 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1 endfunc .endm -vvc_avg avg, 8 -vvc_avg avg, 10 -vvc_avg avg, 12 vvc_avg w_avg, 8 vvc_avg w_avg, 10 vvc_avg w_avg, 12 +.macro vvc_avg2 bit_depth +function ff_vvc_avg_\bit_depth\()_neon, export=1 + mov x10, #(VVC_MAX_PB_SIZE * 2) +.if \bit_depth != 8 + movi v16.8h, #0 + movi v17.16b, #255 + ushr v17.8h, v17.8h, #(16 - \bit_depth) +.endif + cmp w4, #8 + b.gt 16f + b.eq 8f + cmp w4, #4 + b.eq 4f + +2: // width == 2 + ldr s0, [x2] + subs w5, w5, #1 + ldr s1, [x3] +.if \bit_depth == 8 + shadd v0.4h, v0.4h, v1.4h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str h0, [x0] +.else + shadd v0.4h, v0.4h, v1.4h + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) + smax v0.4h, v0.4h, v16.4h + smin v0.4h, v0.4h, v17.4h + str s0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 2b + ret + +4: // width == 4 + ldr d0, [x2] + subs w5, w5, #1 + ldr d1, [x3] +.if \bit_depth == 8 + shadd v0.4h, v0.4h, v1.4h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str s0, [x0] +.else + shadd v0.4h, v0.4h, v1.4h + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) + smax v0.4h, v0.4h, v16.4h + smin v0.4h, v0.4h, v17.4h + str d0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 4b + ret + +8: // width == 8 + ldr q0, [x2] + subs w5, w5, #1 + ldr q1, [x3] +.if \bit_depth == 8 + shadd v0.8h, v0.8h, v1.8h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str d0, [x0] +.else + shadd v0.8h, v0.8h, v1.8h + srshr v0.8h, v0.8h, #(15 - 1 - \bit_depth) + smax v0.8h, v0.8h, v16.8h + smin v0.8h, v0.8h, v17.8h + str q0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 8b + ret + +16: // width >= 16 +.if \bit_depth == 8 + sub x1, x1, w4, sxtw +.else + sub x1, x1, w4, sxtw #1 +.endif + sub x10, x10, w4, sxtw #1 +3: + mov w6, w4 // width +1: + ldp q0, q1, [x2], #32 + subs w6, w6, #16 + ldp q2, q3, [x3], #32 +.if \bit_depth == 8 + shadd v4.8h, v0.8h, v2.8h + shadd v5.8h, v1.8h, v3.8h + sqrshrun v0.8b, v4.8h, #6 + sqrshrun2 v0.16b, v5.8h, #6 + st1 {v0.16b}, [x0], #16 +.else + shadd v4.8h, v0.8h, v2.8h + shadd v5.8h, v1.8h, v3.8h + srshr v0.8h, v4.8h, #(15 - 1 - \bit_depth) + srshr v1.8h, v5.8h, #(15 - 1 - \bit_depth) + smax v0.8h, v0.8h, v16.8h + smax v1.8h, v1.8h, v16.8h + smin v0.8h, v0.8h, v17.8h + smin v1.8h, v1.8h, v17.8h + stp q0, q1, [x0], #32 +.endif + b.ne 1b + + subs w5, w5, #1 + add x2, x2, x10 + add x3, x3, x10 + add x0, x0, x1 + b.ne 3b + ret +endfunc +.endm + +vvc_avg2 8 +vvc_avg2 10 +vvc_avg2 12 + /* x0: int16_t *dst * x1: const uint8_t *_src * x2: ptrdiff_t _src_stride -- 2.47.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-02-26 8:54 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili 1 sibling, 0 replies; 9+ messages in thread From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz --- libavcodec/aarch64/vvc/inter.S | 73 ++++++++++------------------------ 1 file changed, 20 insertions(+), 53 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index b65920e640..09f0627b20 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -365,27 +365,22 @@ function ff_vvc_dmvr_8_neon, export=1 cmp width, #16 sub src_stride, src_stride, x6 cset w15, gt // width > 16 - movi v16.8h, #2 // DMVR_SHIFT sub x7, x7, x6, lsl #1 1: cbz w15, 2f ldr q0, [src], #16 - uxtl v1.8h, v0.8b - uxtl2 v2.8h, v0.16b - ushl v1.8h, v1.8h, v16.8h - ushl v2.8h, v2.8h, v16.8h + ushll v1.8h, v0.8b, #2 + ushll2 v2.8h, v0.16b, #2 stp q1, q2, [dst], #32 b 3f 2: ldr d0, [src], #8 - uxtl v1.8h, v0.8b - ushl v1.8h, v1.8h, v16.8h + ushll v1.8h, v0.8b, #2 str q1, [dst], #16 3: subs height, height, #1 ldr s3, [src], #4 - uxtl v4.8h, v3.8b - ushl v4.4h, v4.4h, v16.4h + ushll v4.8h, v3.8b, #2 st1 {v4.4h}, [dst], x7 add src, src, src_stride @@ -400,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1 cmp width, #16 sub src_stride, src_stride, x6, lsl #1 cset w15, gt // width > 16 - movi v16.8h, #2 // offset4 sub x7, x7, x6, lsl #1 1: cbz w15, 2f ldp q0, q1, [src], #32 - uaddl v2.4s, v0.4h, v16.4h - uaddl2 v3.4s, v0.8h, v16.8h - uaddl v4.4s, v1.4h, v16.4h - uaddl2 v5.4s, v1.8h, v16.8h - ushr v2.4s, v2.4s, #2 - ushr v3.4s, v3.4s, #2 - ushr v4.4s, v4.4s, #2 - ushr v5.4s, v5.4s, #2 - uqxtn v2.4h, v2.4s - uqxtn2 v2.8h, v3.4s - uqxtn v4.4h, v4.4s - uqxtn2 v4.8h, v5.4s - - stp q2, q4, [dst], #32 + urshr v0.8h, v0.8h, #2 + urshr v1.8h, v1.8h, #2 + + stp q0, q1, [dst], #32 b 3f 2: ldr q0, [src], #16 - uaddl v2.4s, v0.4h, v16.4h - uaddl2 v3.4s, v0.8h, v16.8h - ushr v2.4s, v2.4s, #2 - ushr v3.4s, v3.4s, #2 - uqxtn v2.4h, v2.4s - uqxtn2 v2.8h, v3.4s - str q2, [dst], #16 + urshr v0.8h, v0.8h, #2 + str q0, [dst], #16 3: subs height, height, #1 ldr d0, [src], #8 - uaddl v3.4s, v0.4h, v16.4h - ushr v3.4s, v3.4s, #2 - uqxtn v3.4h, v3.4s - st1 {v3.4h}, [dst], x7 + urshr v0.4h, v0.4h, #2 + st1 {v0.4h}, [dst], x7 add src, src, src_stride b.ne 1b @@ -463,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldrb w10, [x12] ldrb w11, [x12, #1] sxtw x6, w6 - movi v30.8h, #(1 << (8 - 7)) // offset1 - movi v31.8h, #8 // offset2 dup v2.8h, w10 // filter_y[0] dup v3.8h, w11 // filter_y[1] @@ -492,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mul v16.8h, v16.8h, v0.8h mla v6.8h, v7.8h, v1.8h mla v16.8h, v17.8h, v1.8h - add v6.8h, v6.8h, v30.8h - add v16.8h, v16.8h, v30.8h - ushr v6.8h, v6.8h, #(8 - 6) - ushr v7.8h, v16.8h, #(8 - 6) + urshr v6.8h, v6.8h, #(8 - 6) + urshr v7.8h, v16.8h, #(8 - 6) stp q6, q7, [x13], #32 cbz w10, 3f @@ -505,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mul v17.8h, v17.8h, v2.8h mla v16.8h, v6.8h, v3.8h mla v17.8h, v7.8h, v3.8h - add v16.8h, v16.8h, v31.8h - add v17.8h, v17.8h, v31.8h - ushr v16.8h, v16.8h, #4 - ushr v17.8h, v17.8h, #4 + urshr v16.8h, v16.8h, #4 + urshr v17.8h, v17.8h, #4 stp q16, q17, [x14], #32 b 3f 2: @@ -519,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 uxtl v6.8h, v4.8b mul v6.8h, v6.8h, v0.8h mla v6.8h, v7.8h, v1.8h - add v6.8h, v6.8h, v30.8h - ushr v6.8h, v6.8h, #(8 - 6) + urshr v6.8h, v6.8h, #(8 - 6) str q6, [x13], #16 cbz w10, 3f @@ -528,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldr q16, [x12], #16 mul v16.8h, v16.8h, v2.8h mla v16.8h, v6.8h, v3.8h - add v16.8h, v16.8h, v31.8h - ushr v16.8h, v16.8h, #4 + urshr v16.8h, v16.8h, #4 str q16, [x14], #16 3: ldur s5, [src, #1] @@ -538,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 uxtl v6.8h, v4.8b mul v6.4h, v6.4h, v0.4h mla v6.4h, v7.4h, v1.4h - add v6.4h, v6.4h, v30.4h - ushr v6.4h, v6.4h, #(8 - 6) + urshr v6.4h, v6.4h, #(8 - 6) str d6, [x13], #8 cbz w10, 4f @@ -547,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldr d16, [x12], #8 mul v16.4h, v16.4h, v2.4h mla v16.4h, v6.4h, v3.4h - add v16.4h, v16.4h, v31.4h - ushr v16.4h, v16.4h, #4 + urshr v16.4h, v16.4h, #4 str d16, [x14], #8 4: subs height, height, #1 -- 2.47.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 18:49 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-26 8:54 ` Zhao Zhili 1 sibling, 0 replies; 9+ messages in thread From: Zhao Zhili @ 2025-02-26 8:54 UTC (permalink / raw) To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz > On Feb 21, 2025, at 02:49, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote: > > --- > libavcodec/aarch64/vvc/inter.S | 125 ++++++++++++++++++++++++++++++++- > 1 file changed, 122 insertions(+), 3 deletions(-) > The patchset LGTM. _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} 2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili @ 2025-03-01 22:21 ` Martin Storsjö 2 siblings, 0 replies; 9+ messages in thread From: Martin Storsjö @ 2025-03-01 22:21 UTC (permalink / raw) To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz On Wed, 19 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > --- As you've noticed in later patches; most of this commentery _is_ valuable to keep in the commit message, so I'd keep most of this, including the performance diff, in the commit message (i.e. above the ---). > This patch replaces integer widening with halving addition, and > multi-step "emulated" rounding shift with a single asm instruction doing > exactly that. This pattern repeats in other functions in this file, I > fixed some in the succeeding patch. There's a lot of performance to be > gained there. > > I didn't modify the existing function because it adds a few extra steps > solely for the shared w_avg implementation (every cycle matters), but also > because I find this linear version easier to digest and understand. That's probably reasonable - but if the avg codepath in vvc_avg is unused now, we should remove it; that makes the patch clearer to see the change, when we see the removed old codepath together with the new added one in the same patch. > Besides, I noticed that removing smin and smax instructions used for > clamping the values for 10 and 12 bit_depth instantiations does not > affect the checkasm result, but it breaks FATE. It would probably be good if we could improve the checkasm to hit those cases too, but that's of course a separate question. > > Benchmarks before and after: > A78 > avg_8_2x2_neon: 21.0 ( 1.55x) > avg_8_4x4_neon: 25.8 ( 3.05x) > avg_8_8x8_neon: 45.0 ( 5.86x) > avg_8_16x16_neon: 178.5 ( 5.49x) > avg_8_32x32_neon: 709.2 ( 6.20x) > avg_8_64x64_neon: 2686.2 ( 6.12x) > avg_8_128x128_neon: 10734.2 ( 5.88x) > avg_10_2x2_neon: 19.0 ( 1.75x) > avg_10_4x4_neon: 28.2 ( 2.76x) > avg_10_8x8_neon: 44.0 ( 5.82x) > avg_10_16x16_neon: 179.5 ( 4.81x) > avg_10_32x32_neon: 680.8 ( 5.58x) > avg_10_64x64_neon: 2536.8 ( 5.40x) > avg_10_128x128_neon: 10079.0 ( 5.22x) > avg_12_2x2_neon: 20.8 ( 1.59x) > avg_12_4x4_neon: 25.2 ( 3.09x) > avg_12_8x8_neon: 44.0 ( 5.79x) > avg_12_16x16_neon: 182.2 ( 4.80x) > avg_12_32x32_neon: 696.2 ( 5.46x) > avg_12_64x64_neon: 2548.2 ( 5.38x) > avg_12_128x128_neon: 10133.8 ( 5.19x) > > avg_8_2x2_neon: 16.5 ( 1.98x) > avg_8_4x4_neon: 26.2 ( 2.93x) > avg_8_8x8_neon: 31.8 ( 8.55x) > avg_8_16x16_neon: 82.0 (12.02x) > avg_8_32x32_neon: 310.2 (14.12x) > avg_8_64x64_neon: 897.8 (18.26x) > avg_8_128x128_neon: 3608.5 (17.37x) > avg_10_2x2_neon: 19.5 ( 1.69x) > avg_10_4x4_neon: 28.0 ( 2.79x) > avg_10_8x8_neon: 34.8 ( 7.32x) > avg_10_16x16_neon: 119.8 ( 7.35x) > avg_10_32x32_neon: 444.2 ( 8.51x) > avg_10_64x64_neon: 1711.8 ( 8.00x) > avg_10_128x128_neon: 7065.2 ( 7.43x) > avg_12_2x2_neon: 19.5 ( 1.71x) > avg_12_4x4_neon: 24.2 ( 3.22x) > avg_12_8x8_neon: 33.8 ( 7.57x) > avg_12_16x16_neon: 120.2 ( 7.33x) > avg_12_32x32_neon: 442.5 ( 8.53x) > avg_12_64x64_neon: 1706.2 ( 8.02x) > avg_12_128x128_neon: 7010.0 ( 7.46x) > > A72 > avg_8_2x2_neon: 30.2 ( 1.48x) > avg_8_4x4_neon: 40.0 ( 3.10x) > avg_8_8x8_neon: 91.0 ( 4.14x) > avg_8_16x16_neon: 340.4 ( 3.92x) > avg_8_32x32_neon: 1220.7 ( 4.67x) > avg_8_64x64_neon: 5823.4 ( 3.88x) > avg_8_128x128_neon: 17430.5 ( 4.73x) > avg_10_2x2_neon: 34.0 ( 1.66x) > avg_10_4x4_neon: 45.2 ( 2.73x) > avg_10_8x8_neon: 97.5 ( 3.87x) > avg_10_16x16_neon: 317.7 ( 3.90x) > avg_10_32x32_neon: 1376.2 ( 4.21x) > avg_10_64x64_neon: 5228.1 ( 3.71x) > avg_10_128x128_neon: 16722.2 ( 4.17x) > avg_12_2x2_neon: 31.7 ( 1.76x) > avg_12_4x4_neon: 36.0 ( 3.44x) > avg_12_8x8_neon: 91.7 ( 4.10x) > avg_12_16x16_neon: 297.2 ( 4.13x) > avg_12_32x32_neon: 1400.5 ( 4.14x) > avg_12_64x64_neon: 5379.1 ( 3.51x) > avg_12_128x128_neon: 16715.7 ( 4.17x) > > avg_8_2x2_neon: 33.7 ( 1.72x) > avg_8_4x4_neon: 45.5 ( 2.84x) > avg_8_8x8_neon: 65.0 ( 5.98x) > avg_8_16x16_neon: 171.0 ( 7.81x) > avg_8_32x32_neon: 558.2 (10.05x) > avg_8_64x64_neon: 2006.5 (10.61x) > avg_8_128x128_neon: 9158.7 ( 8.96x) > avg_10_2x2_neon: 38.0 ( 1.92x) > avg_10_4x4_neon: 53.2 ( 2.69x) > avg_10_8x8_neon: 95.2 ( 4.08x) > avg_10_16x16_neon: 243.0 ( 5.02x) > avg_10_32x32_neon: 891.7 ( 5.64x) > avg_10_64x64_neon: 3357.7 ( 5.60x) > avg_10_128x128_neon: 12411.7 ( 5.56x) > avg_12_2x2_neon: 34.7 ( 1.97x) > avg_12_4x4_neon: 53.2 ( 2.68x) > avg_12_8x8_neon: 91.7 ( 4.22x) > avg_12_16x16_neon: 239.0 ( 5.08x) > avg_12_32x32_neon: 895.7 ( 5.62x) > avg_12_64x64_neon: 3317.5 ( 5.67x) > avg_12_128x128_neon: 12358.5 ( 5.58x) > > > A53 > avg_8_2x2_neon: 58.3 ( 1.41x) > avg_8_4x4_neon: 101.8 ( 2.21x) > avg_8_8x8_neon: 178.6 ( 4.53x) > avg_8_16x16_neon: 569.5 ( 5.01x) > avg_8_32x32_neon: 1962.5 ( 5.50x) > avg_8_64x64_neon: 8327.8 ( 5.18x) > avg_8_128x128_neon: 31631.3 ( 5.34x) > avg_10_2x2_neon: 54.5 ( 1.56x) > avg_10_4x4_neon: 88.8 ( 2.53x) > avg_10_8x8_neon: 163.6 ( 4.97x) > avg_10_16x16_neon: 550.5 ( 5.16x) > avg_10_32x32_neon: 1942.5 ( 5.64x) > avg_10_64x64_neon: 8783.5 ( 4.98x) > avg_10_128x128_neon: 32617.0 ( 5.25x) > avg_12_2x2_neon: 53.3 ( 1.66x) > avg_12_4x4_neon: 86.8 ( 2.61x) > avg_12_8x8_neon: 156.6 ( 5.12x) > avg_12_16x16_neon: 541.3 ( 5.25x) > avg_12_32x32_neon: 1955.3 ( 5.59x) > avg_12_64x64_neon: 8686.0 ( 5.06x) > avg_12_128x128_neon: 32487.5 ( 5.25x) > > avg_8_2x2_neon: 39.5 ( 1.96x) > avg_8_4x4_neon: 65.3 ( 3.41x) > avg_8_8x8_neon: 168.8 ( 4.79x) > avg_8_16x16_neon: 348.0 ( 8.20x) > avg_8_32x32_neon: 1207.5 ( 8.98x) > avg_8_64x64_neon: 6032.3 ( 7.17x) > avg_8_128x128_neon: 22008.5 ( 7.69x) > avg_10_2x2_neon: 55.5 ( 1.52x) > avg_10_4x4_neon: 73.8 ( 3.08x) > avg_10_8x8_neon: 157.8 ( 5.12x) > avg_10_16x16_neon: 445.0 ( 6.43x) > avg_10_32x32_neon: 1587.3 ( 6.87x) > avg_10_64x64_neon: 7738.0 ( 5.68x) > avg_10_128x128_neon: 27813.8 ( 6.14x) > avg_12_2x2_neon: 48.3 ( 1.80x) > avg_12_4x4_neon: 77.0 ( 2.95x) > avg_12_8x8_neon: 161.5 ( 4.98x) > avg_12_16x16_neon: 433.5 ( 6.59x) > avg_12_32x32_neon: 1622.0 ( 6.75x) > avg_12_64x64_neon: 7844.5 ( 5.60x) > avg_12_128x128_neon: 26999.5 ( 6.34x) > > Krzysztof > > libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++- > 1 file changed, 121 insertions(+), 3 deletions(-) Overall the change looks reasonable to me, thanks, but remove the now unused parts and update the patch to include the valuable comments and benchmarks above the "---" bit. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-03-01 22:35 UTC | newest] Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 8:08 ` Zhao Zhili 2025-03-01 22:34 ` Martin Storsjö 2025-02-20 7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 18:49 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel 2025-02-26 8:54 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili 2025-03-01 22:21 ` Martin Storsjö
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git