* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} @ 2025-02-19 17:40 Krzysztof Pyrkosz via ffmpeg-devel 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili 0 siblings, 2 replies; 6+ messages in thread From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz --- This patch replaces integer widening with halving addition, and multi-step "emulated" rounding shift with a single asm instruction doing exactly that. This pattern repeats in other functions in this file, I fixed some in the succeeding patch. There's a lot of performance to be gained there. I didn't modify the existing function because it adds a few extra steps solely for the shared w_avg implementation (every cycle matters), but also because I find this linear version easier to digest and understand. Besides, I noticed that removing smin and smax instructions used for clamping the values for 10 and 12 bit_depth instantiations does not affect the checkasm result, but it breaks FATE. Benchmarks before and after: A78 avg_8_2x2_neon: 21.0 ( 1.55x) avg_8_4x4_neon: 25.8 ( 3.05x) avg_8_8x8_neon: 45.0 ( 5.86x) avg_8_16x16_neon: 178.5 ( 5.49x) avg_8_32x32_neon: 709.2 ( 6.20x) avg_8_64x64_neon: 2686.2 ( 6.12x) avg_8_128x128_neon: 10734.2 ( 5.88x) avg_10_2x2_neon: 19.0 ( 1.75x) avg_10_4x4_neon: 28.2 ( 2.76x) avg_10_8x8_neon: 44.0 ( 5.82x) avg_10_16x16_neon: 179.5 ( 4.81x) avg_10_32x32_neon: 680.8 ( 5.58x) avg_10_64x64_neon: 2536.8 ( 5.40x) avg_10_128x128_neon: 10079.0 ( 5.22x) avg_12_2x2_neon: 20.8 ( 1.59x) avg_12_4x4_neon: 25.2 ( 3.09x) avg_12_8x8_neon: 44.0 ( 5.79x) avg_12_16x16_neon: 182.2 ( 4.80x) avg_12_32x32_neon: 696.2 ( 5.46x) avg_12_64x64_neon: 2548.2 ( 5.38x) avg_12_128x128_neon: 10133.8 ( 5.19x) avg_8_2x2_neon: 16.5 ( 1.98x) avg_8_4x4_neon: 26.2 ( 2.93x) avg_8_8x8_neon: 31.8 ( 8.55x) avg_8_16x16_neon: 82.0 (12.02x) avg_8_32x32_neon: 310.2 (14.12x) avg_8_64x64_neon: 897.8 (18.26x) avg_8_128x128_neon: 3608.5 (17.37x) avg_10_2x2_neon: 19.5 ( 1.69x) avg_10_4x4_neon: 28.0 ( 2.79x) avg_10_8x8_neon: 34.8 ( 7.32x) avg_10_16x16_neon: 119.8 ( 7.35x) avg_10_32x32_neon: 444.2 ( 8.51x) avg_10_64x64_neon: 1711.8 ( 8.00x) avg_10_128x128_neon: 7065.2 ( 7.43x) avg_12_2x2_neon: 19.5 ( 1.71x) avg_12_4x4_neon: 24.2 ( 3.22x) avg_12_8x8_neon: 33.8 ( 7.57x) avg_12_16x16_neon: 120.2 ( 7.33x) avg_12_32x32_neon: 442.5 ( 8.53x) avg_12_64x64_neon: 1706.2 ( 8.02x) avg_12_128x128_neon: 7010.0 ( 7.46x) A72 avg_8_2x2_neon: 30.2 ( 1.48x) avg_8_4x4_neon: 40.0 ( 3.10x) avg_8_8x8_neon: 91.0 ( 4.14x) avg_8_16x16_neon: 340.4 ( 3.92x) avg_8_32x32_neon: 1220.7 ( 4.67x) avg_8_64x64_neon: 5823.4 ( 3.88x) avg_8_128x128_neon: 17430.5 ( 4.73x) avg_10_2x2_neon: 34.0 ( 1.66x) avg_10_4x4_neon: 45.2 ( 2.73x) avg_10_8x8_neon: 97.5 ( 3.87x) avg_10_16x16_neon: 317.7 ( 3.90x) avg_10_32x32_neon: 1376.2 ( 4.21x) avg_10_64x64_neon: 5228.1 ( 3.71x) avg_10_128x128_neon: 16722.2 ( 4.17x) avg_12_2x2_neon: 31.7 ( 1.76x) avg_12_4x4_neon: 36.0 ( 3.44x) avg_12_8x8_neon: 91.7 ( 4.10x) avg_12_16x16_neon: 297.2 ( 4.13x) avg_12_32x32_neon: 1400.5 ( 4.14x) avg_12_64x64_neon: 5379.1 ( 3.51x) avg_12_128x128_neon: 16715.7 ( 4.17x) avg_8_2x2_neon: 33.7 ( 1.72x) avg_8_4x4_neon: 45.5 ( 2.84x) avg_8_8x8_neon: 65.0 ( 5.98x) avg_8_16x16_neon: 171.0 ( 7.81x) avg_8_32x32_neon: 558.2 (10.05x) avg_8_64x64_neon: 2006.5 (10.61x) avg_8_128x128_neon: 9158.7 ( 8.96x) avg_10_2x2_neon: 38.0 ( 1.92x) avg_10_4x4_neon: 53.2 ( 2.69x) avg_10_8x8_neon: 95.2 ( 4.08x) avg_10_16x16_neon: 243.0 ( 5.02x) avg_10_32x32_neon: 891.7 ( 5.64x) avg_10_64x64_neon: 3357.7 ( 5.60x) avg_10_128x128_neon: 12411.7 ( 5.56x) avg_12_2x2_neon: 34.7 ( 1.97x) avg_12_4x4_neon: 53.2 ( 2.68x) avg_12_8x8_neon: 91.7 ( 4.22x) avg_12_16x16_neon: 239.0 ( 5.08x) avg_12_32x32_neon: 895.7 ( 5.62x) avg_12_64x64_neon: 3317.5 ( 5.67x) avg_12_128x128_neon: 12358.5 ( 5.58x) A53 avg_8_2x2_neon: 58.3 ( 1.41x) avg_8_4x4_neon: 101.8 ( 2.21x) avg_8_8x8_neon: 178.6 ( 4.53x) avg_8_16x16_neon: 569.5 ( 5.01x) avg_8_32x32_neon: 1962.5 ( 5.50x) avg_8_64x64_neon: 8327.8 ( 5.18x) avg_8_128x128_neon: 31631.3 ( 5.34x) avg_10_2x2_neon: 54.5 ( 1.56x) avg_10_4x4_neon: 88.8 ( 2.53x) avg_10_8x8_neon: 163.6 ( 4.97x) avg_10_16x16_neon: 550.5 ( 5.16x) avg_10_32x32_neon: 1942.5 ( 5.64x) avg_10_64x64_neon: 8783.5 ( 4.98x) avg_10_128x128_neon: 32617.0 ( 5.25x) avg_12_2x2_neon: 53.3 ( 1.66x) avg_12_4x4_neon: 86.8 ( 2.61x) avg_12_8x8_neon: 156.6 ( 5.12x) avg_12_16x16_neon: 541.3 ( 5.25x) avg_12_32x32_neon: 1955.3 ( 5.59x) avg_12_64x64_neon: 8686.0 ( 5.06x) avg_12_128x128_neon: 32487.5 ( 5.25x) avg_8_2x2_neon: 39.5 ( 1.96x) avg_8_4x4_neon: 65.3 ( 3.41x) avg_8_8x8_neon: 168.8 ( 4.79x) avg_8_16x16_neon: 348.0 ( 8.20x) avg_8_32x32_neon: 1207.5 ( 8.98x) avg_8_64x64_neon: 6032.3 ( 7.17x) avg_8_128x128_neon: 22008.5 ( 7.69x) avg_10_2x2_neon: 55.5 ( 1.52x) avg_10_4x4_neon: 73.8 ( 3.08x) avg_10_8x8_neon: 157.8 ( 5.12x) avg_10_16x16_neon: 445.0 ( 6.43x) avg_10_32x32_neon: 1587.3 ( 6.87x) avg_10_64x64_neon: 7738.0 ( 5.68x) avg_10_128x128_neon: 27813.8 ( 6.14x) avg_12_2x2_neon: 48.3 ( 1.80x) avg_12_4x4_neon: 77.0 ( 2.95x) avg_12_8x8_neon: 161.5 ( 4.98x) avg_12_16x16_neon: 433.5 ( 6.59x) avg_12_32x32_neon: 1622.0 ( 6.75x) avg_12_64x64_neon: 7844.5 ( 5.60x) avg_12_128x128_neon: 26999.5 ( 6.34x) Krzysztof libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++- 1 file changed, 121 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 0edc861f97..c9d698ee29 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -217,13 +217,131 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1 endfunc .endm -vvc_avg avg, 8 -vvc_avg avg, 10 -vvc_avg avg, 12 vvc_avg w_avg, 8 vvc_avg w_avg, 10 vvc_avg w_avg, 12 +.macro vvc_avg2 bit_depth +function ff_vvc_avg_\bit_depth\()_neon, export=1 + mov x10, #(VVC_MAX_PB_SIZE * 2) + movi v16.8h, #0 + movi v17.16b, #255 + ushr v17.8h, v17.8h, #(16 - \bit_depth) + + cmp w4, #8 + b.gt 16f + b.eq 8f + cmp w4, #4 + b.eq 4f + +2: // width == 2 + ldr s0, [x2] + subs w5, w5, #1 + ldr s1, [x3] +.if \bit_depth == 8 + shadd v0.4h, v0.4h, v1.4h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str h0, [x0] +.else + shadd v0.4h, v0.4h, v1.4h + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) + smax v0.4h, v0.4h, v16.4h + smin v0.4h, v0.4h, v17.4h + str s0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 2b + ret + +4: // width == 4 + ldr d0, [x2] + subs w5, w5, #1 + ldr d1, [x3] +.if \bit_depth == 8 + shadd v0.4h, v0.4h, v1.4h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str s0, [x0] +.else + shadd v0.4h, v0.4h, v1.4h + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) + smax v0.4h, v0.4h, v16.4h + smin v0.4h, v0.4h, v17.4h + str d0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 4b + ret + +8: // width == 8 + ldr q0, [x2] + subs w5, w5, #1 + ldr q1, [x3] +.if \bit_depth == 8 + shadd v0.8h, v0.8h, v1.8h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str d0, [x0] +.else + shadd v0.8h, v0.8h, v1.8h + srshr v0.8h, v0.8h, #(15 - 1 - \bit_depth) + smax v0.8h, v0.8h, v16.8h + smin v0.8h, v0.8h, v17.8h + str q0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 8b + ret + +16: // width >= 16 +.if \bit_depth == 8 + sub x1, x1, w4, sxtw +.else + sub x1, x1, w4, sxtw #1 +.endif + sub x10, x10, w4, sxtw #1 +3: + mov w6, w4 // width +1: + ldp q0, q1, [x2], #32 + subs w6, w6, #16 + ldp q2, q3, [x3], #32 +.if \bit_depth == 8 + shadd v4.8h, v0.8h, v2.8h + shadd v5.8h, v1.8h, v3.8h + sqrshrun v0.8b, v4.8h, #6 + sqrshrun2 v0.16b, v5.8h, #6 + st1 {v0.16b}, [x0], #16 +.else + shadd v4.8h, v0.8h, v2.8h + shadd v5.8h, v1.8h, v3.8h + srshr v0.8h, v4.8h, #(15 - 1 - \bit_depth) + srshr v1.8h, v5.8h, #(15 - 1 - \bit_depth) + smax v0.8h, v0.8h, v16.8h + smax v1.8h, v1.8h, v16.8h + smin v0.8h, v0.8h, v17.8h + smin v1.8h, v1.8h, v17.8h + stp q0, q1, [x0], #32 +.endif + b.ne 1b + + subs w5, w5, #1 + add x2, x2, x10 + add x3, x3, x10 + add x0, x0, x1 + b.ne 3b + ret +endfunc +.endm + +vvc_avg2 8 +vvc_avg2 10 +vvc_avg2 12 + /* x0: int16_t *dst * x1: const uint8_t *_src * x2: ptrdiff_t _src_stride -- 2.47.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 6+ messages in thread
* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction 2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 8:08 ` Zhao Zhili 2025-02-20 7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili 1 sibling, 1 reply; 6+ messages in thread From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz --- Before and after on A78 dmvr_8_12x20_neon: 86.2 ( 6.90x) dmvr_8_20x12_neon: 94.8 ( 5.93x) dmvr_8_20x20_neon: 141.5 ( 6.50x) dmvr_12_12x20_neon: 158.0 ( 3.76x) dmvr_12_20x12_neon: 151.2 ( 3.73x) dmvr_12_20x20_neon: 247.2 ( 3.71x) dmvr_hv_8_12x20_neon: 423.2 ( 3.75x) dmvr_hv_8_20x12_neon: 434.0 ( 3.69x) dmvr_hv_8_20x20_neon: 706.0 ( 3.69x) dmvr_8_12x20_neon: 77.2 ( 7.70x) dmvr_8_20x12_neon: 66.5 ( 8.49x) dmvr_8_20x20_neon: 92.2 ( 9.90x) dmvr_12_12x20_neon: 80.2 ( 7.38x) dmvr_12_20x12_neon: 58.2 ( 9.59x) dmvr_12_20x20_neon: 90.0 (10.15x) dmvr_hv_8_12x20_neon: 369.0 ( 4.34x) dmvr_hv_8_20x12_neon: 355.8 ( 4.49x) dmvr_hv_8_20x20_neon: 574.2 ( 4.51x) libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------ 1 file changed, 20 insertions(+), 52 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index c9d698ee29..45add44b6e 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1 1: cbz w15, 2f ldr q0, [src], #16 - uxtl v1.8h, v0.8b - uxtl2 v2.8h, v0.16b - ushl v1.8h, v1.8h, v16.8h - ushl v2.8h, v2.8h, v16.8h + ushll v1.8h, v0.8b, #2 + ushll2 v2.8h, v0.16b, #2 stp q1, q2, [dst], #32 b 3f 2: ldr d0, [src], #8 - uxtl v1.8h, v0.8b - ushl v1.8h, v1.8h, v16.8h + ushll v1.8h, v0.8b, #2 str q1, [dst], #16 3: subs height, height, #1 ldr s3, [src], #4 - uxtl v4.8h, v3.8b - ushl v4.4h, v4.4h, v16.4h + ushll v4.8h, v3.8b, #2 st1 {v4.4h}, [dst], x7 add src, src, src_stride @@ -399,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1 cmp width, #16 sub src_stride, src_stride, x6, lsl #1 cset w15, gt // width > 16 - movi v16.8h, #2 // offset4 sub x7, x7, x6, lsl #1 1: cbz w15, 2f ldp q0, q1, [src], #32 - uaddl v2.4s, v0.4h, v16.4h - uaddl2 v3.4s, v0.8h, v16.8h - uaddl v4.4s, v1.4h, v16.4h - uaddl2 v5.4s, v1.8h, v16.8h - ushr v2.4s, v2.4s, #2 - ushr v3.4s, v3.4s, #2 - ushr v4.4s, v4.4s, #2 - ushr v5.4s, v5.4s, #2 - uqxtn v2.4h, v2.4s - uqxtn2 v2.8h, v3.4s - uqxtn v4.4h, v4.4s - uqxtn2 v4.8h, v5.4s - - stp q2, q4, [dst], #32 + urshr v0.8h, v0.8h, #2 + urshr v1.8h, v1.8h, #2 + + stp q0, q1, [dst], #32 b 3f 2: ldr q0, [src], #16 - uaddl v2.4s, v0.4h, v16.4h - uaddl2 v3.4s, v0.8h, v16.8h - ushr v2.4s, v2.4s, #2 - ushr v3.4s, v3.4s, #2 - uqxtn v2.4h, v2.4s - uqxtn2 v2.8h, v3.4s - str q2, [dst], #16 + urshr v0.8h, v0.8h, #2 + str q0, [dst], #16 3: subs height, height, #1 ldr d0, [src], #8 - uaddl v3.4s, v0.4h, v16.4h - ushr v3.4s, v3.4s, #2 - uqxtn v3.4h, v3.4s - st1 {v3.4h}, [dst], x7 + urshr v0.4h, v0.4h, #2 + st1 {v0.4h}, [dst], x7 add src, src, src_stride b.ne 1b @@ -462,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldrb w10, [x12] ldrb w11, [x12, #1] sxtw x6, w6 - movi v30.8h, #(1 << (8 - 7)) // offset1 - movi v31.8h, #8 // offset2 dup v2.8h, w10 // filter_y[0] dup v3.8h, w11 // filter_y[1] @@ -491,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mul v16.8h, v16.8h, v0.8h mla v6.8h, v7.8h, v1.8h mla v16.8h, v17.8h, v1.8h - add v6.8h, v6.8h, v30.8h - add v16.8h, v16.8h, v30.8h - ushr v6.8h, v6.8h, #(8 - 6) - ushr v7.8h, v16.8h, #(8 - 6) + urshr v6.8h, v6.8h, #(8 - 6) + urshr v7.8h, v16.8h, #(8 - 6) stp q6, q7, [x13], #32 cbz w10, 3f @@ -504,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mul v17.8h, v17.8h, v2.8h mla v16.8h, v6.8h, v3.8h mla v17.8h, v7.8h, v3.8h - add v16.8h, v16.8h, v31.8h - add v17.8h, v17.8h, v31.8h - ushr v16.8h, v16.8h, #4 - ushr v17.8h, v17.8h, #4 + urshr v16.8h, v16.8h, #4 + urshr v17.8h, v17.8h, #4 stp q16, q17, [x14], #32 b 3f 2: @@ -518,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 uxtl v6.8h, v4.8b mul v6.8h, v6.8h, v0.8h mla v6.8h, v7.8h, v1.8h - add v6.8h, v6.8h, v30.8h - ushr v6.8h, v6.8h, #(8 - 6) + urshr v6.8h, v6.8h, #(8 - 6) str q6, [x13], #16 cbz w10, 3f @@ -527,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldr q16, [x12], #16 mul v16.8h, v16.8h, v2.8h mla v16.8h, v6.8h, v3.8h - add v16.8h, v16.8h, v31.8h - ushr v16.8h, v16.8h, #4 + urshr v16.8h, v16.8h, #4 str q16, [x14], #16 3: ldur s5, [src, #1] @@ -537,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 uxtl v6.8h, v4.8b mul v6.4h, v6.4h, v0.4h mla v6.4h, v7.4h, v1.4h - add v6.4h, v6.4h, v30.4h - ushr v6.4h, v6.4h, #(8 - 6) + urshr v6.4h, v6.4h, #(8 - 6) str d6, [x13], #8 cbz w10, 4f @@ -546,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldr d16, [x12], #8 mul v16.4h, v16.4h, v2.4h mla v16.4h, v6.4h, v3.4h - add v16.4h, v16.4h, v31.4h - ushr v16.4h, v16.4h, #4 + urshr v16.4h, v16.4h, #4 str d16, [x14], #8 4: subs height, height, #1 -- 2.47.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 8:08 ` Zhao Zhili 0 siblings, 0 replies; 6+ messages in thread From: Zhao Zhili @ 2025-02-20 8:08 UTC (permalink / raw) To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz > On Feb 20, 2025, at 01:40, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote: > > --- > > Before and after on A78 > > dmvr_8_12x20_neon: 86.2 ( 6.90x) > dmvr_8_20x12_neon: 94.8 ( 5.93x) > dmvr_8_20x20_neon: 141.5 ( 6.50x) > dmvr_12_12x20_neon: 158.0 ( 3.76x) > dmvr_12_20x12_neon: 151.2 ( 3.73x) > dmvr_12_20x20_neon: 247.2 ( 3.71x) > dmvr_hv_8_12x20_neon: 423.2 ( 3.75x) > dmvr_hv_8_20x12_neon: 434.0 ( 3.69x) > dmvr_hv_8_20x20_neon: 706.0 ( 3.69x) > > dmvr_8_12x20_neon: 77.2 ( 7.70x) > dmvr_8_20x12_neon: 66.5 ( 8.49x) > dmvr_8_20x20_neon: 92.2 ( 9.90x) > dmvr_12_12x20_neon: 80.2 ( 7.38x) > dmvr_12_20x12_neon: 58.2 ( 9.59x) > dmvr_12_20x20_neon: 90.0 (10.15x) > dmvr_hv_8_12x20_neon: 369.0 ( 4.34x) > dmvr_hv_8_20x12_neon: 355.8 ( 4.49x) > dmvr_hv_8_20x20_neon: 574.2 ( 4.51x) > > libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------ > 1 file changed, 20 insertions(+), 52 deletions(-) > > diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S > index c9d698ee29..45add44b6e 100644 > --- a/libavcodec/aarch64/vvc/inter.S > +++ b/libavcodec/aarch64/vvc/inter.S > @@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1 > 1: > cbz w15, 2f > ldr q0, [src], #16 > - uxtl v1.8h, v0.8b > - uxtl2 v2.8h, v0.16b > - ushl v1.8h, v1.8h, v16.8h > - ushl v2.8h, v2.8h, v16.8h Please remove assignment to v16. LGTM otherwise. > + ushll v1.8h, v0.8b, #2 > + ushll2 v2.8h, v0.16b, #2 > stp q1, q2, [dst], #32 > b 3f > 2: > ldr d0, [src], #8 > - uxtl v1.8h, v0.8b > - ushl v1.8h, v1.8h, v16.8h > + ushll v1.8h, v0.8b, #2 > str q1, [dst], #16 > 3: > subs height, height, #1 > ldr s3, [src], #4 > - uxtl v4.8h, v3.8b > - ushl v4.4h, v4.4h, v16.4h > + ushll v4.8h, v3.8b, #2 > st1 {v4.4h}, [dst], x7 > > add src, src, src_stride > @@ -399,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1 > cmp width, #16 > sub src_stride, src_stride, x6, lsl #1 > cset w15, gt // width > 16 > - movi v16.8h, #2 // offset4 > sub x7, x7, x6, lsl #1 > 1: > cbz w15, 2f > ldp q0, q1, [src], #32 > - uaddl v2.4s, v0.4h, v16.4h > - uaddl2 v3.4s, v0.8h, v16.8h > - uaddl v4.4s, v1.4h, v16.4h > - uaddl2 v5.4s, v1.8h, v16.8h > - ushr v2.4s, v2.4s, #2 > - ushr v3.4s, v3.4s, #2 > - ushr v4.4s, v4.4s, #2 > - ushr v5.4s, v5.4s, #2 > - uqxtn v2.4h, v2.4s > - uqxtn2 v2.8h, v3.4s > - uqxtn v4.4h, v4.4s > - uqxtn2 v4.8h, v5.4s > - > - stp q2, q4, [dst], #32 > + urshr v0.8h, v0.8h, #2 > + urshr v1.8h, v1.8h, #2 > + > + stp q0, q1, [dst], #32 > b 3f > 2: > ldr q0, [src], #16 > - uaddl v2.4s, v0.4h, v16.4h > - uaddl2 v3.4s, v0.8h, v16.8h > - ushr v2.4s, v2.4s, #2 > - ushr v3.4s, v3.4s, #2 > - uqxtn v2.4h, v2.4s > - uqxtn2 v2.8h, v3.4s > - str q2, [dst], #16 > + urshr v0.8h, v0.8h, #2 > + str q0, [dst], #16 > 3: > subs height, height, #1 > ldr d0, [src], #8 > - uaddl v3.4s, v0.4h, v16.4h > - ushr v3.4s, v3.4s, #2 > - uqxtn v3.4h, v3.4s > - st1 {v3.4h}, [dst], x7 > + urshr v0.4h, v0.4h, #2 > + st1 {v0.4h}, [dst], x7 > > add src, src, src_stride > b.ne 1b > @@ -462,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > ldrb w10, [x12] > ldrb w11, [x12, #1] > sxtw x6, w6 > - movi v30.8h, #(1 << (8 - 7)) // offset1 > - movi v31.8h, #8 // offset2 > dup v2.8h, w10 // filter_y[0] > dup v3.8h, w11 // filter_y[1] > > @@ -491,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > mul v16.8h, v16.8h, v0.8h > mla v6.8h, v7.8h, v1.8h > mla v16.8h, v17.8h, v1.8h > - add v6.8h, v6.8h, v30.8h > - add v16.8h, v16.8h, v30.8h > - ushr v6.8h, v6.8h, #(8 - 6) > - ushr v7.8h, v16.8h, #(8 - 6) > + urshr v6.8h, v6.8h, #(8 - 6) > + urshr v7.8h, v16.8h, #(8 - 6) > stp q6, q7, [x13], #32 > > cbz w10, 3f > @@ -504,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > mul v17.8h, v17.8h, v2.8h > mla v16.8h, v6.8h, v3.8h > mla v17.8h, v7.8h, v3.8h > - add v16.8h, v16.8h, v31.8h > - add v17.8h, v17.8h, v31.8h > - ushr v16.8h, v16.8h, #4 > - ushr v17.8h, v17.8h, #4 > + urshr v16.8h, v16.8h, #4 > + urshr v17.8h, v17.8h, #4 > stp q16, q17, [x14], #32 > b 3f > 2: > @@ -518,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > uxtl v6.8h, v4.8b > mul v6.8h, v6.8h, v0.8h > mla v6.8h, v7.8h, v1.8h > - add v6.8h, v6.8h, v30.8h > - ushr v6.8h, v6.8h, #(8 - 6) > + urshr v6.8h, v6.8h, #(8 - 6) > str q6, [x13], #16 > > cbz w10, 3f > @@ -527,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > ldr q16, [x12], #16 > mul v16.8h, v16.8h, v2.8h > mla v16.8h, v6.8h, v3.8h > - add v16.8h, v16.8h, v31.8h > - ushr v16.8h, v16.8h, #4 > + urshr v16.8h, v16.8h, #4 > str q16, [x14], #16 > 3: > ldur s5, [src, #1] > @@ -537,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > uxtl v6.8h, v4.8b > mul v6.4h, v6.4h, v0.4h > mla v6.4h, v7.4h, v1.4h > - add v6.4h, v6.4h, v30.4h > - ushr v6.4h, v6.4h, #(8 - 6) > + urshr v6.4h, v6.4h, #(8 - 6) > str d6, [x13], #8 > > cbz w10, 4f > @@ -546,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 > ldr d16, [x12], #8 > mul v16.4h, v16.4h, v2.4h > mla v16.4h, v6.4h, v3.4h > - add v16.4h, v16.4h, v31.4h > - ushr v16.4h, v16.4h, #4 > + urshr v16.4h, v16.4h, #4 > str d16, [x14], #8 > 4: > subs height, height, #1 > -- > 2.47.2 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} 2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 7:20 ` Zhao Zhili 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 1 sibling, 1 reply; 6+ messages in thread From: Zhao Zhili @ 2025-02-20 7:20 UTC (permalink / raw) To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz > On Feb 20, 2025, at 01:40, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote: > > --- > > This patch replaces integer widening with halving addition, and > multi-step "emulated" rounding shift with a single asm instruction doing > exactly that. This pattern repeats in other functions in this file, I > fixed some in the succeeding patch. There's a lot of performance to be > gained there. > > I didn't modify the existing function because it adds a few extra steps > solely for the shared w_avg implementation (every cycle matters), but also > because I find this linear version easier to digest and understand. > > Besides, I noticed that removing smin and smax instructions used for > clamping the values for 10 and 12 bit_depth instantiations does not > affect the checkasm result, but it breaks FATE. > > Benchmarks before and after: > A78 > avg_8_2x2_neon: 21.0 ( 1.55x) > avg_8_4x4_neon: 25.8 ( 3.05x) > avg_8_8x8_neon: 45.0 ( 5.86x) > avg_8_16x16_neon: 178.5 ( 5.49x) > avg_8_32x32_neon: 709.2 ( 6.20x) > avg_8_64x64_neon: 2686.2 ( 6.12x) > avg_8_128x128_neon: 10734.2 ( 5.88x) > avg_10_2x2_neon: 19.0 ( 1.75x) > avg_10_4x4_neon: 28.2 ( 2.76x) > avg_10_8x8_neon: 44.0 ( 5.82x) > avg_10_16x16_neon: 179.5 ( 4.81x) > avg_10_32x32_neon: 680.8 ( 5.58x) > avg_10_64x64_neon: 2536.8 ( 5.40x) > avg_10_128x128_neon: 10079.0 ( 5.22x) > avg_12_2x2_neon: 20.8 ( 1.59x) > avg_12_4x4_neon: 25.2 ( 3.09x) > avg_12_8x8_neon: 44.0 ( 5.79x) > avg_12_16x16_neon: 182.2 ( 4.80x) > avg_12_32x32_neon: 696.2 ( 5.46x) > avg_12_64x64_neon: 2548.2 ( 5.38x) > avg_12_128x128_neon: 10133.8 ( 5.19x) > > avg_8_2x2_neon: 16.5 ( 1.98x) > avg_8_4x4_neon: 26.2 ( 2.93x) > avg_8_8x8_neon: 31.8 ( 8.55x) > avg_8_16x16_neon: 82.0 (12.02x) > avg_8_32x32_neon: 310.2 (14.12x) > avg_8_64x64_neon: 897.8 (18.26x) > avg_8_128x128_neon: 3608.5 (17.37x) > avg_10_2x2_neon: 19.5 ( 1.69x) > avg_10_4x4_neon: 28.0 ( 2.79x) > avg_10_8x8_neon: 34.8 ( 7.32x) > avg_10_16x16_neon: 119.8 ( 7.35x) > avg_10_32x32_neon: 444.2 ( 8.51x) > avg_10_64x64_neon: 1711.8 ( 8.00x) > avg_10_128x128_neon: 7065.2 ( 7.43x) > avg_12_2x2_neon: 19.5 ( 1.71x) > avg_12_4x4_neon: 24.2 ( 3.22x) > avg_12_8x8_neon: 33.8 ( 7.57x) > avg_12_16x16_neon: 120.2 ( 7.33x) > avg_12_32x32_neon: 442.5 ( 8.53x) > avg_12_64x64_neon: 1706.2 ( 8.02x) > avg_12_128x128_neon: 7010.0 ( 7.46x) > > A72 > avg_8_2x2_neon: 30.2 ( 1.48x) > avg_8_4x4_neon: 40.0 ( 3.10x) > avg_8_8x8_neon: 91.0 ( 4.14x) > avg_8_16x16_neon: 340.4 ( 3.92x) > avg_8_32x32_neon: 1220.7 ( 4.67x) > avg_8_64x64_neon: 5823.4 ( 3.88x) > avg_8_128x128_neon: 17430.5 ( 4.73x) > avg_10_2x2_neon: 34.0 ( 1.66x) > avg_10_4x4_neon: 45.2 ( 2.73x) > avg_10_8x8_neon: 97.5 ( 3.87x) > avg_10_16x16_neon: 317.7 ( 3.90x) > avg_10_32x32_neon: 1376.2 ( 4.21x) > avg_10_64x64_neon: 5228.1 ( 3.71x) > avg_10_128x128_neon: 16722.2 ( 4.17x) > avg_12_2x2_neon: 31.7 ( 1.76x) > avg_12_4x4_neon: 36.0 ( 3.44x) > avg_12_8x8_neon: 91.7 ( 4.10x) > avg_12_16x16_neon: 297.2 ( 4.13x) > avg_12_32x32_neon: 1400.5 ( 4.14x) > avg_12_64x64_neon: 5379.1 ( 3.51x) > avg_12_128x128_neon: 16715.7 ( 4.17x) > > avg_8_2x2_neon: 33.7 ( 1.72x) > avg_8_4x4_neon: 45.5 ( 2.84x) > avg_8_8x8_neon: 65.0 ( 5.98x) > avg_8_16x16_neon: 171.0 ( 7.81x) > avg_8_32x32_neon: 558.2 (10.05x) > avg_8_64x64_neon: 2006.5 (10.61x) > avg_8_128x128_neon: 9158.7 ( 8.96x) > avg_10_2x2_neon: 38.0 ( 1.92x) > avg_10_4x4_neon: 53.2 ( 2.69x) > avg_10_8x8_neon: 95.2 ( 4.08x) > avg_10_16x16_neon: 243.0 ( 5.02x) > avg_10_32x32_neon: 891.7 ( 5.64x) > avg_10_64x64_neon: 3357.7 ( 5.60x) > avg_10_128x128_neon: 12411.7 ( 5.56x) > avg_12_2x2_neon: 34.7 ( 1.97x) > avg_12_4x4_neon: 53.2 ( 2.68x) > avg_12_8x8_neon: 91.7 ( 4.22x) > avg_12_16x16_neon: 239.0 ( 5.08x) > avg_12_32x32_neon: 895.7 ( 5.62x) > avg_12_64x64_neon: 3317.5 ( 5.67x) > avg_12_128x128_neon: 12358.5 ( 5.58x) > > > A53 > avg_8_2x2_neon: 58.3 ( 1.41x) > avg_8_4x4_neon: 101.8 ( 2.21x) > avg_8_8x8_neon: 178.6 ( 4.53x) > avg_8_16x16_neon: 569.5 ( 5.01x) > avg_8_32x32_neon: 1962.5 ( 5.50x) > avg_8_64x64_neon: 8327.8 ( 5.18x) > avg_8_128x128_neon: 31631.3 ( 5.34x) > avg_10_2x2_neon: 54.5 ( 1.56x) > avg_10_4x4_neon: 88.8 ( 2.53x) > avg_10_8x8_neon: 163.6 ( 4.97x) > avg_10_16x16_neon: 550.5 ( 5.16x) > avg_10_32x32_neon: 1942.5 ( 5.64x) > avg_10_64x64_neon: 8783.5 ( 4.98x) > avg_10_128x128_neon: 32617.0 ( 5.25x) > avg_12_2x2_neon: 53.3 ( 1.66x) > avg_12_4x4_neon: 86.8 ( 2.61x) > avg_12_8x8_neon: 156.6 ( 5.12x) > avg_12_16x16_neon: 541.3 ( 5.25x) > avg_12_32x32_neon: 1955.3 ( 5.59x) > avg_12_64x64_neon: 8686.0 ( 5.06x) > avg_12_128x128_neon: 32487.5 ( 5.25x) > > avg_8_2x2_neon: 39.5 ( 1.96x) > avg_8_4x4_neon: 65.3 ( 3.41x) > avg_8_8x8_neon: 168.8 ( 4.79x) > avg_8_16x16_neon: 348.0 ( 8.20x) > avg_8_32x32_neon: 1207.5 ( 8.98x) > avg_8_64x64_neon: 6032.3 ( 7.17x) > avg_8_128x128_neon: 22008.5 ( 7.69x) > avg_10_2x2_neon: 55.5 ( 1.52x) > avg_10_4x4_neon: 73.8 ( 3.08x) > avg_10_8x8_neon: 157.8 ( 5.12x) > avg_10_16x16_neon: 445.0 ( 6.43x) > avg_10_32x32_neon: 1587.3 ( 6.87x) > avg_10_64x64_neon: 7738.0 ( 5.68x) > avg_10_128x128_neon: 27813.8 ( 6.14x) > avg_12_2x2_neon: 48.3 ( 1.80x) > avg_12_4x4_neon: 77.0 ( 2.95x) > avg_12_8x8_neon: 161.5 ( 4.98x) > avg_12_16x16_neon: 433.5 ( 6.59x) > avg_12_32x32_neon: 1622.0 ( 6.75x) > avg_12_64x64_neon: 7844.5 ( 5.60x) > avg_12_128x128_neon: 26999.5 ( 6.34x) > > Krzysztof > > libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++- > 1 file changed, 121 insertions(+), 3 deletions(-) > > diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S > index 0edc861f97..c9d698ee29 100644 > --- a/libavcodec/aarch64/vvc/inter.S > +++ b/libavcodec/aarch64/vvc/inter.S > @@ -217,13 +217,131 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1 > endfunc > .endm > > -vvc_avg avg, 8 > -vvc_avg avg, 10 > -vvc_avg avg, 12 > vvc_avg w_avg, 8 > vvc_avg w_avg, 10 > vvc_avg w_avg, 12 > > +.macro vvc_avg2 bit_depth > +function ff_vvc_avg_\bit_depth\()_neon, export=1 > + mov x10, #(VVC_MAX_PB_SIZE * 2) > + movi v16.8h, #0 > + movi v17.16b, #255 > + ushr v17.8h, v17.8h, #(16 - \bit_depth) Please set v16 v17 only for bit_depth > 8. LGTM otherwise. > + > + cmp w4, #8 > + b.gt 16f > + b.eq 8f > + cmp w4, #4 > + b.eq 4f > + > +2: // width == 2 > + ldr s0, [x2] > + subs w5, w5, #1 > + ldr s1, [x3] > +.if \bit_depth == 8 > + shadd v0.4h, v0.4h, v1.4h > + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) > + str h0, [x0] > +.else > + shadd v0.4h, v0.4h, v1.4h > + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) > + smax v0.4h, v0.4h, v16.4h > + smin v0.4h, v0.4h, v17.4h > + str s0, [x0] > +.endif > + add x2, x2, #(VVC_MAX_PB_SIZE * 2) > + add x3, x3, #(VVC_MAX_PB_SIZE * 2) > + add x0, x0, x1 > + b.ne 2b > + ret > + > +4: // width == 4 > + ldr d0, [x2] > + subs w5, w5, #1 > + ldr d1, [x3] > +.if \bit_depth == 8 > + shadd v0.4h, v0.4h, v1.4h > + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) > + str s0, [x0] > +.else > + shadd v0.4h, v0.4h, v1.4h > + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) > + smax v0.4h, v0.4h, v16.4h > + smin v0.4h, v0.4h, v17.4h > + str d0, [x0] > +.endif > + add x2, x2, #(VVC_MAX_PB_SIZE * 2) > + add x3, x3, #(VVC_MAX_PB_SIZE * 2) > + add x0, x0, x1 > + b.ne 4b > + ret > + > +8: // width == 8 > + ldr q0, [x2] > + subs w5, w5, #1 > + ldr q1, [x3] > +.if \bit_depth == 8 > + shadd v0.8h, v0.8h, v1.8h > + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) > + str d0, [x0] > +.else > + shadd v0.8h, v0.8h, v1.8h > + srshr v0.8h, v0.8h, #(15 - 1 - \bit_depth) > + smax v0.8h, v0.8h, v16.8h > + smin v0.8h, v0.8h, v17.8h > + str q0, [x0] > +.endif > + add x2, x2, #(VVC_MAX_PB_SIZE * 2) > + add x3, x3, #(VVC_MAX_PB_SIZE * 2) > + add x0, x0, x1 > + b.ne 8b > + ret > + > +16: // width >= 16 > +.if \bit_depth == 8 > + sub x1, x1, w4, sxtw > +.else > + sub x1, x1, w4, sxtw #1 > +.endif > + sub x10, x10, w4, sxtw #1 > +3: > + mov w6, w4 // width > +1: > + ldp q0, q1, [x2], #32 > + subs w6, w6, #16 > + ldp q2, q3, [x3], #32 > +.if \bit_depth == 8 > + shadd v4.8h, v0.8h, v2.8h > + shadd v5.8h, v1.8h, v3.8h > + sqrshrun v0.8b, v4.8h, #6 > + sqrshrun2 v0.16b, v5.8h, #6 > + st1 {v0.16b}, [x0], #16 > +.else > + shadd v4.8h, v0.8h, v2.8h > + shadd v5.8h, v1.8h, v3.8h > + srshr v0.8h, v4.8h, #(15 - 1 - \bit_depth) > + srshr v1.8h, v5.8h, #(15 - 1 - \bit_depth) > + smax v0.8h, v0.8h, v16.8h > + smax v1.8h, v1.8h, v16.8h > + smin v0.8h, v0.8h, v17.8h > + smin v1.8h, v1.8h, v17.8h > + stp q0, q1, [x0], #32 > +.endif > + b.ne 1b > + > + subs w5, w5, #1 > + add x2, x2, x10 > + add x3, x3, x10 > + add x0, x0, x1 > + b.ne 3b > + ret > +endfunc > +.endm > + > +vvc_avg2 8 > +vvc_avg2 10 > +vvc_avg2 12 > + > /* x0: int16_t *dst > * x1: const uint8_t *_src > * x2: ptrdiff_t _src_stride > -- > 2.47.2 > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 6+ messages in thread
* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} 2025-02-20 7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili @ 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 18:49 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel 0 siblings, 1 reply; 6+ messages in thread From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz --- libavcodec/aarch64/vvc/inter.S | 125 ++++++++++++++++++++++++++++++++- 1 file changed, 122 insertions(+), 3 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index 0edc861f97..b65920e640 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -217,13 +217,132 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1 endfunc .endm -vvc_avg avg, 8 -vvc_avg avg, 10 -vvc_avg avg, 12 vvc_avg w_avg, 8 vvc_avg w_avg, 10 vvc_avg w_avg, 12 +.macro vvc_avg2 bit_depth +function ff_vvc_avg_\bit_depth\()_neon, export=1 + mov x10, #(VVC_MAX_PB_SIZE * 2) +.if \bit_depth != 8 + movi v16.8h, #0 + movi v17.16b, #255 + ushr v17.8h, v17.8h, #(16 - \bit_depth) +.endif + cmp w4, #8 + b.gt 16f + b.eq 8f + cmp w4, #4 + b.eq 4f + +2: // width == 2 + ldr s0, [x2] + subs w5, w5, #1 + ldr s1, [x3] +.if \bit_depth == 8 + shadd v0.4h, v0.4h, v1.4h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str h0, [x0] +.else + shadd v0.4h, v0.4h, v1.4h + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) + smax v0.4h, v0.4h, v16.4h + smin v0.4h, v0.4h, v17.4h + str s0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 2b + ret + +4: // width == 4 + ldr d0, [x2] + subs w5, w5, #1 + ldr d1, [x3] +.if \bit_depth == 8 + shadd v0.4h, v0.4h, v1.4h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str s0, [x0] +.else + shadd v0.4h, v0.4h, v1.4h + srshr v0.4h, v0.4h, #(15 - 1 - \bit_depth) + smax v0.4h, v0.4h, v16.4h + smin v0.4h, v0.4h, v17.4h + str d0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 4b + ret + +8: // width == 8 + ldr q0, [x2] + subs w5, w5, #1 + ldr q1, [x3] +.if \bit_depth == 8 + shadd v0.8h, v0.8h, v1.8h + sqrshrun v0.8b, v0.8h, #(15 - 1 - \bit_depth) + str d0, [x0] +.else + shadd v0.8h, v0.8h, v1.8h + srshr v0.8h, v0.8h, #(15 - 1 - \bit_depth) + smax v0.8h, v0.8h, v16.8h + smin v0.8h, v0.8h, v17.8h + str q0, [x0] +.endif + add x2, x2, #(VVC_MAX_PB_SIZE * 2) + add x3, x3, #(VVC_MAX_PB_SIZE * 2) + add x0, x0, x1 + b.ne 8b + ret + +16: // width >= 16 +.if \bit_depth == 8 + sub x1, x1, w4, sxtw +.else + sub x1, x1, w4, sxtw #1 +.endif + sub x10, x10, w4, sxtw #1 +3: + mov w6, w4 // width +1: + ldp q0, q1, [x2], #32 + subs w6, w6, #16 + ldp q2, q3, [x3], #32 +.if \bit_depth == 8 + shadd v4.8h, v0.8h, v2.8h + shadd v5.8h, v1.8h, v3.8h + sqrshrun v0.8b, v4.8h, #6 + sqrshrun2 v0.16b, v5.8h, #6 + st1 {v0.16b}, [x0], #16 +.else + shadd v4.8h, v0.8h, v2.8h + shadd v5.8h, v1.8h, v3.8h + srshr v0.8h, v4.8h, #(15 - 1 - \bit_depth) + srshr v1.8h, v5.8h, #(15 - 1 - \bit_depth) + smax v0.8h, v0.8h, v16.8h + smax v1.8h, v1.8h, v16.8h + smin v0.8h, v0.8h, v17.8h + smin v1.8h, v1.8h, v17.8h + stp q0, q1, [x0], #32 +.endif + b.ne 1b + + subs w5, w5, #1 + add x2, x2, x10 + add x3, x3, x10 + add x0, x0, x1 + b.ne 3b + ret +endfunc +.endm + +vvc_avg2 8 +vvc_avg2 10 +vvc_avg2 12 + /* x0: int16_t *dst * x1: const uint8_t *_src * x2: ptrdiff_t _src_stride -- 2.47.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 6+ messages in thread
* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 0 siblings, 0 replies; 6+ messages in thread From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 UTC (permalink / raw) To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz --- libavcodec/aarch64/vvc/inter.S | 73 ++++++++++------------------------ 1 file changed, 20 insertions(+), 53 deletions(-) diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S index b65920e640..09f0627b20 100644 --- a/libavcodec/aarch64/vvc/inter.S +++ b/libavcodec/aarch64/vvc/inter.S @@ -365,27 +365,22 @@ function ff_vvc_dmvr_8_neon, export=1 cmp width, #16 sub src_stride, src_stride, x6 cset w15, gt // width > 16 - movi v16.8h, #2 // DMVR_SHIFT sub x7, x7, x6, lsl #1 1: cbz w15, 2f ldr q0, [src], #16 - uxtl v1.8h, v0.8b - uxtl2 v2.8h, v0.16b - ushl v1.8h, v1.8h, v16.8h - ushl v2.8h, v2.8h, v16.8h + ushll v1.8h, v0.8b, #2 + ushll2 v2.8h, v0.16b, #2 stp q1, q2, [dst], #32 b 3f 2: ldr d0, [src], #8 - uxtl v1.8h, v0.8b - ushl v1.8h, v1.8h, v16.8h + ushll v1.8h, v0.8b, #2 str q1, [dst], #16 3: subs height, height, #1 ldr s3, [src], #4 - uxtl v4.8h, v3.8b - ushl v4.4h, v4.4h, v16.4h + ushll v4.8h, v3.8b, #2 st1 {v4.4h}, [dst], x7 add src, src, src_stride @@ -400,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1 cmp width, #16 sub src_stride, src_stride, x6, lsl #1 cset w15, gt // width > 16 - movi v16.8h, #2 // offset4 sub x7, x7, x6, lsl #1 1: cbz w15, 2f ldp q0, q1, [src], #32 - uaddl v2.4s, v0.4h, v16.4h - uaddl2 v3.4s, v0.8h, v16.8h - uaddl v4.4s, v1.4h, v16.4h - uaddl2 v5.4s, v1.8h, v16.8h - ushr v2.4s, v2.4s, #2 - ushr v3.4s, v3.4s, #2 - ushr v4.4s, v4.4s, #2 - ushr v5.4s, v5.4s, #2 - uqxtn v2.4h, v2.4s - uqxtn2 v2.8h, v3.4s - uqxtn v4.4h, v4.4s - uqxtn2 v4.8h, v5.4s - - stp q2, q4, [dst], #32 + urshr v0.8h, v0.8h, #2 + urshr v1.8h, v1.8h, #2 + + stp q0, q1, [dst], #32 b 3f 2: ldr q0, [src], #16 - uaddl v2.4s, v0.4h, v16.4h - uaddl2 v3.4s, v0.8h, v16.8h - ushr v2.4s, v2.4s, #2 - ushr v3.4s, v3.4s, #2 - uqxtn v2.4h, v2.4s - uqxtn2 v2.8h, v3.4s - str q2, [dst], #16 + urshr v0.8h, v0.8h, #2 + str q0, [dst], #16 3: subs height, height, #1 ldr d0, [src], #8 - uaddl v3.4s, v0.4h, v16.4h - ushr v3.4s, v3.4s, #2 - uqxtn v3.4h, v3.4s - st1 {v3.4h}, [dst], x7 + urshr v0.4h, v0.4h, #2 + st1 {v0.4h}, [dst], x7 add src, src, src_stride b.ne 1b @@ -463,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldrb w10, [x12] ldrb w11, [x12, #1] sxtw x6, w6 - movi v30.8h, #(1 << (8 - 7)) // offset1 - movi v31.8h, #8 // offset2 dup v2.8h, w10 // filter_y[0] dup v3.8h, w11 // filter_y[1] @@ -492,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mul v16.8h, v16.8h, v0.8h mla v6.8h, v7.8h, v1.8h mla v16.8h, v17.8h, v1.8h - add v6.8h, v6.8h, v30.8h - add v16.8h, v16.8h, v30.8h - ushr v6.8h, v6.8h, #(8 - 6) - ushr v7.8h, v16.8h, #(8 - 6) + urshr v6.8h, v6.8h, #(8 - 6) + urshr v7.8h, v16.8h, #(8 - 6) stp q6, q7, [x13], #32 cbz w10, 3f @@ -505,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1 mul v17.8h, v17.8h, v2.8h mla v16.8h, v6.8h, v3.8h mla v17.8h, v7.8h, v3.8h - add v16.8h, v16.8h, v31.8h - add v17.8h, v17.8h, v31.8h - ushr v16.8h, v16.8h, #4 - ushr v17.8h, v17.8h, #4 + urshr v16.8h, v16.8h, #4 + urshr v17.8h, v17.8h, #4 stp q16, q17, [x14], #32 b 3f 2: @@ -519,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 uxtl v6.8h, v4.8b mul v6.8h, v6.8h, v0.8h mla v6.8h, v7.8h, v1.8h - add v6.8h, v6.8h, v30.8h - ushr v6.8h, v6.8h, #(8 - 6) + urshr v6.8h, v6.8h, #(8 - 6) str q6, [x13], #16 cbz w10, 3f @@ -528,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldr q16, [x12], #16 mul v16.8h, v16.8h, v2.8h mla v16.8h, v6.8h, v3.8h - add v16.8h, v16.8h, v31.8h - ushr v16.8h, v16.8h, #4 + urshr v16.8h, v16.8h, #4 str q16, [x14], #16 3: ldur s5, [src, #1] @@ -538,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 uxtl v6.8h, v4.8b mul v6.4h, v6.4h, v0.4h mla v6.4h, v7.4h, v1.4h - add v6.4h, v6.4h, v30.4h - ushr v6.4h, v6.4h, #(8 - 6) + urshr v6.4h, v6.4h, #(8 - 6) str d6, [x13], #8 cbz w10, 4f @@ -547,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1 ldr d16, [x12], #8 mul v16.4h, v16.4h, v2.4h mla v16.4h, v6.4h, v3.4h - add v16.4h, v16.4h, v31.4h - ushr v16.4h, v16.4h, #4 + urshr v16.4h, v16.4h, #4 str d16, [x14], #8 4: subs height, height, #1 -- 2.47.2 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2025-02-20 18:50 UTC | newest] Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel 2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 8:08 ` Zhao Zhili 2025-02-20 7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili 2025-02-20 18:49 ` Krzysztof Pyrkosz via ffmpeg-devel 2025-02-20 18:49 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git