Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
@ 2025-02-19 17:40 Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---

This patch replaces integer widening with halving addition, and
multi-step "emulated" rounding shift with a single asm instruction doing
exactly that. This pattern repeats in other functions in this file, I
fixed some in the succeeding patch. There's a lot of performance to be
gained there.

I didn't modify the existing function because it adds a few extra steps
solely for the shared w_avg implementation (every cycle matters), but also
because I find this linear version easier to digest and understand.

Besides, I noticed that removing smin and smax instructions used for
clamping the values for 10 and 12 bit_depth instantiations does not
affect the checkasm result, but it breaks FATE.

Benchmarks before and after:
A78
avg_8_2x2_neon:                                         21.0 ( 1.55x)
avg_8_4x4_neon:                                         25.8 ( 3.05x)
avg_8_8x8_neon:                                         45.0 ( 5.86x)
avg_8_16x16_neon:                                      178.5 ( 5.49x)
avg_8_32x32_neon:                                      709.2 ( 6.20x)
avg_8_64x64_neon:                                     2686.2 ( 6.12x)
avg_8_128x128_neon:                                  10734.2 ( 5.88x)
avg_10_2x2_neon:                                        19.0 ( 1.75x)
avg_10_4x4_neon:                                        28.2 ( 2.76x)
avg_10_8x8_neon:                                        44.0 ( 5.82x)
avg_10_16x16_neon:                                     179.5 ( 4.81x)
avg_10_32x32_neon:                                     680.8 ( 5.58x)
avg_10_64x64_neon:                                    2536.8 ( 5.40x)
avg_10_128x128_neon:                                 10079.0 ( 5.22x)
avg_12_2x2_neon:                                        20.8 ( 1.59x)
avg_12_4x4_neon:                                        25.2 ( 3.09x)
avg_12_8x8_neon:                                        44.0 ( 5.79x)
avg_12_16x16_neon:                                     182.2 ( 4.80x)
avg_12_32x32_neon:                                     696.2 ( 5.46x)
avg_12_64x64_neon:                                    2548.2 ( 5.38x)
avg_12_128x128_neon:                                 10133.8 ( 5.19x)

avg_8_2x2_neon:                                         16.5 ( 1.98x)
avg_8_4x4_neon:                                         26.2 ( 2.93x)
avg_8_8x8_neon:                                         31.8 ( 8.55x)
avg_8_16x16_neon:                                       82.0 (12.02x)
avg_8_32x32_neon:                                      310.2 (14.12x)
avg_8_64x64_neon:                                      897.8 (18.26x)
avg_8_128x128_neon:                                   3608.5 (17.37x)
avg_10_2x2_neon:                                        19.5 ( 1.69x)
avg_10_4x4_neon:                                        28.0 ( 2.79x)
avg_10_8x8_neon:                                        34.8 ( 7.32x)
avg_10_16x16_neon:                                     119.8 ( 7.35x)
avg_10_32x32_neon:                                     444.2 ( 8.51x)
avg_10_64x64_neon:                                    1711.8 ( 8.00x)
avg_10_128x128_neon:                                  7065.2 ( 7.43x)
avg_12_2x2_neon:                                        19.5 ( 1.71x)
avg_12_4x4_neon:                                        24.2 ( 3.22x)
avg_12_8x8_neon:                                        33.8 ( 7.57x)
avg_12_16x16_neon:                                     120.2 ( 7.33x)
avg_12_32x32_neon:                                     442.5 ( 8.53x)
avg_12_64x64_neon:                                    1706.2 ( 8.02x)
avg_12_128x128_neon:                                  7010.0 ( 7.46x)

A72
avg_8_2x2_neon:                                         30.2 ( 1.48x)
avg_8_4x4_neon:                                         40.0 ( 3.10x)
avg_8_8x8_neon:                                         91.0 ( 4.14x)
avg_8_16x16_neon:                                      340.4 ( 3.92x)
avg_8_32x32_neon:                                     1220.7 ( 4.67x)
avg_8_64x64_neon:                                     5823.4 ( 3.88x)
avg_8_128x128_neon:                                  17430.5 ( 4.73x)
avg_10_2x2_neon:                                        34.0 ( 1.66x)
avg_10_4x4_neon:                                        45.2 ( 2.73x)
avg_10_8x8_neon:                                        97.5 ( 3.87x)
avg_10_16x16_neon:                                     317.7 ( 3.90x)
avg_10_32x32_neon:                                    1376.2 ( 4.21x)
avg_10_64x64_neon:                                    5228.1 ( 3.71x)
avg_10_128x128_neon:                                 16722.2 ( 4.17x)
avg_12_2x2_neon:                                        31.7 ( 1.76x)
avg_12_4x4_neon:                                        36.0 ( 3.44x)
avg_12_8x8_neon:                                        91.7 ( 4.10x)
avg_12_16x16_neon:                                     297.2 ( 4.13x)
avg_12_32x32_neon:                                    1400.5 ( 4.14x)
avg_12_64x64_neon:                                    5379.1 ( 3.51x)
avg_12_128x128_neon:                                 16715.7 ( 4.17x)

avg_8_2x2_neon:                                         33.7 ( 1.72x)
avg_8_4x4_neon:                                         45.5 ( 2.84x)
avg_8_8x8_neon:                                         65.0 ( 5.98x)
avg_8_16x16_neon:                                      171.0 ( 7.81x)
avg_8_32x32_neon:                                      558.2 (10.05x)
avg_8_64x64_neon:                                     2006.5 (10.61x)
avg_8_128x128_neon:                                   9158.7 ( 8.96x)
avg_10_2x2_neon:                                        38.0 ( 1.92x)
avg_10_4x4_neon:                                        53.2 ( 2.69x)
avg_10_8x8_neon:                                        95.2 ( 4.08x)
avg_10_16x16_neon:                                     243.0 ( 5.02x)
avg_10_32x32_neon:                                     891.7 ( 5.64x)
avg_10_64x64_neon:                                    3357.7 ( 5.60x)
avg_10_128x128_neon:                                 12411.7 ( 5.56x)
avg_12_2x2_neon:                                        34.7 ( 1.97x)
avg_12_4x4_neon:                                        53.2 ( 2.68x)
avg_12_8x8_neon:                                        91.7 ( 4.22x)
avg_12_16x16_neon:                                     239.0 ( 5.08x)
avg_12_32x32_neon:                                     895.7 ( 5.62x)
avg_12_64x64_neon:                                    3317.5 ( 5.67x)
avg_12_128x128_neon:                                 12358.5 ( 5.58x)


A53
avg_8_2x2_neon:                                         58.3 ( 1.41x)
avg_8_4x4_neon:                                        101.8 ( 2.21x)
avg_8_8x8_neon:                                        178.6 ( 4.53x)
avg_8_16x16_neon:                                      569.5 ( 5.01x)
avg_8_32x32_neon:                                     1962.5 ( 5.50x)
avg_8_64x64_neon:                                     8327.8 ( 5.18x)
avg_8_128x128_neon:                                  31631.3 ( 5.34x)
avg_10_2x2_neon:                                        54.5 ( 1.56x)
avg_10_4x4_neon:                                        88.8 ( 2.53x)
avg_10_8x8_neon:                                       163.6 ( 4.97x)
avg_10_16x16_neon:                                     550.5 ( 5.16x)
avg_10_32x32_neon:                                    1942.5 ( 5.64x)
avg_10_64x64_neon:                                    8783.5 ( 4.98x)
avg_10_128x128_neon:                                 32617.0 ( 5.25x)
avg_12_2x2_neon:                                        53.3 ( 1.66x)
avg_12_4x4_neon:                                        86.8 ( 2.61x)
avg_12_8x8_neon:                                       156.6 ( 5.12x)
avg_12_16x16_neon:                                     541.3 ( 5.25x)
avg_12_32x32_neon:                                    1955.3 ( 5.59x)
avg_12_64x64_neon:                                    8686.0 ( 5.06x)
avg_12_128x128_neon:                                 32487.5 ( 5.25x)

avg_8_2x2_neon:                                         39.5 ( 1.96x)
avg_8_4x4_neon:                                         65.3 ( 3.41x)
avg_8_8x8_neon:                                        168.8 ( 4.79x)
avg_8_16x16_neon:                                      348.0 ( 8.20x)
avg_8_32x32_neon:                                     1207.5 ( 8.98x)
avg_8_64x64_neon:                                     6032.3 ( 7.17x)
avg_8_128x128_neon:                                  22008.5 ( 7.69x)
avg_10_2x2_neon:                                        55.5 ( 1.52x)
avg_10_4x4_neon:                                        73.8 ( 3.08x)
avg_10_8x8_neon:                                       157.8 ( 5.12x)
avg_10_16x16_neon:                                     445.0 ( 6.43x)
avg_10_32x32_neon:                                    1587.3 ( 6.87x)
avg_10_64x64_neon:                                    7738.0 ( 5.68x)
avg_10_128x128_neon:                                 27813.8 ( 6.14x)
avg_12_2x2_neon:                                        48.3 ( 1.80x)
avg_12_4x4_neon:                                        77.0 ( 2.95x)
avg_12_8x8_neon:                                       161.5 ( 4.98x)
avg_12_16x16_neon:                                     433.5 ( 6.59x)
avg_12_32x32_neon:                                    1622.0 ( 6.75x)
avg_12_64x64_neon:                                    7844.5 ( 5.60x)
avg_12_128x128_neon:                                 26999.5 ( 6.34x)

Krzysztof

 libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++-
 1 file changed, 121 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index 0edc861f97..c9d698ee29 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -217,13 +217,131 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1
 endfunc
 .endm
 
-vvc_avg avg, 8
-vvc_avg avg, 10
-vvc_avg avg, 12
 vvc_avg w_avg, 8
 vvc_avg w_avg, 10
 vvc_avg w_avg, 12
 
+.macro vvc_avg2 bit_depth
+function ff_vvc_avg_\bit_depth\()_neon, export=1
+        mov             x10, #(VVC_MAX_PB_SIZE * 2)
+        movi            v16.8h, #0
+        movi            v17.16b, #255
+        ushr            v17.8h, v17.8h, #(16 - \bit_depth)
+
+        cmp             w4, #8
+        b.gt            16f
+        b.eq            8f
+        cmp             w4, #4
+        b.eq            4f
+
+2: // width == 2
+        ldr             s0, [x2]
+        subs            w5, w5, #1
+        ldr             s1, [x3]
+.if \bit_depth == 8
+        shadd           v0.4h, v0.4h, v1.4h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             h0, [x0]
+.else
+        shadd           v0.4h, v0.4h, v1.4h
+        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
+        smax            v0.4h, v0.4h, v16.4h
+        smin            v0.4h, v0.4h, v17.4h
+        str             s0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            2b
+        ret
+
+4: // width == 4
+        ldr             d0, [x2]
+        subs            w5, w5, #1
+        ldr             d1, [x3]
+.if \bit_depth == 8
+        shadd           v0.4h, v0.4h, v1.4h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             s0, [x0]
+.else
+        shadd           v0.4h, v0.4h, v1.4h
+        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
+        smax            v0.4h, v0.4h, v16.4h
+        smin            v0.4h, v0.4h, v17.4h
+        str             d0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            4b
+        ret
+
+8: // width == 8
+        ldr             q0, [x2]
+        subs            w5, w5, #1
+        ldr             q1, [x3]
+.if \bit_depth == 8
+        shadd           v0.8h, v0.8h, v1.8h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             d0, [x0]
+.else
+        shadd           v0.8h, v0.8h, v1.8h
+        srshr           v0.8h, v0.8h, #(15 - 1 - \bit_depth)
+        smax            v0.8h, v0.8h, v16.8h
+        smin            v0.8h, v0.8h, v17.8h
+        str             q0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            8b
+        ret
+
+16: // width >= 16
+.if \bit_depth == 8
+        sub             x1, x1, w4, sxtw
+.else
+        sub             x1, x1, w4, sxtw #1
+.endif
+        sub             x10, x10, w4, sxtw #1
+3:
+        mov             w6, w4 // width
+1:
+        ldp             q0, q1, [x2], #32
+        subs            w6, w6, #16
+        ldp             q2, q3, [x3], #32
+.if \bit_depth == 8
+        shadd           v4.8h, v0.8h, v2.8h
+        shadd           v5.8h, v1.8h, v3.8h
+        sqrshrun        v0.8b, v4.8h, #6
+        sqrshrun2       v0.16b, v5.8h, #6
+        st1             {v0.16b}, [x0], #16
+.else
+        shadd           v4.8h, v0.8h, v2.8h
+        shadd           v5.8h, v1.8h, v3.8h
+        srshr           v0.8h, v4.8h, #(15 - 1 - \bit_depth)
+        srshr           v1.8h, v5.8h, #(15 - 1 - \bit_depth)
+        smax            v0.8h, v0.8h, v16.8h
+        smax            v1.8h, v1.8h, v16.8h
+        smin            v0.8h, v0.8h, v17.8h
+        smin            v1.8h, v1.8h, v17.8h
+        stp             q0, q1, [x0], #32
+.endif
+        b.ne            1b
+
+        subs            w5, w5, #1
+        add             x2, x2, x10
+        add             x3, x3, x10
+        add             x0, x0, x1
+        b.ne            3b
+        ret
+endfunc
+.endm
+
+vvc_avg2 8
+vvc_avg2 10
+vvc_avg2 12
+
 /* x0: int16_t *dst
  * x1: const uint8_t *_src
  * x2: ptrdiff_t _src_stride
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction
  2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-19 17:40 ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-20  8:08   ` Zhao Zhili
  2025-03-01 22:34   ` Martin Storsjö
  2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
  2025-03-01 22:21 ` Martin Storsjö
  2 siblings, 2 replies; 9+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---

Before and after on A78

dmvr_8_12x20_neon:                                      86.2 ( 6.90x)
dmvr_8_20x12_neon:                                      94.8 ( 5.93x)
dmvr_8_20x20_neon:                                     141.5 ( 6.50x)
dmvr_12_12x20_neon:                                    158.0 ( 3.76x)
dmvr_12_20x12_neon:                                    151.2 ( 3.73x)
dmvr_12_20x20_neon:                                    247.2 ( 3.71x)
dmvr_hv_8_12x20_neon:                                  423.2 ( 3.75x)
dmvr_hv_8_20x12_neon:                                  434.0 ( 3.69x)
dmvr_hv_8_20x20_neon:                                  706.0 ( 3.69x)

dmvr_8_12x20_neon:                                      77.2 ( 7.70x)
dmvr_8_20x12_neon:                                      66.5 ( 8.49x)
dmvr_8_20x20_neon:                                      92.2 ( 9.90x)
dmvr_12_12x20_neon:                                     80.2 ( 7.38x)
dmvr_12_20x12_neon:                                     58.2 ( 9.59x)
dmvr_12_20x20_neon:                                     90.0 (10.15x)
dmvr_hv_8_12x20_neon:                                  369.0 ( 4.34x)
dmvr_hv_8_20x12_neon:                                  355.8 ( 4.49x)
dmvr_hv_8_20x20_neon:                                  574.2 ( 4.51x)

 libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------
 1 file changed, 20 insertions(+), 52 deletions(-)

diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index c9d698ee29..45add44b6e 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1
 1:
         cbz             w15, 2f
         ldr             q0, [src], #16
-        uxtl            v1.8h, v0.8b
-        uxtl2           v2.8h, v0.16b
-        ushl            v1.8h, v1.8h, v16.8h
-        ushl            v2.8h, v2.8h, v16.8h
+        ushll           v1.8h, v0.8b, #2
+        ushll2          v2.8h, v0.16b, #2
         stp             q1, q2, [dst], #32
         b               3f
 2:
         ldr             d0, [src], #8
-        uxtl            v1.8h, v0.8b
-        ushl            v1.8h, v1.8h, v16.8h
+        ushll           v1.8h, v0.8b, #2
         str             q1, [dst], #16
 3:
         subs            height, height, #1
         ldr             s3, [src], #4
-        uxtl            v4.8h, v3.8b
-        ushl            v4.4h, v4.4h, v16.4h
+        ushll           v4.8h, v3.8b, #2
         st1             {v4.4h}, [dst], x7
 
         add             src, src, src_stride
@@ -399,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1
         cmp             width, #16
         sub             src_stride, src_stride, x6, lsl #1
         cset            w15, gt                     // width > 16
-        movi            v16.8h, #2                  // offset4
         sub             x7, x7, x6, lsl #1
 1:
         cbz             w15, 2f
         ldp             q0, q1, [src], #32
-        uaddl           v2.4s, v0.4h, v16.4h
-        uaddl2          v3.4s, v0.8h, v16.8h
-        uaddl           v4.4s, v1.4h, v16.4h
-        uaddl2          v5.4s, v1.8h, v16.8h
-        ushr            v2.4s, v2.4s, #2
-        ushr            v3.4s, v3.4s, #2
-        ushr            v4.4s, v4.4s, #2
-        ushr            v5.4s, v5.4s, #2
-        uqxtn           v2.4h, v2.4s
-        uqxtn2          v2.8h, v3.4s
-        uqxtn           v4.4h, v4.4s
-        uqxtn2          v4.8h, v5.4s
-
-        stp             q2, q4, [dst], #32
+        urshr           v0.8h, v0.8h, #2
+        urshr           v1.8h, v1.8h, #2
+
+        stp             q0, q1, [dst], #32
         b               3f
 2:
         ldr             q0, [src], #16
-        uaddl           v2.4s, v0.4h, v16.4h
-        uaddl2          v3.4s, v0.8h, v16.8h
-        ushr            v2.4s, v2.4s, #2
-        ushr            v3.4s, v3.4s, #2
-        uqxtn           v2.4h, v2.4s
-        uqxtn2          v2.8h, v3.4s
-        str             q2, [dst], #16
+        urshr           v0.8h, v0.8h, #2
+        str             q0, [dst], #16
 3:
         subs            height, height, #1
         ldr             d0, [src], #8
-        uaddl           v3.4s, v0.4h, v16.4h
-        ushr            v3.4s, v3.4s, #2
-        uqxtn           v3.4h, v3.4s
-        st1             {v3.4h}, [dst], x7
+        urshr           v0.4h, v0.4h, #2
+        st1             {v0.4h}, [dst], x7
 
         add             src, src, src_stride
         b.ne            1b
@@ -462,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldrb            w10, [x12]
         ldrb            w11, [x12, #1]
         sxtw            x6, w6
-        movi            v30.8h, #(1 << (8 - 7))     // offset1
-        movi            v31.8h, #8                  // offset2
         dup             v2.8h, w10                  // filter_y[0]
         dup             v3.8h, w11                  // filter_y[1]
 
@@ -491,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         mul             v16.8h, v16.8h, v0.8h
         mla             v6.8h, v7.8h, v1.8h
         mla             v16.8h, v17.8h, v1.8h
-        add             v6.8h, v6.8h, v30.8h
-        add             v16.8h, v16.8h, v30.8h
-        ushr            v6.8h, v6.8h, #(8 - 6)
-        ushr            v7.8h, v16.8h, #(8 - 6)
+        urshr           v6.8h, v6.8h, #(8 - 6)
+        urshr           v7.8h, v16.8h, #(8 - 6)
         stp             q6, q7, [x13], #32
 
         cbz             w10, 3f
@@ -504,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         mul             v17.8h, v17.8h, v2.8h
         mla             v16.8h, v6.8h, v3.8h
         mla             v17.8h, v7.8h, v3.8h
-        add             v16.8h, v16.8h, v31.8h
-        add             v17.8h, v17.8h, v31.8h
-        ushr            v16.8h, v16.8h, #4
-        ushr            v17.8h, v17.8h, #4
+        urshr           v16.8h, v16.8h, #4
+        urshr           v17.8h, v17.8h, #4
         stp             q16, q17, [x14], #32
         b               3f
 2:
@@ -518,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         uxtl            v6.8h, v4.8b
         mul             v6.8h, v6.8h, v0.8h
         mla             v6.8h, v7.8h, v1.8h
-        add             v6.8h, v6.8h, v30.8h
-        ushr            v6.8h, v6.8h, #(8 - 6)
+        urshr           v6.8h, v6.8h, #(8 - 6)
         str             q6, [x13], #16
 
         cbz             w10, 3f
@@ -527,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldr             q16, [x12], #16
         mul             v16.8h, v16.8h, v2.8h
         mla             v16.8h, v6.8h, v3.8h
-        add             v16.8h, v16.8h, v31.8h
-        ushr            v16.8h, v16.8h, #4
+        urshr           v16.8h, v16.8h, #4
         str             q16, [x14], #16
 3:
         ldur            s5, [src, #1]
@@ -537,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         uxtl            v6.8h, v4.8b
         mul             v6.4h, v6.4h, v0.4h
         mla             v6.4h, v7.4h, v1.4h
-        add             v6.4h, v6.4h, v30.4h
-        ushr            v6.4h, v6.4h, #(8 - 6)
+        urshr           v6.4h, v6.4h, #(8 - 6)
         str             d6, [x13], #8
 
         cbz             w10, 4f
@@ -546,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldr             d16, [x12], #8
         mul             v16.4h, v16.4h, v2.4h
         mla             v16.4h, v6.4h, v3.4h
-        add             v16.4h, v16.4h, v31.4h
-        ushr            v16.4h, v16.4h, #4
+        urshr           v16.4h, v16.4h, #4
         str             d16, [x14], #8
 4:
         subs            height, height, #1
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
  2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-20  7:20 ` Zhao Zhili
  2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-03-01 22:21 ` Martin Storsjö
  2 siblings, 1 reply; 9+ messages in thread
From: Zhao Zhili @ 2025-02-20  7:20 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz



> On Feb 20, 2025, at 01:40, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote:
> 
> ---
> 
> This patch replaces integer widening with halving addition, and
> multi-step "emulated" rounding shift with a single asm instruction doing
> exactly that. This pattern repeats in other functions in this file, I
> fixed some in the succeeding patch. There's a lot of performance to be
> gained there.
> 
> I didn't modify the existing function because it adds a few extra steps
> solely for the shared w_avg implementation (every cycle matters), but also
> because I find this linear version easier to digest and understand.
> 
> Besides, I noticed that removing smin and smax instructions used for
> clamping the values for 10 and 12 bit_depth instantiations does not
> affect the checkasm result, but it breaks FATE.
> 
> Benchmarks before and after:
> A78
> avg_8_2x2_neon:                                         21.0 ( 1.55x)
> avg_8_4x4_neon:                                         25.8 ( 3.05x)
> avg_8_8x8_neon:                                         45.0 ( 5.86x)
> avg_8_16x16_neon:                                      178.5 ( 5.49x)
> avg_8_32x32_neon:                                      709.2 ( 6.20x)
> avg_8_64x64_neon:                                     2686.2 ( 6.12x)
> avg_8_128x128_neon:                                  10734.2 ( 5.88x)
> avg_10_2x2_neon:                                        19.0 ( 1.75x)
> avg_10_4x4_neon:                                        28.2 ( 2.76x)
> avg_10_8x8_neon:                                        44.0 ( 5.82x)
> avg_10_16x16_neon:                                     179.5 ( 4.81x)
> avg_10_32x32_neon:                                     680.8 ( 5.58x)
> avg_10_64x64_neon:                                    2536.8 ( 5.40x)
> avg_10_128x128_neon:                                 10079.0 ( 5.22x)
> avg_12_2x2_neon:                                        20.8 ( 1.59x)
> avg_12_4x4_neon:                                        25.2 ( 3.09x)
> avg_12_8x8_neon:                                        44.0 ( 5.79x)
> avg_12_16x16_neon:                                     182.2 ( 4.80x)
> avg_12_32x32_neon:                                     696.2 ( 5.46x)
> avg_12_64x64_neon:                                    2548.2 ( 5.38x)
> avg_12_128x128_neon:                                 10133.8 ( 5.19x)
> 
> avg_8_2x2_neon:                                         16.5 ( 1.98x)
> avg_8_4x4_neon:                                         26.2 ( 2.93x)
> avg_8_8x8_neon:                                         31.8 ( 8.55x)
> avg_8_16x16_neon:                                       82.0 (12.02x)
> avg_8_32x32_neon:                                      310.2 (14.12x)
> avg_8_64x64_neon:                                      897.8 (18.26x)
> avg_8_128x128_neon:                                   3608.5 (17.37x)
> avg_10_2x2_neon:                                        19.5 ( 1.69x)
> avg_10_4x4_neon:                                        28.0 ( 2.79x)
> avg_10_8x8_neon:                                        34.8 ( 7.32x)
> avg_10_16x16_neon:                                     119.8 ( 7.35x)
> avg_10_32x32_neon:                                     444.2 ( 8.51x)
> avg_10_64x64_neon:                                    1711.8 ( 8.00x)
> avg_10_128x128_neon:                                  7065.2 ( 7.43x)
> avg_12_2x2_neon:                                        19.5 ( 1.71x)
> avg_12_4x4_neon:                                        24.2 ( 3.22x)
> avg_12_8x8_neon:                                        33.8 ( 7.57x)
> avg_12_16x16_neon:                                     120.2 ( 7.33x)
> avg_12_32x32_neon:                                     442.5 ( 8.53x)
> avg_12_64x64_neon:                                    1706.2 ( 8.02x)
> avg_12_128x128_neon:                                  7010.0 ( 7.46x)
> 
> A72
> avg_8_2x2_neon:                                         30.2 ( 1.48x)
> avg_8_4x4_neon:                                         40.0 ( 3.10x)
> avg_8_8x8_neon:                                         91.0 ( 4.14x)
> avg_8_16x16_neon:                                      340.4 ( 3.92x)
> avg_8_32x32_neon:                                     1220.7 ( 4.67x)
> avg_8_64x64_neon:                                     5823.4 ( 3.88x)
> avg_8_128x128_neon:                                  17430.5 ( 4.73x)
> avg_10_2x2_neon:                                        34.0 ( 1.66x)
> avg_10_4x4_neon:                                        45.2 ( 2.73x)
> avg_10_8x8_neon:                                        97.5 ( 3.87x)
> avg_10_16x16_neon:                                     317.7 ( 3.90x)
> avg_10_32x32_neon:                                    1376.2 ( 4.21x)
> avg_10_64x64_neon:                                    5228.1 ( 3.71x)
> avg_10_128x128_neon:                                 16722.2 ( 4.17x)
> avg_12_2x2_neon:                                        31.7 ( 1.76x)
> avg_12_4x4_neon:                                        36.0 ( 3.44x)
> avg_12_8x8_neon:                                        91.7 ( 4.10x)
> avg_12_16x16_neon:                                     297.2 ( 4.13x)
> avg_12_32x32_neon:                                    1400.5 ( 4.14x)
> avg_12_64x64_neon:                                    5379.1 ( 3.51x)
> avg_12_128x128_neon:                                 16715.7 ( 4.17x)
> 
> avg_8_2x2_neon:                                         33.7 ( 1.72x)
> avg_8_4x4_neon:                                         45.5 ( 2.84x)
> avg_8_8x8_neon:                                         65.0 ( 5.98x)
> avg_8_16x16_neon:                                      171.0 ( 7.81x)
> avg_8_32x32_neon:                                      558.2 (10.05x)
> avg_8_64x64_neon:                                     2006.5 (10.61x)
> avg_8_128x128_neon:                                   9158.7 ( 8.96x)
> avg_10_2x2_neon:                                        38.0 ( 1.92x)
> avg_10_4x4_neon:                                        53.2 ( 2.69x)
> avg_10_8x8_neon:                                        95.2 ( 4.08x)
> avg_10_16x16_neon:                                     243.0 ( 5.02x)
> avg_10_32x32_neon:                                     891.7 ( 5.64x)
> avg_10_64x64_neon:                                    3357.7 ( 5.60x)
> avg_10_128x128_neon:                                 12411.7 ( 5.56x)
> avg_12_2x2_neon:                                        34.7 ( 1.97x)
> avg_12_4x4_neon:                                        53.2 ( 2.68x)
> avg_12_8x8_neon:                                        91.7 ( 4.22x)
> avg_12_16x16_neon:                                     239.0 ( 5.08x)
> avg_12_32x32_neon:                                     895.7 ( 5.62x)
> avg_12_64x64_neon:                                    3317.5 ( 5.67x)
> avg_12_128x128_neon:                                 12358.5 ( 5.58x)
> 
> 
> A53
> avg_8_2x2_neon:                                         58.3 ( 1.41x)
> avg_8_4x4_neon:                                        101.8 ( 2.21x)
> avg_8_8x8_neon:                                        178.6 ( 4.53x)
> avg_8_16x16_neon:                                      569.5 ( 5.01x)
> avg_8_32x32_neon:                                     1962.5 ( 5.50x)
> avg_8_64x64_neon:                                     8327.8 ( 5.18x)
> avg_8_128x128_neon:                                  31631.3 ( 5.34x)
> avg_10_2x2_neon:                                        54.5 ( 1.56x)
> avg_10_4x4_neon:                                        88.8 ( 2.53x)
> avg_10_8x8_neon:                                       163.6 ( 4.97x)
> avg_10_16x16_neon:                                     550.5 ( 5.16x)
> avg_10_32x32_neon:                                    1942.5 ( 5.64x)
> avg_10_64x64_neon:                                    8783.5 ( 4.98x)
> avg_10_128x128_neon:                                 32617.0 ( 5.25x)
> avg_12_2x2_neon:                                        53.3 ( 1.66x)
> avg_12_4x4_neon:                                        86.8 ( 2.61x)
> avg_12_8x8_neon:                                       156.6 ( 5.12x)
> avg_12_16x16_neon:                                     541.3 ( 5.25x)
> avg_12_32x32_neon:                                    1955.3 ( 5.59x)
> avg_12_64x64_neon:                                    8686.0 ( 5.06x)
> avg_12_128x128_neon:                                 32487.5 ( 5.25x)
> 
> avg_8_2x2_neon:                                         39.5 ( 1.96x)
> avg_8_4x4_neon:                                         65.3 ( 3.41x)
> avg_8_8x8_neon:                                        168.8 ( 4.79x)
> avg_8_16x16_neon:                                      348.0 ( 8.20x)
> avg_8_32x32_neon:                                     1207.5 ( 8.98x)
> avg_8_64x64_neon:                                     6032.3 ( 7.17x)
> avg_8_128x128_neon:                                  22008.5 ( 7.69x)
> avg_10_2x2_neon:                                        55.5 ( 1.52x)
> avg_10_4x4_neon:                                        73.8 ( 3.08x)
> avg_10_8x8_neon:                                       157.8 ( 5.12x)
> avg_10_16x16_neon:                                     445.0 ( 6.43x)
> avg_10_32x32_neon:                                    1587.3 ( 6.87x)
> avg_10_64x64_neon:                                    7738.0 ( 5.68x)
> avg_10_128x128_neon:                                 27813.8 ( 6.14x)
> avg_12_2x2_neon:                                        48.3 ( 1.80x)
> avg_12_4x4_neon:                                        77.0 ( 2.95x)
> avg_12_8x8_neon:                                       161.5 ( 4.98x)
> avg_12_16x16_neon:                                     433.5 ( 6.59x)
> avg_12_32x32_neon:                                    1622.0 ( 6.75x)
> avg_12_64x64_neon:                                    7844.5 ( 5.60x)
> avg_12_128x128_neon:                                 26999.5 ( 6.34x)
> 
> Krzysztof
> 
> libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++-
> 1 file changed, 121 insertions(+), 3 deletions(-)
> 
> diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
> index 0edc861f97..c9d698ee29 100644
> --- a/libavcodec/aarch64/vvc/inter.S
> +++ b/libavcodec/aarch64/vvc/inter.S
> @@ -217,13 +217,131 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1
> endfunc
> .endm
> 
> -vvc_avg avg, 8
> -vvc_avg avg, 10
> -vvc_avg avg, 12
> vvc_avg w_avg, 8
> vvc_avg w_avg, 10
> vvc_avg w_avg, 12
> 
> +.macro vvc_avg2 bit_depth
> +function ff_vvc_avg_\bit_depth\()_neon, export=1
> +        mov             x10, #(VVC_MAX_PB_SIZE * 2)
> +        movi            v16.8h, #0
> +        movi            v17.16b, #255
> +        ushr            v17.8h, v17.8h, #(16 - \bit_depth)

Please set v16 v17 only for bit_depth > 8. LGTM otherwise.

> +
> +        cmp             w4, #8
> +        b.gt            16f
> +        b.eq            8f
> +        cmp             w4, #4
> +        b.eq            4f
> +
> +2: // width == 2
> +        ldr             s0, [x2]
> +        subs            w5, w5, #1
> +        ldr             s1, [x3]
> +.if \bit_depth == 8
> +        shadd           v0.4h, v0.4h, v1.4h
> +        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
> +        str             h0, [x0]
> +.else
> +        shadd           v0.4h, v0.4h, v1.4h
> +        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
> +        smax            v0.4h, v0.4h, v16.4h
> +        smin            v0.4h, v0.4h, v17.4h
> +        str             s0, [x0]
> +.endif
> +        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
> +        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
> +        add             x0, x0, x1
> +        b.ne            2b
> +        ret
> +
> +4: // width == 4
> +        ldr             d0, [x2]
> +        subs            w5, w5, #1
> +        ldr             d1, [x3]
> +.if \bit_depth == 8
> +        shadd           v0.4h, v0.4h, v1.4h
> +        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
> +        str             s0, [x0]
> +.else
> +        shadd           v0.4h, v0.4h, v1.4h
> +        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
> +        smax            v0.4h, v0.4h, v16.4h
> +        smin            v0.4h, v0.4h, v17.4h
> +        str             d0, [x0]
> +.endif
> +        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
> +        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
> +        add             x0, x0, x1
> +        b.ne            4b
> +        ret
> +
> +8: // width == 8
> +        ldr             q0, [x2]
> +        subs            w5, w5, #1
> +        ldr             q1, [x3]
> +.if \bit_depth == 8
> +        shadd           v0.8h, v0.8h, v1.8h
> +        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
> +        str             d0, [x0]
> +.else
> +        shadd           v0.8h, v0.8h, v1.8h
> +        srshr           v0.8h, v0.8h, #(15 - 1 - \bit_depth)
> +        smax            v0.8h, v0.8h, v16.8h
> +        smin            v0.8h, v0.8h, v17.8h
> +        str             q0, [x0]
> +.endif
> +        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
> +        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
> +        add             x0, x0, x1
> +        b.ne            8b
> +        ret
> +
> +16: // width >= 16
> +.if \bit_depth == 8
> +        sub             x1, x1, w4, sxtw
> +.else
> +        sub             x1, x1, w4, sxtw #1
> +.endif
> +        sub             x10, x10, w4, sxtw #1
> +3:
> +        mov             w6, w4 // width
> +1:
> +        ldp             q0, q1, [x2], #32
> +        subs            w6, w6, #16
> +        ldp             q2, q3, [x3], #32
> +.if \bit_depth == 8
> +        shadd           v4.8h, v0.8h, v2.8h
> +        shadd           v5.8h, v1.8h, v3.8h
> +        sqrshrun        v0.8b, v4.8h, #6
> +        sqrshrun2       v0.16b, v5.8h, #6
> +        st1             {v0.16b}, [x0], #16
> +.else
> +        shadd           v4.8h, v0.8h, v2.8h
> +        shadd           v5.8h, v1.8h, v3.8h
> +        srshr           v0.8h, v4.8h, #(15 - 1 - \bit_depth)
> +        srshr           v1.8h, v5.8h, #(15 - 1 - \bit_depth)
> +        smax            v0.8h, v0.8h, v16.8h
> +        smax            v1.8h, v1.8h, v16.8h
> +        smin            v0.8h, v0.8h, v17.8h
> +        smin            v1.8h, v1.8h, v17.8h
> +        stp             q0, q1, [x0], #32
> +.endif
> +        b.ne            1b
> +
> +        subs            w5, w5, #1
> +        add             x2, x2, x10
> +        add             x3, x3, x10
> +        add             x0, x0, x1
> +        b.ne            3b
> +        ret
> +endfunc
> +.endm
> +
> +vvc_avg2 8
> +vvc_avg2 10
> +vvc_avg2 12
> +
> /* x0: int16_t *dst
>  * x1: const uint8_t *_src
>  * x2: ptrdiff_t _src_stride
> -- 
> 2.47.2
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction
  2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-20  8:08   ` Zhao Zhili
  2025-03-01 22:34   ` Martin Storsjö
  1 sibling, 0 replies; 9+ messages in thread
From: Zhao Zhili @ 2025-02-20  8:08 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz



> On Feb 20, 2025, at 01:40, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote:
> 
> ---
> 
> Before and after on A78
> 
> dmvr_8_12x20_neon:                                      86.2 ( 6.90x)
> dmvr_8_20x12_neon:                                      94.8 ( 5.93x)
> dmvr_8_20x20_neon:                                     141.5 ( 6.50x)
> dmvr_12_12x20_neon:                                    158.0 ( 3.76x)
> dmvr_12_20x12_neon:                                    151.2 ( 3.73x)
> dmvr_12_20x20_neon:                                    247.2 ( 3.71x)
> dmvr_hv_8_12x20_neon:                                  423.2 ( 3.75x)
> dmvr_hv_8_20x12_neon:                                  434.0 ( 3.69x)
> dmvr_hv_8_20x20_neon:                                  706.0 ( 3.69x)
> 
> dmvr_8_12x20_neon:                                      77.2 ( 7.70x)
> dmvr_8_20x12_neon:                                      66.5 ( 8.49x)
> dmvr_8_20x20_neon:                                      92.2 ( 9.90x)
> dmvr_12_12x20_neon:                                     80.2 ( 7.38x)
> dmvr_12_20x12_neon:                                     58.2 ( 9.59x)
> dmvr_12_20x20_neon:                                     90.0 (10.15x)
> dmvr_hv_8_12x20_neon:                                  369.0 ( 4.34x)
> dmvr_hv_8_20x12_neon:                                  355.8 ( 4.49x)
> dmvr_hv_8_20x20_neon:                                  574.2 ( 4.51x)
> 
> libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------
> 1 file changed, 20 insertions(+), 52 deletions(-)
> 
> diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
> index c9d698ee29..45add44b6e 100644
> --- a/libavcodec/aarch64/vvc/inter.S
> +++ b/libavcodec/aarch64/vvc/inter.S
> @@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1
> 1:
>         cbz             w15, 2f
>         ldr             q0, [src], #16
> -        uxtl            v1.8h, v0.8b
> -        uxtl2           v2.8h, v0.16b
> -        ushl            v1.8h, v1.8h, v16.8h
> -        ushl            v2.8h, v2.8h, v16.8h

Please remove assignment to v16. LGTM otherwise.

> +        ushll           v1.8h, v0.8b, #2
> +        ushll2          v2.8h, v0.16b, #2
>         stp             q1, q2, [dst], #32
>         b               3f
> 2:
>         ldr             d0, [src], #8
> -        uxtl            v1.8h, v0.8b
> -        ushl            v1.8h, v1.8h, v16.8h
> +        ushll           v1.8h, v0.8b, #2
>         str             q1, [dst], #16
> 3:
>         subs            height, height, #1
>         ldr             s3, [src], #4
> -        uxtl            v4.8h, v3.8b
> -        ushl            v4.4h, v4.4h, v16.4h
> +        ushll           v4.8h, v3.8b, #2
>         st1             {v4.4h}, [dst], x7
> 
>         add             src, src, src_stride
> @@ -399,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1
>         cmp             width, #16
>         sub             src_stride, src_stride, x6, lsl #1
>         cset            w15, gt                     // width > 16
> -        movi            v16.8h, #2                  // offset4
>         sub             x7, x7, x6, lsl #1
> 1:
>         cbz             w15, 2f
>         ldp             q0, q1, [src], #32
> -        uaddl           v2.4s, v0.4h, v16.4h
> -        uaddl2          v3.4s, v0.8h, v16.8h
> -        uaddl           v4.4s, v1.4h, v16.4h
> -        uaddl2          v5.4s, v1.8h, v16.8h
> -        ushr            v2.4s, v2.4s, #2
> -        ushr            v3.4s, v3.4s, #2
> -        ushr            v4.4s, v4.4s, #2
> -        ushr            v5.4s, v5.4s, #2
> -        uqxtn           v2.4h, v2.4s
> -        uqxtn2          v2.8h, v3.4s
> -        uqxtn           v4.4h, v4.4s
> -        uqxtn2          v4.8h, v5.4s
> -
> -        stp             q2, q4, [dst], #32
> +        urshr           v0.8h, v0.8h, #2
> +        urshr           v1.8h, v1.8h, #2
> +
> +        stp             q0, q1, [dst], #32
>         b               3f
> 2:
>         ldr             q0, [src], #16
> -        uaddl           v2.4s, v0.4h, v16.4h
> -        uaddl2          v3.4s, v0.8h, v16.8h
> -        ushr            v2.4s, v2.4s, #2
> -        ushr            v3.4s, v3.4s, #2
> -        uqxtn           v2.4h, v2.4s
> -        uqxtn2          v2.8h, v3.4s
> -        str             q2, [dst], #16
> +        urshr           v0.8h, v0.8h, #2
> +        str             q0, [dst], #16
> 3:
>         subs            height, height, #1
>         ldr             d0, [src], #8
> -        uaddl           v3.4s, v0.4h, v16.4h
> -        ushr            v3.4s, v3.4s, #2
> -        uqxtn           v3.4h, v3.4s
> -        st1             {v3.4h}, [dst], x7
> +        urshr           v0.4h, v0.4h, #2
> +        st1             {v0.4h}, [dst], x7
> 
>         add             src, src, src_stride
>         b.ne            1b
> @@ -462,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         ldrb            w10, [x12]
>         ldrb            w11, [x12, #1]
>         sxtw            x6, w6
> -        movi            v30.8h, #(1 << (8 - 7))     // offset1
> -        movi            v31.8h, #8                  // offset2
>         dup             v2.8h, w10                  // filter_y[0]
>         dup             v3.8h, w11                  // filter_y[1]
> 
> @@ -491,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         mul             v16.8h, v16.8h, v0.8h
>         mla             v6.8h, v7.8h, v1.8h
>         mla             v16.8h, v17.8h, v1.8h
> -        add             v6.8h, v6.8h, v30.8h
> -        add             v16.8h, v16.8h, v30.8h
> -        ushr            v6.8h, v6.8h, #(8 - 6)
> -        ushr            v7.8h, v16.8h, #(8 - 6)
> +        urshr           v6.8h, v6.8h, #(8 - 6)
> +        urshr           v7.8h, v16.8h, #(8 - 6)
>         stp             q6, q7, [x13], #32
> 
>         cbz             w10, 3f
> @@ -504,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         mul             v17.8h, v17.8h, v2.8h
>         mla             v16.8h, v6.8h, v3.8h
>         mla             v17.8h, v7.8h, v3.8h
> -        add             v16.8h, v16.8h, v31.8h
> -        add             v17.8h, v17.8h, v31.8h
> -        ushr            v16.8h, v16.8h, #4
> -        ushr            v17.8h, v17.8h, #4
> +        urshr           v16.8h, v16.8h, #4
> +        urshr           v17.8h, v17.8h, #4
>         stp             q16, q17, [x14], #32
>         b               3f
> 2:
> @@ -518,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         uxtl            v6.8h, v4.8b
>         mul             v6.8h, v6.8h, v0.8h
>         mla             v6.8h, v7.8h, v1.8h
> -        add             v6.8h, v6.8h, v30.8h
> -        ushr            v6.8h, v6.8h, #(8 - 6)
> +        urshr           v6.8h, v6.8h, #(8 - 6)
>         str             q6, [x13], #16
> 
>         cbz             w10, 3f
> @@ -527,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         ldr             q16, [x12], #16
>         mul             v16.8h, v16.8h, v2.8h
>         mla             v16.8h, v6.8h, v3.8h
> -        add             v16.8h, v16.8h, v31.8h
> -        ushr            v16.8h, v16.8h, #4
> +        urshr           v16.8h, v16.8h, #4
>         str             q16, [x14], #16
> 3:
>         ldur            s5, [src, #1]
> @@ -537,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         uxtl            v6.8h, v4.8b
>         mul             v6.4h, v6.4h, v0.4h
>         mla             v6.4h, v7.4h, v1.4h
> -        add             v6.4h, v6.4h, v30.4h
> -        ushr            v6.4h, v6.4h, #(8 - 6)
> +        urshr           v6.4h, v6.4h, #(8 - 6)
>         str             d6, [x13], #8
> 
>         cbz             w10, 4f
> @@ -546,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         ldr             d16, [x12], #8
>         mul             v16.4h, v16.4h, v2.4h
>         mla             v16.4h, v6.4h, v3.4h
> -        add             v16.4h, v16.4h, v31.4h
> -        ushr            v16.4h, v16.4h, #4
> +        urshr           v16.4h, v16.4h, #4
>         str             d16, [x14], #8
> 4:
>         subs            height, height, #1
> -- 
> 2.47.2
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
  2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
@ 2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-20 18:49     ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-26  8:54     ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
  0 siblings, 2 replies; 9+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---
 libavcodec/aarch64/vvc/inter.S | 125 ++++++++++++++++++++++++++++++++-
 1 file changed, 122 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index 0edc861f97..b65920e640 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -217,13 +217,132 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1
 endfunc
 .endm
 
-vvc_avg avg, 8
-vvc_avg avg, 10
-vvc_avg avg, 12
 vvc_avg w_avg, 8
 vvc_avg w_avg, 10
 vvc_avg w_avg, 12
 
+.macro vvc_avg2 bit_depth
+function ff_vvc_avg_\bit_depth\()_neon, export=1
+        mov             x10, #(VVC_MAX_PB_SIZE * 2)
+.if \bit_depth != 8
+        movi            v16.8h, #0
+        movi            v17.16b, #255
+        ushr            v17.8h, v17.8h, #(16 - \bit_depth)
+.endif
+        cmp             w4, #8
+        b.gt            16f
+        b.eq            8f
+        cmp             w4, #4
+        b.eq            4f
+
+2: // width == 2
+        ldr             s0, [x2]
+        subs            w5, w5, #1
+        ldr             s1, [x3]
+.if \bit_depth == 8
+        shadd           v0.4h, v0.4h, v1.4h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             h0, [x0]
+.else
+        shadd           v0.4h, v0.4h, v1.4h
+        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
+        smax            v0.4h, v0.4h, v16.4h
+        smin            v0.4h, v0.4h, v17.4h
+        str             s0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            2b
+        ret
+
+4: // width == 4
+        ldr             d0, [x2]
+        subs            w5, w5, #1
+        ldr             d1, [x3]
+.if \bit_depth == 8
+        shadd           v0.4h, v0.4h, v1.4h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             s0, [x0]
+.else
+        shadd           v0.4h, v0.4h, v1.4h
+        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
+        smax            v0.4h, v0.4h, v16.4h
+        smin            v0.4h, v0.4h, v17.4h
+        str             d0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            4b
+        ret
+
+8: // width == 8
+        ldr             q0, [x2]
+        subs            w5, w5, #1
+        ldr             q1, [x3]
+.if \bit_depth == 8
+        shadd           v0.8h, v0.8h, v1.8h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             d0, [x0]
+.else
+        shadd           v0.8h, v0.8h, v1.8h
+        srshr           v0.8h, v0.8h, #(15 - 1 - \bit_depth)
+        smax            v0.8h, v0.8h, v16.8h
+        smin            v0.8h, v0.8h, v17.8h
+        str             q0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            8b
+        ret
+
+16: // width >= 16
+.if \bit_depth == 8
+        sub             x1, x1, w4, sxtw
+.else
+        sub             x1, x1, w4, sxtw #1
+.endif
+        sub             x10, x10, w4, sxtw #1
+3:
+        mov             w6, w4 // width
+1:
+        ldp             q0, q1, [x2], #32
+        subs            w6, w6, #16
+        ldp             q2, q3, [x3], #32
+.if \bit_depth == 8
+        shadd           v4.8h, v0.8h, v2.8h
+        shadd           v5.8h, v1.8h, v3.8h
+        sqrshrun        v0.8b, v4.8h, #6
+        sqrshrun2       v0.16b, v5.8h, #6
+        st1             {v0.16b}, [x0], #16
+.else
+        shadd           v4.8h, v0.8h, v2.8h
+        shadd           v5.8h, v1.8h, v3.8h
+        srshr           v0.8h, v4.8h, #(15 - 1 - \bit_depth)
+        srshr           v1.8h, v5.8h, #(15 - 1 - \bit_depth)
+        smax            v0.8h, v0.8h, v16.8h
+        smax            v1.8h, v1.8h, v16.8h
+        smin            v0.8h, v0.8h, v17.8h
+        smin            v1.8h, v1.8h, v17.8h
+        stp             q0, q1, [x0], #32
+.endif
+        b.ne            1b
+
+        subs            w5, w5, #1
+        add             x2, x2, x10
+        add             x3, x3, x10
+        add             x0, x0, x1
+        b.ne            3b
+        ret
+endfunc
+.endm
+
+vvc_avg2 8
+vvc_avg2 10
+vvc_avg2 12
+
 /* x0: int16_t *dst
  * x1: const uint8_t *_src
  * x2: ptrdiff_t _src_stride
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction
  2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-20 18:49     ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-26  8:54     ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
  1 sibling, 0 replies; 9+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---
 libavcodec/aarch64/vvc/inter.S | 73 ++++++++++------------------------
 1 file changed, 20 insertions(+), 53 deletions(-)

diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index b65920e640..09f0627b20 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -365,27 +365,22 @@ function ff_vvc_dmvr_8_neon, export=1
         cmp             width, #16
         sub             src_stride, src_stride, x6
         cset            w15, gt                     // width > 16
-        movi            v16.8h, #2                  // DMVR_SHIFT
         sub             x7, x7, x6, lsl #1
 1:
         cbz             w15, 2f
         ldr             q0, [src], #16
-        uxtl            v1.8h, v0.8b
-        uxtl2           v2.8h, v0.16b
-        ushl            v1.8h, v1.8h, v16.8h
-        ushl            v2.8h, v2.8h, v16.8h
+        ushll           v1.8h, v0.8b, #2
+        ushll2          v2.8h, v0.16b, #2
         stp             q1, q2, [dst], #32
         b               3f
 2:
         ldr             d0, [src], #8
-        uxtl            v1.8h, v0.8b
-        ushl            v1.8h, v1.8h, v16.8h
+        ushll           v1.8h, v0.8b, #2
         str             q1, [dst], #16
 3:
         subs            height, height, #1
         ldr             s3, [src], #4
-        uxtl            v4.8h, v3.8b
-        ushl            v4.4h, v4.4h, v16.4h
+        ushll           v4.8h, v3.8b, #2
         st1             {v4.4h}, [dst], x7
 
         add             src, src, src_stride
@@ -400,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1
         cmp             width, #16
         sub             src_stride, src_stride, x6, lsl #1
         cset            w15, gt                     // width > 16
-        movi            v16.8h, #2                  // offset4
         sub             x7, x7, x6, lsl #1
 1:
         cbz             w15, 2f
         ldp             q0, q1, [src], #32
-        uaddl           v2.4s, v0.4h, v16.4h
-        uaddl2          v3.4s, v0.8h, v16.8h
-        uaddl           v4.4s, v1.4h, v16.4h
-        uaddl2          v5.4s, v1.8h, v16.8h
-        ushr            v2.4s, v2.4s, #2
-        ushr            v3.4s, v3.4s, #2
-        ushr            v4.4s, v4.4s, #2
-        ushr            v5.4s, v5.4s, #2
-        uqxtn           v2.4h, v2.4s
-        uqxtn2          v2.8h, v3.4s
-        uqxtn           v4.4h, v4.4s
-        uqxtn2          v4.8h, v5.4s
-
-        stp             q2, q4, [dst], #32
+        urshr           v0.8h, v0.8h, #2
+        urshr           v1.8h, v1.8h, #2
+
+        stp             q0, q1, [dst], #32
         b               3f
 2:
         ldr             q0, [src], #16
-        uaddl           v2.4s, v0.4h, v16.4h
-        uaddl2          v3.4s, v0.8h, v16.8h
-        ushr            v2.4s, v2.4s, #2
-        ushr            v3.4s, v3.4s, #2
-        uqxtn           v2.4h, v2.4s
-        uqxtn2          v2.8h, v3.4s
-        str             q2, [dst], #16
+        urshr           v0.8h, v0.8h, #2
+        str             q0, [dst], #16
 3:
         subs            height, height, #1
         ldr             d0, [src], #8
-        uaddl           v3.4s, v0.4h, v16.4h
-        ushr            v3.4s, v3.4s, #2
-        uqxtn           v3.4h, v3.4s
-        st1             {v3.4h}, [dst], x7
+        urshr           v0.4h, v0.4h, #2
+        st1             {v0.4h}, [dst], x7
 
         add             src, src, src_stride
         b.ne            1b
@@ -463,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldrb            w10, [x12]
         ldrb            w11, [x12, #1]
         sxtw            x6, w6
-        movi            v30.8h, #(1 << (8 - 7))     // offset1
-        movi            v31.8h, #8                  // offset2
         dup             v2.8h, w10                  // filter_y[0]
         dup             v3.8h, w11                  // filter_y[1]
 
@@ -492,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         mul             v16.8h, v16.8h, v0.8h
         mla             v6.8h, v7.8h, v1.8h
         mla             v16.8h, v17.8h, v1.8h
-        add             v6.8h, v6.8h, v30.8h
-        add             v16.8h, v16.8h, v30.8h
-        ushr            v6.8h, v6.8h, #(8 - 6)
-        ushr            v7.8h, v16.8h, #(8 - 6)
+        urshr           v6.8h, v6.8h, #(8 - 6)
+        urshr           v7.8h, v16.8h, #(8 - 6)
         stp             q6, q7, [x13], #32
 
         cbz             w10, 3f
@@ -505,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         mul             v17.8h, v17.8h, v2.8h
         mla             v16.8h, v6.8h, v3.8h
         mla             v17.8h, v7.8h, v3.8h
-        add             v16.8h, v16.8h, v31.8h
-        add             v17.8h, v17.8h, v31.8h
-        ushr            v16.8h, v16.8h, #4
-        ushr            v17.8h, v17.8h, #4
+        urshr           v16.8h, v16.8h, #4
+        urshr           v17.8h, v17.8h, #4
         stp             q16, q17, [x14], #32
         b               3f
 2:
@@ -519,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         uxtl            v6.8h, v4.8b
         mul             v6.8h, v6.8h, v0.8h
         mla             v6.8h, v7.8h, v1.8h
-        add             v6.8h, v6.8h, v30.8h
-        ushr            v6.8h, v6.8h, #(8 - 6)
+        urshr           v6.8h, v6.8h, #(8 - 6)
         str             q6, [x13], #16
 
         cbz             w10, 3f
@@ -528,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldr             q16, [x12], #16
         mul             v16.8h, v16.8h, v2.8h
         mla             v16.8h, v6.8h, v3.8h
-        add             v16.8h, v16.8h, v31.8h
-        ushr            v16.8h, v16.8h, #4
+        urshr           v16.8h, v16.8h, #4
         str             q16, [x14], #16
 3:
         ldur            s5, [src, #1]
@@ -538,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         uxtl            v6.8h, v4.8b
         mul             v6.4h, v6.4h, v0.4h
         mla             v6.4h, v7.4h, v1.4h
-        add             v6.4h, v6.4h, v30.4h
-        ushr            v6.4h, v6.4h, #(8 - 6)
+        urshr           v6.4h, v6.4h, #(8 - 6)
         str             d6, [x13], #8
 
         cbz             w10, 4f
@@ -547,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldr             d16, [x12], #8
         mul             v16.4h, v16.4h, v2.4h
         mla             v16.4h, v6.4h, v3.4h
-        add             v16.4h, v16.4h, v31.4h
-        ushr            v16.4h, v16.4h, #4
+        urshr           v16.4h, v16.4h, #4
         str             d16, [x14], #8
 4:
         subs            height, height, #1
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
  2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-20 18:49     ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-26  8:54     ` Zhao Zhili
  1 sibling, 0 replies; 9+ messages in thread
From: Zhao Zhili @ 2025-02-26  8:54 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz



> On Feb 21, 2025, at 02:49, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote:
> 
> ---
> libavcodec/aarch64/vvc/inter.S | 125 ++++++++++++++++++++++++++++++++-
> 1 file changed, 122 insertions(+), 3 deletions(-)
> 

The patchset LGTM.

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
  2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
@ 2025-03-01 22:21 ` Martin Storsjö
  2 siblings, 0 replies; 9+ messages in thread
From: Martin Storsjö @ 2025-03-01 22:21 UTC (permalink / raw)
  To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz

On Wed, 19 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> ---

As you've noticed in later patches; most of this commentery _is_ valuable 
to keep in the commit message, so I'd keep most of this, including the 
performance diff, in the commit message (i.e. above the ---).

> This patch replaces integer widening with halving addition, and
> multi-step "emulated" rounding shift with a single asm instruction doing
> exactly that. This pattern repeats in other functions in this file, I
> fixed some in the succeeding patch. There's a lot of performance to be
> gained there.
>
> I didn't modify the existing function because it adds a few extra steps
> solely for the shared w_avg implementation (every cycle matters), but also
> because I find this linear version easier to digest and understand.

That's probably reasonable - but if the avg codepath in vvc_avg is unused 
now, we should remove it; that makes the patch clearer to see the change, 
when we see the removed old codepath together with the new added one in 
the same patch.

> Besides, I noticed that removing smin and smax instructions used for
> clamping the values for 10 and 12 bit_depth instantiations does not
> affect the checkasm result, but it breaks FATE.

It would probably be good if we could improve the checkasm to hit those 
cases too, but that's of course a separate question.

>
> Benchmarks before and after:
> A78
> avg_8_2x2_neon:                                         21.0 ( 1.55x)
> avg_8_4x4_neon:                                         25.8 ( 3.05x)
> avg_8_8x8_neon:                                         45.0 ( 5.86x)
> avg_8_16x16_neon:                                      178.5 ( 5.49x)
> avg_8_32x32_neon:                                      709.2 ( 6.20x)
> avg_8_64x64_neon:                                     2686.2 ( 6.12x)
> avg_8_128x128_neon:                                  10734.2 ( 5.88x)
> avg_10_2x2_neon:                                        19.0 ( 1.75x)
> avg_10_4x4_neon:                                        28.2 ( 2.76x)
> avg_10_8x8_neon:                                        44.0 ( 5.82x)
> avg_10_16x16_neon:                                     179.5 ( 4.81x)
> avg_10_32x32_neon:                                     680.8 ( 5.58x)
> avg_10_64x64_neon:                                    2536.8 ( 5.40x)
> avg_10_128x128_neon:                                 10079.0 ( 5.22x)
> avg_12_2x2_neon:                                        20.8 ( 1.59x)
> avg_12_4x4_neon:                                        25.2 ( 3.09x)
> avg_12_8x8_neon:                                        44.0 ( 5.79x)
> avg_12_16x16_neon:                                     182.2 ( 4.80x)
> avg_12_32x32_neon:                                     696.2 ( 5.46x)
> avg_12_64x64_neon:                                    2548.2 ( 5.38x)
> avg_12_128x128_neon:                                 10133.8 ( 5.19x)
>
> avg_8_2x2_neon:                                         16.5 ( 1.98x)
> avg_8_4x4_neon:                                         26.2 ( 2.93x)
> avg_8_8x8_neon:                                         31.8 ( 8.55x)
> avg_8_16x16_neon:                                       82.0 (12.02x)
> avg_8_32x32_neon:                                      310.2 (14.12x)
> avg_8_64x64_neon:                                      897.8 (18.26x)
> avg_8_128x128_neon:                                   3608.5 (17.37x)
> avg_10_2x2_neon:                                        19.5 ( 1.69x)
> avg_10_4x4_neon:                                        28.0 ( 2.79x)
> avg_10_8x8_neon:                                        34.8 ( 7.32x)
> avg_10_16x16_neon:                                     119.8 ( 7.35x)
> avg_10_32x32_neon:                                     444.2 ( 8.51x)
> avg_10_64x64_neon:                                    1711.8 ( 8.00x)
> avg_10_128x128_neon:                                  7065.2 ( 7.43x)
> avg_12_2x2_neon:                                        19.5 ( 1.71x)
> avg_12_4x4_neon:                                        24.2 ( 3.22x)
> avg_12_8x8_neon:                                        33.8 ( 7.57x)
> avg_12_16x16_neon:                                     120.2 ( 7.33x)
> avg_12_32x32_neon:                                     442.5 ( 8.53x)
> avg_12_64x64_neon:                                    1706.2 ( 8.02x)
> avg_12_128x128_neon:                                  7010.0 ( 7.46x)
>
> A72
> avg_8_2x2_neon:                                         30.2 ( 1.48x)
> avg_8_4x4_neon:                                         40.0 ( 3.10x)
> avg_8_8x8_neon:                                         91.0 ( 4.14x)
> avg_8_16x16_neon:                                      340.4 ( 3.92x)
> avg_8_32x32_neon:                                     1220.7 ( 4.67x)
> avg_8_64x64_neon:                                     5823.4 ( 3.88x)
> avg_8_128x128_neon:                                  17430.5 ( 4.73x)
> avg_10_2x2_neon:                                        34.0 ( 1.66x)
> avg_10_4x4_neon:                                        45.2 ( 2.73x)
> avg_10_8x8_neon:                                        97.5 ( 3.87x)
> avg_10_16x16_neon:                                     317.7 ( 3.90x)
> avg_10_32x32_neon:                                    1376.2 ( 4.21x)
> avg_10_64x64_neon:                                    5228.1 ( 3.71x)
> avg_10_128x128_neon:                                 16722.2 ( 4.17x)
> avg_12_2x2_neon:                                        31.7 ( 1.76x)
> avg_12_4x4_neon:                                        36.0 ( 3.44x)
> avg_12_8x8_neon:                                        91.7 ( 4.10x)
> avg_12_16x16_neon:                                     297.2 ( 4.13x)
> avg_12_32x32_neon:                                    1400.5 ( 4.14x)
> avg_12_64x64_neon:                                    5379.1 ( 3.51x)
> avg_12_128x128_neon:                                 16715.7 ( 4.17x)
>
> avg_8_2x2_neon:                                         33.7 ( 1.72x)
> avg_8_4x4_neon:                                         45.5 ( 2.84x)
> avg_8_8x8_neon:                                         65.0 ( 5.98x)
> avg_8_16x16_neon:                                      171.0 ( 7.81x)
> avg_8_32x32_neon:                                      558.2 (10.05x)
> avg_8_64x64_neon:                                     2006.5 (10.61x)
> avg_8_128x128_neon:                                   9158.7 ( 8.96x)
> avg_10_2x2_neon:                                        38.0 ( 1.92x)
> avg_10_4x4_neon:                                        53.2 ( 2.69x)
> avg_10_8x8_neon:                                        95.2 ( 4.08x)
> avg_10_16x16_neon:                                     243.0 ( 5.02x)
> avg_10_32x32_neon:                                     891.7 ( 5.64x)
> avg_10_64x64_neon:                                    3357.7 ( 5.60x)
> avg_10_128x128_neon:                                 12411.7 ( 5.56x)
> avg_12_2x2_neon:                                        34.7 ( 1.97x)
> avg_12_4x4_neon:                                        53.2 ( 2.68x)
> avg_12_8x8_neon:                                        91.7 ( 4.22x)
> avg_12_16x16_neon:                                     239.0 ( 5.08x)
> avg_12_32x32_neon:                                     895.7 ( 5.62x)
> avg_12_64x64_neon:                                    3317.5 ( 5.67x)
> avg_12_128x128_neon:                                 12358.5 ( 5.58x)
>
>
> A53
> avg_8_2x2_neon:                                         58.3 ( 1.41x)
> avg_8_4x4_neon:                                        101.8 ( 2.21x)
> avg_8_8x8_neon:                                        178.6 ( 4.53x)
> avg_8_16x16_neon:                                      569.5 ( 5.01x)
> avg_8_32x32_neon:                                     1962.5 ( 5.50x)
> avg_8_64x64_neon:                                     8327.8 ( 5.18x)
> avg_8_128x128_neon:                                  31631.3 ( 5.34x)
> avg_10_2x2_neon:                                        54.5 ( 1.56x)
> avg_10_4x4_neon:                                        88.8 ( 2.53x)
> avg_10_8x8_neon:                                       163.6 ( 4.97x)
> avg_10_16x16_neon:                                     550.5 ( 5.16x)
> avg_10_32x32_neon:                                    1942.5 ( 5.64x)
> avg_10_64x64_neon:                                    8783.5 ( 4.98x)
> avg_10_128x128_neon:                                 32617.0 ( 5.25x)
> avg_12_2x2_neon:                                        53.3 ( 1.66x)
> avg_12_4x4_neon:                                        86.8 ( 2.61x)
> avg_12_8x8_neon:                                       156.6 ( 5.12x)
> avg_12_16x16_neon:                                     541.3 ( 5.25x)
> avg_12_32x32_neon:                                    1955.3 ( 5.59x)
> avg_12_64x64_neon:                                    8686.0 ( 5.06x)
> avg_12_128x128_neon:                                 32487.5 ( 5.25x)
>
> avg_8_2x2_neon:                                         39.5 ( 1.96x)
> avg_8_4x4_neon:                                         65.3 ( 3.41x)
> avg_8_8x8_neon:                                        168.8 ( 4.79x)
> avg_8_16x16_neon:                                      348.0 ( 8.20x)
> avg_8_32x32_neon:                                     1207.5 ( 8.98x)
> avg_8_64x64_neon:                                     6032.3 ( 7.17x)
> avg_8_128x128_neon:                                  22008.5 ( 7.69x)
> avg_10_2x2_neon:                                        55.5 ( 1.52x)
> avg_10_4x4_neon:                                        73.8 ( 3.08x)
> avg_10_8x8_neon:                                       157.8 ( 5.12x)
> avg_10_16x16_neon:                                     445.0 ( 6.43x)
> avg_10_32x32_neon:                                    1587.3 ( 6.87x)
> avg_10_64x64_neon:                                    7738.0 ( 5.68x)
> avg_10_128x128_neon:                                 27813.8 ( 6.14x)
> avg_12_2x2_neon:                                        48.3 ( 1.80x)
> avg_12_4x4_neon:                                        77.0 ( 2.95x)
> avg_12_8x8_neon:                                       161.5 ( 4.98x)
> avg_12_16x16_neon:                                     433.5 ( 6.59x)
> avg_12_32x32_neon:                                    1622.0 ( 6.75x)
> avg_12_64x64_neon:                                    7844.5 ( 5.60x)
> avg_12_128x128_neon:                                 26999.5 ( 6.34x)
>
> Krzysztof
>
> libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++-
> 1 file changed, 121 insertions(+), 3 deletions(-)

Overall the change looks reasonable to me, thanks, but remove the now 
unused parts and update the patch to include the valuable comments and 
benchmarks above the "---" bit.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction
  2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-20  8:08   ` Zhao Zhili
@ 2025-03-01 22:34   ` Martin Storsjö
  1 sibling, 0 replies; 9+ messages in thread
From: Martin Storsjö @ 2025-03-01 22:34 UTC (permalink / raw)
  To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz

On Wed, 19 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> ---
>
> Before and after on A78
>
> dmvr_8_12x20_neon:                                      86.2 ( 6.90x)
> dmvr_8_20x12_neon:                                      94.8 ( 5.93x)
> dmvr_8_20x20_neon:                                     141.5 ( 6.50x)
> dmvr_12_12x20_neon:                                    158.0 ( 3.76x)
> dmvr_12_20x12_neon:                                    151.2 ( 3.73x)
> dmvr_12_20x20_neon:                                    247.2 ( 3.71x)
> dmvr_hv_8_12x20_neon:                                  423.2 ( 3.75x)
> dmvr_hv_8_20x12_neon:                                  434.0 ( 3.69x)
> dmvr_hv_8_20x20_neon:                                  706.0 ( 3.69x)
>
> dmvr_8_12x20_neon:                                      77.2 ( 7.70x)
> dmvr_8_20x12_neon:                                      66.5 ( 8.49x)
> dmvr_8_20x20_neon:                                      92.2 ( 9.90x)
> dmvr_12_12x20_neon:                                     80.2 ( 7.38x)
> dmvr_12_20x12_neon:                                     58.2 ( 9.59x)
> dmvr_12_20x20_neon:                                     90.0 (10.15x)
> dmvr_hv_8_12x20_neon:                                  369.0 ( 4.34x)
> dmvr_hv_8_20x12_neon:                                  355.8 ( 4.49x)
> dmvr_hv_8_20x20_neon:                                  574.2 ( 4.51x)
>
> libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------
> 1 file changed, 20 insertions(+), 52 deletions(-)
>
> diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
> index c9d698ee29..45add44b6e 100644
> --- a/libavcodec/aarch64/vvc/inter.S
> +++ b/libavcodec/aarch64/vvc/inter.S
> @@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1
> 1:
>         cbz             w15, 2f
>         ldr             q0, [src], #16
> -        uxtl            v1.8h, v0.8b
> -        uxtl2           v2.8h, v0.16b
> -        ushl            v1.8h, v1.8h, v16.8h
> -        ushl            v2.8h, v2.8h, v16.8h
> +        ushll           v1.8h, v0.8b, #2
> +        ushll2          v2.8h, v0.16b, #2

In addition to what's mentioned in the commit message, this bit is 
semantically a different one, so we should probably mention that in the 
commit message as well. If you're reposting patch 1/2 of this set, can you 
update the commit message on this one, to mention this (and move the 
measurements into the actual commit message).

Other than that, this patch looks very good to me, thanks!

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-01 22:35 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel
2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
2025-02-20  8:08   ` Zhao Zhili
2025-03-01 22:34   ` Martin Storsjö
2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
2025-02-20 18:49     ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
2025-02-26  8:54     ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
2025-03-01 22:21 ` Martin Storsjö

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git