Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
@ 2025-02-19 17:40 Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
  0 siblings, 2 replies; 6+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---

This patch replaces integer widening with halving addition, and
multi-step "emulated" rounding shift with a single asm instruction doing
exactly that. This pattern repeats in other functions in this file, I
fixed some in the succeeding patch. There's a lot of performance to be
gained there.

I didn't modify the existing function because it adds a few extra steps
solely for the shared w_avg implementation (every cycle matters), but also
because I find this linear version easier to digest and understand.

Besides, I noticed that removing smin and smax instructions used for
clamping the values for 10 and 12 bit_depth instantiations does not
affect the checkasm result, but it breaks FATE.

Benchmarks before and after:
A78
avg_8_2x2_neon:                                         21.0 ( 1.55x)
avg_8_4x4_neon:                                         25.8 ( 3.05x)
avg_8_8x8_neon:                                         45.0 ( 5.86x)
avg_8_16x16_neon:                                      178.5 ( 5.49x)
avg_8_32x32_neon:                                      709.2 ( 6.20x)
avg_8_64x64_neon:                                     2686.2 ( 6.12x)
avg_8_128x128_neon:                                  10734.2 ( 5.88x)
avg_10_2x2_neon:                                        19.0 ( 1.75x)
avg_10_4x4_neon:                                        28.2 ( 2.76x)
avg_10_8x8_neon:                                        44.0 ( 5.82x)
avg_10_16x16_neon:                                     179.5 ( 4.81x)
avg_10_32x32_neon:                                     680.8 ( 5.58x)
avg_10_64x64_neon:                                    2536.8 ( 5.40x)
avg_10_128x128_neon:                                 10079.0 ( 5.22x)
avg_12_2x2_neon:                                        20.8 ( 1.59x)
avg_12_4x4_neon:                                        25.2 ( 3.09x)
avg_12_8x8_neon:                                        44.0 ( 5.79x)
avg_12_16x16_neon:                                     182.2 ( 4.80x)
avg_12_32x32_neon:                                     696.2 ( 5.46x)
avg_12_64x64_neon:                                    2548.2 ( 5.38x)
avg_12_128x128_neon:                                 10133.8 ( 5.19x)

avg_8_2x2_neon:                                         16.5 ( 1.98x)
avg_8_4x4_neon:                                         26.2 ( 2.93x)
avg_8_8x8_neon:                                         31.8 ( 8.55x)
avg_8_16x16_neon:                                       82.0 (12.02x)
avg_8_32x32_neon:                                      310.2 (14.12x)
avg_8_64x64_neon:                                      897.8 (18.26x)
avg_8_128x128_neon:                                   3608.5 (17.37x)
avg_10_2x2_neon:                                        19.5 ( 1.69x)
avg_10_4x4_neon:                                        28.0 ( 2.79x)
avg_10_8x8_neon:                                        34.8 ( 7.32x)
avg_10_16x16_neon:                                     119.8 ( 7.35x)
avg_10_32x32_neon:                                     444.2 ( 8.51x)
avg_10_64x64_neon:                                    1711.8 ( 8.00x)
avg_10_128x128_neon:                                  7065.2 ( 7.43x)
avg_12_2x2_neon:                                        19.5 ( 1.71x)
avg_12_4x4_neon:                                        24.2 ( 3.22x)
avg_12_8x8_neon:                                        33.8 ( 7.57x)
avg_12_16x16_neon:                                     120.2 ( 7.33x)
avg_12_32x32_neon:                                     442.5 ( 8.53x)
avg_12_64x64_neon:                                    1706.2 ( 8.02x)
avg_12_128x128_neon:                                  7010.0 ( 7.46x)

A72
avg_8_2x2_neon:                                         30.2 ( 1.48x)
avg_8_4x4_neon:                                         40.0 ( 3.10x)
avg_8_8x8_neon:                                         91.0 ( 4.14x)
avg_8_16x16_neon:                                      340.4 ( 3.92x)
avg_8_32x32_neon:                                     1220.7 ( 4.67x)
avg_8_64x64_neon:                                     5823.4 ( 3.88x)
avg_8_128x128_neon:                                  17430.5 ( 4.73x)
avg_10_2x2_neon:                                        34.0 ( 1.66x)
avg_10_4x4_neon:                                        45.2 ( 2.73x)
avg_10_8x8_neon:                                        97.5 ( 3.87x)
avg_10_16x16_neon:                                     317.7 ( 3.90x)
avg_10_32x32_neon:                                    1376.2 ( 4.21x)
avg_10_64x64_neon:                                    5228.1 ( 3.71x)
avg_10_128x128_neon:                                 16722.2 ( 4.17x)
avg_12_2x2_neon:                                        31.7 ( 1.76x)
avg_12_4x4_neon:                                        36.0 ( 3.44x)
avg_12_8x8_neon:                                        91.7 ( 4.10x)
avg_12_16x16_neon:                                     297.2 ( 4.13x)
avg_12_32x32_neon:                                    1400.5 ( 4.14x)
avg_12_64x64_neon:                                    5379.1 ( 3.51x)
avg_12_128x128_neon:                                 16715.7 ( 4.17x)

avg_8_2x2_neon:                                         33.7 ( 1.72x)
avg_8_4x4_neon:                                         45.5 ( 2.84x)
avg_8_8x8_neon:                                         65.0 ( 5.98x)
avg_8_16x16_neon:                                      171.0 ( 7.81x)
avg_8_32x32_neon:                                      558.2 (10.05x)
avg_8_64x64_neon:                                     2006.5 (10.61x)
avg_8_128x128_neon:                                   9158.7 ( 8.96x)
avg_10_2x2_neon:                                        38.0 ( 1.92x)
avg_10_4x4_neon:                                        53.2 ( 2.69x)
avg_10_8x8_neon:                                        95.2 ( 4.08x)
avg_10_16x16_neon:                                     243.0 ( 5.02x)
avg_10_32x32_neon:                                     891.7 ( 5.64x)
avg_10_64x64_neon:                                    3357.7 ( 5.60x)
avg_10_128x128_neon:                                 12411.7 ( 5.56x)
avg_12_2x2_neon:                                        34.7 ( 1.97x)
avg_12_4x4_neon:                                        53.2 ( 2.68x)
avg_12_8x8_neon:                                        91.7 ( 4.22x)
avg_12_16x16_neon:                                     239.0 ( 5.08x)
avg_12_32x32_neon:                                     895.7 ( 5.62x)
avg_12_64x64_neon:                                    3317.5 ( 5.67x)
avg_12_128x128_neon:                                 12358.5 ( 5.58x)


A53
avg_8_2x2_neon:                                         58.3 ( 1.41x)
avg_8_4x4_neon:                                        101.8 ( 2.21x)
avg_8_8x8_neon:                                        178.6 ( 4.53x)
avg_8_16x16_neon:                                      569.5 ( 5.01x)
avg_8_32x32_neon:                                     1962.5 ( 5.50x)
avg_8_64x64_neon:                                     8327.8 ( 5.18x)
avg_8_128x128_neon:                                  31631.3 ( 5.34x)
avg_10_2x2_neon:                                        54.5 ( 1.56x)
avg_10_4x4_neon:                                        88.8 ( 2.53x)
avg_10_8x8_neon:                                       163.6 ( 4.97x)
avg_10_16x16_neon:                                     550.5 ( 5.16x)
avg_10_32x32_neon:                                    1942.5 ( 5.64x)
avg_10_64x64_neon:                                    8783.5 ( 4.98x)
avg_10_128x128_neon:                                 32617.0 ( 5.25x)
avg_12_2x2_neon:                                        53.3 ( 1.66x)
avg_12_4x4_neon:                                        86.8 ( 2.61x)
avg_12_8x8_neon:                                       156.6 ( 5.12x)
avg_12_16x16_neon:                                     541.3 ( 5.25x)
avg_12_32x32_neon:                                    1955.3 ( 5.59x)
avg_12_64x64_neon:                                    8686.0 ( 5.06x)
avg_12_128x128_neon:                                 32487.5 ( 5.25x)

avg_8_2x2_neon:                                         39.5 ( 1.96x)
avg_8_4x4_neon:                                         65.3 ( 3.41x)
avg_8_8x8_neon:                                        168.8 ( 4.79x)
avg_8_16x16_neon:                                      348.0 ( 8.20x)
avg_8_32x32_neon:                                     1207.5 ( 8.98x)
avg_8_64x64_neon:                                     6032.3 ( 7.17x)
avg_8_128x128_neon:                                  22008.5 ( 7.69x)
avg_10_2x2_neon:                                        55.5 ( 1.52x)
avg_10_4x4_neon:                                        73.8 ( 3.08x)
avg_10_8x8_neon:                                       157.8 ( 5.12x)
avg_10_16x16_neon:                                     445.0 ( 6.43x)
avg_10_32x32_neon:                                    1587.3 ( 6.87x)
avg_10_64x64_neon:                                    7738.0 ( 5.68x)
avg_10_128x128_neon:                                 27813.8 ( 6.14x)
avg_12_2x2_neon:                                        48.3 ( 1.80x)
avg_12_4x4_neon:                                        77.0 ( 2.95x)
avg_12_8x8_neon:                                       161.5 ( 4.98x)
avg_12_16x16_neon:                                     433.5 ( 6.59x)
avg_12_32x32_neon:                                    1622.0 ( 6.75x)
avg_12_64x64_neon:                                    7844.5 ( 5.60x)
avg_12_128x128_neon:                                 26999.5 ( 6.34x)

Krzysztof

 libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++-
 1 file changed, 121 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index 0edc861f97..c9d698ee29 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -217,13 +217,131 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1
 endfunc
 .endm
 
-vvc_avg avg, 8
-vvc_avg avg, 10
-vvc_avg avg, 12
 vvc_avg w_avg, 8
 vvc_avg w_avg, 10
 vvc_avg w_avg, 12
 
+.macro vvc_avg2 bit_depth
+function ff_vvc_avg_\bit_depth\()_neon, export=1
+        mov             x10, #(VVC_MAX_PB_SIZE * 2)
+        movi            v16.8h, #0
+        movi            v17.16b, #255
+        ushr            v17.8h, v17.8h, #(16 - \bit_depth)
+
+        cmp             w4, #8
+        b.gt            16f
+        b.eq            8f
+        cmp             w4, #4
+        b.eq            4f
+
+2: // width == 2
+        ldr             s0, [x2]
+        subs            w5, w5, #1
+        ldr             s1, [x3]
+.if \bit_depth == 8
+        shadd           v0.4h, v0.4h, v1.4h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             h0, [x0]
+.else
+        shadd           v0.4h, v0.4h, v1.4h
+        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
+        smax            v0.4h, v0.4h, v16.4h
+        smin            v0.4h, v0.4h, v17.4h
+        str             s0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            2b
+        ret
+
+4: // width == 4
+        ldr             d0, [x2]
+        subs            w5, w5, #1
+        ldr             d1, [x3]
+.if \bit_depth == 8
+        shadd           v0.4h, v0.4h, v1.4h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             s0, [x0]
+.else
+        shadd           v0.4h, v0.4h, v1.4h
+        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
+        smax            v0.4h, v0.4h, v16.4h
+        smin            v0.4h, v0.4h, v17.4h
+        str             d0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            4b
+        ret
+
+8: // width == 8
+        ldr             q0, [x2]
+        subs            w5, w5, #1
+        ldr             q1, [x3]
+.if \bit_depth == 8
+        shadd           v0.8h, v0.8h, v1.8h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             d0, [x0]
+.else
+        shadd           v0.8h, v0.8h, v1.8h
+        srshr           v0.8h, v0.8h, #(15 - 1 - \bit_depth)
+        smax            v0.8h, v0.8h, v16.8h
+        smin            v0.8h, v0.8h, v17.8h
+        str             q0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            8b
+        ret
+
+16: // width >= 16
+.if \bit_depth == 8
+        sub             x1, x1, w4, sxtw
+.else
+        sub             x1, x1, w4, sxtw #1
+.endif
+        sub             x10, x10, w4, sxtw #1
+3:
+        mov             w6, w4 // width
+1:
+        ldp             q0, q1, [x2], #32
+        subs            w6, w6, #16
+        ldp             q2, q3, [x3], #32
+.if \bit_depth == 8
+        shadd           v4.8h, v0.8h, v2.8h
+        shadd           v5.8h, v1.8h, v3.8h
+        sqrshrun        v0.8b, v4.8h, #6
+        sqrshrun2       v0.16b, v5.8h, #6
+        st1             {v0.16b}, [x0], #16
+.else
+        shadd           v4.8h, v0.8h, v2.8h
+        shadd           v5.8h, v1.8h, v3.8h
+        srshr           v0.8h, v4.8h, #(15 - 1 - \bit_depth)
+        srshr           v1.8h, v5.8h, #(15 - 1 - \bit_depth)
+        smax            v0.8h, v0.8h, v16.8h
+        smax            v1.8h, v1.8h, v16.8h
+        smin            v0.8h, v0.8h, v17.8h
+        smin            v1.8h, v1.8h, v17.8h
+        stp             q0, q1, [x0], #32
+.endif
+        b.ne            1b
+
+        subs            w5, w5, #1
+        add             x2, x2, x10
+        add             x3, x3, x10
+        add             x0, x0, x1
+        b.ne            3b
+        ret
+endfunc
+.endm
+
+vvc_avg2 8
+vvc_avg2 10
+vvc_avg2 12
+
 /* x0: int16_t *dst
  * x1: const uint8_t *_src
  * x2: ptrdiff_t _src_stride
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction
  2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-19 17:40 ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-20  8:08   ` Zhao Zhili
  2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
  1 sibling, 1 reply; 6+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-19 17:40 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---

Before and after on A78

dmvr_8_12x20_neon:                                      86.2 ( 6.90x)
dmvr_8_20x12_neon:                                      94.8 ( 5.93x)
dmvr_8_20x20_neon:                                     141.5 ( 6.50x)
dmvr_12_12x20_neon:                                    158.0 ( 3.76x)
dmvr_12_20x12_neon:                                    151.2 ( 3.73x)
dmvr_12_20x20_neon:                                    247.2 ( 3.71x)
dmvr_hv_8_12x20_neon:                                  423.2 ( 3.75x)
dmvr_hv_8_20x12_neon:                                  434.0 ( 3.69x)
dmvr_hv_8_20x20_neon:                                  706.0 ( 3.69x)

dmvr_8_12x20_neon:                                      77.2 ( 7.70x)
dmvr_8_20x12_neon:                                      66.5 ( 8.49x)
dmvr_8_20x20_neon:                                      92.2 ( 9.90x)
dmvr_12_12x20_neon:                                     80.2 ( 7.38x)
dmvr_12_20x12_neon:                                     58.2 ( 9.59x)
dmvr_12_20x20_neon:                                     90.0 (10.15x)
dmvr_hv_8_12x20_neon:                                  369.0 ( 4.34x)
dmvr_hv_8_20x12_neon:                                  355.8 ( 4.49x)
dmvr_hv_8_20x20_neon:                                  574.2 ( 4.51x)

 libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------
 1 file changed, 20 insertions(+), 52 deletions(-)

diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index c9d698ee29..45add44b6e 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1
 1:
         cbz             w15, 2f
         ldr             q0, [src], #16
-        uxtl            v1.8h, v0.8b
-        uxtl2           v2.8h, v0.16b
-        ushl            v1.8h, v1.8h, v16.8h
-        ushl            v2.8h, v2.8h, v16.8h
+        ushll           v1.8h, v0.8b, #2
+        ushll2          v2.8h, v0.16b, #2
         stp             q1, q2, [dst], #32
         b               3f
 2:
         ldr             d0, [src], #8
-        uxtl            v1.8h, v0.8b
-        ushl            v1.8h, v1.8h, v16.8h
+        ushll           v1.8h, v0.8b, #2
         str             q1, [dst], #16
 3:
         subs            height, height, #1
         ldr             s3, [src], #4
-        uxtl            v4.8h, v3.8b
-        ushl            v4.4h, v4.4h, v16.4h
+        ushll           v4.8h, v3.8b, #2
         st1             {v4.4h}, [dst], x7
 
         add             src, src, src_stride
@@ -399,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1
         cmp             width, #16
         sub             src_stride, src_stride, x6, lsl #1
         cset            w15, gt                     // width > 16
-        movi            v16.8h, #2                  // offset4
         sub             x7, x7, x6, lsl #1
 1:
         cbz             w15, 2f
         ldp             q0, q1, [src], #32
-        uaddl           v2.4s, v0.4h, v16.4h
-        uaddl2          v3.4s, v0.8h, v16.8h
-        uaddl           v4.4s, v1.4h, v16.4h
-        uaddl2          v5.4s, v1.8h, v16.8h
-        ushr            v2.4s, v2.4s, #2
-        ushr            v3.4s, v3.4s, #2
-        ushr            v4.4s, v4.4s, #2
-        ushr            v5.4s, v5.4s, #2
-        uqxtn           v2.4h, v2.4s
-        uqxtn2          v2.8h, v3.4s
-        uqxtn           v4.4h, v4.4s
-        uqxtn2          v4.8h, v5.4s
-
-        stp             q2, q4, [dst], #32
+        urshr           v0.8h, v0.8h, #2
+        urshr           v1.8h, v1.8h, #2
+
+        stp             q0, q1, [dst], #32
         b               3f
 2:
         ldr             q0, [src], #16
-        uaddl           v2.4s, v0.4h, v16.4h
-        uaddl2          v3.4s, v0.8h, v16.8h
-        ushr            v2.4s, v2.4s, #2
-        ushr            v3.4s, v3.4s, #2
-        uqxtn           v2.4h, v2.4s
-        uqxtn2          v2.8h, v3.4s
-        str             q2, [dst], #16
+        urshr           v0.8h, v0.8h, #2
+        str             q0, [dst], #16
 3:
         subs            height, height, #1
         ldr             d0, [src], #8
-        uaddl           v3.4s, v0.4h, v16.4h
-        ushr            v3.4s, v3.4s, #2
-        uqxtn           v3.4h, v3.4s
-        st1             {v3.4h}, [dst], x7
+        urshr           v0.4h, v0.4h, #2
+        st1             {v0.4h}, [dst], x7
 
         add             src, src, src_stride
         b.ne            1b
@@ -462,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldrb            w10, [x12]
         ldrb            w11, [x12, #1]
         sxtw            x6, w6
-        movi            v30.8h, #(1 << (8 - 7))     // offset1
-        movi            v31.8h, #8                  // offset2
         dup             v2.8h, w10                  // filter_y[0]
         dup             v3.8h, w11                  // filter_y[1]
 
@@ -491,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         mul             v16.8h, v16.8h, v0.8h
         mla             v6.8h, v7.8h, v1.8h
         mla             v16.8h, v17.8h, v1.8h
-        add             v6.8h, v6.8h, v30.8h
-        add             v16.8h, v16.8h, v30.8h
-        ushr            v6.8h, v6.8h, #(8 - 6)
-        ushr            v7.8h, v16.8h, #(8 - 6)
+        urshr           v6.8h, v6.8h, #(8 - 6)
+        urshr           v7.8h, v16.8h, #(8 - 6)
         stp             q6, q7, [x13], #32
 
         cbz             w10, 3f
@@ -504,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         mul             v17.8h, v17.8h, v2.8h
         mla             v16.8h, v6.8h, v3.8h
         mla             v17.8h, v7.8h, v3.8h
-        add             v16.8h, v16.8h, v31.8h
-        add             v17.8h, v17.8h, v31.8h
-        ushr            v16.8h, v16.8h, #4
-        ushr            v17.8h, v17.8h, #4
+        urshr           v16.8h, v16.8h, #4
+        urshr           v17.8h, v17.8h, #4
         stp             q16, q17, [x14], #32
         b               3f
 2:
@@ -518,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         uxtl            v6.8h, v4.8b
         mul             v6.8h, v6.8h, v0.8h
         mla             v6.8h, v7.8h, v1.8h
-        add             v6.8h, v6.8h, v30.8h
-        ushr            v6.8h, v6.8h, #(8 - 6)
+        urshr           v6.8h, v6.8h, #(8 - 6)
         str             q6, [x13], #16
 
         cbz             w10, 3f
@@ -527,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldr             q16, [x12], #16
         mul             v16.8h, v16.8h, v2.8h
         mla             v16.8h, v6.8h, v3.8h
-        add             v16.8h, v16.8h, v31.8h
-        ushr            v16.8h, v16.8h, #4
+        urshr           v16.8h, v16.8h, #4
         str             q16, [x14], #16
 3:
         ldur            s5, [src, #1]
@@ -537,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         uxtl            v6.8h, v4.8b
         mul             v6.4h, v6.4h, v0.4h
         mla             v6.4h, v7.4h, v1.4h
-        add             v6.4h, v6.4h, v30.4h
-        ushr            v6.4h, v6.4h, #(8 - 6)
+        urshr           v6.4h, v6.4h, #(8 - 6)
         str             d6, [x13], #8
 
         cbz             w10, 4f
@@ -546,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldr             d16, [x12], #8
         mul             v16.4h, v16.4h, v2.4h
         mla             v16.4h, v6.4h, v3.4h
-        add             v16.4h, v16.4h, v31.4h
-        ushr            v16.4h, v16.4h, #4
+        urshr           v16.4h, v16.4h, #4
         str             d16, [x14], #8
 4:
         subs            height, height, #1
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
  2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-20  7:20 ` Zhao Zhili
  2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
  1 sibling, 1 reply; 6+ messages in thread
From: Zhao Zhili @ 2025-02-20  7:20 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz



> On Feb 20, 2025, at 01:40, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote:
> 
> ---
> 
> This patch replaces integer widening with halving addition, and
> multi-step "emulated" rounding shift with a single asm instruction doing
> exactly that. This pattern repeats in other functions in this file, I
> fixed some in the succeeding patch. There's a lot of performance to be
> gained there.
> 
> I didn't modify the existing function because it adds a few extra steps
> solely for the shared w_avg implementation (every cycle matters), but also
> because I find this linear version easier to digest and understand.
> 
> Besides, I noticed that removing smin and smax instructions used for
> clamping the values for 10 and 12 bit_depth instantiations does not
> affect the checkasm result, but it breaks FATE.
> 
> Benchmarks before and after:
> A78
> avg_8_2x2_neon:                                         21.0 ( 1.55x)
> avg_8_4x4_neon:                                         25.8 ( 3.05x)
> avg_8_8x8_neon:                                         45.0 ( 5.86x)
> avg_8_16x16_neon:                                      178.5 ( 5.49x)
> avg_8_32x32_neon:                                      709.2 ( 6.20x)
> avg_8_64x64_neon:                                     2686.2 ( 6.12x)
> avg_8_128x128_neon:                                  10734.2 ( 5.88x)
> avg_10_2x2_neon:                                        19.0 ( 1.75x)
> avg_10_4x4_neon:                                        28.2 ( 2.76x)
> avg_10_8x8_neon:                                        44.0 ( 5.82x)
> avg_10_16x16_neon:                                     179.5 ( 4.81x)
> avg_10_32x32_neon:                                     680.8 ( 5.58x)
> avg_10_64x64_neon:                                    2536.8 ( 5.40x)
> avg_10_128x128_neon:                                 10079.0 ( 5.22x)
> avg_12_2x2_neon:                                        20.8 ( 1.59x)
> avg_12_4x4_neon:                                        25.2 ( 3.09x)
> avg_12_8x8_neon:                                        44.0 ( 5.79x)
> avg_12_16x16_neon:                                     182.2 ( 4.80x)
> avg_12_32x32_neon:                                     696.2 ( 5.46x)
> avg_12_64x64_neon:                                    2548.2 ( 5.38x)
> avg_12_128x128_neon:                                 10133.8 ( 5.19x)
> 
> avg_8_2x2_neon:                                         16.5 ( 1.98x)
> avg_8_4x4_neon:                                         26.2 ( 2.93x)
> avg_8_8x8_neon:                                         31.8 ( 8.55x)
> avg_8_16x16_neon:                                       82.0 (12.02x)
> avg_8_32x32_neon:                                      310.2 (14.12x)
> avg_8_64x64_neon:                                      897.8 (18.26x)
> avg_8_128x128_neon:                                   3608.5 (17.37x)
> avg_10_2x2_neon:                                        19.5 ( 1.69x)
> avg_10_4x4_neon:                                        28.0 ( 2.79x)
> avg_10_8x8_neon:                                        34.8 ( 7.32x)
> avg_10_16x16_neon:                                     119.8 ( 7.35x)
> avg_10_32x32_neon:                                     444.2 ( 8.51x)
> avg_10_64x64_neon:                                    1711.8 ( 8.00x)
> avg_10_128x128_neon:                                  7065.2 ( 7.43x)
> avg_12_2x2_neon:                                        19.5 ( 1.71x)
> avg_12_4x4_neon:                                        24.2 ( 3.22x)
> avg_12_8x8_neon:                                        33.8 ( 7.57x)
> avg_12_16x16_neon:                                     120.2 ( 7.33x)
> avg_12_32x32_neon:                                     442.5 ( 8.53x)
> avg_12_64x64_neon:                                    1706.2 ( 8.02x)
> avg_12_128x128_neon:                                  7010.0 ( 7.46x)
> 
> A72
> avg_8_2x2_neon:                                         30.2 ( 1.48x)
> avg_8_4x4_neon:                                         40.0 ( 3.10x)
> avg_8_8x8_neon:                                         91.0 ( 4.14x)
> avg_8_16x16_neon:                                      340.4 ( 3.92x)
> avg_8_32x32_neon:                                     1220.7 ( 4.67x)
> avg_8_64x64_neon:                                     5823.4 ( 3.88x)
> avg_8_128x128_neon:                                  17430.5 ( 4.73x)
> avg_10_2x2_neon:                                        34.0 ( 1.66x)
> avg_10_4x4_neon:                                        45.2 ( 2.73x)
> avg_10_8x8_neon:                                        97.5 ( 3.87x)
> avg_10_16x16_neon:                                     317.7 ( 3.90x)
> avg_10_32x32_neon:                                    1376.2 ( 4.21x)
> avg_10_64x64_neon:                                    5228.1 ( 3.71x)
> avg_10_128x128_neon:                                 16722.2 ( 4.17x)
> avg_12_2x2_neon:                                        31.7 ( 1.76x)
> avg_12_4x4_neon:                                        36.0 ( 3.44x)
> avg_12_8x8_neon:                                        91.7 ( 4.10x)
> avg_12_16x16_neon:                                     297.2 ( 4.13x)
> avg_12_32x32_neon:                                    1400.5 ( 4.14x)
> avg_12_64x64_neon:                                    5379.1 ( 3.51x)
> avg_12_128x128_neon:                                 16715.7 ( 4.17x)
> 
> avg_8_2x2_neon:                                         33.7 ( 1.72x)
> avg_8_4x4_neon:                                         45.5 ( 2.84x)
> avg_8_8x8_neon:                                         65.0 ( 5.98x)
> avg_8_16x16_neon:                                      171.0 ( 7.81x)
> avg_8_32x32_neon:                                      558.2 (10.05x)
> avg_8_64x64_neon:                                     2006.5 (10.61x)
> avg_8_128x128_neon:                                   9158.7 ( 8.96x)
> avg_10_2x2_neon:                                        38.0 ( 1.92x)
> avg_10_4x4_neon:                                        53.2 ( 2.69x)
> avg_10_8x8_neon:                                        95.2 ( 4.08x)
> avg_10_16x16_neon:                                     243.0 ( 5.02x)
> avg_10_32x32_neon:                                     891.7 ( 5.64x)
> avg_10_64x64_neon:                                    3357.7 ( 5.60x)
> avg_10_128x128_neon:                                 12411.7 ( 5.56x)
> avg_12_2x2_neon:                                        34.7 ( 1.97x)
> avg_12_4x4_neon:                                        53.2 ( 2.68x)
> avg_12_8x8_neon:                                        91.7 ( 4.22x)
> avg_12_16x16_neon:                                     239.0 ( 5.08x)
> avg_12_32x32_neon:                                     895.7 ( 5.62x)
> avg_12_64x64_neon:                                    3317.5 ( 5.67x)
> avg_12_128x128_neon:                                 12358.5 ( 5.58x)
> 
> 
> A53
> avg_8_2x2_neon:                                         58.3 ( 1.41x)
> avg_8_4x4_neon:                                        101.8 ( 2.21x)
> avg_8_8x8_neon:                                        178.6 ( 4.53x)
> avg_8_16x16_neon:                                      569.5 ( 5.01x)
> avg_8_32x32_neon:                                     1962.5 ( 5.50x)
> avg_8_64x64_neon:                                     8327.8 ( 5.18x)
> avg_8_128x128_neon:                                  31631.3 ( 5.34x)
> avg_10_2x2_neon:                                        54.5 ( 1.56x)
> avg_10_4x4_neon:                                        88.8 ( 2.53x)
> avg_10_8x8_neon:                                       163.6 ( 4.97x)
> avg_10_16x16_neon:                                     550.5 ( 5.16x)
> avg_10_32x32_neon:                                    1942.5 ( 5.64x)
> avg_10_64x64_neon:                                    8783.5 ( 4.98x)
> avg_10_128x128_neon:                                 32617.0 ( 5.25x)
> avg_12_2x2_neon:                                        53.3 ( 1.66x)
> avg_12_4x4_neon:                                        86.8 ( 2.61x)
> avg_12_8x8_neon:                                       156.6 ( 5.12x)
> avg_12_16x16_neon:                                     541.3 ( 5.25x)
> avg_12_32x32_neon:                                    1955.3 ( 5.59x)
> avg_12_64x64_neon:                                    8686.0 ( 5.06x)
> avg_12_128x128_neon:                                 32487.5 ( 5.25x)
> 
> avg_8_2x2_neon:                                         39.5 ( 1.96x)
> avg_8_4x4_neon:                                         65.3 ( 3.41x)
> avg_8_8x8_neon:                                        168.8 ( 4.79x)
> avg_8_16x16_neon:                                      348.0 ( 8.20x)
> avg_8_32x32_neon:                                     1207.5 ( 8.98x)
> avg_8_64x64_neon:                                     6032.3 ( 7.17x)
> avg_8_128x128_neon:                                  22008.5 ( 7.69x)
> avg_10_2x2_neon:                                        55.5 ( 1.52x)
> avg_10_4x4_neon:                                        73.8 ( 3.08x)
> avg_10_8x8_neon:                                       157.8 ( 5.12x)
> avg_10_16x16_neon:                                     445.0 ( 6.43x)
> avg_10_32x32_neon:                                    1587.3 ( 6.87x)
> avg_10_64x64_neon:                                    7738.0 ( 5.68x)
> avg_10_128x128_neon:                                 27813.8 ( 6.14x)
> avg_12_2x2_neon:                                        48.3 ( 1.80x)
> avg_12_4x4_neon:                                        77.0 ( 2.95x)
> avg_12_8x8_neon:                                       161.5 ( 4.98x)
> avg_12_16x16_neon:                                     433.5 ( 6.59x)
> avg_12_32x32_neon:                                    1622.0 ( 6.75x)
> avg_12_64x64_neon:                                    7844.5 ( 5.60x)
> avg_12_128x128_neon:                                 26999.5 ( 6.34x)
> 
> Krzysztof
> 
> libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++-
> 1 file changed, 121 insertions(+), 3 deletions(-)
> 
> diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
> index 0edc861f97..c9d698ee29 100644
> --- a/libavcodec/aarch64/vvc/inter.S
> +++ b/libavcodec/aarch64/vvc/inter.S
> @@ -217,13 +217,131 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1
> endfunc
> .endm
> 
> -vvc_avg avg, 8
> -vvc_avg avg, 10
> -vvc_avg avg, 12
> vvc_avg w_avg, 8
> vvc_avg w_avg, 10
> vvc_avg w_avg, 12
> 
> +.macro vvc_avg2 bit_depth
> +function ff_vvc_avg_\bit_depth\()_neon, export=1
> +        mov             x10, #(VVC_MAX_PB_SIZE * 2)
> +        movi            v16.8h, #0
> +        movi            v17.16b, #255
> +        ushr            v17.8h, v17.8h, #(16 - \bit_depth)

Please set v16 v17 only for bit_depth > 8. LGTM otherwise.

> +
> +        cmp             w4, #8
> +        b.gt            16f
> +        b.eq            8f
> +        cmp             w4, #4
> +        b.eq            4f
> +
> +2: // width == 2
> +        ldr             s0, [x2]
> +        subs            w5, w5, #1
> +        ldr             s1, [x3]
> +.if \bit_depth == 8
> +        shadd           v0.4h, v0.4h, v1.4h
> +        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
> +        str             h0, [x0]
> +.else
> +        shadd           v0.4h, v0.4h, v1.4h
> +        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
> +        smax            v0.4h, v0.4h, v16.4h
> +        smin            v0.4h, v0.4h, v17.4h
> +        str             s0, [x0]
> +.endif
> +        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
> +        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
> +        add             x0, x0, x1
> +        b.ne            2b
> +        ret
> +
> +4: // width == 4
> +        ldr             d0, [x2]
> +        subs            w5, w5, #1
> +        ldr             d1, [x3]
> +.if \bit_depth == 8
> +        shadd           v0.4h, v0.4h, v1.4h
> +        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
> +        str             s0, [x0]
> +.else
> +        shadd           v0.4h, v0.4h, v1.4h
> +        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
> +        smax            v0.4h, v0.4h, v16.4h
> +        smin            v0.4h, v0.4h, v17.4h
> +        str             d0, [x0]
> +.endif
> +        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
> +        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
> +        add             x0, x0, x1
> +        b.ne            4b
> +        ret
> +
> +8: // width == 8
> +        ldr             q0, [x2]
> +        subs            w5, w5, #1
> +        ldr             q1, [x3]
> +.if \bit_depth == 8
> +        shadd           v0.8h, v0.8h, v1.8h
> +        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
> +        str             d0, [x0]
> +.else
> +        shadd           v0.8h, v0.8h, v1.8h
> +        srshr           v0.8h, v0.8h, #(15 - 1 - \bit_depth)
> +        smax            v0.8h, v0.8h, v16.8h
> +        smin            v0.8h, v0.8h, v17.8h
> +        str             q0, [x0]
> +.endif
> +        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
> +        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
> +        add             x0, x0, x1
> +        b.ne            8b
> +        ret
> +
> +16: // width >= 16
> +.if \bit_depth == 8
> +        sub             x1, x1, w4, sxtw
> +.else
> +        sub             x1, x1, w4, sxtw #1
> +.endif
> +        sub             x10, x10, w4, sxtw #1
> +3:
> +        mov             w6, w4 // width
> +1:
> +        ldp             q0, q1, [x2], #32
> +        subs            w6, w6, #16
> +        ldp             q2, q3, [x3], #32
> +.if \bit_depth == 8
> +        shadd           v4.8h, v0.8h, v2.8h
> +        shadd           v5.8h, v1.8h, v3.8h
> +        sqrshrun        v0.8b, v4.8h, #6
> +        sqrshrun2       v0.16b, v5.8h, #6
> +        st1             {v0.16b}, [x0], #16
> +.else
> +        shadd           v4.8h, v0.8h, v2.8h
> +        shadd           v5.8h, v1.8h, v3.8h
> +        srshr           v0.8h, v4.8h, #(15 - 1 - \bit_depth)
> +        srshr           v1.8h, v5.8h, #(15 - 1 - \bit_depth)
> +        smax            v0.8h, v0.8h, v16.8h
> +        smax            v1.8h, v1.8h, v16.8h
> +        smin            v0.8h, v0.8h, v17.8h
> +        smin            v1.8h, v1.8h, v17.8h
> +        stp             q0, q1, [x0], #32
> +.endif
> +        b.ne            1b
> +
> +        subs            w5, w5, #1
> +        add             x2, x2, x10
> +        add             x3, x3, x10
> +        add             x0, x0, x1
> +        b.ne            3b
> +        ret
> +endfunc
> +.endm
> +
> +vvc_avg2 8
> +vvc_avg2 10
> +vvc_avg2 12
> +
> /* x0: int16_t *dst
>  * x1: const uint8_t *_src
>  * x2: ptrdiff_t _src_stride
> -- 
> 2.47.2
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction
  2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-20  8:08   ` Zhao Zhili
  0 siblings, 0 replies; 6+ messages in thread
From: Zhao Zhili @ 2025-02-20  8:08 UTC (permalink / raw)
  To: FFmpeg development discussions and patches; +Cc: Krzysztof Pyrkosz



> On Feb 20, 2025, at 01:40, Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> wrote:
> 
> ---
> 
> Before and after on A78
> 
> dmvr_8_12x20_neon:                                      86.2 ( 6.90x)
> dmvr_8_20x12_neon:                                      94.8 ( 5.93x)
> dmvr_8_20x20_neon:                                     141.5 ( 6.50x)
> dmvr_12_12x20_neon:                                    158.0 ( 3.76x)
> dmvr_12_20x12_neon:                                    151.2 ( 3.73x)
> dmvr_12_20x20_neon:                                    247.2 ( 3.71x)
> dmvr_hv_8_12x20_neon:                                  423.2 ( 3.75x)
> dmvr_hv_8_20x12_neon:                                  434.0 ( 3.69x)
> dmvr_hv_8_20x20_neon:                                  706.0 ( 3.69x)
> 
> dmvr_8_12x20_neon:                                      77.2 ( 7.70x)
> dmvr_8_20x12_neon:                                      66.5 ( 8.49x)
> dmvr_8_20x20_neon:                                      92.2 ( 9.90x)
> dmvr_12_12x20_neon:                                     80.2 ( 7.38x)
> dmvr_12_20x12_neon:                                     58.2 ( 9.59x)
> dmvr_12_20x20_neon:                                     90.0 (10.15x)
> dmvr_hv_8_12x20_neon:                                  369.0 ( 4.34x)
> dmvr_hv_8_20x12_neon:                                  355.8 ( 4.49x)
> dmvr_hv_8_20x20_neon:                                  574.2 ( 4.51x)
> 
> libavcodec/aarch64/vvc/inter.S | 72 ++++++++++------------------------
> 1 file changed, 20 insertions(+), 52 deletions(-)
> 
> diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
> index c9d698ee29..45add44b6e 100644
> --- a/libavcodec/aarch64/vvc/inter.S
> +++ b/libavcodec/aarch64/vvc/inter.S
> @@ -369,22 +369,18 @@ function ff_vvc_dmvr_8_neon, export=1
> 1:
>         cbz             w15, 2f
>         ldr             q0, [src], #16
> -        uxtl            v1.8h, v0.8b
> -        uxtl2           v2.8h, v0.16b
> -        ushl            v1.8h, v1.8h, v16.8h
> -        ushl            v2.8h, v2.8h, v16.8h

Please remove assignment to v16. LGTM otherwise.

> +        ushll           v1.8h, v0.8b, #2
> +        ushll2          v2.8h, v0.16b, #2
>         stp             q1, q2, [dst], #32
>         b               3f
> 2:
>         ldr             d0, [src], #8
> -        uxtl            v1.8h, v0.8b
> -        ushl            v1.8h, v1.8h, v16.8h
> +        ushll           v1.8h, v0.8b, #2
>         str             q1, [dst], #16
> 3:
>         subs            height, height, #1
>         ldr             s3, [src], #4
> -        uxtl            v4.8h, v3.8b
> -        ushl            v4.4h, v4.4h, v16.4h
> +        ushll           v4.8h, v3.8b, #2
>         st1             {v4.4h}, [dst], x7
> 
>         add             src, src, src_stride
> @@ -399,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1
>         cmp             width, #16
>         sub             src_stride, src_stride, x6, lsl #1
>         cset            w15, gt                     // width > 16
> -        movi            v16.8h, #2                  // offset4
>         sub             x7, x7, x6, lsl #1
> 1:
>         cbz             w15, 2f
>         ldp             q0, q1, [src], #32
> -        uaddl           v2.4s, v0.4h, v16.4h
> -        uaddl2          v3.4s, v0.8h, v16.8h
> -        uaddl           v4.4s, v1.4h, v16.4h
> -        uaddl2          v5.4s, v1.8h, v16.8h
> -        ushr            v2.4s, v2.4s, #2
> -        ushr            v3.4s, v3.4s, #2
> -        ushr            v4.4s, v4.4s, #2
> -        ushr            v5.4s, v5.4s, #2
> -        uqxtn           v2.4h, v2.4s
> -        uqxtn2          v2.8h, v3.4s
> -        uqxtn           v4.4h, v4.4s
> -        uqxtn2          v4.8h, v5.4s
> -
> -        stp             q2, q4, [dst], #32
> +        urshr           v0.8h, v0.8h, #2
> +        urshr           v1.8h, v1.8h, #2
> +
> +        stp             q0, q1, [dst], #32
>         b               3f
> 2:
>         ldr             q0, [src], #16
> -        uaddl           v2.4s, v0.4h, v16.4h
> -        uaddl2          v3.4s, v0.8h, v16.8h
> -        ushr            v2.4s, v2.4s, #2
> -        ushr            v3.4s, v3.4s, #2
> -        uqxtn           v2.4h, v2.4s
> -        uqxtn2          v2.8h, v3.4s
> -        str             q2, [dst], #16
> +        urshr           v0.8h, v0.8h, #2
> +        str             q0, [dst], #16
> 3:
>         subs            height, height, #1
>         ldr             d0, [src], #8
> -        uaddl           v3.4s, v0.4h, v16.4h
> -        ushr            v3.4s, v3.4s, #2
> -        uqxtn           v3.4h, v3.4s
> -        st1             {v3.4h}, [dst], x7
> +        urshr           v0.4h, v0.4h, #2
> +        st1             {v0.4h}, [dst], x7
> 
>         add             src, src, src_stride
>         b.ne            1b
> @@ -462,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         ldrb            w10, [x12]
>         ldrb            w11, [x12, #1]
>         sxtw            x6, w6
> -        movi            v30.8h, #(1 << (8 - 7))     // offset1
> -        movi            v31.8h, #8                  // offset2
>         dup             v2.8h, w10                  // filter_y[0]
>         dup             v3.8h, w11                  // filter_y[1]
> 
> @@ -491,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         mul             v16.8h, v16.8h, v0.8h
>         mla             v6.8h, v7.8h, v1.8h
>         mla             v16.8h, v17.8h, v1.8h
> -        add             v6.8h, v6.8h, v30.8h
> -        add             v16.8h, v16.8h, v30.8h
> -        ushr            v6.8h, v6.8h, #(8 - 6)
> -        ushr            v7.8h, v16.8h, #(8 - 6)
> +        urshr           v6.8h, v6.8h, #(8 - 6)
> +        urshr           v7.8h, v16.8h, #(8 - 6)
>         stp             q6, q7, [x13], #32
> 
>         cbz             w10, 3f
> @@ -504,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         mul             v17.8h, v17.8h, v2.8h
>         mla             v16.8h, v6.8h, v3.8h
>         mla             v17.8h, v7.8h, v3.8h
> -        add             v16.8h, v16.8h, v31.8h
> -        add             v17.8h, v17.8h, v31.8h
> -        ushr            v16.8h, v16.8h, #4
> -        ushr            v17.8h, v17.8h, #4
> +        urshr           v16.8h, v16.8h, #4
> +        urshr           v17.8h, v17.8h, #4
>         stp             q16, q17, [x14], #32
>         b               3f
> 2:
> @@ -518,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         uxtl            v6.8h, v4.8b
>         mul             v6.8h, v6.8h, v0.8h
>         mla             v6.8h, v7.8h, v1.8h
> -        add             v6.8h, v6.8h, v30.8h
> -        ushr            v6.8h, v6.8h, #(8 - 6)
> +        urshr           v6.8h, v6.8h, #(8 - 6)
>         str             q6, [x13], #16
> 
>         cbz             w10, 3f
> @@ -527,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         ldr             q16, [x12], #16
>         mul             v16.8h, v16.8h, v2.8h
>         mla             v16.8h, v6.8h, v3.8h
> -        add             v16.8h, v16.8h, v31.8h
> -        ushr            v16.8h, v16.8h, #4
> +        urshr           v16.8h, v16.8h, #4
>         str             q16, [x14], #16
> 3:
>         ldur            s5, [src, #1]
> @@ -537,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         uxtl            v6.8h, v4.8b
>         mul             v6.4h, v6.4h, v0.4h
>         mla             v6.4h, v7.4h, v1.4h
> -        add             v6.4h, v6.4h, v30.4h
> -        ushr            v6.4h, v6.4h, #(8 - 6)
> +        urshr           v6.4h, v6.4h, #(8 - 6)
>         str             d6, [x13], #8
> 
>         cbz             w10, 4f
> @@ -546,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
>         ldr             d16, [x12], #8
>         mul             v16.4h, v16.4h, v2.4h
>         mla             v16.4h, v6.4h, v3.4h
> -        add             v16.4h, v16.4h, v31.4h
> -        ushr            v16.4h, v16.4h, #4
> +        urshr           v16.4h, v16.4h, #4
>         str             d16, [x14], #8
> 4:
>         subs            height, height, #1
> -- 
> 2.47.2
> 
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
> 
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
  2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
@ 2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-20 18:49     ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
  0 siblings, 1 reply; 6+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---
 libavcodec/aarch64/vvc/inter.S | 125 ++++++++++++++++++++++++++++++++-
 1 file changed, 122 insertions(+), 3 deletions(-)

diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index 0edc861f97..b65920e640 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -217,13 +217,132 @@ function ff_vvc_\type\()_\bit_depth\()_neon, export=1
 endfunc
 .endm
 
-vvc_avg avg, 8
-vvc_avg avg, 10
-vvc_avg avg, 12
 vvc_avg w_avg, 8
 vvc_avg w_avg, 10
 vvc_avg w_avg, 12
 
+.macro vvc_avg2 bit_depth
+function ff_vvc_avg_\bit_depth\()_neon, export=1
+        mov             x10, #(VVC_MAX_PB_SIZE * 2)
+.if \bit_depth != 8
+        movi            v16.8h, #0
+        movi            v17.16b, #255
+        ushr            v17.8h, v17.8h, #(16 - \bit_depth)
+.endif
+        cmp             w4, #8
+        b.gt            16f
+        b.eq            8f
+        cmp             w4, #4
+        b.eq            4f
+
+2: // width == 2
+        ldr             s0, [x2]
+        subs            w5, w5, #1
+        ldr             s1, [x3]
+.if \bit_depth == 8
+        shadd           v0.4h, v0.4h, v1.4h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             h0, [x0]
+.else
+        shadd           v0.4h, v0.4h, v1.4h
+        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
+        smax            v0.4h, v0.4h, v16.4h
+        smin            v0.4h, v0.4h, v17.4h
+        str             s0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            2b
+        ret
+
+4: // width == 4
+        ldr             d0, [x2]
+        subs            w5, w5, #1
+        ldr             d1, [x3]
+.if \bit_depth == 8
+        shadd           v0.4h, v0.4h, v1.4h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             s0, [x0]
+.else
+        shadd           v0.4h, v0.4h, v1.4h
+        srshr           v0.4h, v0.4h, #(15 - 1 - \bit_depth)
+        smax            v0.4h, v0.4h, v16.4h
+        smin            v0.4h, v0.4h, v17.4h
+        str             d0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            4b
+        ret
+
+8: // width == 8
+        ldr             q0, [x2]
+        subs            w5, w5, #1
+        ldr             q1, [x3]
+.if \bit_depth == 8
+        shadd           v0.8h, v0.8h, v1.8h
+        sqrshrun        v0.8b, v0.8h, #(15 - 1 - \bit_depth)
+        str             d0, [x0]
+.else
+        shadd           v0.8h, v0.8h, v1.8h
+        srshr           v0.8h, v0.8h, #(15 - 1 - \bit_depth)
+        smax            v0.8h, v0.8h, v16.8h
+        smin            v0.8h, v0.8h, v17.8h
+        str             q0, [x0]
+.endif
+        add             x2, x2, #(VVC_MAX_PB_SIZE * 2)
+        add             x3, x3, #(VVC_MAX_PB_SIZE * 2)
+        add             x0, x0, x1
+        b.ne            8b
+        ret
+
+16: // width >= 16
+.if \bit_depth == 8
+        sub             x1, x1, w4, sxtw
+.else
+        sub             x1, x1, w4, sxtw #1
+.endif
+        sub             x10, x10, w4, sxtw #1
+3:
+        mov             w6, w4 // width
+1:
+        ldp             q0, q1, [x2], #32
+        subs            w6, w6, #16
+        ldp             q2, q3, [x3], #32
+.if \bit_depth == 8
+        shadd           v4.8h, v0.8h, v2.8h
+        shadd           v5.8h, v1.8h, v3.8h
+        sqrshrun        v0.8b, v4.8h, #6
+        sqrshrun2       v0.16b, v5.8h, #6
+        st1             {v0.16b}, [x0], #16
+.else
+        shadd           v4.8h, v0.8h, v2.8h
+        shadd           v5.8h, v1.8h, v3.8h
+        srshr           v0.8h, v4.8h, #(15 - 1 - \bit_depth)
+        srshr           v1.8h, v5.8h, #(15 - 1 - \bit_depth)
+        smax            v0.8h, v0.8h, v16.8h
+        smax            v1.8h, v1.8h, v16.8h
+        smin            v0.8h, v0.8h, v17.8h
+        smin            v1.8h, v1.8h, v17.8h
+        stp             q0, q1, [x0], #32
+.endif
+        b.ne            1b
+
+        subs            w5, w5, #1
+        add             x2, x2, x10
+        add             x3, x3, x10
+        add             x0, x0, x1
+        b.ne            3b
+        ret
+endfunc
+.endm
+
+vvc_avg2 8
+vvc_avg2 10
+vvc_avg2 12
+
 /* x0: int16_t *dst
  * x1: const uint8_t *_src
  * x2: ptrdiff_t _src_stride
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction
  2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-20 18:49     ` Krzysztof Pyrkosz via ffmpeg-devel
  0 siblings, 0 replies; 6+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-20 18:49 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---
 libavcodec/aarch64/vvc/inter.S | 73 ++++++++++------------------------
 1 file changed, 20 insertions(+), 53 deletions(-)

diff --git a/libavcodec/aarch64/vvc/inter.S b/libavcodec/aarch64/vvc/inter.S
index b65920e640..09f0627b20 100644
--- a/libavcodec/aarch64/vvc/inter.S
+++ b/libavcodec/aarch64/vvc/inter.S
@@ -365,27 +365,22 @@ function ff_vvc_dmvr_8_neon, export=1
         cmp             width, #16
         sub             src_stride, src_stride, x6
         cset            w15, gt                     // width > 16
-        movi            v16.8h, #2                  // DMVR_SHIFT
         sub             x7, x7, x6, lsl #1
 1:
         cbz             w15, 2f
         ldr             q0, [src], #16
-        uxtl            v1.8h, v0.8b
-        uxtl2           v2.8h, v0.16b
-        ushl            v1.8h, v1.8h, v16.8h
-        ushl            v2.8h, v2.8h, v16.8h
+        ushll           v1.8h, v0.8b, #2
+        ushll2          v2.8h, v0.16b, #2
         stp             q1, q2, [dst], #32
         b               3f
 2:
         ldr             d0, [src], #8
-        uxtl            v1.8h, v0.8b
-        ushl            v1.8h, v1.8h, v16.8h
+        ushll           v1.8h, v0.8b, #2
         str             q1, [dst], #16
 3:
         subs            height, height, #1
         ldr             s3, [src], #4
-        uxtl            v4.8h, v3.8b
-        ushl            v4.4h, v4.4h, v16.4h
+        ushll           v4.8h, v3.8b, #2
         st1             {v4.4h}, [dst], x7
 
         add             src, src, src_stride
@@ -400,42 +395,24 @@ function ff_vvc_dmvr_12_neon, export=1
         cmp             width, #16
         sub             src_stride, src_stride, x6, lsl #1
         cset            w15, gt                     // width > 16
-        movi            v16.8h, #2                  // offset4
         sub             x7, x7, x6, lsl #1
 1:
         cbz             w15, 2f
         ldp             q0, q1, [src], #32
-        uaddl           v2.4s, v0.4h, v16.4h
-        uaddl2          v3.4s, v0.8h, v16.8h
-        uaddl           v4.4s, v1.4h, v16.4h
-        uaddl2          v5.4s, v1.8h, v16.8h
-        ushr            v2.4s, v2.4s, #2
-        ushr            v3.4s, v3.4s, #2
-        ushr            v4.4s, v4.4s, #2
-        ushr            v5.4s, v5.4s, #2
-        uqxtn           v2.4h, v2.4s
-        uqxtn2          v2.8h, v3.4s
-        uqxtn           v4.4h, v4.4s
-        uqxtn2          v4.8h, v5.4s
-
-        stp             q2, q4, [dst], #32
+        urshr           v0.8h, v0.8h, #2
+        urshr           v1.8h, v1.8h, #2
+
+        stp             q0, q1, [dst], #32
         b               3f
 2:
         ldr             q0, [src], #16
-        uaddl           v2.4s, v0.4h, v16.4h
-        uaddl2          v3.4s, v0.8h, v16.8h
-        ushr            v2.4s, v2.4s, #2
-        ushr            v3.4s, v3.4s, #2
-        uqxtn           v2.4h, v2.4s
-        uqxtn2          v2.8h, v3.4s
-        str             q2, [dst], #16
+        urshr           v0.8h, v0.8h, #2
+        str             q0, [dst], #16
 3:
         subs            height, height, #1
         ldr             d0, [src], #8
-        uaddl           v3.4s, v0.4h, v16.4h
-        ushr            v3.4s, v3.4s, #2
-        uqxtn           v3.4h, v3.4s
-        st1             {v3.4h}, [dst], x7
+        urshr           v0.4h, v0.4h, #2
+        st1             {v0.4h}, [dst], x7
 
         add             src, src, src_stride
         b.ne            1b
@@ -463,8 +440,6 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldrb            w10, [x12]
         ldrb            w11, [x12, #1]
         sxtw            x6, w6
-        movi            v30.8h, #(1 << (8 - 7))     // offset1
-        movi            v31.8h, #8                  // offset2
         dup             v2.8h, w10                  // filter_y[0]
         dup             v3.8h, w11                  // filter_y[1]
 
@@ -492,10 +467,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         mul             v16.8h, v16.8h, v0.8h
         mla             v6.8h, v7.8h, v1.8h
         mla             v16.8h, v17.8h, v1.8h
-        add             v6.8h, v6.8h, v30.8h
-        add             v16.8h, v16.8h, v30.8h
-        ushr            v6.8h, v6.8h, #(8 - 6)
-        ushr            v7.8h, v16.8h, #(8 - 6)
+        urshr           v6.8h, v6.8h, #(8 - 6)
+        urshr           v7.8h, v16.8h, #(8 - 6)
         stp             q6, q7, [x13], #32
 
         cbz             w10, 3f
@@ -505,10 +478,8 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         mul             v17.8h, v17.8h, v2.8h
         mla             v16.8h, v6.8h, v3.8h
         mla             v17.8h, v7.8h, v3.8h
-        add             v16.8h, v16.8h, v31.8h
-        add             v17.8h, v17.8h, v31.8h
-        ushr            v16.8h, v16.8h, #4
-        ushr            v17.8h, v17.8h, #4
+        urshr           v16.8h, v16.8h, #4
+        urshr           v17.8h, v17.8h, #4
         stp             q16, q17, [x14], #32
         b               3f
 2:
@@ -519,8 +490,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         uxtl            v6.8h, v4.8b
         mul             v6.8h, v6.8h, v0.8h
         mla             v6.8h, v7.8h, v1.8h
-        add             v6.8h, v6.8h, v30.8h
-        ushr            v6.8h, v6.8h, #(8 - 6)
+        urshr           v6.8h, v6.8h, #(8 - 6)
         str             q6, [x13], #16
 
         cbz             w10, 3f
@@ -528,8 +498,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldr             q16, [x12], #16
         mul             v16.8h, v16.8h, v2.8h
         mla             v16.8h, v6.8h, v3.8h
-        add             v16.8h, v16.8h, v31.8h
-        ushr            v16.8h, v16.8h, #4
+        urshr           v16.8h, v16.8h, #4
         str             q16, [x14], #16
 3:
         ldur            s5, [src, #1]
@@ -538,8 +507,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         uxtl            v6.8h, v4.8b
         mul             v6.4h, v6.4h, v0.4h
         mla             v6.4h, v7.4h, v1.4h
-        add             v6.4h, v6.4h, v30.4h
-        ushr            v6.4h, v6.4h, #(8 - 6)
+        urshr           v6.4h, v6.4h, #(8 - 6)
         str             d6, [x13], #8
 
         cbz             w10, 4f
@@ -547,8 +515,7 @@ function ff_vvc_dmvr_hv_8_neon, export=1
         ldr             d16, [x12], #8
         mul             v16.4h, v16.4h, v2.4h
         mla             v16.4h, v6.4h, v3.4h
-        add             v16.4h, v16.4h, v31.4h
-        ushr            v16.4h, v16.4h, #4
+        urshr           v16.4h, v16.4h, #4
         str             d16, [x14], #8
 4:
         subs            height, height, #1
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-02-20 18:50 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-19 17:40 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Krzysztof Pyrkosz via ffmpeg-devel
2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
2025-02-20  8:08   ` Zhao Zhili
2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
2025-02-20 18:49     ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git