[FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
@ 2024-03-25 15:02 Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line Martin Storsjö
                   ` (21 more replies)
  0 siblings, 22 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

Hi,

Since some time, we have pretty complete AArch64 NEON coverage
for the hevc decoder.

However, some of these functions require the I8MM instruction set
extension, and many of them (but not all) lack a plain NEON
version.

This patchset fills in a regular NEON version of all functions
where we have an I8MM function.

For context; the I8MM instruction set extension is a mandatory
part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
but Apple M1 and Ampere Altra don't.

This patchset takes decoding of a 1080p HEVC clip from 402
fps to 649 fps on an Apple M1.

Patch #2 also fixes a subtle bug in the existing implementation;
two functions relied on the contents on the stack, below the
stack pointer, being untouched within a function. If a signal
gets delivered, those parts of the stack could be clobbered.

// Martin

Martin Storsjö (21):
  aarch64: hevc: Reorder a misplaced function init line
  aarch64: hevc: Don't iterate with sp in
    ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm
  aarch64: hevc: Merge consecutive stores in
    put_hevc_\type\()_h16_8_neon
  aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal
    looping
  aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h
  aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8
  aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8
  aarch64: hevc: Split the epel_*_hv functions into two parts
  aarch64: hevc: Reorder epel_hv functions to prepare for templating
  aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm
  aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm
  aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm
  aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm
  aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8
  aarch64: hevc: Split the qpel_*_hv functions into two parts
  aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon
    functions
  aarch64: hevc: Reorder qpel_hv functions to prepare for templating
  aarch64: hevc: Produce plain neon versions of qpel_hv
  aarch64: hevc: Produce plain neon versions of qpel_uni_hv
  aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv
  aarch64: hevc: Produce plain neon versions of qpel_bi_hv

 libavcodec/aarch64/hevcdsp_epel_neon.S    | 1529 +++++++++++------
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   96 +-
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 1804 +++++++++++++--------
 3 files changed, 2291 insertions(+), 1138 deletions(-)

-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Martin Storsjö
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

Group the epel and qpel functions together.
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 04692aa98e..d2f2a3681f 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -381,12 +381,12 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
             NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h ,_i8mm);
+            NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h, _i8mm);
-            NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv, _i8mm);
             NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, qpel_uni_w_hv, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 1, qpel_bi_hv, _i8mm);
         }
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon Martin Storsjö
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon
store temporary buffers on the stack. When consuming it,
many of these functions use the stack pointer as incremental pointer
for reading the data (instead of storing it in another register),
which is rather unusual.

Technically, this is fine as long as the pointer remains properly
aligned.

However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm,
after incrementing sp when reading data (within each 16 pixel
wide stripe) it would then reset the stack pointer back to a lower
value, for reading the next 16 pixel wide stripe, expecting the
data to remain untouched.

This can't be assumed; data on the stack below the stack pointer
can be clobbered (e.g. by a signal handler). Some OS ABIs
allow for a little margin that won't be touched, aka a red zone,
but not all do. The ones that do, guarantee 16 or 128 bytes, not
9 KB.

Convert this function to use a separate pointer register to
iterate through the data, retaining the stack pointer to point
at the bottom of the data we require to remain untouched.
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 130 +++++++++++++------------
 1 file changed, 66 insertions(+), 64 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 9be29cafe2..815d897094 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -3981,24 +3981,25 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         mov             x11, sp
         mov             w12, w22
         mov             x13, x20
+        mov             x14, sp
 3:
-        ldp             q16, q1, [sp]
-        add             sp, sp, x10
-        ldp             q17, q2, [sp]
-        add             sp, sp, x10
-        ldp             q18, q3, [sp]
-        add             sp, sp, x10
-        ldp             q19, q4, [sp]
-        add             sp, sp, x10
-        ldp             q20, q5, [sp]
-        add             sp, sp, x10
-        ldp             q21, q6, [sp]
-        add             sp, sp, x10
-        ldp             q22, q7, [sp]
-        add             sp, sp, x10
+        ldp             q16, q1, [x11]
+        add             x11, x11, x10
+        ldp             q17, q2, [x11]
+        add             x11, x11, x10
+        ldp             q18, q3, [x11]
+        add             x11, x11, x10
+        ldp             q19, q4, [x11]
+        add             x11, x11, x10
+        ldp             q20, q5, [x11]
+        add             x11, x11, x10
+        ldp             q21, q6, [x11]
+        add             x11, x11, x10
+        ldp             q22, q7, [x11]
+        add             x11, x11, x10
 1:
-        ldp             q23, q31, [sp]
-        add             sp, sp, x10
+        ldp             q23, q31, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v16, v17, v18, v19, v20, v21, v22, v23
         QPEL_FILTER_H2  v25, v16, v17, v18, v19, v20, v21, v22, v23
         QPEL_FILTER_H   v26,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
@@ -4007,8 +4008,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q16, q1, [sp]
-        add             sp, sp, x10
+        ldp             q16, q1, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v17, v18, v19, v20, v21, v22, v23, v16
         QPEL_FILTER_H2  v25, v17, v18, v19, v20, v21, v22, v23, v16
         QPEL_FILTER_H   v26,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
@@ -4017,8 +4018,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q17, q2, [sp]
-        add             sp, sp, x10
+        ldp             q17, q2, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v18, v19, v20, v21, v22, v23, v16, v17
         QPEL_FILTER_H2  v25, v18, v19, v20, v21, v22, v23, v16, v17
         QPEL_FILTER_H   v26,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
@@ -4027,8 +4028,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q18, q3, [sp]
-        add             sp, sp, x10
+        ldp             q18, q3, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v19, v20, v21, v22, v23, v16, v17, v18
         QPEL_FILTER_H2  v25, v19, v20, v21, v22, v23, v16, v17, v18
         QPEL_FILTER_H   v26,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
@@ -4037,8 +4038,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q19, q4, [sp]
-        add             sp, sp, x10
+        ldp             q19, q4, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v20, v21, v22, v23, v16, v17, v18, v19
         QPEL_FILTER_H2  v25, v20, v21, v22, v23, v16, v17, v18, v19
         QPEL_FILTER_H   v26,  v5,  v6,  v7, v31,  v1,  v2,  v3,  v4
@@ -4047,8 +4048,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q20, q5, [sp]
-        add             sp, sp, x10
+        ldp             q20, q5, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v21, v22, v23, v16, v17, v18, v19, v20
         QPEL_FILTER_H2  v25, v21, v22, v23, v16, v17, v18, v19, v20
         QPEL_FILTER_H   v26,  v6,  v7, v31,  v1,  v2,  v3,  v4,  v5
@@ -4057,8 +4058,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q21, q6, [sp]
-        add             sp, sp, x10
+        ldp             q21, q6, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v22, v23, v16, v17, v18, v19, v20, v21
         QPEL_FILTER_H2  v25, v22, v23, v16, v17, v18, v19, v20, v21
         QPEL_FILTER_H   v26,  v7, v31,  v1,  v2,  v3,  v4,  v5,  v6
@@ -4067,8 +4068,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q22, q7, [sp]
-        add             sp, sp, x10
+        ldp             q22, q7, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v23, v16, v17, v18, v19, v20, v21, v22
         QPEL_FILTER_H2  v25, v23, v16, v17, v18, v19, v20, v21, v22
         QPEL_FILTER_H   v26, v31,  v1,  v2,  v3,  v4,  v5,  v6,  v7
@@ -4078,10 +4079,10 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         b.hi            1b
 2:
         subs            w27, w27, #16
-        add             sp, x11, #32
+        add             x11, x14, #32
         add             x20, x13, #16
         mov             w22, w12
-        mov             x11, sp
+        mov             x14, x11
         mov             x13, x20
         b.hi            3b
         QPEL_UNI_W_HV_END
@@ -4093,24 +4094,25 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         mov             x11, sp
         mov             w12, w22
         mov             x13, x20
+        mov             x14, sp
 3:
-        ldp             q16, q1, [sp]
-        add             sp, sp, x10
-        ldp             q17, q2, [sp]
-        add             sp, sp, x10
-        ldp             q18, q3, [sp]
-        add             sp, sp, x10
-        ldp             q19, q4, [sp]
-        add             sp, sp, x10
-        ldp             q20, q5, [sp]
-        add             sp, sp, x10
-        ldp             q21, q6, [sp]
-        add             sp, sp, x10
-        ldp             q22, q7, [sp]
-        add             sp, sp, x10
+        ldp             q16, q1, [x11]
+        add             x11, x11, x10
+        ldp             q17, q2, [x11]
+        add             x11, x11, x10
+        ldp             q18, q3, [x11]
+        add             x11, x11, x10
+        ldp             q19, q4, [x11]
+        add             x11, x11, x10
+        ldp             q20, q5, [x11]
+        add             x11, x11, x10
+        ldp             q21, q6, [x11]
+        add             x11, x11, x10
+        ldp             q22, q7, [x11]
+        add             x11, x11, x10
 1:
-        ldp             q23, q31, [sp]
-        add             sp, sp, x10
+        ldp             q23, q31, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v16, v17, v18, v19, v20, v21, v22, v23
         QPEL_FILTER_H2  v25, v16, v17, v18, v19, v20, v21, v22, v23
         QPEL_FILTER_H   v26,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
@@ -4119,8 +4121,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q16, q1, [sp]
-        add             sp, sp, x10
+        ldp             q16, q1, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v17, v18, v19, v20, v21, v22, v23, v16
         QPEL_FILTER_H2  v25, v17, v18, v19, v20, v21, v22, v23, v16
         QPEL_FILTER_H   v26,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
@@ -4129,8 +4131,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q17, q2, [sp]
-        add             sp, sp, x10
+        ldp             q17, q2, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v18, v19, v20, v21, v22, v23, v16, v17
         QPEL_FILTER_H2  v25, v18, v19, v20, v21, v22, v23, v16, v17
         QPEL_FILTER_H   v26,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
@@ -4139,8 +4141,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q18, q3, [sp]
-        add             sp, sp, x10
+        ldp             q18, q3, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v19, v20, v21, v22, v23, v16, v17, v18
         QPEL_FILTER_H2  v25, v19, v20, v21, v22, v23, v16, v17, v18
         QPEL_FILTER_H   v26,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
@@ -4149,8 +4151,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q19, q4, [sp]
-        add             sp, sp, x10
+        ldp             q19, q4, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v20, v21, v22, v23, v16, v17, v18, v19
         QPEL_FILTER_H2  v25, v20, v21, v22, v23, v16, v17, v18, v19
         QPEL_FILTER_H   v26,  v5,  v6,  v7, v31,  v1,  v2,  v3,  v4
@@ -4159,8 +4161,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q20, q5, [sp]
-        add             sp, sp, x10
+        ldp             q20, q5, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v21, v22, v23, v16, v17, v18, v19, v20
         QPEL_FILTER_H2  v25, v21, v22, v23, v16, v17, v18, v19, v20
         QPEL_FILTER_H   v26,  v6,  v7, v31,  v1,  v2,  v3,  v4,  v5
@@ -4169,8 +4171,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q21, q6, [sp]
-        add             sp, sp, x10
+        ldp             q21, q6, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v22, v23, v16, v17, v18, v19, v20, v21
         QPEL_FILTER_H2  v25, v22, v23, v16, v17, v18, v19, v20, v21
         QPEL_FILTER_H   v26,  v7, v31,  v1,  v2,  v3,  v4,  v5,  v6
@@ -4179,8 +4181,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         subs            w22, w22, #1
         b.eq            2f
 
-        ldp             q22, q7, [sp]
-        add             sp, sp, x10
+        ldp             q22, q7, [x11]
+        add             x11, x11, x10
         QPEL_FILTER_H   v24, v23, v16, v17, v18, v19, v20, v21, v22
         QPEL_FILTER_H2  v25, v23, v16, v17, v18, v19, v20, v21, v22
         QPEL_FILTER_H   v26, v31,  v1,  v2,  v3,  v4,  v5,  v6,  v7
@@ -4190,10 +4192,10 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         b.hi            1b
 2:
         subs            w27, w27, #16
-        add             sp, x11, #32
+        add             x11, x14, #32
         add             x20, x13, #16
         mov             w22, w12
-        mov             x11, sp
+        mov             x14, x11
         mov             x13, x20
         b.hi            3b
         QPEL_UNI_W_HV_END
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping Martin Storsjö
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

This gets rid of a couple instructions, but the actual performance
is almost identical on Cortex A72/A73. On Cortex A53, it is a
handful of cycles faster.
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 815d897094..432558bb95 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -512,11 +512,10 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
 .ifc \type, qpel
         mov             dststride, #(MAX_PB_SIZE << 1)
         lsl             x13, srcstride, #1 // srcstridel
-        mov             x14, #((MAX_PB_SIZE << 2) - 16)
+        mov             x14, #(MAX_PB_SIZE << 2)
 .else
         lsl             x14, dststride, #1 // dststridel
         lsl             x13, srcstride, #1 // srcstridel
-        sub             x14, x14, #8
 .endif
         add             x10, dst, dststride // dstb
         add             x12, src, srcstride // srcb
@@ -527,10 +526,8 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
         bl              ff_hevc_put_hevc_h16_8_neon
 
 .ifc \type, qpel
-        st1             {v26.8h}, [dst], #16
-        st1             {v28.8h}, [x10], #16
-        st1             {v27.8h}, [dst], x14
-        st1             {v29.8h}, [x10], x14
+        st1             {v26.8h, v27.8h}, [dst], x14
+        st1             {v28.8h, v29.8h}, [x10], x14
 .else
 .ifc \type, qpel_bi
         ld1             {v16.8h, v17.8h}, [ x4], x16
@@ -549,10 +546,8 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
         sqrshrun        v28.8b, v28.8h, #6
         sqrshrun        v29.8b, v29.8h, #6
 .endif
-        st1             {v26.8b}, [dst], #8
-        st1             {v28.8b}, [x10], #8
-        st1             {v27.8b}, [dst], x14
-        st1             {v29.8b}, [x10], x14
+        st1             {v26.8b, v27.8b}, [dst], x14
+        st1             {v28.8b, v29.8b}, [x10], x14
 .endif
         b.gt            1b // double line
         subs            width, width, #16
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (2 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Martin Storsjö
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

For widths of 32 pixels and more, loop first horizontally,
then vertically.

Previously, this function would process a 16 pixel wide slice
of the block, looping vertically. After processing the whole
height, it would backtrack and process the next 16 pixel wide
slice.

When doing 8tap filtering horizontally, the function must load
7 more pixels (in practice, 8) following the actual inputs, and
this was done for each slice.

By iterating first horizontally throughout each line, then
vertically, we access data in a more cache friendly order, and
we don't need to reload data unnecessarily.

Keep the original order in put_hevc_\type\()_h12_8_neon; the
only suboptimal case there is for width=24. But specializing
an optimal variant for that would require more code, which
might not be worth it.

For the h16 case, this implementation would give a slowdown,
as it now loads the first 8 pixels separately from the rest, but
for larger widths, it is a gain. Therefore, keep the h16 case
as it was (but remove the outer loop), and create a new specialized
version for horizontal looping with 16 pixels at a time.

Before:                  Cortex A53      A72      A73  Graviton 3
put_hevc_qpel_h16_8_neon:     710.5    667.7    692.5   211.0
put_hevc_qpel_h32_8_neon:    2791.5   2643.5   2732.0   883.5
put_hevc_qpel_h64_8_neon:   10954.0  10657.0  10874.2  3241.5
After:
put_hevc_qpel_h16_8_neon:     697.5    663.5    705.7   212.5
put_hevc_qpel_h32_8_neon:    2767.2   2684.5   2791.2   920.5
put_hevc_qpel_h64_8_neon:   10559.2  10471.5  10932.2  3051.7
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  20 +++--
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 103 +++++++++++++++++-----
 2 files changed, 94 insertions(+), 29 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index d2f2a3681f..1e9f5e32db 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -109,6 +109,8 @@ void ff_hevc_put_hevc_qpel_h12_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff
                                       intptr_t mx, intptr_t my, int width);
 void ff_hevc_put_hevc_qpel_h16_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
                                       intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_hevc_qpel_h32_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
+                                      intptr_t mx, intptr_t my, int width);
 void ff_hevc_put_hevc_qpel_uni_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
                                          ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my,
                                          int width);
@@ -124,6 +126,9 @@ void ff_hevc_put_hevc_qpel_uni_h12_8_neon(uint8_t *_dst, ptrdiff_t _dststride, c
 void ff_hevc_put_hevc_qpel_uni_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
                                           ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t
                                           my, int width);
+void ff_hevc_put_hevc_qpel_uni_h32_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+                                          ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t
+                                          my, int width);
 void ff_hevc_put_hevc_qpel_bi_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
                                         ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
                                         mx, intptr_t my, int width);
@@ -139,6 +144,9 @@ void ff_hevc_put_hevc_qpel_bi_h12_8_neon(uint8_t *_dst, ptrdiff_t _dststride, co
 void ff_hevc_put_hevc_qpel_bi_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
                                          ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
                                          mx, intptr_t my, int width);
+void ff_hevc_put_hevc_qpel_bi_h32_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+                                         ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
+                                         mx, intptr_t my, int width);
 
 #define NEON8_FNPROTO(fn, args, ext) \
     void ff_hevc_put_hevc_##fn##4_8_neon##ext args; \
@@ -335,28 +343,28 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         c->put_hevc_qpel[3][0][1]      = ff_hevc_put_hevc_qpel_h8_8_neon;
         c->put_hevc_qpel[4][0][1]      =
         c->put_hevc_qpel[6][0][1]      = ff_hevc_put_hevc_qpel_h12_8_neon;
-        c->put_hevc_qpel[5][0][1]      =
+        c->put_hevc_qpel[5][0][1]      = ff_hevc_put_hevc_qpel_h16_8_neon;
         c->put_hevc_qpel[7][0][1]      =
         c->put_hevc_qpel[8][0][1]      =
-        c->put_hevc_qpel[9][0][1]      = ff_hevc_put_hevc_qpel_h16_8_neon;
+        c->put_hevc_qpel[9][0][1]      = ff_hevc_put_hevc_qpel_h32_8_neon;
         c->put_hevc_qpel_uni[1][0][1]  = ff_hevc_put_hevc_qpel_uni_h4_8_neon;
         c->put_hevc_qpel_uni[2][0][1]  = ff_hevc_put_hevc_qpel_uni_h6_8_neon;
         c->put_hevc_qpel_uni[3][0][1]  = ff_hevc_put_hevc_qpel_uni_h8_8_neon;
         c->put_hevc_qpel_uni[4][0][1]  =
         c->put_hevc_qpel_uni[6][0][1]  = ff_hevc_put_hevc_qpel_uni_h12_8_neon;
-        c->put_hevc_qpel_uni[5][0][1]  =
+        c->put_hevc_qpel_uni[5][0][1]  = ff_hevc_put_hevc_qpel_uni_h16_8_neon;
         c->put_hevc_qpel_uni[7][0][1]  =
         c->put_hevc_qpel_uni[8][0][1]  =
-        c->put_hevc_qpel_uni[9][0][1]  = ff_hevc_put_hevc_qpel_uni_h16_8_neon;
+        c->put_hevc_qpel_uni[9][0][1]  = ff_hevc_put_hevc_qpel_uni_h32_8_neon;
         c->put_hevc_qpel_bi[1][0][1]   = ff_hevc_put_hevc_qpel_bi_h4_8_neon;
         c->put_hevc_qpel_bi[2][0][1]   = ff_hevc_put_hevc_qpel_bi_h6_8_neon;
         c->put_hevc_qpel_bi[3][0][1]   = ff_hevc_put_hevc_qpel_bi_h8_8_neon;
         c->put_hevc_qpel_bi[4][0][1]   =
         c->put_hevc_qpel_bi[6][0][1]   = ff_hevc_put_hevc_qpel_bi_h12_8_neon;
-        c->put_hevc_qpel_bi[5][0][1]   =
+        c->put_hevc_qpel_bi[5][0][1]   = ff_hevc_put_hevc_qpel_bi_h16_8_neon;
         c->put_hevc_qpel_bi[7][0][1]   =
         c->put_hevc_qpel_bi[8][0][1]   =
-        c->put_hevc_qpel_bi[9][0][1]   = ff_hevc_put_hevc_qpel_bi_h16_8_neon;
+        c->put_hevc_qpel_bi[9][0][1]   = ff_hevc_put_hevc_qpel_bi_h32_8_neon;
 
         NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels,);
         NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v,);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 432558bb95..0fcded344b 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -383,11 +383,9 @@ endfunc
 
 .ifc \type, qpel
 function ff_hevc_put_hevc_h16_8_neon, export=0
-        uxtl            v16.8h,  v16.8b
         uxtl            v17.8h,  v17.8b
         uxtl            v18.8h,  v18.8b
 
-        uxtl            v19.8h,  v19.8b
         uxtl            v20.8h,  v20.8b
         uxtl            v21.8h,  v21.8b
 
@@ -408,7 +406,6 @@ function ff_hevc_put_hevc_h16_8_neon, export=0
         mla             v28.8h,  v24.8h, v0.h[\i]
         mla             v29.8h,  v25.8h, v0.h[\i]
 .endr
-        subs            x9, x9, #2
         ret
 endfunc
 .endif
@@ -439,7 +436,10 @@ function ff_hevc_put_hevc_\type\()_h12_8_neon, export=1
 1:      ld1             {v16.8b-v18.8b}, [src], x13
         ld1             {v19.8b-v21.8b}, [x12], x13
 
+        uxtl            v16.8h,  v16.8b
+        uxtl            v19.8h,  v19.8b
         bl              ff_hevc_put_hevc_h16_8_neon
+        subs            x9, x9, #2
 
 .ifc \type, qpel
         st1             {v26.8h}, [dst], #16
@@ -504,7 +504,6 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
 .ifc \type, qpel_bi
         ldrh            w8, [sp] // width
         mov             x16, #(MAX_PB_SIZE << 2) // src2bstridel
-        lsl             x17, x5, #7 // src2b reset
         add             x15, x4, #(MAX_PB_SIZE << 1) // src2b
 .endif
         sub             src, src, #3
@@ -519,11 +518,14 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
 .endif
         add             x10, dst, dststride // dstb
         add             x12, src, srcstride // srcb
-0:      mov             x9, height
+
 1:      ld1             {v16.8b-v18.8b}, [src], x13
         ld1             {v19.8b-v21.8b}, [x12], x13
 
+        uxtl            v16.8h,  v16.8b
+        uxtl            v19.8h,  v19.8b
         bl              ff_hevc_put_hevc_h16_8_neon
+        subs            height, height, #2
 
 .ifc \type, qpel
         st1             {v26.8h, v27.8h}, [dst], x14
@@ -550,28 +552,83 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
         st1             {v28.8b, v29.8b}, [x10], x14
 .endif
         b.gt            1b // double line
-        subs            width, width, #16
-        // reset src
-        msub            src, srcstride, height, src
-        msub            x12, srcstride, height, x12
-        // reset dst
-        msub            dst, dststride, height, dst
-        msub            x10, dststride, height, x10
+        ret             mx
+endfunc
+
+function ff_hevc_put_hevc_\type\()_h32_8_neon, export=1
+        load_filter     mx
+        sxtw            height, heightw
+        mov             mx, x30
 .ifc \type, qpel_bi
-        // reset xsrc
-        sub             x4,  x4,  x17
-        sub             x15, x15, x17
-        add             x4,  x4,  #32
-        add             x15, x15, #32
+        ldrh            w8, [sp] // width
+        mov             x16, #(MAX_PB_SIZE << 2) // src2bstridel
+        lsl             x17, x5, #7 // src2b reset
+        add             x15, x4, #(MAX_PB_SIZE << 1) // src2b
+        sub             x16, x16, width, uxtw #1
 .endif
-        add             src, src, #16
-        add             x12, x12, #16
+        sub             src, src, #3
+        mov             mx, x30
+.ifc \type, qpel
+        mov             dststride, #(MAX_PB_SIZE << 1)
+        lsl             x13, srcstride, #1 // srcstridel
+        mov             x14, #(MAX_PB_SIZE << 2)
+        sub             x14, x14, width, uxtw #1
+.else
+        lsl             x14, dststride, #1 // dststridel
+        lsl             x13, srcstride, #1 // srcstridel
+        sub             x14, x14, width, uxtw
+.endif
+        sub             x13, x13, width, uxtw
+        sub             x13, x13, #8
+        add             x10, dst, dststride // dstb
+        add             x12, src, srcstride // srcb
+0:      mov             w9, width
+        ld1             {v16.8b}, [src], #8
+        ld1             {v19.8b}, [x12], #8
+        uxtl            v16.8h, v16.8b
+        uxtl            v19.8h, v19.8b
+1:
+        ld1             {v17.8b-v18.8b}, [src], #16
+        ld1             {v20.8b-v21.8b}, [x12], #16
+
+        bl              ff_hevc_put_hevc_h16_8_neon
+        subs            w9, w9, #16
+
+        mov             v16.16b, v18.16b
+        mov             v19.16b, v21.16b
 .ifc \type, qpel
-        add             dst, dst, #32
-        add             x10, x10, #32
+        st1             {v26.8h, v27.8h}, [dst], #32
+        st1             {v28.8h, v29.8h}, [x10], #32
+.else
+.ifc \type, qpel_bi
+        ld1             {v20.8h, v21.8h}, [ x4], #32
+        ld1             {v22.8h, v23.8h}, [x15], #32
+        sqadd           v26.8h, v26.8h, v20.8h
+        sqadd           v27.8h, v27.8h, v21.8h
+        sqadd           v28.8h, v28.8h, v22.8h
+        sqadd           v29.8h, v29.8h, v23.8h
+        sqrshrun        v26.8b, v26.8h, #7
+        sqrshrun        v27.8b, v27.8h, #7
+        sqrshrun        v28.8b, v28.8h, #7
+        sqrshrun        v29.8b, v29.8h, #7
 .else
-        add             dst, dst, #16
-        add             x10, x10, #16
+        sqrshrun        v26.8b, v26.8h, #6
+        sqrshrun        v27.8b, v27.8h, #6
+        sqrshrun        v28.8b, v28.8h, #6
+        sqrshrun        v29.8b, v29.8h, #6
+.endif
+        st1             {v26.8b, v27.8b}, [dst], #16
+        st1             {v28.8b, v29.8b}, [x10], #16
+.endif
+        b.gt            1b // double line
+        subs            height, height, #2
+        add             src, src, x13
+        add             x12, x12, x13
+        add             dst, dst, x14
+        add             x10, x10, x14
+.ifc \type, qpel_bi
+        add             x4,  x4,  x16
+        add             x15, x15, x16
 .endif
         b.gt            0b
         ret             mx
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (3 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 Martin Storsjö
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 0fcded344b..062b7d4d0f 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2462,8 +2462,7 @@ endfunc
         sub             x2, x2, #3
         movrel          x9, qpel_filters
         add             x9, x9, x12, lsl #3
-        ldr             x11, [x9]
-        dup             v28.2d, x11
+        ld1r            {v28.2d}, [x9]
         mov             w10, #-6
         sub             w10, w10, w5
         dup             v30.4s, w6              // wx
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (4 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 Martin Storsjö
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

AWS Graviton 3:
put_hevc_epel_h4_8_c: 64.7
put_hevc_epel_h4_8_neon: 25.0
put_hevc_epel_h4_8_i8mm: 21.2
put_hevc_epel_h6_8_c: 130.0
put_hevc_epel_h6_8_neon: 40.7
put_hevc_epel_h6_8_i8mm: 36.5
put_hevc_epel_h8_8_c: 209.0
put_hevc_epel_h8_8_neon: 45.2
put_hevc_epel_h8_8_i8mm: 41.2
put_hevc_epel_h12_8_c: 465.5
put_hevc_epel_h12_8_neon: 104.5
put_hevc_epel_h12_8_i8mm: 86.5
put_hevc_epel_h16_8_c: 830.7
put_hevc_epel_h16_8_neon: 134.2
put_hevc_epel_h16_8_i8mm: 114.0
put_hevc_epel_h24_8_c: 1844.7
put_hevc_epel_h24_8_neon: 282.2
put_hevc_epel_h24_8_i8mm: 277.2
put_hevc_epel_h32_8_c: 3227.5
put_hevc_epel_h32_8_neon: 501.5
put_hevc_epel_h32_8_i8mm: 396.0
put_hevc_epel_h48_8_c: 7229.2
put_hevc_epel_h48_8_neon: 1120.2
put_hevc_epel_h48_8_i8mm: 901.2
put_hevc_epel_h64_8_c: 12869.0
put_hevc_epel_h64_8_neon: 1999.2
put_hevc_epel_h64_8_i8mm: 1610.5
---
 libavcodec/aarch64/hevcdsp_epel_neon.S    | 194 +++++++++++++++++++++-
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  17 ++
 2 files changed, 209 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index d3f0a26f79..419e83529a 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1321,8 +1321,6 @@ function ff_hevc_put_hevc_epel_uni_v64_8_neon, export=1
         ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
 
 .macro EPEL_H_HEADER
         movrel          x5, epel_filters
@@ -1332,6 +1330,198 @@ ENABLE_I8MM
         mov             x10, #(MAX_PB_SIZE * 2)
 .endm
 
+function ff_hevc_put_hevc_epel_h4_8_neon, export=1
+        EPEL_H_HEADER
+        sxtl            v0.8h,   v30.8b
+1:      ld1             {v4.8b}, [x1], x2
+        subs            w3,  w3,  #1   // height
+        uxtl            v4.8h,   v4.8b
+        ext             v5.16b,  v4.16b,  v4.16b,  #2
+        ext             v6.16b,  v4.16b,  v4.16b,  #4
+        ext             v7.16b,  v4.16b,  v4.16b,  #6
+        mul             v16.4h,  v4.4h,   v0.h[0]
+        mla             v16.4h,  v5.4h,   v0.h[1]
+        mla             v16.4h,  v6.4h,   v0.h[2]
+        mla             v16.4h,  v7.4h,   v0.h[3]
+        st1             {v16.4h}, [x0], x10
+        b.ne            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h6_8_neon, export=1
+        EPEL_H_HEADER
+        sxtl            v0.8h,   v30.8b
+        add             x6,  x0,  #8
+1:      ld1             {v3.16b},  [x1], x2
+        subs            w3,  w3,  #1   // height
+        uxtl2           v4.8h,   v3.16b
+        uxtl            v3.8h,   v3.8b
+        ext             v5.16b,  v3.16b,  v4.16b,  #2
+        ext             v6.16b,  v3.16b,  v4.16b,  #4
+        ext             v7.16b,  v3.16b,  v4.16b,  #6
+        mul             v16.8h,  v3.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        st1             {v16.4h},   [x0], x10
+        st1             {v16.s}[2], [x6], x10
+        b.ne            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h8_8_neon, export=1
+        EPEL_H_HEADER
+        sxtl            v0.8h,   v30.8b
+1:      ld1             {v3.16b},  [x1], x2
+        subs            w3,  w3,  #1   // height
+        uxtl2           v4.8h,   v3.16b
+        uxtl            v3.8h,   v3.8b
+        ext             v5.16b,  v3.16b,  v4.16b,  #2
+        ext             v6.16b,  v3.16b,  v4.16b,  #4
+        ext             v7.16b,  v3.16b,  v4.16b,  #6
+        mul             v16.8h,  v3.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        st1             {v16.8h},   [x0], x10
+        b.ne            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h12_8_neon, export=1
+        EPEL_H_HEADER
+        add             x6,  x0,  #16
+        sxtl            v0.8h,   v30.8b
+1:      ld1             {v3.16b}, [x1], x2
+        subs            w3,  w3,  #1   // height
+        uxtl2           v4.8h,   v3.16b
+        uxtl            v3.8h,   v3.8b
+        ext             v5.16b,  v3.16b,  v4.16b,  #2
+        ext             v6.16b,  v3.16b,  v4.16b,  #4
+        ext             v7.16b,  v3.16b,  v4.16b,  #6
+        ext             v20.16b, v4.16b,  v4.16b,  #2
+        ext             v21.16b, v4.16b,  v4.16b,  #4
+        ext             v22.16b, v4.16b,  v4.16b,  #6
+        mul             v16.8h,  v3.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        mul             v17.4h,  v4.4h,   v0.h[0]
+        mla             v17.4h,  v20.4h,  v0.h[1]
+        mla             v17.4h,  v21.4h,  v0.h[2]
+        mla             v17.4h,  v22.4h,  v0.h[3]
+        st1             {v16.8h}, [x0], x10
+        st1             {v17.4h}, [x6], x10
+        b.ne            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h16_8_neon, export=1
+        EPEL_H_HEADER
+        sxtl            v0.8h,   v30.8b
+1:      ld1             {v1.8b, v2.8b, v3.8b}, [x1], x2
+        subs            w3,  w3,  #1   // height
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        ext             v5.16b,  v1.16b,  v2.16b,  #2
+        ext             v6.16b,  v1.16b,  v2.16b,  #4
+        ext             v7.16b,  v1.16b,  v2.16b,  #6
+        ext             v20.16b, v2.16b,  v3.16b,  #2
+        ext             v21.16b, v2.16b,  v3.16b,  #4
+        ext             v22.16b, v2.16b,  v3.16b,  #6
+        mul             v16.8h,  v1.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        mul             v17.8h,  v2.8h,   v0.h[0]
+        mla             v17.8h,  v20.8h,  v0.h[1]
+        mla             v17.8h,  v21.8h,  v0.h[2]
+        mla             v17.8h,  v22.8h,  v0.h[3]
+        st1             {v16.8h, v17.8h}, [x0], x10
+        b.ne            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h24_8_neon, export=1
+        EPEL_H_HEADER
+        sxtl            v0.8h,   v30.8b
+1:      ld1             {v1.8b, v2.8b, v3.8b, v4.8b}, [x1], x2
+        subs            w3,  w3,  #1   // height
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        uxtl            v4.8h,   v4.8b
+        ext             v5.16b,  v1.16b,  v2.16b,  #2
+        ext             v6.16b,  v1.16b,  v2.16b,  #4
+        ext             v7.16b,  v1.16b,  v2.16b,  #6
+        ext             v20.16b, v2.16b,  v3.16b,  #2
+        ext             v21.16b, v2.16b,  v3.16b,  #4
+        ext             v22.16b, v2.16b,  v3.16b,  #6
+        ext             v23.16b, v3.16b,  v4.16b,  #2
+        ext             v24.16b, v3.16b,  v4.16b,  #4
+        ext             v25.16b, v3.16b,  v4.16b,  #6
+        mul             v16.8h,  v1.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        mul             v17.8h,  v2.8h,   v0.h[0]
+        mla             v17.8h,  v20.8h,  v0.h[1]
+        mla             v17.8h,  v21.8h,  v0.h[2]
+        mla             v17.8h,  v22.8h,  v0.h[3]
+        mul             v18.8h,  v3.8h,   v0.h[0]
+        mla             v18.8h,  v23.8h,  v0.h[1]
+        mla             v18.8h,  v24.8h,  v0.h[2]
+        mla             v18.8h,  v25.8h,  v0.h[3]
+        st1             {v16.8h, v17.8h, v18.8h}, [x0], x10
+        b.ne            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h32_8_neon, export=1
+        EPEL_H_HEADER
+        ld1             {v1.8b}, [x1], #8
+        sub             x2,  x2,  w6, uxtw    // decrement src stride
+        mov             w7,  w6               // original width
+        sub             x2,  x2,  #8          // decrement src stride
+        sub             x10, x10, w6, uxtw #1 // decrement dst stride
+        sxtl            v0.8h,   v30.8b
+        uxtl            v1.8h,   v1.8b
+1:      ld1             {v2.8b, v3.8b}, [x1], #16
+        subs            w6,  w6,  #16   // width
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        ext             v5.16b,  v1.16b,  v2.16b,  #2
+        ext             v6.16b,  v1.16b,  v2.16b,  #4
+        ext             v7.16b,  v1.16b,  v2.16b,  #6
+        ext             v20.16b, v2.16b,  v3.16b,  #2
+        ext             v21.16b, v2.16b,  v3.16b,  #4
+        ext             v22.16b, v2.16b,  v3.16b,  #6
+        mul             v16.8h,  v1.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        mul             v17.8h,  v2.8h,   v0.h[0]
+        mla             v17.8h,  v20.8h,  v0.h[1]
+        mla             v17.8h,  v21.8h,  v0.h[2]
+        mla             v17.8h,  v22.8h,  v0.h[3]
+        st1             {v16.8h, v17.8h}, [x0], #32
+        mov             v1.16b,  v3.16b
+        b.gt            1b
+        subs            w3,  w3,  #1   // height
+        add             x1,  x1,  x2
+        b.le            9f
+        ld1             {v1.8b}, [x1], #8
+        mov             w6,  w7
+        add             x0,  x0,  x10
+        uxtl            v1.8h,   v1.8b
+        b               1b
+9:
+        ret
+endfunc
+
+#if HAVE_I8MM
+ENABLE_I8MM
 function ff_hevc_put_hevc_epel_h4_8_neon_i8mm, export=1
         EPEL_H_HEADER
 1:      ld1             {v4.8b}, [x1], x2
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 1e9f5e32db..ece911b8d4 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -223,6 +223,10 @@ NEON8_FNPROTO_PARTIAL_4(qpel_uni_w_v, (uint8_t *_dst,  ptrdiff_t _dststride,
         int height, int denom, int wx, int ox,
         intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(epel_h, (int16_t *dst,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(epel_h, (int16_t *dst,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -290,6 +294,17 @@ NEON8_FNPROTO(qpel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
         member[8][v][h] = ff_hevc_put_hevc_##fn##48_8_neon##ext; \
         member[9][v][h] = ff_hevc_put_hevc_##fn##64_8_neon##ext;
 
+#define NEON8_FNASSIGN_SHARED_32(member, v, h, fn, ext) \
+        member[1][v][h] = ff_hevc_put_hevc_##fn##4_8_neon##ext;  \
+        member[2][v][h] = ff_hevc_put_hevc_##fn##6_8_neon##ext;  \
+        member[3][v][h] = ff_hevc_put_hevc_##fn##8_8_neon##ext;  \
+        member[4][v][h] = ff_hevc_put_hevc_##fn##12_8_neon##ext; \
+        member[5][v][h] = ff_hevc_put_hevc_##fn##16_8_neon##ext; \
+        member[6][v][h] = ff_hevc_put_hevc_##fn##24_8_neon##ext; \
+        member[7][v][h] =                                        \
+        member[8][v][h] =                                        \
+        member[9][v][h] = ff_hevc_put_hevc_##fn##32_8_neon##ext;
+
 #define NEON8_FNASSIGN_PARTIAL_4(member, v, h, fn, ext) \
         member[1][v][h] = ff_hevc_put_hevc_##fn##4_8_neon##ext;  \
         member[3][v][h] = ff_hevc_put_hevc_##fn##8_8_neon##ext;  \
@@ -384,6 +399,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 0, epel_uni_w_v,);
         NEON8_FNASSIGN_PARTIAL_4(c->put_hevc_qpel_uni_w, 1, 0, qpel_uni_w_v,);
 
+        NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel, 0, 1, epel_h,);
+
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (5 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts Martin Storsjö
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

AWS Graviton 3:
put_hevc_epel_uni_w_h4_8_c: 97.2
put_hevc_epel_uni_w_h4_8_neon: 41.2
put_hevc_epel_uni_w_h4_8_i8mm: 35.2
put_hevc_epel_uni_w_h6_8_c: 203.7
put_hevc_epel_uni_w_h6_8_neon: 84.7
put_hevc_epel_uni_w_h6_8_i8mm: 74.7
put_hevc_epel_uni_w_h8_8_c: 345.7
put_hevc_epel_uni_w_h8_8_neon: 94.0
put_hevc_epel_uni_w_h8_8_i8mm: 80.7
put_hevc_epel_uni_w_h12_8_c: 768.7
put_hevc_epel_uni_w_h12_8_neon: 196.7
put_hevc_epel_uni_w_h12_8_i8mm: 169.7
put_hevc_epel_uni_w_h16_8_c: 1313.0
put_hevc_epel_uni_w_h16_8_neon: 290.7
put_hevc_epel_uni_w_h16_8_i8mm: 238.0
put_hevc_epel_uni_w_h24_8_c: 2877.5
put_hevc_epel_uni_w_h24_8_neon: 650.0
put_hevc_epel_uni_w_h24_8_i8mm: 512.0
put_hevc_epel_uni_w_h32_8_c: 5113.5
put_hevc_epel_uni_w_h32_8_neon: 1129.5
put_hevc_epel_uni_w_h32_8_i8mm: 739.2
put_hevc_epel_uni_w_h48_8_c: 11757.0
put_hevc_epel_uni_w_h48_8_neon: 2518.7
put_hevc_epel_uni_w_h48_8_i8mm: 1688.5
put_hevc_epel_uni_w_h64_8_c: 20478.0
put_hevc_epel_uni_w_h64_8_neon: 4411.7
put_hevc_epel_uni_w_h64_8_i8mm: 2884.0
---
 libavcodec/aarch64/hevcdsp_epel_neon.S    | 326 +++++++++++++++++++++-
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   6 +
 2 files changed, 319 insertions(+), 13 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 419e83529a..0e49491a81 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1520,6 +1520,319 @@ function ff_hevc_put_hevc_epel_h32_8_neon, export=1
         ret
 endfunc
 
+.macro EPEL_UNI_W_H_HEADER elems=4s
+        ldr             x12, [sp]
+        sub             x2, x2, #1
+        movrel          x9, epel_filters
+        add             x9, x9, x12, lsl #2
+        ld1r            {v28.4s}, [x9]
+        mov             w10, #-6
+        sub             w10, w10, w5
+        dup             v30.\elems, w6
+        dup             v31.4s, w10
+        dup             v29.4s, w7
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_h4_8_neon, export=1
+        EPEL_UNI_W_H_HEADER 4h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v4.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v4.8h,   v4.8b
+        ext             v5.16b,  v4.16b,  v4.16b,  #2
+        ext             v6.16b,  v4.16b,  v4.16b,  #4
+        ext             v7.16b,  v4.16b,  v4.16b,  #6
+        mul             v16.4h,  v4.4h,   v0.h[0]
+        mla             v16.4h,  v5.4h,   v0.h[1]
+        mla             v16.4h,  v6.4h,   v0.h[2]
+        mla             v16.4h,  v7.4h,   v0.h[3]
+        smull           v16.4s,  v16.4h,  v30.4h
+        sqrshl          v16.4s,  v16.4s,  v31.4s
+        sqadd           v16.4s,  v16.4s,  v29.4s
+        sqxtn           v16.4h,  v16.4s
+        sqxtun          v16.8b,  v16.8h
+        str             s16, [x0]
+        add             x0,  x0,  x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_neon, export=1
+        EPEL_UNI_W_H_HEADER 8h
+        sub             x1,  x1,  #4
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v3.8b, v4.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v3.8h,   v3.8b
+        uxtl            v4.8h,   v4.8b
+        ext             v5.16b,  v3.16b,  v4.16b,  #2
+        ext             v6.16b,  v3.16b,  v4.16b,  #4
+        ext             v7.16b,  v3.16b,  v4.16b,  #6
+        mul             v16.8h,  v3.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        smull           v17.4s,  v16.4h,  v30.4h
+        smull2          v18.4s,  v16.8h,  v30.8h
+        sqrshl          v17.4s,  v17.4s,  v31.4s
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqadd           v17.4s,  v17.4s,  v29.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqxtn           v16.4h,  v17.4s
+        sqxtn2          v16.8h,  v18.4s
+        sqxtun          v16.8b,  v16.8h
+        str             s16, [x0], #4
+        st1             {v16.h}[2], [x0], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_neon, export=1
+        EPEL_UNI_W_H_HEADER 8h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v3.8b, v4.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v3.8h,   v3.8b
+        uxtl            v4.8h,   v4.8b
+        ext             v5.16b,  v3.16b,  v4.16b,  #2
+        ext             v6.16b,  v3.16b,  v4.16b,  #4
+        ext             v7.16b,  v3.16b,  v4.16b,  #6
+        mul             v16.8h,  v3.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        smull           v17.4s,  v16.4h,  v30.4h
+        smull2          v18.4s,  v16.8h,  v30.8h
+        sqrshl          v17.4s,  v17.4s,  v31.4s
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqadd           v17.4s,  v17.4s,  v29.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqxtn           v16.4h,  v17.4s
+        sqxtn2          v16.8h,  v18.4s
+        sqxtun          v16.8b,  v16.8h
+        st1             {v16.8b}, [x0], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h12_8_neon, export=1
+        EPEL_UNI_W_H_HEADER 8h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v3.8b, v4.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v3.8h,   v3.8b
+        uxtl            v4.8h,   v4.8b
+        ext             v5.16b,  v3.16b,  v4.16b,  #2
+        ext             v6.16b,  v3.16b,  v4.16b,  #4
+        ext             v7.16b,  v3.16b,  v4.16b,  #6
+        ext             v20.16b, v4.16b,  v4.16b,  #2
+        ext             v21.16b, v4.16b,  v4.16b,  #4
+        ext             v22.16b, v4.16b,  v4.16b,  #6
+        mul             v16.8h,  v3.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        mul             v17.4h,  v4.4h,   v0.h[0]
+        mla             v17.4h,  v20.4h,  v0.h[1]
+        mla             v17.4h,  v21.4h,  v0.h[2]
+        mla             v17.4h,  v22.4h,  v0.h[3]
+        smull           v18.4s,  v16.4h,  v30.4h
+        smull2          v19.4s,  v16.8h,  v30.8h
+        smull           v20.4s,  v17.4h,  v30.4h
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqrshl          v19.4s,  v19.4s,  v31.4s
+        sqrshl          v20.4s,  v20.4s,  v31.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqadd           v19.4s,  v19.4s,  v29.4s
+        sqadd           v20.4s,  v20.4s,  v29.4s
+        sqxtn           v16.4h,  v18.4s
+        sqxtn2          v16.8h,  v19.4s
+        sqxtn           v17.4h,  v20.4s
+        sqxtun          v16.8b,  v16.8h
+        sqxtun          v17.8b,  v17.8h
+        str             d16, [x0]
+        str             s17, [x0, #8]
+        add             x0,  x0,  x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h16_8_neon, export=1
+        EPEL_UNI_W_H_HEADER 8h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v1.8b, v2.8b, v3.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        ext             v5.16b,  v1.16b,  v2.16b,  #2
+        ext             v6.16b,  v1.16b,  v2.16b,  #4
+        ext             v7.16b,  v1.16b,  v2.16b,  #6
+        ext             v20.16b, v2.16b,  v3.16b,  #2
+        ext             v21.16b, v2.16b,  v3.16b,  #4
+        ext             v22.16b, v2.16b,  v3.16b,  #6
+        mul             v16.8h,  v1.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        mul             v17.8h,  v2.8h,   v0.h[0]
+        mla             v17.8h,  v20.8h,  v0.h[1]
+        mla             v17.8h,  v21.8h,  v0.h[2]
+        mla             v17.8h,  v22.8h,  v0.h[3]
+        smull           v18.4s,  v16.4h,  v30.4h
+        smull2          v19.4s,  v16.8h,  v30.8h
+        smull           v20.4s,  v17.4h,  v30.4h
+        smull2          v21.4s,  v17.8h,  v30.8h
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqrshl          v19.4s,  v19.4s,  v31.4s
+        sqrshl          v20.4s,  v20.4s,  v31.4s
+        sqrshl          v21.4s,  v21.4s,  v31.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqadd           v19.4s,  v19.4s,  v29.4s
+        sqadd           v20.4s,  v20.4s,  v29.4s
+        sqadd           v21.4s,  v21.4s,  v29.4s
+        sqxtn           v16.4h,  v18.4s
+        sqxtn2          v16.8h,  v19.4s
+        sqxtn           v17.4h,  v20.4s
+        sqxtn2          v17.8h,  v21.4s
+        sqxtun          v16.8b,  v16.8h
+        sqxtun          v17.8b,  v17.8h
+        st1             {v16.8b, v17.8b}, [x0], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h24_8_neon, export=1
+        EPEL_UNI_W_H_HEADER 8h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v1.8b, v2.8b, v3.8b, v4.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        uxtl            v4.8h,   v4.8b
+        ext             v5.16b,  v1.16b,  v2.16b,  #2
+        ext             v6.16b,  v1.16b,  v2.16b,  #4
+        ext             v7.16b,  v1.16b,  v2.16b,  #6
+        ext             v20.16b, v2.16b,  v3.16b,  #2
+        ext             v21.16b, v2.16b,  v3.16b,  #4
+        ext             v22.16b, v2.16b,  v3.16b,  #6
+        ext             v23.16b, v3.16b,  v4.16b,  #2
+        ext             v24.16b, v3.16b,  v4.16b,  #4
+        ext             v25.16b, v3.16b,  v4.16b,  #6
+        mul             v16.8h,  v1.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        mul             v17.8h,  v2.8h,   v0.h[0]
+        mla             v17.8h,  v20.8h,  v0.h[1]
+        mla             v17.8h,  v21.8h,  v0.h[2]
+        mla             v17.8h,  v22.8h,  v0.h[3]
+        mul             v18.8h,  v3.8h,   v0.h[0]
+        mla             v18.8h,  v23.8h,  v0.h[1]
+        mla             v18.8h,  v24.8h,  v0.h[2]
+        mla             v18.8h,  v25.8h,  v0.h[3]
+        smull           v20.4s,  v16.4h,  v30.4h
+        smull2          v21.4s,  v16.8h,  v30.8h
+        smull           v22.4s,  v17.4h,  v30.4h
+        smull2          v23.4s,  v17.8h,  v30.8h
+        smull           v24.4s,  v18.4h,  v30.4h
+        smull2          v25.4s,  v18.8h,  v30.8h
+        sqrshl          v20.4s,  v20.4s,  v31.4s
+        sqrshl          v21.4s,  v21.4s,  v31.4s
+        sqrshl          v22.4s,  v22.4s,  v31.4s
+        sqrshl          v23.4s,  v23.4s,  v31.4s
+        sqrshl          v24.4s,  v24.4s,  v31.4s
+        sqrshl          v25.4s,  v25.4s,  v31.4s
+        sqadd           v20.4s,  v20.4s,  v29.4s
+        sqadd           v21.4s,  v21.4s,  v29.4s
+        sqadd           v22.4s,  v22.4s,  v29.4s
+        sqadd           v23.4s,  v23.4s,  v29.4s
+        sqadd           v24.4s,  v24.4s,  v29.4s
+        sqadd           v25.4s,  v25.4s,  v29.4s
+        sqxtn           v16.4h,  v20.4s
+        sqxtn2          v16.8h,  v21.4s
+        sqxtn           v17.4h,  v22.4s
+        sqxtn2          v17.8h,  v23.4s
+        sqxtn           v18.4h,  v24.4s
+        sqxtn2          v18.8h,  v25.4s
+        sqxtun          v16.8b,  v16.8h
+        sqxtun          v17.8b,  v17.8h
+        sqxtun          v18.8b,  v18.8h
+        st1             {v16.8b, v17.8b, v18.8b}, [x0], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h32_8_neon, export=1
+        EPEL_UNI_W_H_HEADER 8h
+        ldr             w10, [sp, #16]        // width
+        ld1             {v1.8b}, [x2], #8
+        sub             x3,  x3,  w10, uxtw   // decrement src stride
+        mov             w11, w10              // original width
+        sub             x3,  x3,  #8          // decrement src stride
+        sub             x1,  x1,  w10, uxtw   // decrement dst stride
+        sxtl            v0.8h,   v28.8b
+        uxtl            v1.8h,   v1.8b
+1:
+        ld1             {v2.8b, v3.8b}, [x2], #16
+        subs            w10, w10, #16         // width
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        ext             v5.16b,  v1.16b,  v2.16b,  #2
+        ext             v6.16b,  v1.16b,  v2.16b,  #4
+        ext             v7.16b,  v1.16b,  v2.16b,  #6
+        ext             v20.16b, v2.16b,  v3.16b,  #2
+        ext             v21.16b, v2.16b,  v3.16b,  #4
+        ext             v22.16b, v2.16b,  v3.16b,  #6
+        mul             v16.8h,  v1.8h,   v0.h[0]
+        mla             v16.8h,  v5.8h,   v0.h[1]
+        mla             v16.8h,  v6.8h,   v0.h[2]
+        mla             v16.8h,  v7.8h,   v0.h[3]
+        mul             v17.8h,  v2.8h,   v0.h[0]
+        mla             v17.8h,  v20.8h,  v0.h[1]
+        mla             v17.8h,  v21.8h,  v0.h[2]
+        mla             v17.8h,  v22.8h,  v0.h[3]
+        smull           v18.4s,  v16.4h,  v30.4h
+        smull2          v19.4s,  v16.8h,  v30.8h
+        smull           v20.4s,  v17.4h,  v30.4h
+        smull2          v21.4s,  v17.8h,  v30.8h
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqrshl          v19.4s,  v19.4s,  v31.4s
+        sqrshl          v20.4s,  v20.4s,  v31.4s
+        sqrshl          v21.4s,  v21.4s,  v31.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqadd           v19.4s,  v19.4s,  v29.4s
+        sqadd           v20.4s,  v20.4s,  v29.4s
+        sqadd           v21.4s,  v21.4s,  v29.4s
+        sqxtn           v16.4h,  v18.4s
+        sqxtn2          v16.8h,  v19.4s
+        sqxtn           v17.4h,  v20.4s
+        sqxtn2          v17.8h,  v21.4s
+        sqxtun          v16.8b,  v16.8h
+        sqxtun          v17.8b,  v17.8h
+        st1             {v16.8b, v17.8b}, [x0], #16
+        mov             v1.16b,  v3.16b
+        b.gt            1b
+        subs            w4,  w4,  #1          // height
+        add             x2,  x2,  x3
+        b.le            9f
+        ld1             {v1.8b}, [x2], #8
+        mov             w10, w11
+        add             x0,  x0,  x1
+        uxtl            v1.8h,   v1.8b
+        b               1b
+9:
+        ret
+endfunc
+
+
 #if HAVE_I8MM
 ENABLE_I8MM
 function ff_hevc_put_hevc_epel_h4_8_neon_i8mm, export=1
@@ -2410,19 +2723,6 @@ function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
         ret
 endfunc
 
-.macro EPEL_UNI_W_H_HEADER
-        ldr             x12, [sp]
-        sub             x2, x2, #1
-        movrel          x9, epel_filters
-        add             x9, x9, x12, lsl #2
-        ld1r            {v28.4s}, [x9]
-        mov             w10, #-6
-        sub             w10, w10, w5
-        dup             v30.4s, w6
-        dup             v31.4s, w10
-        dup             v29.4s, w7
-.endm
-
 
 function ff_hevc_put_hevc_epel_uni_w_h4_8_neon_i8mm, export=1
         EPEL_UNI_W_H_HEADER
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index ece911b8d4..be24737c9c 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -235,6 +235,11 @@ NEON8_FNPROTO(epel_hv, (int16_t *dst,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO(epel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(epel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, int denom, int wx, int ox,
@@ -400,6 +405,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN_PARTIAL_4(c->put_hevc_qpel_uni_w, 1, 0, qpel_uni_w_v,);
 
         NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel, 0, 1, epel_h,);
+        NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h,);
 
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (6 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating Martin Storsjö
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

The first horizontal filter can use either i8mm or plain neon
versions, while the second part is a pure neon implementation.
---
 libavcodec/aarch64/hevcdsp_epel_neon.S | 100 +++++++++++++++++++++++++
 1 file changed, 100 insertions(+)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 0e49491a81..6be171ece1 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2186,6 +2186,10 @@ function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv4_8_end_neon
         load_epel_filterh x5, x4
         mov             x10, #(MAX_PB_SIZE * 2)
         ldr             d16, [sp]
@@ -2215,6 +2219,10 @@ function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv6_8_end_neon
         load_epel_filterh x5, x4
         mov             x5, #120
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2247,6 +2255,10 @@ function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv8_8_end_neon
         load_epel_filterh x5, x4
         mov             x10, #(MAX_PB_SIZE * 2)
         ldr             q16, [sp]
@@ -2277,6 +2289,10 @@ function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv12_8_end_neon
         load_epel_filterh x5, x4
         mov             x5, #112
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2309,6 +2325,10 @@ function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv16_8_end_neon
         load_epel_filterh x5, x4
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h}, [sp], x10
@@ -2340,6 +2360,10 @@ function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv24_8_end_neon
         load_epel_filterh x5, x4
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -2445,6 +2469,10 @@ function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv4_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.4h}, [sp], x10
@@ -2478,6 +2506,10 @@ function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv6_8_end_neon
         load_epel_filterh x6, x5
         sub             x1, x1, #4
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2514,6 +2546,10 @@ function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv8_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h}, [sp], x10
@@ -2548,6 +2584,10 @@ function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv12_8_end_neon
         load_epel_filterh x6, x5
         sub             x1, x1, #8
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2586,6 +2626,10 @@ function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv16_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h}, [sp], x10
@@ -2623,6 +2667,10 @@ function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv24_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -3173,6 +3221,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv4_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.4h}, [sp], x10
@@ -3240,6 +3292,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv6_8_end_neon
         load_epel_filterh x6, x5
         sub             x1, x1, #4
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3312,6 +3368,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv8_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h}, [sp], x10
@@ -3379,6 +3439,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv12_8_end_neon
         load_epel_filterh x6, x5
         sub             x1, x1, #8
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3459,6 +3523,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv16_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h}, [sp], x10
@@ -3538,6 +3606,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -3715,6 +3787,10 @@ function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv4_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.4h}, [sp], x10
@@ -3751,6 +3827,10 @@ function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv6_8_end_neon
         load_epel_filterh x7, x6
         sub             x1, x1, #4
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3790,6 +3870,10 @@ function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv8_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h}, [sp], x10
@@ -3827,6 +3911,10 @@ function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv12_8_end_neon
         load_epel_filterh x7, x6
         sub             x1, x1, #8
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3869,6 +3957,10 @@ function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv16_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h}, [sp], x10
@@ -3910,6 +4002,10 @@ function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv24_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -3956,6 +4052,10 @@ function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv32_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv32_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
         ld1             {v16.8h, v17.8h, v18.8h, v19.8h}, [sp], x10
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (7 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm Martin Storsjö
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

This is a pure reordering of code without changing anything in
the individual functions.
---
 libavcodec/aarch64/hevcdsp_epel_neon.S | 971 +++++++++++++------------
 1 file changed, 497 insertions(+), 474 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 6be171ece1..2088630da1 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2173,21 +2173,9 @@ function ff_hevc_put_hevc_epel_h64_8_neon_i8mm, export=1
         ret
 endfunc
 
+DISABLE_I8MM
+#endif
 
-function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
-        add             w10, w3, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        sub             x1, x1, x2
-        add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_epel_hv4_8_end_neon
-endfunc
 
 function hevc_put_hevc_epel_hv4_8_end_neon
         load_epel_filterh x5, x4
@@ -2207,21 +2195,6 @@ function hevc_put_hevc_epel_hv4_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
-        add             w10, w3, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        sub             x1, x1, x2
-        add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_epel_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv6_8_end_neon
         load_epel_filterh x5, x4
         mov             x5, #120
@@ -2243,21 +2216,6 @@ function hevc_put_hevc_epel_hv6_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
-        add             w10, w3, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        sub             x1, x1, x2
-        add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_epel_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv8_8_end_neon
         load_epel_filterh x5, x4
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2277,21 +2235,6 @@ function hevc_put_hevc_epel_hv8_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
-        add             w10, w3, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        sub             x1, x1, x2
-        add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_epel_hv12_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv12_8_end_neon
         load_epel_filterh x5, x4
         mov             x5, #112
@@ -2313,21 +2256,6 @@ function hevc_put_hevc_epel_hv12_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
-        add             w10, w3, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        sub             x1, x1, x2
-        add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_epel_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv16_8_end_neon
         load_epel_filterh x5, x4
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2348,21 +2276,6 @@ function hevc_put_hevc_epel_hv16_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
-        add             w10, w3, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        sub             x1, x1, x2
-        add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_epel_hv24_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_hv24_8_end_neon
         load_epel_filterh x5, x4
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2385,6 +2298,99 @@ function hevc_put_hevc_epel_hv24_8_end_neon
 2:      ret
 endfunc
 
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
+        add             w10, w3, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        add             w3, w3, #3
+        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_epel_hv24_8_end_neon
+endfunc
+
 function ff_hevc_put_hevc_epel_hv32_8_neon_i8mm, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
@@ -2453,24 +2459,8 @@ function ff_hevc_put_hevc_epel_hv64_8_neon_i8mm, export=1
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
-        add             w10, w4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_hv4_8_end_neon
-endfunc
+DISABLE_I8MM
+#endif
 
 function hevc_put_hevc_epel_uni_hv4_8_end_neon
         load_epel_filterh x6, x5
@@ -2490,25 +2480,6 @@ function hevc_put_hevc_epel_uni_hv4_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
-        add             w10, w4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_hv6_8_end_neon
         load_epel_filterh x6, x5
         sub             x1, x1, #4
@@ -2530,25 +2501,6 @@ function hevc_put_hevc_epel_uni_hv6_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
-        add             w10, w4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_hv8_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2568,25 +2520,6 @@ function hevc_put_hevc_epel_uni_hv8_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
-        add             w10, w4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_hv12_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_hv12_8_end_neon
         load_epel_filterh x6, x5
         sub             x1, x1, #8
@@ -2610,25 +2543,6 @@ function hevc_put_hevc_epel_uni_hv12_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
-        add             w10, w4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_hv16_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2651,25 +2565,6 @@ function hevc_put_hevc_epel_uni_hv16_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
-        add             w10, w4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_hv24_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_hv24_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2695,6 +2590,123 @@ function hevc_put_hevc_epel_uni_hv24_8_end_neon
 2:      ret
 endfunc
 
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
+        add             w10, w4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
+        add             w10, w4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
+        add             w10, w4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
+        add             w10, w4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
+        add             w10, w4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
+        add             w10, w4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_hv24_8_end_neon
+endfunc
+
 function ff_hevc_put_hevc_epel_uni_hv32_8_neon_i8mm, export=1
         stp             x5, x6, [sp, #-64]!
         stp             x3, x4, [sp, #16]
@@ -3098,6 +3110,8 @@ function ff_hevc_put_hevc_epel_uni_w_h64_8_neon_i8mm, export=1
         b.hi            1b
         ret
 endfunc
+DISABLE_I8MM
+#endif
 
 .macro epel_uni_w_hv_start
         mov             x15, x5         //denom
@@ -3202,28 +3216,6 @@ endfunc
 
 
 
-function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
-        epel_uni_w_hv_start
-        sxtw            x4, w4
-
-        add             x10, x4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10     // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             x3, x4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_w_hv4_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_w_hv4_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3273,28 +3265,6 @@ function hevc_put_hevc_epel_uni_w_hv4_8_end_neon
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
-        epel_uni_w_hv_start
-        sxtw            x4, w4
-
-        add             x10, x4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10     // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             x3, x4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_w_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_w_hv6_8_end_neon
         load_epel_filterh x6, x5
         sub             x1, x1, #4
@@ -3349,28 +3319,6 @@ function hevc_put_hevc_epel_uni_w_hv6_8_end_neon
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
-        epel_uni_w_hv_start
-        sxtw            x4, w4
-
-        add             x10, x4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10     // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             x3, x4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_w_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_w_hv8_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3420,28 +3368,6 @@ function hevc_put_hevc_epel_uni_w_hv8_8_end_neon
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
-        epel_uni_w_hv_start
-        sxtw            x4, w4
-
-        add             x10, x4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10     // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             x3, x4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_w_hv12_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_w_hv12_8_end_neon
         load_epel_filterh x6, x5
         sub             x1, x1, #8
@@ -3504,28 +3430,6 @@ function hevc_put_hevc_epel_uni_w_hv12_8_end_neon
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
-        epel_uni_w_hv_start
-        sxtw            x4, w4
-
-        add             x10, x4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10     // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             x3, x4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_w_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_w_hv16_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3587,28 +3491,6 @@ function hevc_put_hevc_epel_uni_w_hv16_8_end_neon
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
-        epel_uni_w_hv_start
-        sxtw            x4, w4
-
-        add             x10, x4, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10     // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             x3, x4, #3
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_epel_uni_w_hv24_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3686,6 +3568,141 @@ function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
         ret
 endfunc
 
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
+        epel_uni_w_hv_start
+        sxtw            x4, w4
+
+        add             x10, x4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10     // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             x3, x4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
+        epel_uni_w_hv_start
+        sxtw            x4, w4
+
+        add             x10, x4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10     // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             x3, x4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
+        epel_uni_w_hv_start
+        sxtw            x4, w4
+
+        add             x10, x4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10     // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             x3, x4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
+        epel_uni_w_hv_start
+        sxtw            x4, w4
+
+        add             x10, x4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10     // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             x3, x4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
+        epel_uni_w_hv_start
+        sxtw            x4, w4
+
+        add             x10, x4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10     // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             x3, x4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
+        epel_uni_w_hv_start
+        sxtw            x4, w4
+
+        add             x10, x4, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10     // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             x3, x4, #3
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_epel_uni_w_hv24_8_end_neon
+endfunc
+
 function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
         ldp             x15, x16, [sp]
         mov             x17, #16
@@ -3769,26 +3786,9 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_neon_i8mm, export=1
         ret
 endfunc
 
+DISABLE_I8MM
+#endif
 
-function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
-        add             w10, w5, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w5, #3
-        mov             x4, x6
-        mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_epel_bi_hv4_8_end_neon
-endfunc
 
 function hevc_put_hevc_epel_bi_hv4_8_end_neon
         load_epel_filterh x7, x6
@@ -3810,26 +3810,6 @@ function hevc_put_hevc_epel_bi_hv4_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
-        add             w10, w5, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w5, #3
-        mov             x4, x6
-        mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_epel_bi_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_bi_hv6_8_end_neon
         load_epel_filterh x7, x6
         sub             x1, x1, #4
@@ -3853,26 +3833,6 @@ function hevc_put_hevc_epel_bi_hv6_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
-        add             w10, w5, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w5, #3
-        mov             x4, x6
-        mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_epel_bi_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_bi_hv8_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3894,26 +3854,6 @@ function hevc_put_hevc_epel_bi_hv8_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
-        add             w10, w5, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w5, #3
-        mov             x4, x6
-        mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_epel_bi_hv12_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_bi_hv12_8_end_neon
         load_epel_filterh x7, x6
         sub             x1, x1, #8
@@ -3940,26 +3880,6 @@ function hevc_put_hevc_epel_bi_hv12_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
-        add             w10, w5, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w5, #3
-        mov             x4, x6
-        mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_epel_bi_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_bi_hv16_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3985,26 +3905,6 @@ function hevc_put_hevc_epel_bi_hv16_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
-        add             w10, w5, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w5, #3
-        mov             x4, x6
-        mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_epel_bi_hv24_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_bi_hv24_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -4034,27 +3934,6 @@ function hevc_put_hevc_epel_bi_hv24_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
-        str             d8, [sp, #-16]!
-        add             w10, w5, #3
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3
-        mov             x2, x3
-        add             w3, w5, #3
-        mov             x4, x6
-        mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h32_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_epel_bi_hv32_8_end_neon
-endfunc
-
 function hevc_put_hevc_epel_bi_hv32_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -4089,6 +3968,150 @@ function hevc_put_hevc_epel_bi_hv32_8_end_neon
         ret
 endfunc
 
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
+        add             w10, w5, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w5, #3
+        mov             x4, x6
+        mov             x5, x7
+        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
+        add             w10, w5, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w5, #3
+        mov             x4, x6
+        mov             x5, x7
+        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
+        add             w10, w5, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w5, #3
+        mov             x4, x6
+        mov             x5, x7
+        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
+        add             w10, w5, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w5, #3
+        mov             x4, x6
+        mov             x5, x7
+        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
+        add             w10, w5, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w5, #3
+        mov             x4, x6
+        mov             x5, x7
+        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
+        add             w10, w5, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w5, #3
+        mov             x4, x6
+        mov             x5, x7
+        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv24_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
+        str             d8, [sp, #-16]!
+        add             w10, w5, #3
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3
+        mov             x2, x3
+        add             w3, w5, #3
+        mov             x4, x6
+        mov             x5, x7
+        bl              X(ff_hevc_put_hevc_epel_h32_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_epel_bi_hv32_8_end_neon
+endfunc
+
 function ff_hevc_put_hevc_epel_bi_hv48_8_neon_i8mm, export=1
         stp             x6, x7, [sp, #-80]!
         stp             x4, x5, [sp, #16]
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (8 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both " Martin Storsjö
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

AWS Graviton 3:
put_hevc_epel_hv4_8_c: 163.7
put_hevc_epel_hv4_8_neon: 52.5
put_hevc_epel_hv4_8_i8mm: 49.5
put_hevc_epel_hv6_8_c: 292.2
put_hevc_epel_hv6_8_neon: 97.7
put_hevc_epel_hv6_8_i8mm: 101.2
put_hevc_epel_hv8_8_c: 471.0
put_hevc_epel_hv8_8_neon: 106.7
put_hevc_epel_hv8_8_i8mm: 102.5
put_hevc_epel_hv12_8_c: 1030.2
put_hevc_epel_hv12_8_neon: 240.5
put_hevc_epel_hv12_8_i8mm: 215.0
put_hevc_epel_hv16_8_c: 1711.5
put_hevc_epel_hv16_8_neon: 340.2
put_hevc_epel_hv16_8_i8mm: 319.2
put_hevc_epel_hv24_8_c: 3670.0
put_hevc_epel_hv24_8_neon: 702.0
put_hevc_epel_hv24_8_i8mm: 666.5
put_hevc_epel_hv32_8_c: 6785.5
put_hevc_epel_hv32_8_neon: 1247.0
put_hevc_epel_hv32_8_i8mm: 1169.0
put_hevc_epel_hv48_8_c: 14689.7
put_hevc_epel_hv48_8_neon: 2665.2
put_hevc_epel_hv48_8_i8mm: 2740.0
put_hevc_epel_hv64_8_c: 25899.2
put_hevc_epel_hv64_8_neon: 4801.2
put_hevc_epel_hv64_8_i8mm: 4487.7
---
 libavcodec/aarch64/hevcdsp_epel_neon.S    | 58 +++++++++++++----------
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  6 +++
 2 files changed, 38 insertions(+), 26 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 2088630da1..024464723b 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2298,10 +2298,8 @@ function hevc_put_hevc_epel_hv24_8_end_neon
 2:      ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
+.macro epel_hv suffix
+function ff_hevc_put_hevc_epel_hv4_8_\suffix, export=1
         add             w10, w3, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2310,13 +2308,13 @@ function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
         add             x0, sp, #32
         sub             x1, x1, x2
         add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h4_8_\suffix)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
         b               hevc_put_hevc_epel_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv6_8_\suffix, export=1
         add             w10, w3, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2325,13 +2323,13 @@ function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
         add             x0, sp, #32
         sub             x1, x1, x2
         add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h6_8_\suffix)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
         b               hevc_put_hevc_epel_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv8_8_\suffix, export=1
         add             w10, w3, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2340,13 +2338,13 @@ function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
         add             x0, sp, #32
         sub             x1, x1, x2
         add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h8_8_\suffix)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
         b               hevc_put_hevc_epel_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv12_8_\suffix, export=1
         add             w10, w3, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2355,13 +2353,13 @@ function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
         add             x0, sp, #32
         sub             x1, x1, x2
         add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h12_8_\suffix)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
         b               hevc_put_hevc_epel_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv16_8_\suffix, export=1
         add             w10, w3, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2370,13 +2368,13 @@ function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
         add             x0, sp, #32
         sub             x1, x1, x2
         add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h16_8_\suffix)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
         b               hevc_put_hevc_epel_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv24_8_\suffix, export=1
         add             w10, w3, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2385,79 +2383,87 @@ function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
         add             x0, sp, #32
         sub             x1, x1, x2
         add             w3, w3, #3
-        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h24_8_\suffix)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
         b               hevc_put_hevc_epel_hv24_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_hv32_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv32_8_\suffix, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
         stp             x0, x1, [sp, #32]
         str             x30, [sp, #48]
         mov             x6, #16
-        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
         ldp             x0, x1, [sp, #32]
         ldp             x2, x3, [sp, #16]
         ldp             x4, x5, [sp], #48
         add             x0, x0, #32
         add             x1, x1, #16
         mov             x6, #16
-        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv48_8_\suffix, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
         stp             x0, x1, [sp, #32]
         str             x30, [sp, #48]
         mov             x6, #24
-        bl              X(ff_hevc_put_hevc_epel_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_hv24_8_\suffix)
         ldp             x0, x1, [sp, #32]
         ldp             x2, x3, [sp, #16]
         ldp             x4, x5, [sp], #48
         add             x0, x0, #48
         add             x1, x1, #24
         mov             x6, #24
-        bl              X(ff_hevc_put_hevc_epel_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_hv24_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv64_8_\suffix, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
         stp             x0, x1, [sp, #32]
         str             x30, [sp, #48]
         mov             x6, #16
-        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
         ldp             x4, x5, [sp]
         ldp             x2, x3, [sp, #16]
         ldp             x0, x1, [sp, #32]
         add             x0, x0, #32
         add             x1, x1, #16
         mov             x6, #16
-        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
         ldp             x4, x5, [sp]
         ldp             x2, x3, [sp, #16]
         ldp             x0, x1, [sp, #32]
         add             x0, x0, #64
         add             x1, x1, #32
         mov             x6, #16
-        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
         ldp             x0, x1, [sp, #32]
         ldp             x2, x3, [sp, #16]
         ldp             x4, x5, [sp], #48
         add             x0, x0, #96
         add             x1, x1, #48
         mov             x6, #16
-        bl              X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
+.endm
+
+epel_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+epel_hv neon_i8mm
 
 DISABLE_I8MM
 #endif
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index be24737c9c..87e321da71 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -227,6 +227,10 @@ NEON8_FNPROTO(epel_h, (int16_t *dst,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(epel_hv, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width), );
+
 NEON8_FNPROTO(epel_h, (int16_t *dst,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -407,6 +411,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel, 0, 1, epel_h,);
         NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h,);
 
+        NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv,);
+
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (9 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv " Martin Storsjö
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

AWS Graviton 3:
put_hevc_epel_uni_hv4_8_c: 163.5
put_hevc_epel_uni_hv4_8_neon: 59.7
put_hevc_epel_uni_hv4_8_i8mm: 57.5
put_hevc_epel_uni_hv6_8_c: 344.7
put_hevc_epel_uni_hv6_8_neon: 105.0
put_hevc_epel_uni_hv6_8_i8mm: 102.7
put_hevc_epel_uni_hv8_8_c: 552.2
put_hevc_epel_uni_hv8_8_neon: 111.2
put_hevc_epel_uni_hv8_8_i8mm: 104.0
put_hevc_epel_uni_hv12_8_c: 1195.0
put_hevc_epel_uni_hv12_8_neon: 248.7
put_hevc_epel_uni_hv12_8_i8mm: 229.5
put_hevc_epel_uni_hv16_8_c: 1910.2
put_hevc_epel_uni_hv16_8_neon: 339.5
put_hevc_epel_uni_hv16_8_i8mm: 323.2
put_hevc_epel_uni_hv24_8_c: 4048.2
put_hevc_epel_uni_hv24_8_neon: 737.7
put_hevc_epel_uni_hv24_8_i8mm: 713.7
put_hevc_epel_uni_hv32_8_c: 6865.7
put_hevc_epel_uni_hv32_8_neon: 1285.0
put_hevc_epel_uni_hv32_8_i8mm: 1206.0
put_hevc_epel_uni_hv48_8_c: 15830.5
put_hevc_epel_uni_hv48_8_neon: 2844.7
put_hevc_epel_uni_hv48_8_i8mm: 2914.0
put_hevc_epel_uni_hv64_8_c: 27912.7
put_hevc_epel_uni_hv64_8_neon: 4970.5
put_hevc_epel_uni_hv64_8_i8mm: 4653.7
---
 libavcodec/aarch64/hevcdsp_epel_neon.S    | 67 +++++++++++------------
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  5 ++
 2 files changed, 38 insertions(+), 34 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 024464723b..876db9d449 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2460,14 +2460,6 @@ endfunc
 
 epel_hv neon
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-epel_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
 function hevc_put_hevc_epel_uni_hv4_8_end_neon
         load_epel_filterh x6, x5
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -2596,10 +2588,8 @@ function hevc_put_hevc_epel_uni_hv24_8_end_neon
 2:      ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
+.macro epel_uni_hv suffix
+function ff_hevc_put_hevc_epel_uni_hv4_8_\suffix, export=1
         add             w10, w4, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2611,14 +2601,14 @@ function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
         mov             x2, x3
         add             w3, w4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h4_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv6_8_\suffix, export=1
         add             w10, w4, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2630,14 +2620,14 @@ function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
         mov             x2, x3
         add             w3, w4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h6_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv8_8_\suffix, export=1
         add             w10, w4, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2649,14 +2639,14 @@ function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
         mov             x2, x3
         add             w3, w4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h8_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv12_8_\suffix, export=1
         add             w10, w4, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2668,14 +2658,14 @@ function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
         mov             x2, x3
         add             w3, w4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h12_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv16_8_\suffix, export=1
         add             w10, w4, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2687,14 +2677,14 @@ function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
         mov             x2, x3
         add             w3, w4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h16_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv24_8_\suffix, export=1
         add             w10, w4, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -2706,20 +2696,20 @@ function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
         mov             x2, x3
         add             w3, w4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h24_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_hv24_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv32_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv32_8_\suffix, export=1
         stp             x5, x6, [sp, #-64]!
         stp             x3, x4, [sp, #16]
         stp             x1, x2, [sp, #32]
         stp             x0, x30, [sp, #48]
         mov             x7, #16
-        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
         ldp             x5, x6, [sp]
         ldp             x3, x4, [sp, #16]
         ldp             x1, x2, [sp, #32]
@@ -2727,19 +2717,19 @@ function ff_hevc_put_hevc_epel_uni_hv32_8_neon_i8mm, export=1
         add             x0, x0, #16
         add             x2, x2, #16
         mov             x7, #16
-        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
         ldr             x30, [sp, #56]
         add             sp, sp, #64
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv48_8_\suffix, export=1
         stp             x5, x6, [sp, #-64]!
         stp             x3, x4, [sp, #16]
         stp             x1, x2, [sp, #32]
         stp             x0, x30, [sp, #48]
         mov             x7, #24
-        bl              X(ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_hv24_8_\suffix)
         ldp             x5, x6, [sp]
         ldp             x3, x4, [sp, #16]
         ldp             x1, x2, [sp, #32]
@@ -2747,19 +2737,19 @@ function ff_hevc_put_hevc_epel_uni_hv48_8_neon_i8mm, export=1
         add             x0, x0, #24
         add             x2, x2, #24
         mov             x7, #24
-        bl              X(ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_hv24_8_\suffix)
         ldr             x30, [sp, #56]
         add             sp, sp, #64
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv64_8_\suffix, export=1
         stp             x5, x6, [sp, #-64]!
         stp             x3, x4, [sp, #16]
         stp             x1, x2, [sp, #32]
         stp             x0, x30, [sp, #48]
         mov             x7, #16
-        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
         ldp             x5, x6, [sp]
         ldp             x3, x4, [sp, #16]
         ldp             x1, x2, [sp, #32]
@@ -2767,7 +2757,7 @@ function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
         add             x0, x0, #16
         add             x2, x2, #16
         mov             x7, #16
-        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
         ldp             x5, x6, [sp]
         ldp             x3, x4, [sp, #16]
         ldp             x1, x2, [sp, #32]
@@ -2775,7 +2765,7 @@ function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
         add             x0, x0, #32
         add             x2, x2, #32
         mov             x7, #16
-        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
         ldp             x5, x6, [sp]
         ldp             x3, x4, [sp, #16]
         ldp             x1, x2, [sp, #32]
@@ -2783,12 +2773,21 @@ function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
         add             x0, x0, #48
         add             x2, x2, #48
         mov             x7, #16
-        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
         ldr             x30, [sp, #56]
         add             sp, sp, #64
         ret
 endfunc
+.endm
+
+epel_uni_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+epel_hv neon_i8mm
 
+epel_uni_hv neon_i8mm
 
 function ff_hevc_put_hevc_epel_uni_w_h4_8_neon_i8mm, export=1
         EPEL_UNI_W_H_HEADER
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 87e321da71..447ae80bfb 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -209,6 +209,10 @@ NEON8_FNPROTO(epel_uni_v, (uint8_t *dst,  ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(epel_uni_hv, (uint8_t *dst, ptrdiff_t _dststride,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(epel_uni_hv, (uint8_t *dst, ptrdiff_t _dststride,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -412,6 +416,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h,);
 
         NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv,);
+        NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv,);
 
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (10 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both " Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv " Martin Storsjö
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

AWS Graviton 3:
put_hevc_epel_uni_w_hv4_8_c: 191.2
put_hevc_epel_uni_w_hv4_8_neon: 87.7
put_hevc_epel_uni_w_hv4_8_i8mm: 83.2
put_hevc_epel_uni_w_hv6_8_c: 349.5
put_hevc_epel_uni_w_hv6_8_neon: 153.0
put_hevc_epel_uni_w_hv6_8_i8mm: 148.5
put_hevc_epel_uni_w_hv8_8_c: 581.2
put_hevc_epel_uni_w_hv8_8_neon: 166.7
put_hevc_epel_uni_w_hv8_8_i8mm: 163.5
put_hevc_epel_uni_w_hv12_8_c: 1230.0
put_hevc_epel_uni_w_hv12_8_neon: 387.7
put_hevc_epel_uni_w_hv12_8_i8mm: 370.2
put_hevc_epel_uni_w_hv16_8_c: 2003.2
put_hevc_epel_uni_w_hv16_8_neon: 501.5
put_hevc_epel_uni_w_hv16_8_i8mm: 490.2
put_hevc_epel_uni_w_hv24_8_c: 4448.7
put_hevc_epel_uni_w_hv24_8_neon: 1092.2
put_hevc_epel_uni_w_hv24_8_i8mm: 1069.7
put_hevc_epel_uni_w_hv32_8_c: 7817.2
put_hevc_epel_uni_w_hv32_8_neon: 1916.2
put_hevc_epel_uni_w_hv32_8_i8mm: 1829.5
put_hevc_epel_uni_w_hv48_8_c: 16728.2
put_hevc_epel_uni_w_hv48_8_neon: 4263.7
put_hevc_epel_uni_w_hv48_8_i8mm: 4342.7
put_hevc_epel_uni_w_hv64_8_c: 29563.2
put_hevc_epel_uni_w_hv64_8_neon: 7474.2
put_hevc_epel_uni_w_hv64_8_i8mm: 7128.5
---
 libavcodec/aarch64/hevcdsp_epel_neon.S    | 55 ++++++++++++-----------
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  6 +++
 2 files changed, 36 insertions(+), 25 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 876db9d449..d0c6205e1c 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -3573,10 +3573,8 @@ function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
         ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
+.macro epel_uni_w_hv suffix
+function ff_hevc_put_hevc_epel_uni_w_hv4_8_\suffix, export=1
         epel_uni_w_hv_start
         sxtw            x4, w4
 
@@ -3591,14 +3589,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
         mov             x2, x3
         add             x3, x4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h4_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_w_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_\suffix, export=1
         epel_uni_w_hv_start
         sxtw            x4, w4
 
@@ -3613,14 +3611,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
         mov             x2, x3
         add             x3, x4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h6_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_w_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_\suffix, export=1
         epel_uni_w_hv_start
         sxtw            x4, w4
 
@@ -3635,14 +3633,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
         mov             x2, x3
         add             x3, x4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h8_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_w_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_\suffix, export=1
         epel_uni_w_hv_start
         sxtw            x4, w4
 
@@ -3657,14 +3655,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
         mov             x2, x3
         add             x3, x4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h12_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_w_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_\suffix, export=1
         epel_uni_w_hv_start
         sxtw            x4, w4
 
@@ -3679,14 +3677,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
         mov             x2, x3
         add             x3, x4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h16_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_w_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_\suffix, export=1
         epel_uni_w_hv_start
         sxtw            x4, w4
 
@@ -3701,14 +3699,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
         mov             x2, x3
         add             x3, x4, #3
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h24_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
         b               hevc_put_hevc_epel_uni_w_hv24_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv32_8_\suffix, export=1
         ldp             x15, x16, [sp]
         mov             x17, #16
         stp             x15, x16, [sp, #-96]!
@@ -3718,7 +3716,7 @@ function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
         stp             x5, x6, [sp, #64]
         stp             x17, x7, [sp, #80]
 
-        bl              X(ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_w_hv16_8_\suffix)
         ldp             x0, x30, [sp, #16]
         ldp             x1, x2, [sp, #32]
         ldp             x3, x4, [sp, #48]
@@ -3730,13 +3728,13 @@ function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
         mov             x17, #16
         stp             x15, x16, [sp, #-32]!
         stp             x17, x30, [sp, #16]
-        bl              X(ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_w_hv16_8_\suffix)
         ldp             x17, x30, [sp, #16]
         ldp             x15, x16, [sp], #32
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv48_8_\suffix, export=1
         ldp             x15, x16, [sp]
         mov             x17, #24
         stp             x15, x16, [sp, #-96]!
@@ -3745,7 +3743,7 @@ function ff_hevc_put_hevc_epel_uni_w_hv48_8_neon_i8mm, export=1
         stp             x3, x4, [sp, #48]
         stp             x5, x6, [sp, #64]
         stp             x17, x7, [sp, #80]
-        bl              X(ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_w_hv24_8_\suffix)
         ldp             x0, x30, [sp, #16]
         ldp             x1, x2, [sp, #32]
         ldp             x3, x4, [sp, #48]
@@ -3757,13 +3755,13 @@ function ff_hevc_put_hevc_epel_uni_w_hv48_8_neon_i8mm, export=1
         mov             x17, #24
         stp             x15, x16, [sp, #-32]!
         stp             x17, x30, [sp, #16]
-        bl              X(ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_w_hv24_8_\suffix)
         ldp             x17, x30, [sp, #16]
         ldp             x15, x16, [sp], #32
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_uni_w_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv64_8_\suffix, export=1
         ldp             x15, x16, [sp]
         mov             x17, #32
         stp             x15, x16, [sp, #-96]!
@@ -3773,7 +3771,7 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_neon_i8mm, export=1
         stp             x5, x6, [sp, #64]
         stp             x17, x7, [sp, #80]
 
-        bl              X(ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_w_hv32_8_\suffix)
         ldp             x0, x30, [sp, #16]
         ldp             x1, x2, [sp, #32]
         ldp             x3, x4, [sp, #48]
@@ -3785,16 +3783,23 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_neon_i8mm, export=1
         mov             x17, #32
         stp             x15, x16, [sp, #-32]!
         stp             x17, x30, [sp, #16]
-        bl              X(ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_uni_w_hv32_8_\suffix)
         ldp             x17, x30, [sp, #16]
         ldp             x15, x16, [sp], #32
         ret
 endfunc
+.endm
+
+epel_uni_w_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+epel_uni_w_hv neon_i8mm
 
 DISABLE_I8MM
 #endif
 
-
 function hevc_put_hevc_epel_bi_hv4_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 447ae80bfb..948103aa09 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -278,6 +278,11 @@ NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
         int height, int denom, int wx, int ox,
         intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO(epel_uni_w_hv, (uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(epel_uni_w_hv, (uint8_t *_dst,  ptrdiff_t _dststride,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, int denom, int wx, int ox,
@@ -417,6 +422,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
 
         NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv,);
         NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv,);
+        NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv,);
 
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (11 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv " Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 Martin Storsjö
                   ` (8 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

In addition to just templating, this contains one change to
ff_hevc_put_hevc_epel_bi_hv32_8, by setting the w6 register
which ff_hevc_put_hevc_epel_h32_8_neon requires.

AWS Graviton 3:
put_hevc_epel_bi_hv4_8_c: 176.5
put_hevc_epel_bi_hv4_8_neon: 62.0
put_hevc_epel_bi_hv4_8_i8mm: 58.0
put_hevc_epel_bi_hv6_8_c: 343.7
put_hevc_epel_bi_hv6_8_neon: 109.7
put_hevc_epel_bi_hv6_8_i8mm: 105.7
put_hevc_epel_bi_hv8_8_c: 536.0
put_hevc_epel_bi_hv8_8_neon: 112.7
put_hevc_epel_bi_hv8_8_i8mm: 111.7
put_hevc_epel_bi_hv12_8_c: 1107.7
put_hevc_epel_bi_hv12_8_neon: 254.7
put_hevc_epel_bi_hv12_8_i8mm: 239.0
put_hevc_epel_bi_hv16_8_c: 1927.7
put_hevc_epel_bi_hv16_8_neon: 356.2
put_hevc_epel_bi_hv16_8_i8mm: 334.2
put_hevc_epel_bi_hv24_8_c: 4195.2
put_hevc_epel_bi_hv24_8_neon: 736.7
put_hevc_epel_bi_hv24_8_i8mm: 715.5
put_hevc_epel_bi_hv32_8_c: 7280.5
put_hevc_epel_bi_hv32_8_neon: 1287.7
put_hevc_epel_bi_hv32_8_i8mm: 1162.2
put_hevc_epel_bi_hv48_8_c: 16857.7
put_hevc_epel_bi_hv48_8_neon: 2836.2
put_hevc_epel_bi_hv48_8_i8mm: 2908.5
put_hevc_epel_bi_hv64_8_c: 29248.2
put_hevc_epel_bi_hv64_8_neon: 5051.7
put_hevc_epel_bi_hv64_8_i8mm: 4491.5
---
 libavcodec/aarch64/hevcdsp_epel_neon.S    | 62 +++++++++++------------
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  5 ++
 2 files changed, 36 insertions(+), 31 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index d0c6205e1c..cb17758a72 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -3792,14 +3792,6 @@ endfunc
 
 epel_uni_w_hv neon
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-epel_uni_w_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
 function hevc_put_hevc_epel_bi_hv4_8_end_neon
         load_epel_filterh x7, x6
         mov             x10, #(MAX_PB_SIZE * 2)
@@ -3978,10 +3970,8 @@ function hevc_put_hevc_epel_bi_hv32_8_end_neon
         ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
+.macro epel_bi_hv suffix
+function ff_hevc_put_hevc_epel_bi_hv4_8_\suffix, export=1
         add             w10, w5, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -3994,14 +3984,14 @@ function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
         add             w3, w5, #3
         mov             x4, x6
         mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h4_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         b               hevc_put_hevc_epel_bi_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv6_8_\suffix, export=1
         add             w10, w5, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -4014,14 +4004,14 @@ function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
         add             w3, w5, #3
         mov             x4, x6
         mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h6_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         b               hevc_put_hevc_epel_bi_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv8_8_\suffix, export=1
         add             w10, w5, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -4034,14 +4024,14 @@ function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
         add             w3, w5, #3
         mov             x4, x6
         mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h8_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         b               hevc_put_hevc_epel_bi_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv12_8_\suffix, export=1
         add             w10, w5, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -4054,14 +4044,14 @@ function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
         add             w3, w5, #3
         mov             x4, x6
         mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h12_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         b               hevc_put_hevc_epel_bi_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv16_8_\suffix, export=1
         add             w10, w5, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -4074,14 +4064,14 @@ function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
         add             w3, w5, #3
         mov             x4, x6
         mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h16_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         b               hevc_put_hevc_epel_bi_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv24_8_\suffix, export=1
         add             w10, w5, #3
         lsl             x10, x10, #7
         sub             sp, sp, x10 // tmp_array
@@ -4094,14 +4084,14 @@ function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
         add             w3, w5, #3
         mov             x4, x6
         mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_h24_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         b               hevc_put_hevc_epel_bi_hv24_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv32_8_\suffix, export=1
         str             d8, [sp, #-16]!
         add             w10, w5, #3
         lsl             x10, x10, #7
@@ -4115,20 +4105,21 @@ function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
         add             w3, w5, #3
         mov             x4, x6
         mov             x5, x7
-        bl              X(ff_hevc_put_hevc_epel_h32_8_neon_i8mm)
+        mov             w6, #32
+        bl              X(ff_hevc_put_hevc_epel_h32_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         b               hevc_put_hevc_epel_bi_hv32_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv48_8_\suffix, export=1
         stp             x6, x7, [sp, #-80]!
         stp             x4, x5, [sp, #16]
         stp             x2, x3, [sp, #32]
         stp             x0, x1, [sp, #48]
         str             x30, [sp, #64]
-        bl              X(ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_bi_hv24_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x2, x3, [sp, #32]
         ldp             x0, x1, [sp, #48]
@@ -4136,18 +4127,18 @@ function ff_hevc_put_hevc_epel_bi_hv48_8_neon_i8mm, export=1
         add             x0, x0, #24
         add             x2, x2, #24
         add             x4, x4, #48
-        bl              X(ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_bi_hv24_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
 
-function ff_hevc_put_hevc_epel_bi_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv64_8_\suffix, export=1
         stp             x6, x7, [sp, #-80]!
         stp             x4, x5, [sp, #16]
         stp             x2, x3, [sp, #32]
         stp             x0, x1, [sp, #48]
         str             x30, [sp, #64]
-        bl              X(ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_bi_hv32_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x2, x3, [sp, #32]
         ldp             x0, x1, [sp, #48]
@@ -4155,10 +4146,19 @@ function ff_hevc_put_hevc_epel_bi_hv64_8_neon_i8mm, export=1
         add             x0, x0, #32
         add             x2, x2, #32
         add             x4, x4, #64
-        bl              X(ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_epel_bi_hv32_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
+.endm
+
+epel_bi_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+epel_uni_w_hv neon_i8mm
+epel_bi_hv neon_i8mm
 
 DISABLE_I8MM
 #endif
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 948103aa09..6110a360d8 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -188,6 +188,10 @@ NEON8_FNPROTO(epel_bi_v, (uint8_t *dst, ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
         int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(epel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
+        const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
+        int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(epel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -423,6 +427,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv,);
         NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv,);
         NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv,);
+        NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv,);
 
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (12 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv " Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts Martin Storsjö
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

AWS Graviton 3:
put_hevc_qpel_uni_w_h4_8_c: 159.0
put_hevc_qpel_uni_w_h4_8_neon: 64.2
put_hevc_qpel_uni_w_h4_8_i8mm: 40.0
put_hevc_qpel_uni_w_h6_8_c: 344.7
put_hevc_qpel_uni_w_h6_8_neon: 114.5
put_hevc_qpel_uni_w_h6_8_i8mm: 82.0
put_hevc_qpel_uni_w_h8_8_c: 596.2
put_hevc_qpel_uni_w_h8_8_neon: 132.2
put_hevc_qpel_uni_w_h8_8_i8mm: 106.0
put_hevc_qpel_uni_w_h12_8_c: 1325.0
put_hevc_qpel_uni_w_h12_8_neon: 299.0
put_hevc_qpel_uni_w_h12_8_i8mm: 211.5
put_hevc_qpel_uni_w_h16_8_c: 2300.0
put_hevc_qpel_uni_w_h16_8_neon: 422.0
put_hevc_qpel_uni_w_h16_8_i8mm: 286.2
put_hevc_qpel_uni_w_h24_8_c: 5059.0
put_hevc_qpel_uni_w_h24_8_neon: 912.2
put_hevc_qpel_uni_w_h24_8_i8mm: 664.2
put_hevc_qpel_uni_w_h32_8_c: 9198.2
put_hevc_qpel_uni_w_h32_8_neon: 1638.2
put_hevc_qpel_uni_w_h32_8_i8mm: 1033.7
put_hevc_qpel_uni_w_h48_8_c: 20754.7
put_hevc_qpel_uni_w_h48_8_neon: 3633.7
put_hevc_qpel_uni_w_h48_8_i8mm: 2300.7
put_hevc_qpel_uni_w_h64_8_c: 36854.7
put_hevc_qpel_uni_w_h64_8_neon: 6435.7
put_hevc_qpel_uni_w_h64_8_i8mm: 4039.2
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   7 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 405 +++++++++++++++++++++-
 2 files changed, 410 insertions(+), 2 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 6110a360d8..ea0d26c019 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -277,6 +277,11 @@ NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst,  ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst,  ptrdiff_t _dststride,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, int denom, int wx, int ox,
@@ -429,6 +434,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv,);
         NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv,);
 
+        NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
+
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 062b7d4d0f..fba063186c 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2456,8 +2456,10 @@ function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
         ldp             x7, x30, [sp], #48
         b               .Lqpel_uni_hv16_loop
 endfunc
+DISABLE_I8MM
+#endif
 
-.macro QPEL_UNI_W_H_HEADER
+.macro QPEL_UNI_W_H_HEADER elems=4s
         ldr             x12, [sp]
         sub             x2, x2, #3
         movrel          x9, qpel_filters
@@ -2465,11 +2467,410 @@ endfunc
         ld1r            {v28.2d}, [x9]
         mov             w10, #-6
         sub             w10, w10, w5
-        dup             v30.4s, w6              // wx
+        dup             v30.\elems, w6          // wx
         dup             v31.4s, w10             // shift
         dup             v29.4s, w7              // ox
 .endm
 
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_neon, export=1
+        QPEL_UNI_W_H_HEADER 4h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v1.8b, v2.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        ext             v3.16b,  v1.16b,  v2.16b,  #2
+        ext             v4.16b,  v1.16b,  v2.16b,  #4
+        ext             v5.16b,  v1.16b,  v2.16b,  #6
+        ext             v6.16b,  v1.16b,  v2.16b,  #8
+        ext             v7.16b,  v1.16b,  v2.16b,  #10
+        ext             v16.16b, v1.16b,  v2.16b,  #12
+        ext             v17.16b, v1.16b,  v2.16b,  #14
+        mul             v18.4h,  v1.4h,   v0.h[0]
+        mla             v18.4h,  v3.4h,   v0.h[1]
+        mla             v18.4h,  v4.4h,   v0.h[2]
+        mla             v18.4h,  v5.4h,   v0.h[3]
+        mla             v18.4h,  v6.4h,   v0.h[4]
+        mla             v18.4h,  v7.4h,   v0.h[5]
+        mla             v18.4h,  v16.4h,  v0.h[6]
+        mla             v18.4h,  v17.4h,  v0.h[7]
+        smull           v16.4s,  v18.4h,  v30.4h
+        sqrshl          v16.4s,  v16.4s,  v31.4s
+        sqadd           v16.4s,  v16.4s,  v29.4s
+        sqxtn           v16.4h,  v16.4s
+        sqxtun          v16.8b,  v16.8h
+        str             s16, [x0]
+        add             x0,  x0,  x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h6_8_neon, export=1
+        QPEL_UNI_W_H_HEADER 8h
+        sub             x1,  x1,  #4
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v1.8b, v2.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        ext             v3.16b,  v1.16b,  v2.16b,  #2
+        ext             v4.16b,  v1.16b,  v2.16b,  #4
+        ext             v5.16b,  v1.16b,  v2.16b,  #6
+        ext             v6.16b,  v1.16b,  v2.16b,  #8
+        ext             v7.16b,  v1.16b,  v2.16b,  #10
+        ext             v16.16b, v1.16b,  v2.16b,  #12
+        ext             v17.16b, v1.16b,  v2.16b,  #14
+        mul             v18.8h,  v1.8h,   v0.h[0]
+        mla             v18.8h,  v3.8h,   v0.h[1]
+        mla             v18.8h,  v4.8h,   v0.h[2]
+        mla             v18.8h,  v5.8h,   v0.h[3]
+        mla             v18.8h,  v6.8h,   v0.h[4]
+        mla             v18.8h,  v7.8h,   v0.h[5]
+        mla             v18.8h,  v16.8h,  v0.h[6]
+        mla             v18.8h,  v17.8h,  v0.h[7]
+        smull           v16.4s,  v18.4h,  v30.4h
+        smull2          v17.4s,  v18.8h,  v30.8h
+        sqrshl          v16.4s,  v16.4s,  v31.4s
+        sqrshl          v17.4s,  v17.4s,  v31.4s
+        sqadd           v16.4s,  v16.4s,  v29.4s
+        sqadd           v17.4s,  v17.4s,  v29.4s
+        sqxtn           v16.4h,  v16.4s
+        sqxtn2          v16.8h,  v17.4s
+        sqxtun          v16.8b,  v16.8h
+        str             s16, [x0], #4
+        st1             {v16.h}[2], [x0], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h8_8_neon, export=1
+        QPEL_UNI_W_H_HEADER 8h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v1.8b, v2.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        ext             v3.16b,  v1.16b,  v2.16b,  #2
+        ext             v4.16b,  v1.16b,  v2.16b,  #4
+        ext             v5.16b,  v1.16b,  v2.16b,  #6
+        ext             v6.16b,  v1.16b,  v2.16b,  #8
+        ext             v7.16b,  v1.16b,  v2.16b,  #10
+        ext             v16.16b, v1.16b,  v2.16b,  #12
+        ext             v17.16b, v1.16b,  v2.16b,  #14
+        mul             v18.8h,  v1.8h,   v0.h[0]
+        mla             v18.8h,  v3.8h,   v0.h[1]
+        mla             v18.8h,  v4.8h,   v0.h[2]
+        mla             v18.8h,  v5.8h,   v0.h[3]
+        mla             v18.8h,  v6.8h,   v0.h[4]
+        mla             v18.8h,  v7.8h,   v0.h[5]
+        mla             v18.8h,  v16.8h,  v0.h[6]
+        mla             v18.8h,  v17.8h,  v0.h[7]
+        smull           v16.4s,  v18.4h,  v30.4h
+        smull2          v17.4s,  v18.8h,  v30.8h
+        sqrshl          v16.4s,  v16.4s,  v31.4s
+        sqrshl          v17.4s,  v17.4s,  v31.4s
+        sqadd           v16.4s,  v16.4s,  v29.4s
+        sqadd           v17.4s,  v17.4s,  v29.4s
+        sqxtn           v16.4h,  v16.4s
+        sqxtn2          v16.8h,  v17.4s
+        sqxtun          v16.8b,  v16.8h
+        st1             {v16.8b}, [x0], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h12_8_neon, export=1
+        QPEL_UNI_W_H_HEADER 8h
+        add             x13, x0,  #8
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v1.8b, v2.8b, v3.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        ext             v4.16b,  v1.16b,  v2.16b,  #2
+        ext             v5.16b,  v1.16b,  v2.16b,  #4
+        ext             v6.16b,  v1.16b,  v2.16b,  #6
+        ext             v7.16b,  v1.16b,  v2.16b,  #8
+        ext             v16.16b, v1.16b,  v2.16b,  #10
+        ext             v17.16b, v1.16b,  v2.16b,  #12
+        ext             v18.16b, v1.16b,  v2.16b,  #14
+        mul             v19.8h,  v1.8h,   v0.h[0]
+        mla             v19.8h,  v4.8h,   v0.h[1]
+        mla             v19.8h,  v5.8h,   v0.h[2]
+        mla             v19.8h,  v6.8h,   v0.h[3]
+        mla             v19.8h,  v7.8h,   v0.h[4]
+        mla             v19.8h,  v16.8h,  v0.h[5]
+        mla             v19.8h,  v17.8h,  v0.h[6]
+        mla             v19.8h,  v18.8h,  v0.h[7]
+        ext             v4.16b,  v2.16b,  v3.16b,  #2
+        ext             v5.16b,  v2.16b,  v3.16b,  #4
+        ext             v6.16b,  v2.16b,  v3.16b,  #6
+        ext             v7.16b,  v2.16b,  v3.16b,  #8
+        ext             v16.16b, v2.16b,  v3.16b,  #10
+        ext             v17.16b, v2.16b,  v3.16b,  #12
+        ext             v18.16b, v2.16b,  v3.16b,  #14
+        mul             v20.4h,  v2.4h,   v0.h[0]
+        mla             v20.4h,  v4.4h,   v0.h[1]
+        mla             v20.4h,  v5.4h,   v0.h[2]
+        mla             v20.4h,  v6.4h,   v0.h[3]
+        mla             v20.4h,  v7.4h,   v0.h[4]
+        mla             v20.4h,  v16.4h,  v0.h[5]
+        mla             v20.4h,  v17.4h,  v0.h[6]
+        mla             v20.4h,  v18.4h,  v0.h[7]
+        smull           v16.4s,  v19.4h,  v30.4h
+        smull2          v17.4s,  v19.8h,  v30.8h
+        smull           v18.4s,  v20.4h,  v30.4h
+        sqrshl          v16.4s,  v16.4s,  v31.4s
+        sqrshl          v17.4s,  v17.4s,  v31.4s
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqadd           v16.4s,  v16.4s,  v29.4s
+        sqadd           v17.4s,  v17.4s,  v29.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqxtn           v16.4h,  v16.4s
+        sqxtn2          v16.8h,  v17.4s
+        sqxtn           v17.4h,  v18.4s
+        sqxtun          v16.8b,  v16.8h
+        sqxtun          v17.8b,  v17.8h
+        st1             {v16.8b},   [x0],  x1
+        st1             {v17.s}[0], [x13], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h16_8_neon, export=1
+        QPEL_UNI_W_H_HEADER 8h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v1.8b, v2.8b, v3.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        ext             v4.16b,  v1.16b,  v2.16b,  #2
+        ext             v5.16b,  v1.16b,  v2.16b,  #4
+        ext             v6.16b,  v1.16b,  v2.16b,  #6
+        ext             v7.16b,  v1.16b,  v2.16b,  #8
+        ext             v16.16b, v1.16b,  v2.16b,  #10
+        ext             v17.16b, v1.16b,  v2.16b,  #12
+        ext             v18.16b, v1.16b,  v2.16b,  #14
+        mul             v19.8h,  v1.8h,   v0.h[0]
+        mla             v19.8h,  v4.8h,   v0.h[1]
+        mla             v19.8h,  v5.8h,   v0.h[2]
+        mla             v19.8h,  v6.8h,   v0.h[3]
+        mla             v19.8h,  v7.8h,   v0.h[4]
+        mla             v19.8h,  v16.8h,  v0.h[5]
+        mla             v19.8h,  v17.8h,  v0.h[6]
+        mla             v19.8h,  v18.8h,  v0.h[7]
+        ext             v4.16b,  v2.16b,  v3.16b,  #2
+        ext             v5.16b,  v2.16b,  v3.16b,  #4
+        ext             v6.16b,  v2.16b,  v3.16b,  #6
+        ext             v7.16b,  v2.16b,  v3.16b,  #8
+        ext             v16.16b, v2.16b,  v3.16b,  #10
+        ext             v17.16b, v2.16b,  v3.16b,  #12
+        ext             v18.16b, v2.16b,  v3.16b,  #14
+        mul             v20.8h,  v2.8h,   v0.h[0]
+        mla             v20.8h,  v4.8h,   v0.h[1]
+        mla             v20.8h,  v5.8h,   v0.h[2]
+        mla             v20.8h,  v6.8h,   v0.h[3]
+        mla             v20.8h,  v7.8h,   v0.h[4]
+        mla             v20.8h,  v16.8h,  v0.h[5]
+        mla             v20.8h,  v17.8h,  v0.h[6]
+        mla             v20.8h,  v18.8h,  v0.h[7]
+        smull           v16.4s,  v19.4h,  v30.4h
+        smull2          v17.4s,  v19.8h,  v30.8h
+        smull           v18.4s,  v20.4h,  v30.4h
+        smull2          v19.4s,  v20.8h,  v30.8h
+        sqrshl          v16.4s,  v16.4s,  v31.4s
+        sqrshl          v17.4s,  v17.4s,  v31.4s
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqrshl          v19.4s,  v19.4s,  v31.4s
+        sqadd           v16.4s,  v16.4s,  v29.4s
+        sqadd           v17.4s,  v17.4s,  v29.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqadd           v19.4s,  v19.4s,  v29.4s
+        sqxtn           v16.4h,  v16.4s
+        sqxtn2          v16.8h,  v17.4s
+        sqxtn           v17.4h,  v18.4s
+        sqxtn2          v17.8h,  v19.4s
+        sqxtun          v16.8b,  v16.8h
+        sqxtun          v17.8b,  v17.8h
+        st1             {v16.8b, v17.8b}, [x0], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h24_8_neon, export=1
+        QPEL_UNI_W_H_HEADER 8h
+        sxtl            v0.8h,   v28.8b
+1:
+        ld1             {v1.8b, v2.8b, v3.8b, v4.8b}, [x2], x3
+        subs            w4,  w4,  #1
+        uxtl            v1.8h,   v1.8b
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        uxtl            v4.8h,   v4.8b
+        ext             v5.16b,  v1.16b,  v2.16b,  #2
+        ext             v6.16b,  v1.16b,  v2.16b,  #4
+        ext             v7.16b,  v1.16b,  v2.16b,  #6
+        ext             v16.16b, v1.16b,  v2.16b,  #8
+        ext             v17.16b, v1.16b,  v2.16b,  #10
+        ext             v18.16b, v1.16b,  v2.16b,  #12
+        ext             v19.16b, v1.16b,  v2.16b,  #14
+        mul             v20.8h,  v1.8h,   v0.h[0]
+        mla             v20.8h,  v5.8h,   v0.h[1]
+        mla             v20.8h,  v6.8h,   v0.h[2]
+        mla             v20.8h,  v7.8h,   v0.h[3]
+        mla             v20.8h,  v16.8h,  v0.h[4]
+        mla             v20.8h,  v17.8h,  v0.h[5]
+        mla             v20.8h,  v18.8h,  v0.h[6]
+        mla             v20.8h,  v19.8h,  v0.h[7]
+        ext             v5.16b,  v2.16b,  v3.16b,  #2
+        ext             v6.16b,  v2.16b,  v3.16b,  #4
+        ext             v7.16b,  v2.16b,  v3.16b,  #6
+        ext             v16.16b, v2.16b,  v3.16b,  #8
+        ext             v17.16b, v2.16b,  v3.16b,  #10
+        ext             v18.16b, v2.16b,  v3.16b,  #12
+        ext             v19.16b, v2.16b,  v3.16b,  #14
+        mul             v21.8h,  v2.8h,   v0.h[0]
+        mla             v21.8h,  v5.8h,   v0.h[1]
+        mla             v21.8h,  v6.8h,   v0.h[2]
+        mla             v21.8h,  v7.8h,   v0.h[3]
+        mla             v21.8h,  v16.8h,  v0.h[4]
+        mla             v21.8h,  v17.8h,  v0.h[5]
+        mla             v21.8h,  v18.8h,  v0.h[6]
+        mla             v21.8h,  v19.8h,  v0.h[7]
+        ext             v5.16b,  v3.16b,  v4.16b,  #2
+        ext             v6.16b,  v3.16b,  v4.16b,  #4
+        ext             v7.16b,  v3.16b,  v4.16b,  #6
+        ext             v16.16b, v3.16b,  v4.16b,  #8
+        ext             v17.16b, v3.16b,  v4.16b,  #10
+        ext             v18.16b, v3.16b,  v4.16b,  #12
+        ext             v19.16b, v3.16b,  v4.16b,  #14
+        mul             v22.8h,  v3.8h,   v0.h[0]
+        mla             v22.8h,  v5.8h,   v0.h[1]
+        mla             v22.8h,  v6.8h,   v0.h[2]
+        mla             v22.8h,  v7.8h,   v0.h[3]
+        mla             v22.8h,  v16.8h,  v0.h[4]
+        mla             v22.8h,  v17.8h,  v0.h[5]
+        mla             v22.8h,  v18.8h,  v0.h[6]
+        mla             v22.8h,  v19.8h,  v0.h[7]
+        smull           v16.4s,  v20.4h,  v30.4h
+        smull2          v17.4s,  v20.8h,  v30.8h
+        smull           v18.4s,  v21.4h,  v30.4h
+        smull2          v19.4s,  v21.8h,  v30.8h
+        smull           v20.4s,  v22.4h,  v30.4h
+        smull2          v21.4s,  v22.8h,  v30.8h
+        sqrshl          v16.4s,  v16.4s,  v31.4s
+        sqrshl          v17.4s,  v17.4s,  v31.4s
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqrshl          v19.4s,  v19.4s,  v31.4s
+        sqrshl          v20.4s,  v20.4s,  v31.4s
+        sqrshl          v21.4s,  v21.4s,  v31.4s
+        sqadd           v16.4s,  v16.4s,  v29.4s
+        sqadd           v17.4s,  v17.4s,  v29.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqadd           v19.4s,  v19.4s,  v29.4s
+        sqadd           v20.4s,  v20.4s,  v29.4s
+        sqadd           v21.4s,  v21.4s,  v29.4s
+        sqxtn           v16.4h,  v16.4s
+        sqxtn2          v16.8h,  v17.4s
+        sqxtn           v17.4h,  v18.4s
+        sqxtn2          v17.8h,  v19.4s
+        sqxtn           v18.4h,  v20.4s
+        sqxtn2          v18.8h,  v21.4s
+        sqxtun          v16.8b,  v16.8h
+        sqxtun          v17.8b,  v17.8h
+        sqxtun          v18.8b,  v18.8h
+        st1             {v16.8b, v17.8b, v18.8b}, [x0], x1
+        b.hi            1b
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h32_8_neon, export=1
+        QPEL_UNI_W_H_HEADER 8h
+        ldr             w10, [sp, #16]        // width
+        ld1             {v1.8b}, [x2], #8
+        sub             x3,  x3,  w10, uxtw   // decrement src stride
+        mov             w11, w10              // original width
+        sub             x3,  x3,  #8          // decrement src stride
+        sub             x1,  x1,  w10, uxtw   // decrement dst stride
+        sxtl            v0.8h,   v28.8b
+        uxtl            v1.8h,   v1.8b
+1:
+        ld1             {v2.8b, v3.8b}, [x2], #16
+        subs            w10, w10, #16         // width
+        uxtl            v2.8h,   v2.8b
+        uxtl            v3.8h,   v3.8b
+        ext             v4.16b,  v1.16b,  v2.16b,  #2
+        ext             v5.16b,  v1.16b,  v2.16b,  #4
+        ext             v6.16b,  v1.16b,  v2.16b,  #6
+        ext             v7.16b,  v1.16b,  v2.16b,  #8
+        ext             v16.16b, v1.16b,  v2.16b,  #10
+        ext             v17.16b, v1.16b,  v2.16b,  #12
+        ext             v18.16b, v1.16b,  v2.16b,  #14
+        mul             v19.8h,  v1.8h,   v0.h[0]
+        mla             v19.8h,  v4.8h,   v0.h[1]
+        mla             v19.8h,  v5.8h,   v0.h[2]
+        mla             v19.8h,  v6.8h,   v0.h[3]
+        mla             v19.8h,  v7.8h,   v0.h[4]
+        mla             v19.8h,  v16.8h,  v0.h[5]
+        mla             v19.8h,  v17.8h,  v0.h[6]
+        mla             v19.8h,  v18.8h,  v0.h[7]
+        ext             v4.16b,  v2.16b,  v3.16b,  #2
+        ext             v5.16b,  v2.16b,  v3.16b,  #4
+        ext             v6.16b,  v2.16b,  v3.16b,  #6
+        ext             v7.16b,  v2.16b,  v3.16b,  #8
+        ext             v16.16b, v2.16b,  v3.16b,  #10
+        ext             v17.16b, v2.16b,  v3.16b,  #12
+        ext             v18.16b, v2.16b,  v3.16b,  #14
+        mul             v20.8h,  v2.8h,   v0.h[0]
+        mla             v20.8h,  v4.8h,   v0.h[1]
+        mla             v20.8h,  v5.8h,   v0.h[2]
+        mla             v20.8h,  v6.8h,   v0.h[3]
+        mla             v20.8h,  v7.8h,   v0.h[4]
+        mla             v20.8h,  v16.8h,  v0.h[5]
+        mla             v20.8h,  v17.8h,  v0.h[6]
+        mla             v20.8h,  v18.8h,  v0.h[7]
+        smull           v16.4s,  v19.4h,  v30.4h
+        smull2          v17.4s,  v19.8h,  v30.8h
+        smull           v18.4s,  v20.4h,  v30.4h
+        smull2          v19.4s,  v20.8h,  v30.8h
+        sqrshl          v16.4s,  v16.4s,  v31.4s
+        sqrshl          v17.4s,  v17.4s,  v31.4s
+        sqrshl          v18.4s,  v18.4s,  v31.4s
+        sqrshl          v19.4s,  v19.4s,  v31.4s
+        sqadd           v16.4s,  v16.4s,  v29.4s
+        sqadd           v17.4s,  v17.4s,  v29.4s
+        sqadd           v18.4s,  v18.4s,  v29.4s
+        sqadd           v19.4s,  v19.4s,  v29.4s
+        sqxtn           v16.4h,  v16.4s
+        sqxtn2          v16.8h,  v17.4s
+        sqxtn           v17.4h,  v18.4s
+        sqxtn2          v17.8h,  v19.4s
+        sqxtun          v16.8b,  v16.8h
+        sqxtun          v17.8b,  v17.8h
+        st1             {v16.8b, v17.8b}, [x0], #16
+        mov             v1.16b,  v3.16b
+        b.gt            1b
+        subs            w4,  w4,  #1          // height
+        add             x2,  x2,  x3
+        b.le            9f
+        ld1             {v1.8b}, [x2], #8
+        mov             w10, w11
+        add             x0,  x0,  x1
+        uxtl            v1.8h,   v1.8b
+        b               1b
+9:
+        ret
+endfunc
+
+#if HAVE_I8MM
+ENABLE_I8MM
 function ff_hevc_put_hevc_qpel_uni_w_h4_8_neon_i8mm, export=1
         QPEL_UNI_W_H_HEADER
 1:
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (13 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions Martin Storsjö
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 94 +++++++++++++++++++++++---
 1 file changed, 86 insertions(+), 8 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index fba063186c..c04e8dbea8 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2166,6 +2166,10 @@ function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv4_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
         ldr             d16, [sp]
@@ -2208,6 +2212,10 @@ function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv6_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
         sub             x1, x1, #4
@@ -2253,6 +2261,10 @@ function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldr             x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv8_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
         ldr             q16, [sp]
@@ -2296,6 +2308,10 @@ function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv12_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
         sub             x1, x1, #8
@@ -2339,7 +2355,10 @@ function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
-.Lqpel_uni_hv16_loop:
+        b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv16_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
         sub             w12, w9, w7, lsl #1
@@ -2414,7 +2433,7 @@ function ff_hevc_put_hevc_qpel_uni_hv32_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
-        b               .Lqpel_uni_hv16_loop
+        b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_hv48_8_neon_i8mm, export=1
@@ -2434,7 +2453,7 @@ function ff_hevc_put_hevc_qpel_uni_hv48_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
-        b               .Lqpel_uni_hv16_loop
+        b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
@@ -2454,7 +2473,7 @@ function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
-        b               .Lqpel_uni_hv16_loop
+        b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 DISABLE_I8MM
 #endif
@@ -3776,6 +3795,10 @@ function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv4_8_end_neon
         load_qpel_filterh x5, x4
         ldr             d16, [sp]
         ldr             d17, [sp, x7]
@@ -3813,6 +3836,10 @@ function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv6_8_end_neon
         mov             x8, #120
         load_qpel_filterh x5, x4
         ldr             q16, [sp]
@@ -3852,6 +3879,10 @@ function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv8_8_end_neon
         mov             x7, #128
         load_qpel_filterh x5, x4
         ldr             q16, [sp]
@@ -3890,6 +3921,10 @@ function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv12_8_end_neon
         mov             x7, #128
         load_qpel_filterh x5, x4
         mov             x8, #112
@@ -3927,6 +3962,10 @@ function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv16_8_end_neon
         mov             x7, #128
         load_qpel_filterh x5, x4
         ld1             {v16.8h, v17.8h}, [sp], x7
@@ -3979,6 +4018,10 @@ function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
         bl              X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
         ldp             x0, x3, [sp, #16]
         ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv32_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv32_8_end_neon
         mov             x7, #128
         load_qpel_filterh x5, x4
 0:      mov             x8, sp          // src
@@ -4127,6 +4170,10 @@ endfunc
 
 function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
         QPEL_UNI_W_HV_HEADER 4
+        b               hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
         ldr             d16, [sp]
         ldr             d17, [sp, x10]
         add             sp, sp, x10, lsl #1
@@ -4217,6 +4264,10 @@ endfunc
 
 function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
         QPEL_UNI_W_HV_HEADER 8
+        b               hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
         ldr             q16, [sp]
         ldr             q17, [sp, x10]
         add             sp, sp, x10, lsl #1
@@ -4327,6 +4378,10 @@ endfunc
 
 function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
         QPEL_UNI_W_HV_HEADER 16
+        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
         ldp             q16, q1, [sp]
         add             sp, sp, x10
         ldp             q17, q2, [sp]
@@ -4430,6 +4485,10 @@ endfunc
 
 function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         QPEL_UNI_W_HV_HEADER 32
+        b               hevc_put_hevc_qpel_uni_w_hv32_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv32_8_end_neon
         mov             x11, sp
         mov             w12, w22
         mov             x13, x20
@@ -4543,6 +4602,10 @@ endfunc
 
 function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         QPEL_UNI_W_HV_HEADER 64
+        b               hevc_put_hevc_qpel_uni_w_hv64_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv64_8_end_neon
         mov             x11, sp
         mov             w12, w22
         mov             x13, x20
@@ -4671,6 +4734,10 @@ function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_bi_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_bi_hv4_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x7, x6
         ld1             {v16.4h}, [sp], x9
@@ -4712,6 +4779,10 @@ function ff_hevc_put_hevc_qpel_bi_hv6_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_bi_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_bi_hv6_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x7, x6
         sub             x1, x1, #4
@@ -4758,6 +4829,10 @@ function ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm, export=1
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_bi_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_bi_hv8_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x7, x6
         ld1             {v16.8h}, [sp], x9
@@ -4822,7 +4897,10 @@ function ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm, export=1
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         mov             x6, #16          // width
-.Lqpel_bi_hv16_loop:
+        b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_bi_hv16_8_end_neon
         load_qpel_filterh x7, x8
         mov             x9, #(MAX_PB_SIZE * 2)
         mov             x10, x6
@@ -4908,7 +4986,7 @@ function ff_hevc_put_hevc_qpel_bi_hv32_8_neon_i8mm, export=1
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         mov             x6, #32 // width
-        b               .Lqpel_bi_hv16_loop
+        b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
 endfunc
 
 function ff_hevc_put_hevc_qpel_bi_hv48_8_neon_i8mm, export=1
@@ -4929,7 +5007,7 @@ function ff_hevc_put_hevc_qpel_bi_hv48_8_neon_i8mm, export=1
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         mov             x6, #48 // width
-        b               .Lqpel_bi_hv16_loop
+        b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
 endfunc
 
 function ff_hevc_put_hevc_qpel_bi_hv64_8_neon_i8mm, export=1
@@ -4950,7 +5028,7 @@ function ff_hevc_put_hevc_qpel_bi_hv64_8_neon_i8mm, export=1
         ldp             x0, x1, [sp, #32]
         ldp             x7, x30, [sp], #48
         mov             x6, #64          // width
-        b               .Lqpel_bi_hv16_loop
+        b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
 endfunc
 
 DISABLE_I8MM
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (14 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating Martin Storsjö
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

The hv32 and hv64 functions were identical - both loop and
process 16 pixels at a time.

The hv16 function was near identical, except for the outer loop
(and using sp instead of a separate register).

Given the size of these functions, the extra cost of the outer
loop is negligible, so use the same function for hv16 as well.

This removes over 200 lines of duplicated assembly, and over 4 KB
of binary size.
---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 220 +------------------------
 1 file changed, 3 insertions(+), 217 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index c04e8dbea8..06832603d9 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4381,231 +4381,17 @@ function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
         b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 
-function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
-        ldp             q16, q1, [sp]
-        add             sp, sp, x10
-        ldp             q17, q2, [sp]
-        add             sp, sp, x10
-        ldp             q18, q3, [sp]
-        add             sp, sp, x10
-        ldp             q19, q4, [sp]
-        add             sp, sp, x10
-        ldp             q20, q5, [sp]
-        add             sp, sp, x10
-        ldp             q21, q6, [sp]
-        add             sp, sp, x10
-        ldp             q22, q7, [sp]
-        add             sp, sp, x10
-1:
-        ldp             q23, q31, [sp]
-        add             sp, sp, x10
-        QPEL_FILTER_H   v24, v16, v17, v18, v19, v20, v21, v22, v23
-        QPEL_FILTER_H2  v25, v16, v17, v18, v19, v20, v21, v22, v23
-        QPEL_FILTER_H   v26,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
-        QPEL_FILTER_H2  v27,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q16, q1, [sp]
-        add             sp, sp, x10
-        QPEL_FILTER_H   v24, v17, v18, v19, v20, v21, v22, v23, v16
-        QPEL_FILTER_H2  v25, v17, v18, v19, v20, v21, v22, v23, v16
-        QPEL_FILTER_H   v26,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
-        QPEL_FILTER_H2  v27,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q17, q2, [sp]
-        add             sp, sp, x10
-        QPEL_FILTER_H   v24, v18, v19, v20, v21, v22, v23, v16, v17
-        QPEL_FILTER_H2  v25, v18, v19, v20, v21, v22, v23, v16, v17
-        QPEL_FILTER_H   v26,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
-        QPEL_FILTER_H2  v27,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q18, q3, [sp]
-        add             sp, sp, x10
-        QPEL_FILTER_H   v24, v19, v20, v21, v22, v23, v16, v17, v18
-        QPEL_FILTER_H2  v25, v19, v20, v21, v22, v23, v16, v17, v18
-        QPEL_FILTER_H   v26,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
-        QPEL_FILTER_H2  v27,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q19, q4, [sp]
-        add             sp, sp, x10
-        QPEL_FILTER_H   v24, v20, v21, v22, v23, v16, v17, v18, v19
-        QPEL_FILTER_H2  v25, v20, v21, v22, v23, v16, v17, v18, v19
-        QPEL_FILTER_H   v26,  v5,  v6,  v7, v31,  v1,  v2,  v3,  v4
-        QPEL_FILTER_H2  v27,  v5,  v6,  v7, v31,  v1,  v2,  v3,  v4
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q20, q5, [sp]
-        add             sp, sp, x10
-        QPEL_FILTER_H   v24, v21, v22, v23, v16, v17, v18, v19, v20
-        QPEL_FILTER_H2  v25, v21, v22, v23, v16, v17, v18, v19, v20
-        QPEL_FILTER_H   v26,  v6,  v7, v31,  v1,  v2,  v3,  v4,  v5
-        QPEL_FILTER_H2  v27,  v6,  v7, v31,  v1,  v2,  v3,  v4,  v5
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q21, q6, [sp]
-        add             sp, sp, x10
-        QPEL_FILTER_H   v24, v22, v23, v16, v17, v18, v19, v20, v21
-        QPEL_FILTER_H2  v25, v22, v23, v16, v17, v18, v19, v20, v21
-        QPEL_FILTER_H   v26,  v7, v31,  v1,  v2,  v3,  v4,  v5,  v6
-        QPEL_FILTER_H2  v27,  v7, v31,  v1,  v2,  v3,  v4,  v5,  v6
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q22, q7, [sp]
-        add             sp, sp, x10
-        QPEL_FILTER_H   v24, v23, v16, v17, v18, v19, v20, v21, v22
-        QPEL_FILTER_H2  v25, v23, v16, v17, v18, v19, v20, v21, v22
-        QPEL_FILTER_H   v26, v31,  v1,  v2,  v3,  v4,  v5,  v6,  v7
-        QPEL_FILTER_H2  v27, v31,  v1,  v2,  v3,  v4,  v5,  v6,  v7
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.hi            1b
-
-2:
-        QPEL_UNI_W_HV_END
-        ret
-endfunc
-
-
 function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
         QPEL_UNI_W_HV_HEADER 32
-        b               hevc_put_hevc_qpel_uni_w_hv32_8_end_neon
-endfunc
-
-function hevc_put_hevc_qpel_uni_w_hv32_8_end_neon
-        mov             x11, sp
-        mov             w12, w22
-        mov             x13, x20
-        mov             x14, sp
-3:
-        ldp             q16, q1, [x11]
-        add             x11, x11, x10
-        ldp             q17, q2, [x11]
-        add             x11, x11, x10
-        ldp             q18, q3, [x11]
-        add             x11, x11, x10
-        ldp             q19, q4, [x11]
-        add             x11, x11, x10
-        ldp             q20, q5, [x11]
-        add             x11, x11, x10
-        ldp             q21, q6, [x11]
-        add             x11, x11, x10
-        ldp             q22, q7, [x11]
-        add             x11, x11, x10
-1:
-        ldp             q23, q31, [x11]
-        add             x11, x11, x10
-        QPEL_FILTER_H   v24, v16, v17, v18, v19, v20, v21, v22, v23
-        QPEL_FILTER_H2  v25, v16, v17, v18, v19, v20, v21, v22, v23
-        QPEL_FILTER_H   v26,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
-        QPEL_FILTER_H2  v27,  v1,  v2,  v3,  v4,  v5,  v6,  v7, v31
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q16, q1, [x11]
-        add             x11, x11, x10
-        QPEL_FILTER_H   v24, v17, v18, v19, v20, v21, v22, v23, v16
-        QPEL_FILTER_H2  v25, v17, v18, v19, v20, v21, v22, v23, v16
-        QPEL_FILTER_H   v26,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
-        QPEL_FILTER_H2  v27,  v2,  v3,  v4,  v5,  v6,  v7, v31,  v1
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q17, q2, [x11]
-        add             x11, x11, x10
-        QPEL_FILTER_H   v24, v18, v19, v20, v21, v22, v23, v16, v17
-        QPEL_FILTER_H2  v25, v18, v19, v20, v21, v22, v23, v16, v17
-        QPEL_FILTER_H   v26,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
-        QPEL_FILTER_H2  v27,  v3,  v4,  v5,  v6,  v7, v31,  v1,  v2
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q18, q3, [x11]
-        add             x11, x11, x10
-        QPEL_FILTER_H   v24, v19, v20, v21, v22, v23, v16, v17, v18
-        QPEL_FILTER_H2  v25, v19, v20, v21, v22, v23, v16, v17, v18
-        QPEL_FILTER_H   v26,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
-        QPEL_FILTER_H2  v27,  v4,  v5,  v6,  v7, v31,  v1,  v2,  v3
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q19, q4, [x11]
-        add             x11, x11, x10
-        QPEL_FILTER_H   v24, v20, v21, v22, v23, v16, v17, v18, v19
-        QPEL_FILTER_H2  v25, v20, v21, v22, v23, v16, v17, v18, v19
-        QPEL_FILTER_H   v26,  v5,  v6,  v7, v31,  v1,  v2,  v3,  v4
-        QPEL_FILTER_H2  v27,  v5,  v6,  v7, v31,  v1,  v2,  v3,  v4
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q20, q5, [x11]
-        add             x11, x11, x10
-        QPEL_FILTER_H   v24, v21, v22, v23, v16, v17, v18, v19, v20
-        QPEL_FILTER_H2  v25, v21, v22, v23, v16, v17, v18, v19, v20
-        QPEL_FILTER_H   v26,  v6,  v7, v31,  v1,  v2,  v3,  v4,  v5
-        QPEL_FILTER_H2  v27,  v6,  v7, v31,  v1,  v2,  v3,  v4,  v5
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q21, q6, [x11]
-        add             x11, x11, x10
-        QPEL_FILTER_H   v24, v22, v23, v16, v17, v18, v19, v20, v21
-        QPEL_FILTER_H2  v25, v22, v23, v16, v17, v18, v19, v20, v21
-        QPEL_FILTER_H   v26,  v7, v31,  v1,  v2,  v3,  v4,  v5,  v6
-        QPEL_FILTER_H2  v27,  v7, v31,  v1,  v2,  v3,  v4,  v5,  v6
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.eq            2f
-
-        ldp             q22, q7, [x11]
-        add             x11, x11, x10
-        QPEL_FILTER_H   v24, v23, v16, v17, v18, v19, v20, v21, v22
-        QPEL_FILTER_H2  v25, v23, v16, v17, v18, v19, v20, v21, v22
-        QPEL_FILTER_H   v26, v31,  v1,  v2,  v3,  v4,  v5,  v6,  v7
-        QPEL_FILTER_H2  v27, v31,  v1,  v2,  v3,  v4,  v5,  v6,  v7
-        QPEL_UNI_W_HV_16
-        subs            w22, w22, #1
-        b.hi            1b
-2:
-        subs            w27, w27, #16
-        add             x11, x14, #32
-        add             x20, x13, #16
-        mov             w22, w12
-        mov             x14, x11
-        mov             x13, x20
-        b.hi            3b
-        QPEL_UNI_W_HV_END
-        ret
+        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 
 function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
         QPEL_UNI_W_HV_HEADER 64
-        b               hevc_put_hevc_qpel_uni_w_hv64_8_end_neon
+        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 
-function hevc_put_hevc_qpel_uni_w_hv64_8_end_neon
+function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
         mov             x11, sp
         mov             w12, w22
         mov             x13, x20
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (15 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv Martin Storsjö
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

---
 libavcodec/aarch64/hevcdsp_qpel_neon.S | 695 +++++++++++++------------
 1 file changed, 355 insertions(+), 340 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 06832603d9..ad568e415b 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2146,29 +2146,6 @@ function ff_hevc_put_hevc_qpel_uni_w_v64_8_neon, export=1
         ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
-        add             w10, w4, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        add             x0, sp, #48
-        mov             x2, x3
-        add             x3, x4, #7
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_qpel_uni_hv4_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv4_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
@@ -2195,26 +2172,6 @@ function hevc_put_hevc_qpel_uni_hv4_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
-        add             w10, w4, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        add             x0, sp, #48
-        mov             x2, x3
-        add             w3, w4, #7
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_qpel_uni_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv6_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
@@ -2244,26 +2201,6 @@ function hevc_put_hevc_qpel_uni_hv6_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
-        add             w10, w4, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        str             x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        add             x0, sp, #48
-        mov             x2, x3
-        add             w3, w4, #7
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
-        b               hevc_put_hevc_qpel_uni_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv8_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
@@ -2291,26 +2228,6 @@ function hevc_put_hevc_qpel_uni_hv8_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
-        add             w10, w4, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        mov             x2, x3
-        add             x0, sp, #48
-        add             w3, w4, #7
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_qpel_uni_hv12_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv12_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
@@ -2338,26 +2255,6 @@ function hevc_put_hevc_qpel_uni_hv12_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
-        add             w10, w4, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x6, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        mov             x2, x3
-        add             w3, w4, #7
-        mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
-        ldp             x4, x6, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_hv16_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x6, x5
@@ -2396,6 +2293,109 @@ function hevc_put_hevc_qpel_uni_hv16_8_end_neon
         ret
 endfunc
 
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
+        add             w10, w4, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        add             x0, sp, #48
+        mov             x2, x3
+        add             x3, x4, #7
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
+        add             w10, w4, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        add             x0, sp, #48
+        mov             x2, x3
+        add             w3, w4, #7
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
+        add             w10, w4, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        str             x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        add             x0, sp, #48
+        mov             x2, x3
+        add             w3, w4, #7
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldr             x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
+        add             w10, w4, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        mov             x2, x3
+        add             x0, sp, #48
+        add             w3, w4, #7
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
+        add             w10, w4, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x6, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        mov             x2, x3
+        add             w3, w4, #7
+        mov             x4, x5
+        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+        ldp             x4, x6, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
+endfunc
+
 function ff_hevc_put_hevc_qpel_uni_hv24_8_neon_i8mm, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
@@ -3779,25 +3779,10 @@ function ff_hevc_put_hevc_qpel_h64_8_neon_i8mm, export=1
         b.ne            1b
         ret
 endfunc
+DISABLE_I8MM
+#endif
 
 
-function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
-        add             w10, w3, #7
-        mov             x7, #128
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        sub             x1, x1, x2, lsl #1
-        add             x3, x3, #7
-        sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_qpel_hv4_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_hv4_8_end_neon
         load_qpel_filterh x5, x4
         ldr             d16, [sp]
@@ -3822,23 +3807,6 @@ function hevc_put_hevc_qpel_hv4_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
-        add             w10, w3, #7
-        mov             x7, #128
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        sub             x1, x1, x2, lsl #1
-        add             x3, x3, #7
-        sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_qpel_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_hv6_8_end_neon
         mov             x8, #120
         load_qpel_filterh x5, x4
@@ -3866,22 +3834,6 @@ function hevc_put_hevc_qpel_hv6_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
-        add             w10, w3, #7
-        lsl             x10, x10, #7
-        sub             x1, x1, x2, lsl #1
-        sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        add             x3, x3, #7
-        sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_qpel_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_hv8_8_end_neon
         mov             x7, #128
         load_qpel_filterh x5, x4
@@ -3908,22 +3860,6 @@ function hevc_put_hevc_qpel_hv8_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
-        add             w10, w3, #7
-        lsl             x10, x10, #7
-        sub             x1, x1, x2, lsl #1
-        sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
-        add             x3, x3, #7
-        sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_qpel_hv12_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_hv12_8_end_neon
         mov             x7, #128
         load_qpel_filterh x5, x4
@@ -3949,22 +3885,6 @@ function hevc_put_hevc_qpel_hv12_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
-        add             w10, w3, #7
-        lsl             x10, x10, #7
-        sub             x1, x1, x2, lsl #1
-        sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x3, x3, #7
-        add             x0, sp, #32
-        sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_qpel_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_hv16_8_end_neon
         mov             x7, #128
         load_qpel_filterh x5, x4
@@ -3989,38 +3909,6 @@ function hevc_put_hevc_qpel_hv16_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm, export=1
-        stp             x4, x5, [sp, #-64]!
-        stp             x2, x3, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        str             x30, [sp, #48]
-        bl              X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
-        ldp             x0, x1, [sp, #32]
-        ldp             x2, x3, [sp, #16]
-        ldp             x4, x5, [sp], #48
-        add             x1, x1, #12
-        add             x0, x0, #24
-        bl              X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
-        ldr             x30, [sp], #16
-        ret
-endfunc
-
-function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
-        add             w10, w3, #7
-        sub             x1, x1, x2, lsl #1
-        lsl             x10, x10, #7
-        sub             x1, x1, x2
-        sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x3, x3, #7
-        add             x0, sp, #32
-        bl              X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
-        b               hevc_put_hevc_qpel_hv32_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_hv32_8_end_neon
         mov             x7, #128
         load_qpel_filterh x5, x4
@@ -4056,6 +3944,122 @@ function hevc_put_hevc_qpel_hv32_8_end_neon
         ret
 endfunc
 
+#if HAVE_I8MM
+ENABLE_I8MM
+function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        mov             x7, #128
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2, lsl #1
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        mov             x7, #128
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        sub             x1, x1, x2, lsl #1
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x0, sp, #32
+        add             x3, x3, #7
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        lsl             x10, x10, #7
+        sub             x1, x1, x2, lsl #1
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x3, x3, #7
+        add             x0, sp, #32
+        sub             x1, x1, x2
+        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm, export=1
+        stp             x4, x5, [sp, #-64]!
+        stp             x2, x3, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        str             x30, [sp, #48]
+        bl              X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
+        ldp             x0, x1, [sp, #32]
+        ldp             x2, x3, [sp, #16]
+        ldp             x4, x5, [sp], #48
+        add             x1, x1, #12
+        add             x0, x0, #24
+        bl              X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
+        add             w10, w3, #7
+        sub             x1, x1, x2, lsl #1
+        lsl             x10, x10, #7
+        sub             x1, x1, x2
+        sub             sp, sp, x10         // tmp_array
+        stp             x5, x30, [sp, #-32]!
+        stp             x0, x3, [sp, #16]
+        add             x3, x3, #7
+        add             x0, sp, #32
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
+        ldp             x0, x3, [sp, #16]
+        ldp             x5, x30, [sp], #32
+        b               hevc_put_hevc_qpel_hv32_8_end_neon
+endfunc
+
 function ff_hevc_put_hevc_qpel_hv48_8_neon_i8mm, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
@@ -4089,6 +4093,8 @@ function ff_hevc_put_hevc_qpel_hv64_8_neon_i8mm, export=1
         ldr             x30, [sp], #16
         ret
 endfunc
+DISABLE_I8MM
+#endif
 
 .macro QPEL_UNI_W_HV_HEADER width
         ldp             x14, x15, [sp]          // mx, my
@@ -4168,11 +4174,6 @@ endfunc
         smlal2          \dst\().4s, \src7\().8h, v0.h[7]
 .endm
 
-function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 4
-        b               hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
         ldr             d16, [sp]
         ldr             d17, [sp, x10]
@@ -4262,11 +4263,6 @@ endfunc
         st1             {v24.d}[0], [x20], x21
 .endm
 
-function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 8
-        b               hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
         ldr             q16, [sp]
         ldr             q17, [sp, x10]
@@ -4376,21 +4372,6 @@ endfunc
         st1             {v24.16b}, [x20], x21
 .endm
 
-function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 16
-        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
-endfunc
-
-function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 32
-        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
-endfunc
-
-function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 64
-        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
         mov             x11, sp
         mov             w12, w22
@@ -4503,26 +4484,37 @@ function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
         ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
-        add             w10, w5, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        add             x0, sp, #48
-        mov             x2, x3
-        add             w3, w5, #7
-        mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_qpel_bi_hv4_8_end_neon
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
+        QPEL_UNI_W_HV_HEADER 4
+        b               hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
+        QPEL_UNI_W_HV_HEADER 8
+        b               hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
+        QPEL_UNI_W_HV_HEADER 16
+        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
+        QPEL_UNI_W_HV_HEADER 32
+        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 
+function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
+        QPEL_UNI_W_HV_HEADER 64
+        b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
+endfunc
+
+DISABLE_I8MM
+#endif
+
 function hevc_put_hevc_qpel_bi_hv4_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x7, x6
@@ -4548,26 +4540,6 @@ function hevc_put_hevc_qpel_bi_hv4_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv6_8_neon_i8mm, export=1
-        add             w10, w5, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        add             x0, sp, #48
-        mov             x2, x3
-        add             x3, x5, #7
-        mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_qpel_bi_hv6_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_bi_hv6_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x7, x6
@@ -4598,26 +4570,6 @@ function hevc_put_hevc_qpel_bi_hv6_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm, export=1
-        add             w10, w5, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        add             x0, sp, #48
-        mov             x2, x3
-        add             x3, x5, #7
-        mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        b               hevc_put_hevc_qpel_bi_hv8_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_bi_hv8_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x7, x6
@@ -4646,46 +4598,6 @@ function hevc_put_hevc_qpel_bi_hv8_8_end_neon
 2:      ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv12_8_neon_i8mm, export=1
-        stp             x6, x7, [sp, #-80]!
-        stp             x4, x5, [sp, #16]
-        stp             x2, x3, [sp, #32]
-        stp             x0, x1, [sp, #48]
-        str             x30, [sp, #64]
-        bl              X(ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x2, x3, [sp, #32]
-        ldp             x0, x1, [sp, #48]
-        ldp             x6, x7, [sp], #64
-        add             x4, x4, #16
-        add             x2, x2, #8
-        add             x0, x0, #8
-        bl              X(ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm)
-        ldr             x30, [sp], #16
-        ret
-endfunc
-
-function ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm, export=1
-        add             w10, w5, #7
-        lsl             x10, x10, #7
-        sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
-        stp             x4, x5, [sp, #16]
-        stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
-        sub             x1, x2, x3, lsl #1
-        sub             x1, x1, x3
-        mov             x2, x3
-        add             w3, w5, #7
-        mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
-        ldp             x4, x5, [sp, #16]
-        ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
-        mov             x6, #16          // width
-        b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
-endfunc
-
 function hevc_put_hevc_qpel_bi_hv16_8_end_neon
         load_qpel_filterh x7, x8
         mov             x9, #(MAX_PB_SIZE * 2)
@@ -4735,6 +4647,109 @@ function hevc_put_hevc_qpel_bi_hv16_8_end_neon
         ret
 endfunc
 
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
+        add             w10, w5, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10 // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        add             x0, sp, #48
+        mov             x2, x3
+        add             w3, w5, #7
+        mov             x4, x6
+        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_bi_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_bi_hv6_8_neon_i8mm, export=1
+        add             w10, w5, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        add             x0, sp, #48
+        mov             x2, x3
+        add             x3, x5, #7
+        mov             x4, x6
+        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_bi_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm, export=1
+        add             w10, w5, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        add             x0, sp, #48
+        mov             x2, x3
+        add             x3, x5, #7
+        mov             x4, x6
+        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        b               hevc_put_hevc_qpel_bi_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_bi_hv12_8_neon_i8mm, export=1
+        stp             x6, x7, [sp, #-80]!
+        stp             x4, x5, [sp, #16]
+        stp             x2, x3, [sp, #32]
+        stp             x0, x1, [sp, #48]
+        str             x30, [sp, #64]
+        bl              X(ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x2, x3, [sp, #32]
+        ldp             x0, x1, [sp, #48]
+        ldp             x6, x7, [sp], #64
+        add             x4, x4, #16
+        add             x2, x2, #8
+        add             x0, x0, #8
+        bl              X(ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm)
+        ldr             x30, [sp], #16
+        ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm, export=1
+        add             w10, w5, #7
+        lsl             x10, x10, #7
+        sub             sp, sp, x10         // tmp_array
+        stp             x7, x30, [sp, #-48]!
+        stp             x4, x5, [sp, #16]
+        stp             x0, x1, [sp, #32]
+        add             x0, sp, #48
+        sub             x1, x2, x3, lsl #1
+        sub             x1, x1, x3
+        mov             x2, x3
+        add             w3, w5, #7
+        mov             x4, x6
+        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+        ldp             x4, x5, [sp, #16]
+        ldp             x0, x1, [sp, #32]
+        ldp             x7, x30, [sp], #48
+        mov             x6, #16          // width
+        b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
+endfunc
+
 function ff_hevc_put_hevc_qpel_bi_hv24_8_neon_i8mm, export=1
         stp             x6, x7, [sp, #-80]!
         stp             x4, x5, [sp, #16]
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (16 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv Martin Storsjö
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_hv4_8_c: 386.0
put_hevc_qpel_hv4_8_neon: 125.7
put_hevc_qpel_hv4_8_i8mm: 83.2
put_hevc_qpel_hv6_8_c: 749.0
put_hevc_qpel_hv6_8_neon: 207.0
put_hevc_qpel_hv6_8_i8mm: 166.0
put_hevc_qpel_hv8_8_c: 1305.2
put_hevc_qpel_hv8_8_neon: 216.5
put_hevc_qpel_hv8_8_i8mm: 213.0
put_hevc_qpel_hv12_8_c: 2570.5
put_hevc_qpel_hv12_8_neon: 480.0
put_hevc_qpel_hv12_8_i8mm: 398.2
put_hevc_qpel_hv16_8_c: 4158.7
put_hevc_qpel_hv16_8_neon: 659.7
put_hevc_qpel_hv16_8_i8mm: 593.5
put_hevc_qpel_hv24_8_c: 8626.7
put_hevc_qpel_hv24_8_neon: 1653.5
put_hevc_qpel_hv24_8_i8mm: 1398.7
put_hevc_qpel_hv32_8_c: 14646.0
put_hevc_qpel_hv32_8_neon: 2566.2
put_hevc_qpel_hv32_8_i8mm: 2287.5
put_hevc_qpel_hv48_8_c: 31072.5
put_hevc_qpel_hv48_8_neon: 6228.5
put_hevc_qpel_hv48_8_i8mm: 5291.0
put_hevc_qpel_hv64_8_c: 53847.2
put_hevc_qpel_hv64_8_neon: 9856.7
put_hevc_qpel_hv64_8_i8mm: 8831.0
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   6 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 166 +++++++++++++---------
 2 files changed, 104 insertions(+), 68 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index ea0d26c019..105c26017b 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -265,6 +265,10 @@ NEON8_FNPROTO(qpel_v, (int16_t *dst,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(qpel_hv, (int16_t *dst,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_hv, (int16_t *dst,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -436,6 +440,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
 
         NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
 
+        NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
+
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
             NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index ad568e415b..7bffb991a7 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -3804,7 +3804,8 @@ function hevc_put_hevc_qpel_hv4_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_hv6_8_end_neon
@@ -3831,7 +3832,8 @@ function hevc_put_hevc_qpel_hv6_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_hv8_8_end_neon
@@ -3857,7 +3859,8 @@ function hevc_put_hevc_qpel_hv8_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_hv12_8_end_neon
@@ -3882,7 +3885,8 @@ function hevc_put_hevc_qpel_hv12_8_end_neon
 .endm
 1:      calc_all2
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_hv16_8_end_neon
@@ -3906,7 +3910,8 @@ function hevc_put_hevc_qpel_hv16_8_end_neon
 .endm
 1:      calc_all2
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_hv32_8_end_neon
@@ -3937,162 +3942,187 @@ function hevc_put_hevc_qpel_hv32_8_end_neon
         add             sp, sp, #32
         subs            w6, w6, #16
         b.hi            0b
-        add             w10, w3, #6
-        add             sp, sp, #64          // discard rest of first line
-        lsl             x10, x10, #7
-        add             sp, sp, x10         // tmp_array without first line
+        mov             sp, x14
         ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
-        add             w10, w3, #7
+.macro qpel_hv suffix
+function ff_hevc_put_hevc_qpel_hv4_8_\suffix, export=1
+        add             w10, w3, #8
         mov             x7, #128
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
+        stp             x5,  x30, [sp, #-48]!
+        stp             x0,  x3,  [sp, #16]
+        str             x14,      [sp, #32]
+        add             x0, sp, #48
         sub             x1, x1, x2, lsl #1
         add             x3, x3, #7
         sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
+        bl              X(ff_hevc_put_hevc_qpel_h4_8_\suffix)
+        ldr             x14,      [sp, #32]
+        ldp             x0,  x3,  [sp, #16]
+        ldp             x5,  x30, [sp], #48
         b               hevc_put_hevc_qpel_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
-        add             w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv6_8_\suffix, export=1
+        add             w10, w3, #8
         mov             x7, #128
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
+        stp             x5,  x30, [sp, #-48]!
+        stp             x0,  x3,  [sp, #16]
+        str             x14,      [sp, #32]
+        add             x0, sp, #48
         sub             x1, x1, x2, lsl #1
         add             x3, x3, #7
         sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
+        bl              X(ff_hevc_put_hevc_qpel_h6_8_\suffix)
+        ldr             x14,      [sp, #32]
+        ldp             x0,  x3,  [sp, #16]
+        ldp             x5,  x30, [sp], #48
         b               hevc_put_hevc_qpel_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
-        add             w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv8_8_\suffix, export=1
+        add             w10, w3, #8
         lsl             x10, x10, #7
         sub             x1, x1, x2, lsl #1
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
+        stp             x5,  x30, [sp, #-48]!
+        stp             x0,  x3,  [sp, #16]
+        str             x14,      [sp, #32]
+        add             x0, sp, #48
         add             x3, x3, #7
         sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
+        bl              X(ff_hevc_put_hevc_qpel_h8_8_\suffix)
+        ldr             x14,      [sp, #32]
+        ldp             x0,  x3,  [sp, #16]
+        ldp             x5,  x30, [sp], #48
         b               hevc_put_hevc_qpel_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
-        add             w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv12_8_\suffix, export=1
+        add             w10, w3, #8
         lsl             x10, x10, #7
         sub             x1, x1, x2, lsl #1
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
-        add             x0, sp, #32
+        stp             x5,  x30, [sp, #-48]!
+        stp             x0,  x3,  [sp, #16]
+        str             x14,      [sp, #32]
+        add             x0, sp, #48
         add             x3, x3, #7
         sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
+        mov             w6, #12
+        bl              X(ff_hevc_put_hevc_qpel_h12_8_\suffix)
+        ldr             x14,      [sp, #32]
+        ldp             x0,  x3,  [sp, #16]
+        ldp             x5,  x30, [sp], #48
         b               hevc_put_hevc_qpel_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
-        add             w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv16_8_\suffix, export=1
+        add             w10, w3, #8
         lsl             x10, x10, #7
         sub             x1, x1, x2, lsl #1
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
+        stp             x5,  x30, [sp, #-48]!
+        stp             x0,  x3,  [sp, #16]
+        str             x14,      [sp, #32]
         add             x3, x3, #7
-        add             x0, sp, #32
+        add             x0, sp, #48
         sub             x1, x1, x2
-        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
+        bl              X(ff_hevc_put_hevc_qpel_h16_8_\suffix)
+        ldr             x14,      [sp, #32]
+        ldp             x0,  x3,  [sp, #16]
+        ldp             x5,  x30, [sp], #48
         b               hevc_put_hevc_qpel_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_hv24_8_\suffix, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
         stp             x0, x1, [sp, #32]
         str             x30, [sp, #48]
-        bl              X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_hv12_8_\suffix)
         ldp             x0, x1, [sp, #32]
         ldp             x2, x3, [sp, #16]
         ldp             x4, x5, [sp], #48
         add             x1, x1, #12
         add             x0, x0, #24
-        bl              X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_hv12_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
-        add             w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv32_8_\suffix, export=1
+        add             w10, w3, #8
         sub             x1, x1, x2, lsl #1
         lsl             x10, x10, #7
         sub             x1, x1, x2
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x5, x30, [sp, #-32]!
-        stp             x0, x3, [sp, #16]
+        stp             x5,  x30, [sp, #-48]!
+        stp             x0,  x3,  [sp, #16]
+        str             x14,      [sp, #32]
         add             x3, x3, #7
-        add             x0, sp, #32
-        bl              X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
-        ldp             x0, x3, [sp, #16]
-        ldp             x5, x30, [sp], #32
+        add             x0, sp, #48
+        mov             w6, #32
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+        ldr             x14,      [sp, #32]
+        ldp             x0,  x3,  [sp, #16]
+        ldp             x5,  x30, [sp], #48
         b               hevc_put_hevc_qpel_hv32_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_hv48_8_\suffix, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
         stp             x0, x1, [sp, #32]
         str             x30, [sp, #48]
-        bl              X(ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_hv24_8_\suffix)
         ldp             x0, x1, [sp, #32]
         ldp             x2, x3, [sp, #16]
         ldp             x4, x5, [sp], #48
         add             x1, x1, #24
         add             x0, x0, #48
-        bl              X(ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_hv24_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_hv64_8_\suffix, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
         stp             x0, x1, [sp, #32]
         str             x30, [sp, #48]
         mov             x6, #32
-        bl              X(ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_hv32_8_\suffix)
         ldp             x0, x1, [sp, #32]
         ldp             x2, x3, [sp, #16]
         ldp             x4, x5, [sp], #48
         add             x1, x1, #32
         add             x0, x0, #64
         mov             x6, #32
-        bl              X(ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_hv32_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
+.endm
+
+qpel_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+qpel_hv neon_i8mm
+
 DISABLE_I8MM
 #endif
 
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (17 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv Martin Storsjö
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_uni_hv4_8_c: 384.2
put_hevc_qpel_uni_hv4_8_neon: 127.5
put_hevc_qpel_uni_hv4_8_i8mm: 85.5
put_hevc_qpel_uni_hv6_8_c: 705.5
put_hevc_qpel_uni_hv6_8_neon: 224.5
put_hevc_qpel_uni_hv6_8_i8mm: 176.2
put_hevc_qpel_uni_hv8_8_c: 1136.5
put_hevc_qpel_uni_hv8_8_neon: 216.5
put_hevc_qpel_uni_hv8_8_i8mm: 214.0
put_hevc_qpel_uni_hv12_8_c: 2259.5
put_hevc_qpel_uni_hv12_8_neon: 498.5
put_hevc_qpel_uni_hv12_8_i8mm: 410.7
put_hevc_qpel_uni_hv16_8_c: 3824.7
put_hevc_qpel_uni_hv16_8_neon: 670.0
put_hevc_qpel_uni_hv16_8_i8mm: 603.7
put_hevc_qpel_uni_hv24_8_c: 8113.5
put_hevc_qpel_uni_hv24_8_neon: 1474.7
put_hevc_qpel_uni_hv24_8_i8mm: 1351.5
put_hevc_qpel_uni_hv32_8_c: 14744.5
put_hevc_qpel_uni_hv32_8_neon: 2599.7
put_hevc_qpel_uni_hv32_8_i8mm: 2266.0
put_hevc_qpel_uni_hv48_8_c: 32800.0
put_hevc_qpel_uni_hv48_8_neon: 5650.0
put_hevc_qpel_uni_hv48_8_i8mm: 5011.7
put_hevc_qpel_uni_hv64_8_c: 57856.2
put_hevc_qpel_uni_hv64_8_neon: 9863.5
put_hevc_qpel_uni_hv64_8_i8mm: 8767.7
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 156 ++++++++++++++--------
 2 files changed, 102 insertions(+), 59 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 105c26017b..0531db027b 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -277,6 +277,10 @@ NEON8_FNPROTO(qpel_uni_v, (uint8_t *dst,  ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst,  ptrdiff_t dststride,
+        const uint8_t *src, ptrdiff_t srcstride,
+        int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst,  ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -441,6 +445,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
 
         NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
+        NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
 
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 7bffb991a7..f285ab7461 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2169,7 +2169,8 @@ function hevc_put_hevc_qpel_uni_hv4_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_uni_hv6_8_end_neon
@@ -2198,7 +2199,8 @@ function hevc_put_hevc_qpel_uni_hv6_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_uni_hv8_8_end_neon
@@ -2225,7 +2227,8 @@ function hevc_put_hevc_qpel_uni_hv8_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_uni_hv12_8_end_neon
@@ -2252,7 +2255,8 @@ function hevc_put_hevc_qpel_uni_hv12_8_end_neon
 .endm
 1:      calc_all2
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_uni_hv16_8_end_neon
@@ -2286,21 +2290,17 @@ function hevc_put_hevc_qpel_uni_hv16_8_end_neon
         add             sp, sp, #32
         subs            w7, w7, #16
         b.ne            0b
-        add             w10, w4, #6
-        add             sp, sp, x12         // discard rest of first line
-        lsl             x10, x10, #7
-        add             sp, sp, x10         // tmp_array without first line
+        mov             sp, x14
         ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
-        add             w10, w4, #7
+.macro qpel_uni_hv suffix
+function ff_hevc_put_hevc_qpel_uni_hv4_8_\suffix, export=1
+        add             w10, w4, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        str             x30, [sp, #-48]!
+        stp             x30, x14,[sp, #-48]!
         stp             x4, x6, [sp, #16]
         stp             x0, x1, [sp, #32]
         sub             x1, x2, x3, lsl #1
@@ -2309,18 +2309,19 @@ function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
         mov             x2, x3
         add             x3, x4, #7
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_h4_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
+        ldp             x30, x14, [sp], #48
         b               hevc_put_hevc_qpel_uni_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
-        add             w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv6_8_\suffix, export=1
+        add             w10, w4, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        str             x30, [sp, #-48]!
+        stp             x30, x14,[sp, #-48]!
         stp             x4, x6, [sp, #16]
         stp             x0, x1, [sp, #32]
         sub             x1, x2, x3, lsl #1
@@ -2329,18 +2330,19 @@ function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
         mov             x2, x3
         add             w3, w4, #7
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_h6_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
+        ldp             x30, x14, [sp], #48
         b               hevc_put_hevc_qpel_uni_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
-        add             w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv8_8_\suffix, export=1
+        add             w10, w4, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        str             x30, [sp, #-48]!
+        stp             x30, x14,[sp, #-48]!
         stp             x4, x6, [sp, #16]
         stp             x0, x1, [sp, #32]
         sub             x1, x2, x3, lsl #1
@@ -2349,60 +2351,67 @@ function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
         mov             x2, x3
         add             w3, w4, #7
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_h8_8_\suffix)
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldr             x30, [sp], #48
+        ldp             x30, x14, [sp], #48
         b               hevc_put_hevc_qpel_uni_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
-        add             w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv12_8_\suffix, export=1
+        add             w10, w4, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x6, [sp, #16]
         stp             x0, x1, [sp, #32]
+        str             x14,    [sp, #48]
         sub             x1, x2, x3, lsl #1
         sub             x1, x1, x3
         mov             x2, x3
-        add             x0, sp, #48
+        add             x0, sp, #64
         add             w3, w4, #7
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
+        mov             w6, #12
+        bl              X(ff_hevc_put_hevc_qpel_h12_8_\suffix)
+        ldr             x14,    [sp, #48]
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldp             x7, x30, [sp], #64
         b               hevc_put_hevc_qpel_uni_hv12_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
-        add             w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv16_8_\suffix, export=1
+        add             w10, w4, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x6, [sp, #16]
         stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
+        str             x14,    [sp, #48]
+        add             x0, sp, #64
         sub             x1, x2, x3, lsl #1
         sub             x1, x1, x3
         mov             x2, x3
         add             w3, w4, #7
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_h16_8_\suffix)
+        ldr             x14,    [sp, #48]
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldp             x7, x30, [sp], #64
         b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_uni_hv24_8_\suffix, export=1
         stp             x4, x5, [sp, #-64]!
         stp             x2, x3, [sp, #16]
         stp             x0, x1, [sp, #32]
         stp             x6, x30, [sp, #48]
         mov             x7, #16
-        bl              X(ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_uni_hv16_8_\suffix)
         ldp             x2, x3, [sp, #16]
         add             x2, x2, #16
         ldp             x0, x1, [sp, #32]
@@ -2410,71 +2419,100 @@ function ff_hevc_put_hevc_qpel_uni_hv24_8_neon_i8mm, export=1
         mov             x7, #8
         add             x0, x0, #16
         ldr             x6, [sp]
-        bl              X(ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_uni_hv8_8_\suffix)
         ldr             x30, [sp, #8]
         add             sp, sp, #16
         ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv32_8_neon_i8mm, export=1
-        add             w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv32_8_\suffix, export=1
+        add             w10, w4, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x6, [sp, #16]
         stp             x0, x1, [sp, #32]
+        str             x14,    [sp, #48]
         sub             x1, x2, x3, lsl #1
-        add             x0, sp, #48
+        add             x0, sp, #64
         sub             x1, x1, x3
         mov             x2, x3
         add             w3, w4, #7
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
+        mov             w6, #32
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+        ldr             x14,    [sp, #48]
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldp             x7, x30, [sp], #64
         b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv48_8_neon_i8mm, export=1
-        add             w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv48_8_\suffix, export=1
+        add             w10, w4, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x6, [sp, #16]
         stp             x0, x1, [sp, #32]
+        str             x14,    [sp, #48]
         sub             x1, x2, x3, lsl #1
         sub             x1, x1, x3
         mov             x2, x3
-        add             x0, sp, #48
+        add             x0, sp, #64
         add             w3, w4, #7
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h48_8_neon_i8mm)
+.ifc \suffix, neon
+        mov             w6, #48
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+.else
+        bl              X(ff_hevc_put_hevc_qpel_h48_8_\suffix)
+.endif
+        ldr             x14,    [sp, #48]
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldp             x7, x30, [sp], #64
         b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
-        add             w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv64_8_\suffix, export=1
+        add             w10, w4, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x6, [sp, #16]
         stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
+        str             x14,    [sp, #48]
+        add             x0, sp, #64
         sub             x1, x2, x3, lsl #1
         mov             x2, x3
         sub             x1, x1, x3
         add             w3, w4, #7
         mov             x4, x5
-        bl              X(ff_hevc_put_hevc_qpel_h64_8_neon_i8mm)
+.ifc \suffix, neon
+        mov             w6, #64
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+.else
+        bl              X(ff_hevc_put_hevc_qpel_h64_8_\suffix)
+.endif
+        ldr             x14,    [sp, #48]
         ldp             x4, x6, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldp             x7, x30, [sp], #64
         b               hevc_put_hevc_qpel_uni_hv16_8_end_neon
 endfunc
+.endm
+
+qpel_uni_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+qpel_uni_hv neon_i8mm
+
 DISABLE_I8MM
 #endif
 
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (18 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv Martin Storsjö
  2024-03-25 21:15 ` [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

AWS Graviton 3:
put_hevc_qpel_uni_w_hv4_8_c: 422.2
put_hevc_qpel_uni_w_hv4_8_neon: 140.7
put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7
put_hevc_qpel_uni_w_hv8_8_c: 1208.0
put_hevc_qpel_uni_w_hv8_8_neon: 268.2
put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5
put_hevc_qpel_uni_w_hv16_8_c: 4297.2
put_hevc_qpel_uni_w_hv16_8_neon: 802.2
put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2
put_hevc_qpel_uni_w_hv32_8_c: 15518.5
put_hevc_qpel_uni_w_hv32_8_neon: 3085.2
put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2
put_hevc_qpel_uni_w_hv64_8_c: 57254.5
put_hevc_qpel_uni_w_hv64_8_neon: 11787.5
put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |  6 +++
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 47 +++++++++++++++--------
 2 files changed, 37 insertions(+), 16 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 0531db027b..e9ee901322 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -305,6 +305,11 @@ NEON8_FNPROTO(epel_uni_w_hv, (uint8_t *_dst,  ptrdiff_t _dststride,
         int height, int denom, int wx, int ox,
         intptr_t mx, intptr_t my, int width), _i8mm);
 
+NEON8_FNPROTO_PARTIAL_5(qpel_uni_w_hv, (uint8_t *_dst,  ptrdiff_t _dststride,
+        const uint8_t *_src, ptrdiff_t _srcstride,
+        int height, int denom, int wx, int ox,
+        intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO_PARTIAL_5(qpel_uni_w_hv, (uint8_t *_dst,  ptrdiff_t _dststride,
         const uint8_t *_src, ptrdiff_t _srcstride,
         int height, int denom, int wx, int ox,
@@ -446,6 +451,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
 
         NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
         NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
+        NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, qpel_uni_w_hv,);
 
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index f285ab7461..df7032b692 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4164,7 +4164,7 @@ qpel_hv neon_i8mm
 DISABLE_I8MM
 #endif
 
-.macro QPEL_UNI_W_HV_HEADER width
+.macro QPEL_UNI_W_HV_HEADER width, suffix
         ldp             x14, x15, [sp]          // mx, my
         ldr             w13, [sp, #16]          // width
         stp             x19, x30, [sp, #-80]!
@@ -4173,7 +4173,7 @@ DISABLE_I8MM
         stp             x24, x25, [sp, #48]
         stp             x26, x27, [sp, #64]
         mov             x19, sp
-        mov             x11, #9088
+        mov             x11, #(MAX_PB_SIZE*(MAX_PB_SIZE+8)*2)
         sub             sp, sp, x11
         mov             x20, x0
         mov             x21, x1
@@ -4190,7 +4190,16 @@ DISABLE_I8MM
         mov             w26, #-6
         sub             w26, w26, w5            // -shift
         mov             w27, w13                // width
-        bl              X(ff_hevc_put_hevc_qpel_h\width\()_8_neon_i8mm)
+.ifc \suffix, neon
+.if \width >= 32
+        mov             w6,  #\width
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_neon)
+.else
+        bl              X(ff_hevc_put_hevc_qpel_h\width\()_8_\suffix)
+.endif
+.else
+        bl              X(ff_hevc_put_hevc_qpel_h\width\()_8_\suffix)
+.endif
         movrel          x9, qpel_filters
         add             x9, x9, x23, lsl #3
         ld1             {v0.8b}, [x9]
@@ -4552,33 +4561,39 @@ function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
         ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 4
+.macro qpel_uni_w_hv suffix
+function ff_hevc_put_hevc_qpel_uni_w_hv4_8_\suffix, export=1
+        QPEL_UNI_W_HV_HEADER 4, \suffix
         b               hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 8
+function ff_hevc_put_hevc_qpel_uni_w_hv8_8_\suffix, export=1
+        QPEL_UNI_W_HV_HEADER 8, \suffix
         b               hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 16
+function ff_hevc_put_hevc_qpel_uni_w_hv16_8_\suffix, export=1
+        QPEL_UNI_W_HV_HEADER 16, \suffix
         b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 32
+function ff_hevc_put_hevc_qpel_uni_w_hv32_8_\suffix, export=1
+        QPEL_UNI_W_HV_HEADER 32, \suffix
         b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
-        QPEL_UNI_W_HV_HEADER 64
+function ff_hevc_put_hevc_qpel_uni_w_hv64_8_\suffix, export=1
+        QPEL_UNI_W_HV_HEADER 64, \suffix
         b               hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
 endfunc
+.endm
+
+qpel_uni_w_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+qpel_uni_w_hv neon_i8mm
 
 DISABLE_I8MM
 #endif
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (19 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
  2024-03-25 21:15 ` [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
  21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.

By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.

AWS Graviton 3:
put_hevc_qpel_bi_hv4_8_c: 385.7
put_hevc_qpel_bi_hv4_8_neon: 131.0
put_hevc_qpel_bi_hv4_8_i8mm: 92.2
put_hevc_qpel_bi_hv6_8_c: 701.0
put_hevc_qpel_bi_hv6_8_neon: 239.5
put_hevc_qpel_bi_hv6_8_i8mm: 191.0
put_hevc_qpel_bi_hv8_8_c: 1162.0
put_hevc_qpel_bi_hv8_8_neon: 228.0
put_hevc_qpel_bi_hv8_8_i8mm: 225.2
put_hevc_qpel_bi_hv12_8_c: 2305.0
put_hevc_qpel_bi_hv12_8_neon: 558.0
put_hevc_qpel_bi_hv12_8_i8mm: 483.2
put_hevc_qpel_bi_hv16_8_c: 3965.2
put_hevc_qpel_bi_hv16_8_neon: 732.7
put_hevc_qpel_bi_hv16_8_i8mm: 656.5
put_hevc_qpel_bi_hv24_8_c: 8709.7
put_hevc_qpel_bi_hv24_8_neon: 1555.2
put_hevc_qpel_bi_hv24_8_i8mm: 1448.7
put_hevc_qpel_bi_hv32_8_c: 14818.0
put_hevc_qpel_bi_hv32_8_neon: 2763.7
put_hevc_qpel_bi_hv32_8_i8mm: 2468.0
put_hevc_qpel_bi_hv48_8_c: 32855.5
put_hevc_qpel_bi_hv48_8_neon: 6107.2
put_hevc_qpel_bi_hv48_8_i8mm: 5452.7
put_hevc_qpel_bi_hv64_8_c: 57591.5
put_hevc_qpel_bi_hv64_8_neon: 10660.2
put_hevc_qpel_bi_hv64_8_i8mm: 9580.0
---
 libavcodec/aarch64/hevcdsp_init_aarch64.c |   5 +
 libavcodec/aarch64/hevcdsp_qpel_neon.S    | 164 +++++++++++++---------
 2 files changed, 103 insertions(+), 66 deletions(-)

diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index e9ee901322..e24dd0cbda 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -319,6 +319,10 @@ NEON8_FNPROTO(qpel_bi_v, (uint8_t *dst, ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
         int height, intptr_t mx, intptr_t my, int width),);
 
+NEON8_FNPROTO(qpel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
+        const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
+        int height, intptr_t mx, intptr_t my, int width),);
+
 NEON8_FNPROTO(qpel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
         const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
         int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -452,6 +456,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
         NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
         NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
         NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, qpel_uni_w_hv,);
+        NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 1, qpel_bi_hv,);
 
         if (have_i8mm(cpu_flags)) {
             NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index df7032b692..8ddaa32b70 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4590,14 +4590,6 @@ endfunc
 
 qpel_uni_w_hv neon
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-qpel_uni_w_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
 function hevc_put_hevc_qpel_bi_hv4_8_end_neon
         mov             x9, #(MAX_PB_SIZE * 2)
         load_qpel_filterh x7, x6
@@ -4620,7 +4612,8 @@ function hevc_put_hevc_qpel_bi_hv4_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_bi_hv6_8_end_neon
@@ -4650,7 +4643,8 @@ function hevc_put_hevc_qpel_bi_hv6_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_bi_hv8_8_end_neon
@@ -4678,7 +4672,8 @@ function hevc_put_hevc_qpel_bi_hv8_8_end_neon
 .endm
 1:      calc_all
 .purgem calc
-2:      ret
+2:      mov             sp, x14
+        ret
 endfunc
 
 function hevc_put_hevc_qpel_bi_hv16_8_end_neon
@@ -4723,83 +4718,87 @@ function hevc_put_hevc_qpel_bi_hv16_8_end_neon
         subs            x10, x10, #16
         add             x4, x4, #32
         b.ne            0b
-        add             w10, w5, #7
-        lsl             x10, x10, #7
-        sub             x10, x10, x6, lsl #1 // part of first line
-        add             sp, sp, x10         // tmp_array without first line
+        mov             sp, x14
         ret
 endfunc
 
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
-        add             w10, w5, #7
+.macro qpel_bi_hv suffix
+function ff_hevc_put_hevc_qpel_bi_hv4_8_\suffix, export=1
+        add             w10, w5, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x5, [sp, #16]
         stp             x0, x1, [sp, #32]
+        str             x14,    [sp, #48]
         sub             x1, x2, x3, lsl #1
         sub             x1, x1, x3
-        add             x0, sp, #48
+        add             x0, sp, #64
         mov             x2, x3
         add             w3, w5, #7
         mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_h4_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldr             x14,    [sp, #48]
+        ldp             x7, x30, [sp], #64
         b               hevc_put_hevc_qpel_bi_hv4_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv6_8_neon_i8mm, export=1
-        add             w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv6_8_\suffix, export=1
+        add             w10, w5, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x5, [sp, #16]
         stp             x0, x1, [sp, #32]
+        str             x14,    [sp, #48]
         sub             x1, x2, x3, lsl #1
         sub             x1, x1, x3
-        add             x0, sp, #48
+        add             x0, sp, #64
         mov             x2, x3
         add             x3, x5, #7
         mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_h6_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldr             x14,    [sp, #48]
+        ldp             x7, x30, [sp], #64
         b               hevc_put_hevc_qpel_bi_hv6_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm, export=1
-        add             w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv8_8_\suffix, export=1
+        add             w10, w5, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x5, [sp, #16]
         stp             x0, x1, [sp, #32]
+        str             x14,    [sp, #48]
         sub             x1, x2, x3, lsl #1
         sub             x1, x1, x3
-        add             x0, sp, #48
+        add             x0, sp, #64
         mov             x2, x3
         add             x3, x5, #7
         mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_h8_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldr             x14,    [sp, #48]
+        ldp             x7, x30, [sp], #64
         b               hevc_put_hevc_qpel_bi_hv8_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_bi_hv12_8_\suffix, export=1
         stp             x6, x7, [sp, #-80]!
         stp             x4, x5, [sp, #16]
         stp             x2, x3, [sp, #32]
         stp             x0, x1, [sp, #48]
         str             x30, [sp, #64]
-        bl              X(ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_bi_hv8_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x2, x3, [sp, #32]
         ldp             x0, x1, [sp, #48]
@@ -4807,39 +4806,42 @@ function ff_hevc_put_hevc_qpel_bi_hv12_8_neon_i8mm, export=1
         add             x4, x4, #16
         add             x2, x2, #8
         add             x0, x0, #8
-        bl              X(ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_bi_hv4_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm, export=1
-        add             w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv16_8_\suffix, export=1
+        add             w10, w5, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x5, [sp, #16]
         stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
+        str             x14,    [sp, #48]
+        add             x0, sp, #64
         sub             x1, x2, x3, lsl #1
         sub             x1, x1, x3
         mov             x2, x3
         add             w3, w5, #7
         mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_h16_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldr             x14,    [sp, #48]
+        ldp             x7, x30, [sp], #64
         mov             x6, #16          // width
         b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_bi_hv24_8_\suffix, export=1
         stp             x6, x7, [sp, #-80]!
         stp             x4, x5, [sp, #16]
         stp             x2, x3, [sp, #32]
         stp             x0, x1, [sp, #48]
         str             x30, [sp, #64]
-        bl              X(ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_bi_hv16_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x2, x3, [sp, #32]
         ldp             x0, x1, [sp, #48]
@@ -4847,73 +4849,103 @@ function ff_hevc_put_hevc_qpel_bi_hv24_8_neon_i8mm, export=1
         add             x4, x4, #32
         add             x2, x2, #16
         add             x0, x0, #16
-        bl              X(ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm)
+        bl              X(ff_hevc_put_hevc_qpel_bi_hv8_8_\suffix)
         ldr             x30, [sp], #16
         ret
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv32_8_neon_i8mm, export=1
-        add             w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv32_8_\suffix, export=1
+        add             w10, w5, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10         // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x5, [sp, #16]
         stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
+        str             x14,    [sp, #48]
+        add             x0, sp, #64
         sub             x1, x2, x3, lsl #1
         mov             x2, x3
         sub             x1, x1, x3
         add             w3, w5, #7
         mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
+        mov             w6, #32
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldr             x14,    [sp, #48]
+        ldp             x7, x30, [sp], #64
         mov             x6, #32 // width
         b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv48_8_neon_i8mm, export=1
-        add             w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv48_8_\suffix, export=1
+        add             w10, w5, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x5, [sp, #16]
         stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
+        str             x14,    [sp, #48]
+        add             x0, sp, #64
         sub             x1, x2, x3, lsl #1
         mov             x2, x3
         sub             x1, x1, x3
         add             w3, w5, #7
         mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h48_8_neon_i8mm)
+.ifc \suffix, neon
+        mov             w6, #48
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+.else
+        bl              X(ff_hevc_put_hevc_qpel_h48_8_\suffix)
+.endif
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldr             x14,    [sp, #48]
+        ldp             x7, x30, [sp], #64
         mov             x6, #48 // width
         b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
 endfunc
 
-function ff_hevc_put_hevc_qpel_bi_hv64_8_neon_i8mm, export=1
-        add             w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv64_8_\suffix, export=1
+        add             w10, w5, #8
         lsl             x10, x10, #7
+        mov             x14, sp
         sub             sp, sp, x10 // tmp_array
-        stp             x7, x30, [sp, #-48]!
+        stp             x7, x30, [sp, #-64]!
         stp             x4, x5, [sp, #16]
         stp             x0, x1, [sp, #32]
-        add             x0, sp, #48
+        str             x14,    [sp, #48]
+        add             x0, sp, #64
         sub             x1, x2, x3, lsl #1
         mov             x2, x3
         sub             x1, x1, x3
         add             w3, w5, #7
         mov             x4, x6
-        bl              X(ff_hevc_put_hevc_qpel_h64_8_neon_i8mm)
+.ifc \suffix, neon
+        mov             w6, #64
+        bl              X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+.else
+        bl              X(ff_hevc_put_hevc_qpel_h64_8_\suffix)
+.endif
         ldp             x4, x5, [sp, #16]
         ldp             x0, x1, [sp, #32]
-        ldp             x7, x30, [sp], #48
+        ldr             x14,    [sp, #48]
+        ldp             x7, x30, [sp], #64
         mov             x6, #64          // width
         b               hevc_put_hevc_qpel_bi_hv16_8_end_neon
 endfunc
+.endm
+
+qpel_bi_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+qpel_uni_w_hv neon_i8mm
+
+qpel_bi_hv neon_i8mm
 
 DISABLE_I8MM
 #endif // HAVE_I8MM
-- 
2.39.3 (Apple Git-146)

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
  2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
                   ` (20 preceding siblings ...)
  2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv Martin Storsjö
@ 2024-03-25 21:15 ` Martin Storsjö
  2024-03-25 21:56   ` J. Dekker
  21 siblings, 1 reply; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 21:15 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker

On Mon, 25 Mar 2024, Martin Storsjö wrote:

> Since some time, we have pretty complete AArch64 NEON coverage
> for the hevc decoder.
>
> However, some of these functions require the I8MM instruction set
> extension, and many of them (but not all) lack a plain NEON
> version.
>
> This patchset fills in a regular NEON version of all functions
> where we have an I8MM function.
>
> For context; the I8MM instruction set extension is a mandatory
> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
> but Apple M1 and Ampere Altra don't.
>
> This patchset takes decoding of a 1080p HEVC clip from 402
> fps to 649 fps on an Apple M1.
>
> Patch #2 also fixes a subtle bug in the existing implementation;
> two functions relied on the contents on the stack, below the
> stack pointer, being untouched within a function. If a signal
> gets delivered, those parts of the stack could be clobbered.

I know this is a bit short notice for a patchset of this size - but, would 
people be OK with merging this patchset before the impending 7.0 branch 
(which is made within the next 24h)?

The patches pass all my tricky build configurations, they give a very 
non-negligible speedup on many common CPUs, and patch #2 fixes a real bug 
in the existing impleemntations. (A bug fix patch can of course be 
backported after the branch too, but performance optimizations aren't 
generally relevant for backporting.)

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
  2024-03-25 21:15 ` [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
@ 2024-03-25 21:56   ` J. Dekker
  2024-03-26  6:01     ` Jean-Baptiste Kempf
  0 siblings, 1 reply; 26+ messages in thread
From: J. Dekker @ 2024-03-25 21:56 UTC (permalink / raw)
  To: Martin Storsjö; +Cc: Logan Lyu, J . Dekker, ffmpeg-devel


> On Mon, 25 Mar 2024, Martin Storsjö wrote:
> 
>> Since some time, we have pretty complete AArch64 NEON coverage
>> for the hevc decoder.
>> 
>> However, some of these functions require the I8MM instruction set
>> extension, and many of them (but not all) lack a plain NEON
>> version.
>> 
>> This patchset fills in a regular NEON version of all functions
>> where we have an I8MM function.
>> 
>> For context; the I8MM instruction set extension is a mandatory
>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>> but Apple M1 and Ampere Altra don't.
>> 
>> This patchset takes decoding of a 1080p HEVC clip from 402
>> fps to 649 fps on an Apple M1.
>> 
>> Patch #2 also fixes a subtle bug in the existing implementation;
>> two functions relied on the contents on the stack, below the
>> stack pointer, being untouched within a function. If a signal
>> gets delivered, those parts of the stack could be clobbered.
> 
> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
> 
> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
> 
> // Martin

Yes, please. I will tomorrow morning if you didn’t already push.
-- 
jd
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
  2024-03-25 21:56   ` J. Dekker
@ 2024-03-26  6:01     ` Jean-Baptiste Kempf
  2024-03-26  7:09       ` Martin Storsjö
  0 siblings, 1 reply; 26+ messages in thread
From: Jean-Baptiste Kempf @ 2024-03-26  6:01 UTC (permalink / raw)
  To: J. Dekker, ffmpeg-devel, Martin Storsjö; +Cc: myais

On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote:
>> On Mon, 25 Mar 2024, Martin Storsjö wrote:
>> 
>>> Since some time, we have pretty complete AArch64 NEON coverage
>>> for the hevc decoder.
>>> 
>>> However, some of these functions require the I8MM instruction set
>>> extension, and many of them (but not all) lack a plain NEON
>>> version.
>>> 
>>> This patchset fills in a regular NEON version of all functions
>>> where we have an I8MM function.
>>> 
>>> For context; the I8MM instruction set extension is a mandatory
>>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>>> but Apple M1 and Ampere Altra don't.
>>> 
>>> This patchset takes decoding of a 1080p HEVC clip from 402
>>> fps to 649 fps on an Apple M1.
>>> 
>>> Patch #2 also fixes a subtle bug in the existing implementation;
>>> two functions relied on the contents on the stack, below the
>>> stack pointer, being untouched within a function. If a signal
>>> gets delivered, those parts of the stack could be clobbered.
>> 
>> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
>> 
>> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
>> 
>> // Martin
>
> Yes, please. I will tomorrow morning if you didn’t already push.

+1

-- 
Jean-Baptiste Kempf -  President
+33 672 704 734
https://jbkempf.com/
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
  2024-03-26  6:01     ` Jean-Baptiste Kempf
@ 2024-03-26  7:09       ` Martin Storsjö
  0 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-26  7:09 UTC (permalink / raw)
  To: Jean-Baptiste Kempf; +Cc: myais, J. Dekker, ffmpeg-devel

On Tue, 26 Mar 2024, Jean-Baptiste Kempf wrote:

> On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote:
>>> On Mon, 25 Mar 2024, Martin Storsjö wrote:
>>>
>>>> Since some time, we have pretty complete AArch64 NEON coverage
>>>> for the hevc decoder.
>>>>
>>>> However, some of these functions require the I8MM instruction set
>>>> extension, and many of them (but not all) lack a plain NEON
>>>> version.
>>>>
>>>> This patchset fills in a regular NEON version of all functions
>>>> where we have an I8MM function.
>>>>
>>>> For context; the I8MM instruction set extension is a mandatory
>>>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>>>> but Apple M1 and Ampere Altra don't.
>>>>
>>>> This patchset takes decoding of a 1080p HEVC clip from 402
>>>> fps to 649 fps on an Apple M1.
>>>>
>>>> Patch #2 also fixes a subtle bug in the existing implementation;
>>>> two functions relied on the contents on the stack, below the
>>>> stack pointer, being untouched within a function. If a signal
>>>> gets delivered, those parts of the stack could be clobbered.
>>>
>>> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
>>>
>>> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
>>>
>>> // Martin
>>
>> Yes, please. I will tomorrow morning if you didn’t already push.
>
> +1

Thanks, I pushed this set now.

// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-03-26  7:09 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both " Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv " Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv " Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv Martin Storsjö
2024-03-25 21:15 ` [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
2024-03-25 21:56   ` J. Dekker
2024-03-26  6:01     ` Jean-Baptiste Kempf
2024-03-26  7:09       ` Martin Storsjö

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git