* [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
@ 2024-03-25 15:02 Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line Martin Storsjö
` (21 more replies)
0 siblings, 22 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
Hi,
Since some time, we have pretty complete AArch64 NEON coverage
for the hevc decoder.
However, some of these functions require the I8MM instruction set
extension, and many of them (but not all) lack a plain NEON
version.
This patchset fills in a regular NEON version of all functions
where we have an I8MM function.
For context; the I8MM instruction set extension is a mandatory
part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
but Apple M1 and Ampere Altra don't.
This patchset takes decoding of a 1080p HEVC clip from 402
fps to 649 fps on an Apple M1.
Patch #2 also fixes a subtle bug in the existing implementation;
two functions relied on the contents on the stack, below the
stack pointer, being untouched within a function. If a signal
gets delivered, those parts of the stack could be clobbered.
// Martin
Martin Storsjö (21):
aarch64: hevc: Reorder a misplaced function init line
aarch64: hevc: Don't iterate with sp in
ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm
aarch64: hevc: Merge consecutive stores in
put_hevc_\type\()_h16_8_neon
aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal
looping
aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h
aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8
aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8
aarch64: hevc: Split the epel_*_hv functions into two parts
aarch64: hevc: Reorder epel_hv functions to prepare for templating
aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm
aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm
aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm
aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm
aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8
aarch64: hevc: Split the qpel_*_hv functions into two parts
aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon
functions
aarch64: hevc: Reorder qpel_hv functions to prepare for templating
aarch64: hevc: Produce plain neon versions of qpel_hv
aarch64: hevc: Produce plain neon versions of qpel_uni_hv
aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv
aarch64: hevc: Produce plain neon versions of qpel_bi_hv
libavcodec/aarch64/hevcdsp_epel_neon.S | 1529 +++++++++++------
libavcodec/aarch64/hevcdsp_init_aarch64.c | 96 +-
libavcodec/aarch64/hevcdsp_qpel_neon.S | 1804 +++++++++++++--------
3 files changed, 2291 insertions(+), 1138 deletions(-)
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Martin Storsjö
` (20 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
Group the epel and qpel functions together.
---
libavcodec/aarch64/hevcdsp_init_aarch64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 04692aa98e..d2f2a3681f 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -381,12 +381,12 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv, _i8mm);
NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h ,_i8mm);
+ NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv, _i8mm);
NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv, _i8mm);
NEON8_FNASSIGN(c->put_hevc_qpel, 0, 1, qpel_h, _i8mm);
NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv, _i8mm);
NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv, _i8mm);
NEON8_FNASSIGN(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h, _i8mm);
- NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv, _i8mm);
NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, qpel_uni_w_hv, _i8mm);
NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 1, qpel_bi_hv, _i8mm);
}
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon Martin Storsjö
` (19 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
Many of the routines within hevcdsp_epel_neon and hevcdsp_qpel_neon
store temporary buffers on the stack. When consuming it,
many of these functions use the stack pointer as incremental pointer
for reading the data (instead of storing it in another register),
which is rather unusual.
Technically, this is fine as long as the pointer remains properly
aligned.
However in the case of ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm,
after incrementing sp when reading data (within each 16 pixel
wide stripe) it would then reset the stack pointer back to a lower
value, for reading the next 16 pixel wide stripe, expecting the
data to remain untouched.
This can't be assumed; data on the stack below the stack pointer
can be clobbered (e.g. by a signal handler). Some OS ABIs
allow for a little margin that won't be touched, aka a red zone,
but not all do. The ones that do, guarantee 16 or 128 bytes, not
9 KB.
Convert this function to use a separate pointer register to
iterate through the data, retaining the stack pointer to point
at the bottom of the data we require to remain untouched.
---
libavcodec/aarch64/hevcdsp_qpel_neon.S | 130 +++++++++++++------------
1 file changed, 66 insertions(+), 64 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 9be29cafe2..815d897094 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -3981,24 +3981,25 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
mov x11, sp
mov w12, w22
mov x13, x20
+ mov x14, sp
3:
- ldp q16, q1, [sp]
- add sp, sp, x10
- ldp q17, q2, [sp]
- add sp, sp, x10
- ldp q18, q3, [sp]
- add sp, sp, x10
- ldp q19, q4, [sp]
- add sp, sp, x10
- ldp q20, q5, [sp]
- add sp, sp, x10
- ldp q21, q6, [sp]
- add sp, sp, x10
- ldp q22, q7, [sp]
- add sp, sp, x10
+ ldp q16, q1, [x11]
+ add x11, x11, x10
+ ldp q17, q2, [x11]
+ add x11, x11, x10
+ ldp q18, q3, [x11]
+ add x11, x11, x10
+ ldp q19, q4, [x11]
+ add x11, x11, x10
+ ldp q20, q5, [x11]
+ add x11, x11, x10
+ ldp q21, q6, [x11]
+ add x11, x11, x10
+ ldp q22, q7, [x11]
+ add x11, x11, x10
1:
- ldp q23, q31, [sp]
- add sp, sp, x10
+ ldp q23, q31, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v16, v17, v18, v19, v20, v21, v22, v23
QPEL_FILTER_H2 v25, v16, v17, v18, v19, v20, v21, v22, v23
QPEL_FILTER_H v26, v1, v2, v3, v4, v5, v6, v7, v31
@@ -4007,8 +4008,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q16, q1, [sp]
- add sp, sp, x10
+ ldp q16, q1, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v17, v18, v19, v20, v21, v22, v23, v16
QPEL_FILTER_H2 v25, v17, v18, v19, v20, v21, v22, v23, v16
QPEL_FILTER_H v26, v2, v3, v4, v5, v6, v7, v31, v1
@@ -4017,8 +4018,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q17, q2, [sp]
- add sp, sp, x10
+ ldp q17, q2, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v18, v19, v20, v21, v22, v23, v16, v17
QPEL_FILTER_H2 v25, v18, v19, v20, v21, v22, v23, v16, v17
QPEL_FILTER_H v26, v3, v4, v5, v6, v7, v31, v1, v2
@@ -4027,8 +4028,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q18, q3, [sp]
- add sp, sp, x10
+ ldp q18, q3, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v19, v20, v21, v22, v23, v16, v17, v18
QPEL_FILTER_H2 v25, v19, v20, v21, v22, v23, v16, v17, v18
QPEL_FILTER_H v26, v4, v5, v6, v7, v31, v1, v2, v3
@@ -4037,8 +4038,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q19, q4, [sp]
- add sp, sp, x10
+ ldp q19, q4, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v20, v21, v22, v23, v16, v17, v18, v19
QPEL_FILTER_H2 v25, v20, v21, v22, v23, v16, v17, v18, v19
QPEL_FILTER_H v26, v5, v6, v7, v31, v1, v2, v3, v4
@@ -4047,8 +4048,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q20, q5, [sp]
- add sp, sp, x10
+ ldp q20, q5, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v21, v22, v23, v16, v17, v18, v19, v20
QPEL_FILTER_H2 v25, v21, v22, v23, v16, v17, v18, v19, v20
QPEL_FILTER_H v26, v6, v7, v31, v1, v2, v3, v4, v5
@@ -4057,8 +4058,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q21, q6, [sp]
- add sp, sp, x10
+ ldp q21, q6, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v22, v23, v16, v17, v18, v19, v20, v21
QPEL_FILTER_H2 v25, v22, v23, v16, v17, v18, v19, v20, v21
QPEL_FILTER_H v26, v7, v31, v1, v2, v3, v4, v5, v6
@@ -4067,8 +4068,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q22, q7, [sp]
- add sp, sp, x10
+ ldp q22, q7, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v23, v16, v17, v18, v19, v20, v21, v22
QPEL_FILTER_H2 v25, v23, v16, v17, v18, v19, v20, v21, v22
QPEL_FILTER_H v26, v31, v1, v2, v3, v4, v5, v6, v7
@@ -4078,10 +4079,10 @@ function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
b.hi 1b
2:
subs w27, w27, #16
- add sp, x11, #32
+ add x11, x14, #32
add x20, x13, #16
mov w22, w12
- mov x11, sp
+ mov x14, x11
mov x13, x20
b.hi 3b
QPEL_UNI_W_HV_END
@@ -4093,24 +4094,25 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
mov x11, sp
mov w12, w22
mov x13, x20
+ mov x14, sp
3:
- ldp q16, q1, [sp]
- add sp, sp, x10
- ldp q17, q2, [sp]
- add sp, sp, x10
- ldp q18, q3, [sp]
- add sp, sp, x10
- ldp q19, q4, [sp]
- add sp, sp, x10
- ldp q20, q5, [sp]
- add sp, sp, x10
- ldp q21, q6, [sp]
- add sp, sp, x10
- ldp q22, q7, [sp]
- add sp, sp, x10
+ ldp q16, q1, [x11]
+ add x11, x11, x10
+ ldp q17, q2, [x11]
+ add x11, x11, x10
+ ldp q18, q3, [x11]
+ add x11, x11, x10
+ ldp q19, q4, [x11]
+ add x11, x11, x10
+ ldp q20, q5, [x11]
+ add x11, x11, x10
+ ldp q21, q6, [x11]
+ add x11, x11, x10
+ ldp q22, q7, [x11]
+ add x11, x11, x10
1:
- ldp q23, q31, [sp]
- add sp, sp, x10
+ ldp q23, q31, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v16, v17, v18, v19, v20, v21, v22, v23
QPEL_FILTER_H2 v25, v16, v17, v18, v19, v20, v21, v22, v23
QPEL_FILTER_H v26, v1, v2, v3, v4, v5, v6, v7, v31
@@ -4119,8 +4121,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q16, q1, [sp]
- add sp, sp, x10
+ ldp q16, q1, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v17, v18, v19, v20, v21, v22, v23, v16
QPEL_FILTER_H2 v25, v17, v18, v19, v20, v21, v22, v23, v16
QPEL_FILTER_H v26, v2, v3, v4, v5, v6, v7, v31, v1
@@ -4129,8 +4131,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q17, q2, [sp]
- add sp, sp, x10
+ ldp q17, q2, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v18, v19, v20, v21, v22, v23, v16, v17
QPEL_FILTER_H2 v25, v18, v19, v20, v21, v22, v23, v16, v17
QPEL_FILTER_H v26, v3, v4, v5, v6, v7, v31, v1, v2
@@ -4139,8 +4141,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q18, q3, [sp]
- add sp, sp, x10
+ ldp q18, q3, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v19, v20, v21, v22, v23, v16, v17, v18
QPEL_FILTER_H2 v25, v19, v20, v21, v22, v23, v16, v17, v18
QPEL_FILTER_H v26, v4, v5, v6, v7, v31, v1, v2, v3
@@ -4149,8 +4151,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q19, q4, [sp]
- add sp, sp, x10
+ ldp q19, q4, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v20, v21, v22, v23, v16, v17, v18, v19
QPEL_FILTER_H2 v25, v20, v21, v22, v23, v16, v17, v18, v19
QPEL_FILTER_H v26, v5, v6, v7, v31, v1, v2, v3, v4
@@ -4159,8 +4161,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q20, q5, [sp]
- add sp, sp, x10
+ ldp q20, q5, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v21, v22, v23, v16, v17, v18, v19, v20
QPEL_FILTER_H2 v25, v21, v22, v23, v16, v17, v18, v19, v20
QPEL_FILTER_H v26, v6, v7, v31, v1, v2, v3, v4, v5
@@ -4169,8 +4171,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q21, q6, [sp]
- add sp, sp, x10
+ ldp q21, q6, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v22, v23, v16, v17, v18, v19, v20, v21
QPEL_FILTER_H2 v25, v22, v23, v16, v17, v18, v19, v20, v21
QPEL_FILTER_H v26, v7, v31, v1, v2, v3, v4, v5, v6
@@ -4179,8 +4181,8 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
subs w22, w22, #1
b.eq 2f
- ldp q22, q7, [sp]
- add sp, sp, x10
+ ldp q22, q7, [x11]
+ add x11, x11, x10
QPEL_FILTER_H v24, v23, v16, v17, v18, v19, v20, v21, v22
QPEL_FILTER_H2 v25, v23, v16, v17, v18, v19, v20, v21, v22
QPEL_FILTER_H v26, v31, v1, v2, v3, v4, v5, v6, v7
@@ -4190,10 +4192,10 @@ function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
b.hi 1b
2:
subs w27, w27, #16
- add sp, x11, #32
+ add x11, x14, #32
add x20, x13, #16
mov w22, w12
- mov x11, sp
+ mov x14, x11
mov x13, x20
b.hi 3b
QPEL_UNI_W_HV_END
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping Martin Storsjö
` (18 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
This gets rid of a couple instructions, but the actual performance
is almost identical on Cortex A72/A73. On Cortex A53, it is a
handful of cycles faster.
---
libavcodec/aarch64/hevcdsp_qpel_neon.S | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 815d897094..432558bb95 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -512,11 +512,10 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
.ifc \type, qpel
mov dststride, #(MAX_PB_SIZE << 1)
lsl x13, srcstride, #1 // srcstridel
- mov x14, #((MAX_PB_SIZE << 2) - 16)
+ mov x14, #(MAX_PB_SIZE << 2)
.else
lsl x14, dststride, #1 // dststridel
lsl x13, srcstride, #1 // srcstridel
- sub x14, x14, #8
.endif
add x10, dst, dststride // dstb
add x12, src, srcstride // srcb
@@ -527,10 +526,8 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
bl ff_hevc_put_hevc_h16_8_neon
.ifc \type, qpel
- st1 {v26.8h}, [dst], #16
- st1 {v28.8h}, [x10], #16
- st1 {v27.8h}, [dst], x14
- st1 {v29.8h}, [x10], x14
+ st1 {v26.8h, v27.8h}, [dst], x14
+ st1 {v28.8h, v29.8h}, [x10], x14
.else
.ifc \type, qpel_bi
ld1 {v16.8h, v17.8h}, [ x4], x16
@@ -549,10 +546,8 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
sqrshrun v28.8b, v28.8h, #6
sqrshrun v29.8b, v29.8h, #6
.endif
- st1 {v26.8b}, [dst], #8
- st1 {v28.8b}, [x10], #8
- st1 {v27.8b}, [dst], x14
- st1 {v29.8b}, [x10], x14
+ st1 {v26.8b, v27.8b}, [dst], x14
+ st1 {v28.8b, v29.8b}, [x10], x14
.endif
b.gt 1b // double line
subs width, width, #16
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (2 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Martin Storsjö
` (17 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
For widths of 32 pixels and more, loop first horizontally,
then vertically.
Previously, this function would process a 16 pixel wide slice
of the block, looping vertically. After processing the whole
height, it would backtrack and process the next 16 pixel wide
slice.
When doing 8tap filtering horizontally, the function must load
7 more pixels (in practice, 8) following the actual inputs, and
this was done for each slice.
By iterating first horizontally throughout each line, then
vertically, we access data in a more cache friendly order, and
we don't need to reload data unnecessarily.
Keep the original order in put_hevc_\type\()_h12_8_neon; the
only suboptimal case there is for width=24. But specializing
an optimal variant for that would require more code, which
might not be worth it.
For the h16 case, this implementation would give a slowdown,
as it now loads the first 8 pixels separately from the rest, but
for larger widths, it is a gain. Therefore, keep the h16 case
as it was (but remove the outer loop), and create a new specialized
version for horizontal looping with 16 pixels at a time.
Before: Cortex A53 A72 A73 Graviton 3
put_hevc_qpel_h16_8_neon: 710.5 667.7 692.5 211.0
put_hevc_qpel_h32_8_neon: 2791.5 2643.5 2732.0 883.5
put_hevc_qpel_h64_8_neon: 10954.0 10657.0 10874.2 3241.5
After:
put_hevc_qpel_h16_8_neon: 697.5 663.5 705.7 212.5
put_hevc_qpel_h32_8_neon: 2767.2 2684.5 2791.2 920.5
put_hevc_qpel_h64_8_neon: 10559.2 10471.5 10932.2 3051.7
---
libavcodec/aarch64/hevcdsp_init_aarch64.c | 20 +++--
libavcodec/aarch64/hevcdsp_qpel_neon.S | 103 +++++++++++++++++-----
2 files changed, 94 insertions(+), 29 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index d2f2a3681f..1e9f5e32db 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -109,6 +109,8 @@ void ff_hevc_put_hevc_qpel_h12_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff
intptr_t mx, intptr_t my, int width);
void ff_hevc_put_hevc_qpel_h16_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
intptr_t mx, intptr_t my, int width);
+void ff_hevc_put_hevc_qpel_h32_8_neon(int16_t *dst, const uint8_t *_src, ptrdiff_t _srcstride, int height,
+ intptr_t mx, intptr_t my, int width);
void ff_hevc_put_hevc_qpel_uni_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t my,
int width);
@@ -124,6 +126,9 @@ void ff_hevc_put_hevc_qpel_uni_h12_8_neon(uint8_t *_dst, ptrdiff_t _dststride, c
void ff_hevc_put_hevc_qpel_uni_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t
my, int width);
+void ff_hevc_put_hevc_qpel_uni_h32_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+ ptrdiff_t _srcstride, int height, intptr_t mx, intptr_t
+ my, int width);
void ff_hevc_put_hevc_qpel_bi_h4_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
mx, intptr_t my, int width);
@@ -139,6 +144,9 @@ void ff_hevc_put_hevc_qpel_bi_h12_8_neon(uint8_t *_dst, ptrdiff_t _dststride, co
void ff_hevc_put_hevc_qpel_bi_h16_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
mx, intptr_t my, int width);
+void ff_hevc_put_hevc_qpel_bi_h32_8_neon(uint8_t *_dst, ptrdiff_t _dststride, const uint8_t *_src,
+ ptrdiff_t _srcstride, const int16_t *src2, int height, intptr_t
+ mx, intptr_t my, int width);
#define NEON8_FNPROTO(fn, args, ext) \
void ff_hevc_put_hevc_##fn##4_8_neon##ext args; \
@@ -335,28 +343,28 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
c->put_hevc_qpel[3][0][1] = ff_hevc_put_hevc_qpel_h8_8_neon;
c->put_hevc_qpel[4][0][1] =
c->put_hevc_qpel[6][0][1] = ff_hevc_put_hevc_qpel_h12_8_neon;
- c->put_hevc_qpel[5][0][1] =
+ c->put_hevc_qpel[5][0][1] = ff_hevc_put_hevc_qpel_h16_8_neon;
c->put_hevc_qpel[7][0][1] =
c->put_hevc_qpel[8][0][1] =
- c->put_hevc_qpel[9][0][1] = ff_hevc_put_hevc_qpel_h16_8_neon;
+ c->put_hevc_qpel[9][0][1] = ff_hevc_put_hevc_qpel_h32_8_neon;
c->put_hevc_qpel_uni[1][0][1] = ff_hevc_put_hevc_qpel_uni_h4_8_neon;
c->put_hevc_qpel_uni[2][0][1] = ff_hevc_put_hevc_qpel_uni_h6_8_neon;
c->put_hevc_qpel_uni[3][0][1] = ff_hevc_put_hevc_qpel_uni_h8_8_neon;
c->put_hevc_qpel_uni[4][0][1] =
c->put_hevc_qpel_uni[6][0][1] = ff_hevc_put_hevc_qpel_uni_h12_8_neon;
- c->put_hevc_qpel_uni[5][0][1] =
+ c->put_hevc_qpel_uni[5][0][1] = ff_hevc_put_hevc_qpel_uni_h16_8_neon;
c->put_hevc_qpel_uni[7][0][1] =
c->put_hevc_qpel_uni[8][0][1] =
- c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_qpel_uni_h16_8_neon;
+ c->put_hevc_qpel_uni[9][0][1] = ff_hevc_put_hevc_qpel_uni_h32_8_neon;
c->put_hevc_qpel_bi[1][0][1] = ff_hevc_put_hevc_qpel_bi_h4_8_neon;
c->put_hevc_qpel_bi[2][0][1] = ff_hevc_put_hevc_qpel_bi_h6_8_neon;
c->put_hevc_qpel_bi[3][0][1] = ff_hevc_put_hevc_qpel_bi_h8_8_neon;
c->put_hevc_qpel_bi[4][0][1] =
c->put_hevc_qpel_bi[6][0][1] = ff_hevc_put_hevc_qpel_bi_h12_8_neon;
- c->put_hevc_qpel_bi[5][0][1] =
+ c->put_hevc_qpel_bi[5][0][1] = ff_hevc_put_hevc_qpel_bi_h16_8_neon;
c->put_hevc_qpel_bi[7][0][1] =
c->put_hevc_qpel_bi[8][0][1] =
- c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_qpel_bi_h16_8_neon;
+ c->put_hevc_qpel_bi[9][0][1] = ff_hevc_put_hevc_qpel_bi_h32_8_neon;
NEON8_FNASSIGN(c->put_hevc_epel, 0, 0, pel_pixels,);
NEON8_FNASSIGN(c->put_hevc_epel, 1, 0, epel_v,);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 432558bb95..0fcded344b 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -383,11 +383,9 @@ endfunc
.ifc \type, qpel
function ff_hevc_put_hevc_h16_8_neon, export=0
- uxtl v16.8h, v16.8b
uxtl v17.8h, v17.8b
uxtl v18.8h, v18.8b
- uxtl v19.8h, v19.8b
uxtl v20.8h, v20.8b
uxtl v21.8h, v21.8b
@@ -408,7 +406,6 @@ function ff_hevc_put_hevc_h16_8_neon, export=0
mla v28.8h, v24.8h, v0.h[\i]
mla v29.8h, v25.8h, v0.h[\i]
.endr
- subs x9, x9, #2
ret
endfunc
.endif
@@ -439,7 +436,10 @@ function ff_hevc_put_hevc_\type\()_h12_8_neon, export=1
1: ld1 {v16.8b-v18.8b}, [src], x13
ld1 {v19.8b-v21.8b}, [x12], x13
+ uxtl v16.8h, v16.8b
+ uxtl v19.8h, v19.8b
bl ff_hevc_put_hevc_h16_8_neon
+ subs x9, x9, #2
.ifc \type, qpel
st1 {v26.8h}, [dst], #16
@@ -504,7 +504,6 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
.ifc \type, qpel_bi
ldrh w8, [sp] // width
mov x16, #(MAX_PB_SIZE << 2) // src2bstridel
- lsl x17, x5, #7 // src2b reset
add x15, x4, #(MAX_PB_SIZE << 1) // src2b
.endif
sub src, src, #3
@@ -519,11 +518,14 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
.endif
add x10, dst, dststride // dstb
add x12, src, srcstride // srcb
-0: mov x9, height
+
1: ld1 {v16.8b-v18.8b}, [src], x13
ld1 {v19.8b-v21.8b}, [x12], x13
+ uxtl v16.8h, v16.8b
+ uxtl v19.8h, v19.8b
bl ff_hevc_put_hevc_h16_8_neon
+ subs height, height, #2
.ifc \type, qpel
st1 {v26.8h, v27.8h}, [dst], x14
@@ -550,28 +552,83 @@ function ff_hevc_put_hevc_\type\()_h16_8_neon, export=1
st1 {v28.8b, v29.8b}, [x10], x14
.endif
b.gt 1b // double line
- subs width, width, #16
- // reset src
- msub src, srcstride, height, src
- msub x12, srcstride, height, x12
- // reset dst
- msub dst, dststride, height, dst
- msub x10, dststride, height, x10
+ ret mx
+endfunc
+
+function ff_hevc_put_hevc_\type\()_h32_8_neon, export=1
+ load_filter mx
+ sxtw height, heightw
+ mov mx, x30
.ifc \type, qpel_bi
- // reset xsrc
- sub x4, x4, x17
- sub x15, x15, x17
- add x4, x4, #32
- add x15, x15, #32
+ ldrh w8, [sp] // width
+ mov x16, #(MAX_PB_SIZE << 2) // src2bstridel
+ lsl x17, x5, #7 // src2b reset
+ add x15, x4, #(MAX_PB_SIZE << 1) // src2b
+ sub x16, x16, width, uxtw #1
.endif
- add src, src, #16
- add x12, x12, #16
+ sub src, src, #3
+ mov mx, x30
+.ifc \type, qpel
+ mov dststride, #(MAX_PB_SIZE << 1)
+ lsl x13, srcstride, #1 // srcstridel
+ mov x14, #(MAX_PB_SIZE << 2)
+ sub x14, x14, width, uxtw #1
+.else
+ lsl x14, dststride, #1 // dststridel
+ lsl x13, srcstride, #1 // srcstridel
+ sub x14, x14, width, uxtw
+.endif
+ sub x13, x13, width, uxtw
+ sub x13, x13, #8
+ add x10, dst, dststride // dstb
+ add x12, src, srcstride // srcb
+0: mov w9, width
+ ld1 {v16.8b}, [src], #8
+ ld1 {v19.8b}, [x12], #8
+ uxtl v16.8h, v16.8b
+ uxtl v19.8h, v19.8b
+1:
+ ld1 {v17.8b-v18.8b}, [src], #16
+ ld1 {v20.8b-v21.8b}, [x12], #16
+
+ bl ff_hevc_put_hevc_h16_8_neon
+ subs w9, w9, #16
+
+ mov v16.16b, v18.16b
+ mov v19.16b, v21.16b
.ifc \type, qpel
- add dst, dst, #32
- add x10, x10, #32
+ st1 {v26.8h, v27.8h}, [dst], #32
+ st1 {v28.8h, v29.8h}, [x10], #32
+.else
+.ifc \type, qpel_bi
+ ld1 {v20.8h, v21.8h}, [ x4], #32
+ ld1 {v22.8h, v23.8h}, [x15], #32
+ sqadd v26.8h, v26.8h, v20.8h
+ sqadd v27.8h, v27.8h, v21.8h
+ sqadd v28.8h, v28.8h, v22.8h
+ sqadd v29.8h, v29.8h, v23.8h
+ sqrshrun v26.8b, v26.8h, #7
+ sqrshrun v27.8b, v27.8h, #7
+ sqrshrun v28.8b, v28.8h, #7
+ sqrshrun v29.8b, v29.8h, #7
.else
- add dst, dst, #16
- add x10, x10, #16
+ sqrshrun v26.8b, v26.8h, #6
+ sqrshrun v27.8b, v27.8h, #6
+ sqrshrun v28.8b, v28.8h, #6
+ sqrshrun v29.8b, v29.8h, #6
+.endif
+ st1 {v26.8b, v27.8b}, [dst], #16
+ st1 {v28.8b, v29.8b}, [x10], #16
+.endif
+ b.gt 1b // double line
+ subs height, height, #2
+ add src, src, x13
+ add x12, x12, x13
+ add dst, dst, x14
+ add x10, x10, x14
+.ifc \type, qpel_bi
+ add x4, x4, x16
+ add x15, x15, x16
.endif
b.gt 0b
ret mx
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (3 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 Martin Storsjö
` (16 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
---
libavcodec/aarch64/hevcdsp_qpel_neon.S | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 0fcded344b..062b7d4d0f 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2462,8 +2462,7 @@ endfunc
sub x2, x2, #3
movrel x9, qpel_filters
add x9, x9, x12, lsl #3
- ldr x11, [x9]
- dup v28.2d, x11
+ ld1r {v28.2d}, [x9]
mov w10, #-6
sub w10, w10, w5
dup v30.4s, w6 // wx
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (4 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 Martin Storsjö
` (15 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
AWS Graviton 3:
put_hevc_epel_h4_8_c: 64.7
put_hevc_epel_h4_8_neon: 25.0
put_hevc_epel_h4_8_i8mm: 21.2
put_hevc_epel_h6_8_c: 130.0
put_hevc_epel_h6_8_neon: 40.7
put_hevc_epel_h6_8_i8mm: 36.5
put_hevc_epel_h8_8_c: 209.0
put_hevc_epel_h8_8_neon: 45.2
put_hevc_epel_h8_8_i8mm: 41.2
put_hevc_epel_h12_8_c: 465.5
put_hevc_epel_h12_8_neon: 104.5
put_hevc_epel_h12_8_i8mm: 86.5
put_hevc_epel_h16_8_c: 830.7
put_hevc_epel_h16_8_neon: 134.2
put_hevc_epel_h16_8_i8mm: 114.0
put_hevc_epel_h24_8_c: 1844.7
put_hevc_epel_h24_8_neon: 282.2
put_hevc_epel_h24_8_i8mm: 277.2
put_hevc_epel_h32_8_c: 3227.5
put_hevc_epel_h32_8_neon: 501.5
put_hevc_epel_h32_8_i8mm: 396.0
put_hevc_epel_h48_8_c: 7229.2
put_hevc_epel_h48_8_neon: 1120.2
put_hevc_epel_h48_8_i8mm: 901.2
put_hevc_epel_h64_8_c: 12869.0
put_hevc_epel_h64_8_neon: 1999.2
put_hevc_epel_h64_8_i8mm: 1610.5
---
libavcodec/aarch64/hevcdsp_epel_neon.S | 194 +++++++++++++++++++++-
libavcodec/aarch64/hevcdsp_init_aarch64.c | 17 ++
2 files changed, 209 insertions(+), 2 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index d3f0a26f79..419e83529a 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1321,8 +1321,6 @@ function ff_hevc_put_hevc_epel_uni_v64_8_neon, export=1
ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
.macro EPEL_H_HEADER
movrel x5, epel_filters
@@ -1332,6 +1330,198 @@ ENABLE_I8MM
mov x10, #(MAX_PB_SIZE * 2)
.endm
+function ff_hevc_put_hevc_epel_h4_8_neon, export=1
+ EPEL_H_HEADER
+ sxtl v0.8h, v30.8b
+1: ld1 {v4.8b}, [x1], x2
+ subs w3, w3, #1 // height
+ uxtl v4.8h, v4.8b
+ ext v5.16b, v4.16b, v4.16b, #2
+ ext v6.16b, v4.16b, v4.16b, #4
+ ext v7.16b, v4.16b, v4.16b, #6
+ mul v16.4h, v4.4h, v0.h[0]
+ mla v16.4h, v5.4h, v0.h[1]
+ mla v16.4h, v6.4h, v0.h[2]
+ mla v16.4h, v7.4h, v0.h[3]
+ st1 {v16.4h}, [x0], x10
+ b.ne 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h6_8_neon, export=1
+ EPEL_H_HEADER
+ sxtl v0.8h, v30.8b
+ add x6, x0, #8
+1: ld1 {v3.16b}, [x1], x2
+ subs w3, w3, #1 // height
+ uxtl2 v4.8h, v3.16b
+ uxtl v3.8h, v3.8b
+ ext v5.16b, v3.16b, v4.16b, #2
+ ext v6.16b, v3.16b, v4.16b, #4
+ ext v7.16b, v3.16b, v4.16b, #6
+ mul v16.8h, v3.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ st1 {v16.4h}, [x0], x10
+ st1 {v16.s}[2], [x6], x10
+ b.ne 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h8_8_neon, export=1
+ EPEL_H_HEADER
+ sxtl v0.8h, v30.8b
+1: ld1 {v3.16b}, [x1], x2
+ subs w3, w3, #1 // height
+ uxtl2 v4.8h, v3.16b
+ uxtl v3.8h, v3.8b
+ ext v5.16b, v3.16b, v4.16b, #2
+ ext v6.16b, v3.16b, v4.16b, #4
+ ext v7.16b, v3.16b, v4.16b, #6
+ mul v16.8h, v3.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ st1 {v16.8h}, [x0], x10
+ b.ne 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h12_8_neon, export=1
+ EPEL_H_HEADER
+ add x6, x0, #16
+ sxtl v0.8h, v30.8b
+1: ld1 {v3.16b}, [x1], x2
+ subs w3, w3, #1 // height
+ uxtl2 v4.8h, v3.16b
+ uxtl v3.8h, v3.8b
+ ext v5.16b, v3.16b, v4.16b, #2
+ ext v6.16b, v3.16b, v4.16b, #4
+ ext v7.16b, v3.16b, v4.16b, #6
+ ext v20.16b, v4.16b, v4.16b, #2
+ ext v21.16b, v4.16b, v4.16b, #4
+ ext v22.16b, v4.16b, v4.16b, #6
+ mul v16.8h, v3.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ mul v17.4h, v4.4h, v0.h[0]
+ mla v17.4h, v20.4h, v0.h[1]
+ mla v17.4h, v21.4h, v0.h[2]
+ mla v17.4h, v22.4h, v0.h[3]
+ st1 {v16.8h}, [x0], x10
+ st1 {v17.4h}, [x6], x10
+ b.ne 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h16_8_neon, export=1
+ EPEL_H_HEADER
+ sxtl v0.8h, v30.8b
+1: ld1 {v1.8b, v2.8b, v3.8b}, [x1], x2
+ subs w3, w3, #1 // height
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ ext v5.16b, v1.16b, v2.16b, #2
+ ext v6.16b, v1.16b, v2.16b, #4
+ ext v7.16b, v1.16b, v2.16b, #6
+ ext v20.16b, v2.16b, v3.16b, #2
+ ext v21.16b, v2.16b, v3.16b, #4
+ ext v22.16b, v2.16b, v3.16b, #6
+ mul v16.8h, v1.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ mul v17.8h, v2.8h, v0.h[0]
+ mla v17.8h, v20.8h, v0.h[1]
+ mla v17.8h, v21.8h, v0.h[2]
+ mla v17.8h, v22.8h, v0.h[3]
+ st1 {v16.8h, v17.8h}, [x0], x10
+ b.ne 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h24_8_neon, export=1
+ EPEL_H_HEADER
+ sxtl v0.8h, v30.8b
+1: ld1 {v1.8b, v2.8b, v3.8b, v4.8b}, [x1], x2
+ subs w3, w3, #1 // height
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ uxtl v4.8h, v4.8b
+ ext v5.16b, v1.16b, v2.16b, #2
+ ext v6.16b, v1.16b, v2.16b, #4
+ ext v7.16b, v1.16b, v2.16b, #6
+ ext v20.16b, v2.16b, v3.16b, #2
+ ext v21.16b, v2.16b, v3.16b, #4
+ ext v22.16b, v2.16b, v3.16b, #6
+ ext v23.16b, v3.16b, v4.16b, #2
+ ext v24.16b, v3.16b, v4.16b, #4
+ ext v25.16b, v3.16b, v4.16b, #6
+ mul v16.8h, v1.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ mul v17.8h, v2.8h, v0.h[0]
+ mla v17.8h, v20.8h, v0.h[1]
+ mla v17.8h, v21.8h, v0.h[2]
+ mla v17.8h, v22.8h, v0.h[3]
+ mul v18.8h, v3.8h, v0.h[0]
+ mla v18.8h, v23.8h, v0.h[1]
+ mla v18.8h, v24.8h, v0.h[2]
+ mla v18.8h, v25.8h, v0.h[3]
+ st1 {v16.8h, v17.8h, v18.8h}, [x0], x10
+ b.ne 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_h32_8_neon, export=1
+ EPEL_H_HEADER
+ ld1 {v1.8b}, [x1], #8
+ sub x2, x2, w6, uxtw // decrement src stride
+ mov w7, w6 // original width
+ sub x2, x2, #8 // decrement src stride
+ sub x10, x10, w6, uxtw #1 // decrement dst stride
+ sxtl v0.8h, v30.8b
+ uxtl v1.8h, v1.8b
+1: ld1 {v2.8b, v3.8b}, [x1], #16
+ subs w6, w6, #16 // width
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ ext v5.16b, v1.16b, v2.16b, #2
+ ext v6.16b, v1.16b, v2.16b, #4
+ ext v7.16b, v1.16b, v2.16b, #6
+ ext v20.16b, v2.16b, v3.16b, #2
+ ext v21.16b, v2.16b, v3.16b, #4
+ ext v22.16b, v2.16b, v3.16b, #6
+ mul v16.8h, v1.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ mul v17.8h, v2.8h, v0.h[0]
+ mla v17.8h, v20.8h, v0.h[1]
+ mla v17.8h, v21.8h, v0.h[2]
+ mla v17.8h, v22.8h, v0.h[3]
+ st1 {v16.8h, v17.8h}, [x0], #32
+ mov v1.16b, v3.16b
+ b.gt 1b
+ subs w3, w3, #1 // height
+ add x1, x1, x2
+ b.le 9f
+ ld1 {v1.8b}, [x1], #8
+ mov w6, w7
+ add x0, x0, x10
+ uxtl v1.8h, v1.8b
+ b 1b
+9:
+ ret
+endfunc
+
+#if HAVE_I8MM
+ENABLE_I8MM
function ff_hevc_put_hevc_epel_h4_8_neon_i8mm, export=1
EPEL_H_HEADER
1: ld1 {v4.8b}, [x1], x2
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 1e9f5e32db..ece911b8d4 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -223,6 +223,10 @@ NEON8_FNPROTO_PARTIAL_4(qpel_uni_w_v, (uint8_t *_dst, ptrdiff_t _dststride,
int height, int denom, int wx, int ox,
intptr_t mx, intptr_t my, int width),);
+NEON8_FNPROTO(epel_h, (int16_t *dst,
+ const uint8_t *_src, ptrdiff_t _srcstride,
+ int height, intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(epel_h, (int16_t *dst,
const uint8_t *_src, ptrdiff_t _srcstride,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -290,6 +294,17 @@ NEON8_FNPROTO(qpel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
member[8][v][h] = ff_hevc_put_hevc_##fn##48_8_neon##ext; \
member[9][v][h] = ff_hevc_put_hevc_##fn##64_8_neon##ext;
+#define NEON8_FNASSIGN_SHARED_32(member, v, h, fn, ext) \
+ member[1][v][h] = ff_hevc_put_hevc_##fn##4_8_neon##ext; \
+ member[2][v][h] = ff_hevc_put_hevc_##fn##6_8_neon##ext; \
+ member[3][v][h] = ff_hevc_put_hevc_##fn##8_8_neon##ext; \
+ member[4][v][h] = ff_hevc_put_hevc_##fn##12_8_neon##ext; \
+ member[5][v][h] = ff_hevc_put_hevc_##fn##16_8_neon##ext; \
+ member[6][v][h] = ff_hevc_put_hevc_##fn##24_8_neon##ext; \
+ member[7][v][h] = \
+ member[8][v][h] = \
+ member[9][v][h] = ff_hevc_put_hevc_##fn##32_8_neon##ext;
+
#define NEON8_FNASSIGN_PARTIAL_4(member, v, h, fn, ext) \
member[1][v][h] = ff_hevc_put_hevc_##fn##4_8_neon##ext; \
member[3][v][h] = ff_hevc_put_hevc_##fn##8_8_neon##ext; \
@@ -384,6 +399,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 0, epel_uni_w_v,);
NEON8_FNASSIGN_PARTIAL_4(c->put_hevc_qpel_uni_w, 1, 0, qpel_uni_w_v,);
+ NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel, 0, 1, epel_h,);
+
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (5 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts Martin Storsjö
` (14 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
AWS Graviton 3:
put_hevc_epel_uni_w_h4_8_c: 97.2
put_hevc_epel_uni_w_h4_8_neon: 41.2
put_hevc_epel_uni_w_h4_8_i8mm: 35.2
put_hevc_epel_uni_w_h6_8_c: 203.7
put_hevc_epel_uni_w_h6_8_neon: 84.7
put_hevc_epel_uni_w_h6_8_i8mm: 74.7
put_hevc_epel_uni_w_h8_8_c: 345.7
put_hevc_epel_uni_w_h8_8_neon: 94.0
put_hevc_epel_uni_w_h8_8_i8mm: 80.7
put_hevc_epel_uni_w_h12_8_c: 768.7
put_hevc_epel_uni_w_h12_8_neon: 196.7
put_hevc_epel_uni_w_h12_8_i8mm: 169.7
put_hevc_epel_uni_w_h16_8_c: 1313.0
put_hevc_epel_uni_w_h16_8_neon: 290.7
put_hevc_epel_uni_w_h16_8_i8mm: 238.0
put_hevc_epel_uni_w_h24_8_c: 2877.5
put_hevc_epel_uni_w_h24_8_neon: 650.0
put_hevc_epel_uni_w_h24_8_i8mm: 512.0
put_hevc_epel_uni_w_h32_8_c: 5113.5
put_hevc_epel_uni_w_h32_8_neon: 1129.5
put_hevc_epel_uni_w_h32_8_i8mm: 739.2
put_hevc_epel_uni_w_h48_8_c: 11757.0
put_hevc_epel_uni_w_h48_8_neon: 2518.7
put_hevc_epel_uni_w_h48_8_i8mm: 1688.5
put_hevc_epel_uni_w_h64_8_c: 20478.0
put_hevc_epel_uni_w_h64_8_neon: 4411.7
put_hevc_epel_uni_w_h64_8_i8mm: 2884.0
---
libavcodec/aarch64/hevcdsp_epel_neon.S | 326 +++++++++++++++++++++-
libavcodec/aarch64/hevcdsp_init_aarch64.c | 6 +
2 files changed, 319 insertions(+), 13 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 419e83529a..0e49491a81 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -1520,6 +1520,319 @@ function ff_hevc_put_hevc_epel_h32_8_neon, export=1
ret
endfunc
+.macro EPEL_UNI_W_H_HEADER elems=4s
+ ldr x12, [sp]
+ sub x2, x2, #1
+ movrel x9, epel_filters
+ add x9, x9, x12, lsl #2
+ ld1r {v28.4s}, [x9]
+ mov w10, #-6
+ sub w10, w10, w5
+ dup v30.\elems, w6
+ dup v31.4s, w10
+ dup v29.4s, w7
+.endm
+
+function ff_hevc_put_hevc_epel_uni_w_h4_8_neon, export=1
+ EPEL_UNI_W_H_HEADER 4h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v4.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v4.8h, v4.8b
+ ext v5.16b, v4.16b, v4.16b, #2
+ ext v6.16b, v4.16b, v4.16b, #4
+ ext v7.16b, v4.16b, v4.16b, #6
+ mul v16.4h, v4.4h, v0.h[0]
+ mla v16.4h, v5.4h, v0.h[1]
+ mla v16.4h, v6.4h, v0.h[2]
+ mla v16.4h, v7.4h, v0.h[3]
+ smull v16.4s, v16.4h, v30.4h
+ sqrshl v16.4s, v16.4s, v31.4s
+ sqadd v16.4s, v16.4s, v29.4s
+ sqxtn v16.4h, v16.4s
+ sqxtun v16.8b, v16.8h
+ str s16, [x0]
+ add x0, x0, x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h6_8_neon, export=1
+ EPEL_UNI_W_H_HEADER 8h
+ sub x1, x1, #4
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v3.8b, v4.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v3.8h, v3.8b
+ uxtl v4.8h, v4.8b
+ ext v5.16b, v3.16b, v4.16b, #2
+ ext v6.16b, v3.16b, v4.16b, #4
+ ext v7.16b, v3.16b, v4.16b, #6
+ mul v16.8h, v3.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ smull v17.4s, v16.4h, v30.4h
+ smull2 v18.4s, v16.8h, v30.8h
+ sqrshl v17.4s, v17.4s, v31.4s
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqadd v17.4s, v17.4s, v29.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqxtn v16.4h, v17.4s
+ sqxtn2 v16.8h, v18.4s
+ sqxtun v16.8b, v16.8h
+ str s16, [x0], #4
+ st1 {v16.h}[2], [x0], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h8_8_neon, export=1
+ EPEL_UNI_W_H_HEADER 8h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v3.8b, v4.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v3.8h, v3.8b
+ uxtl v4.8h, v4.8b
+ ext v5.16b, v3.16b, v4.16b, #2
+ ext v6.16b, v3.16b, v4.16b, #4
+ ext v7.16b, v3.16b, v4.16b, #6
+ mul v16.8h, v3.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ smull v17.4s, v16.4h, v30.4h
+ smull2 v18.4s, v16.8h, v30.8h
+ sqrshl v17.4s, v17.4s, v31.4s
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqadd v17.4s, v17.4s, v29.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqxtn v16.4h, v17.4s
+ sqxtn2 v16.8h, v18.4s
+ sqxtun v16.8b, v16.8h
+ st1 {v16.8b}, [x0], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h12_8_neon, export=1
+ EPEL_UNI_W_H_HEADER 8h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v3.8b, v4.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v3.8h, v3.8b
+ uxtl v4.8h, v4.8b
+ ext v5.16b, v3.16b, v4.16b, #2
+ ext v6.16b, v3.16b, v4.16b, #4
+ ext v7.16b, v3.16b, v4.16b, #6
+ ext v20.16b, v4.16b, v4.16b, #2
+ ext v21.16b, v4.16b, v4.16b, #4
+ ext v22.16b, v4.16b, v4.16b, #6
+ mul v16.8h, v3.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ mul v17.4h, v4.4h, v0.h[0]
+ mla v17.4h, v20.4h, v0.h[1]
+ mla v17.4h, v21.4h, v0.h[2]
+ mla v17.4h, v22.4h, v0.h[3]
+ smull v18.4s, v16.4h, v30.4h
+ smull2 v19.4s, v16.8h, v30.8h
+ smull v20.4s, v17.4h, v30.4h
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqrshl v19.4s, v19.4s, v31.4s
+ sqrshl v20.4s, v20.4s, v31.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqadd v19.4s, v19.4s, v29.4s
+ sqadd v20.4s, v20.4s, v29.4s
+ sqxtn v16.4h, v18.4s
+ sqxtn2 v16.8h, v19.4s
+ sqxtn v17.4h, v20.4s
+ sqxtun v16.8b, v16.8h
+ sqxtun v17.8b, v17.8h
+ str d16, [x0]
+ str s17, [x0, #8]
+ add x0, x0, x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h16_8_neon, export=1
+ EPEL_UNI_W_H_HEADER 8h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v1.8b, v2.8b, v3.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ ext v5.16b, v1.16b, v2.16b, #2
+ ext v6.16b, v1.16b, v2.16b, #4
+ ext v7.16b, v1.16b, v2.16b, #6
+ ext v20.16b, v2.16b, v3.16b, #2
+ ext v21.16b, v2.16b, v3.16b, #4
+ ext v22.16b, v2.16b, v3.16b, #6
+ mul v16.8h, v1.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ mul v17.8h, v2.8h, v0.h[0]
+ mla v17.8h, v20.8h, v0.h[1]
+ mla v17.8h, v21.8h, v0.h[2]
+ mla v17.8h, v22.8h, v0.h[3]
+ smull v18.4s, v16.4h, v30.4h
+ smull2 v19.4s, v16.8h, v30.8h
+ smull v20.4s, v17.4h, v30.4h
+ smull2 v21.4s, v17.8h, v30.8h
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqrshl v19.4s, v19.4s, v31.4s
+ sqrshl v20.4s, v20.4s, v31.4s
+ sqrshl v21.4s, v21.4s, v31.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqadd v19.4s, v19.4s, v29.4s
+ sqadd v20.4s, v20.4s, v29.4s
+ sqadd v21.4s, v21.4s, v29.4s
+ sqxtn v16.4h, v18.4s
+ sqxtn2 v16.8h, v19.4s
+ sqxtn v17.4h, v20.4s
+ sqxtn2 v17.8h, v21.4s
+ sqxtun v16.8b, v16.8h
+ sqxtun v17.8b, v17.8h
+ st1 {v16.8b, v17.8b}, [x0], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h24_8_neon, export=1
+ EPEL_UNI_W_H_HEADER 8h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v1.8b, v2.8b, v3.8b, v4.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ uxtl v4.8h, v4.8b
+ ext v5.16b, v1.16b, v2.16b, #2
+ ext v6.16b, v1.16b, v2.16b, #4
+ ext v7.16b, v1.16b, v2.16b, #6
+ ext v20.16b, v2.16b, v3.16b, #2
+ ext v21.16b, v2.16b, v3.16b, #4
+ ext v22.16b, v2.16b, v3.16b, #6
+ ext v23.16b, v3.16b, v4.16b, #2
+ ext v24.16b, v3.16b, v4.16b, #4
+ ext v25.16b, v3.16b, v4.16b, #6
+ mul v16.8h, v1.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ mul v17.8h, v2.8h, v0.h[0]
+ mla v17.8h, v20.8h, v0.h[1]
+ mla v17.8h, v21.8h, v0.h[2]
+ mla v17.8h, v22.8h, v0.h[3]
+ mul v18.8h, v3.8h, v0.h[0]
+ mla v18.8h, v23.8h, v0.h[1]
+ mla v18.8h, v24.8h, v0.h[2]
+ mla v18.8h, v25.8h, v0.h[3]
+ smull v20.4s, v16.4h, v30.4h
+ smull2 v21.4s, v16.8h, v30.8h
+ smull v22.4s, v17.4h, v30.4h
+ smull2 v23.4s, v17.8h, v30.8h
+ smull v24.4s, v18.4h, v30.4h
+ smull2 v25.4s, v18.8h, v30.8h
+ sqrshl v20.4s, v20.4s, v31.4s
+ sqrshl v21.4s, v21.4s, v31.4s
+ sqrshl v22.4s, v22.4s, v31.4s
+ sqrshl v23.4s, v23.4s, v31.4s
+ sqrshl v24.4s, v24.4s, v31.4s
+ sqrshl v25.4s, v25.4s, v31.4s
+ sqadd v20.4s, v20.4s, v29.4s
+ sqadd v21.4s, v21.4s, v29.4s
+ sqadd v22.4s, v22.4s, v29.4s
+ sqadd v23.4s, v23.4s, v29.4s
+ sqadd v24.4s, v24.4s, v29.4s
+ sqadd v25.4s, v25.4s, v29.4s
+ sqxtn v16.4h, v20.4s
+ sqxtn2 v16.8h, v21.4s
+ sqxtn v17.4h, v22.4s
+ sqxtn2 v17.8h, v23.4s
+ sqxtn v18.4h, v24.4s
+ sqxtn2 v18.8h, v25.4s
+ sqxtun v16.8b, v16.8h
+ sqxtun v17.8b, v17.8h
+ sqxtun v18.8b, v18.8h
+ st1 {v16.8b, v17.8b, v18.8b}, [x0], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_h32_8_neon, export=1
+ EPEL_UNI_W_H_HEADER 8h
+ ldr w10, [sp, #16] // width
+ ld1 {v1.8b}, [x2], #8
+ sub x3, x3, w10, uxtw // decrement src stride
+ mov w11, w10 // original width
+ sub x3, x3, #8 // decrement src stride
+ sub x1, x1, w10, uxtw // decrement dst stride
+ sxtl v0.8h, v28.8b
+ uxtl v1.8h, v1.8b
+1:
+ ld1 {v2.8b, v3.8b}, [x2], #16
+ subs w10, w10, #16 // width
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ ext v5.16b, v1.16b, v2.16b, #2
+ ext v6.16b, v1.16b, v2.16b, #4
+ ext v7.16b, v1.16b, v2.16b, #6
+ ext v20.16b, v2.16b, v3.16b, #2
+ ext v21.16b, v2.16b, v3.16b, #4
+ ext v22.16b, v2.16b, v3.16b, #6
+ mul v16.8h, v1.8h, v0.h[0]
+ mla v16.8h, v5.8h, v0.h[1]
+ mla v16.8h, v6.8h, v0.h[2]
+ mla v16.8h, v7.8h, v0.h[3]
+ mul v17.8h, v2.8h, v0.h[0]
+ mla v17.8h, v20.8h, v0.h[1]
+ mla v17.8h, v21.8h, v0.h[2]
+ mla v17.8h, v22.8h, v0.h[3]
+ smull v18.4s, v16.4h, v30.4h
+ smull2 v19.4s, v16.8h, v30.8h
+ smull v20.4s, v17.4h, v30.4h
+ smull2 v21.4s, v17.8h, v30.8h
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqrshl v19.4s, v19.4s, v31.4s
+ sqrshl v20.4s, v20.4s, v31.4s
+ sqrshl v21.4s, v21.4s, v31.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqadd v19.4s, v19.4s, v29.4s
+ sqadd v20.4s, v20.4s, v29.4s
+ sqadd v21.4s, v21.4s, v29.4s
+ sqxtn v16.4h, v18.4s
+ sqxtn2 v16.8h, v19.4s
+ sqxtn v17.4h, v20.4s
+ sqxtn2 v17.8h, v21.4s
+ sqxtun v16.8b, v16.8h
+ sqxtun v17.8b, v17.8h
+ st1 {v16.8b, v17.8b}, [x0], #16
+ mov v1.16b, v3.16b
+ b.gt 1b
+ subs w4, w4, #1 // height
+ add x2, x2, x3
+ b.le 9f
+ ld1 {v1.8b}, [x2], #8
+ mov w10, w11
+ add x0, x0, x1
+ uxtl v1.8h, v1.8b
+ b 1b
+9:
+ ret
+endfunc
+
+
#if HAVE_I8MM
ENABLE_I8MM
function ff_hevc_put_hevc_epel_h4_8_neon_i8mm, export=1
@@ -2410,19 +2723,6 @@ function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
ret
endfunc
-.macro EPEL_UNI_W_H_HEADER
- ldr x12, [sp]
- sub x2, x2, #1
- movrel x9, epel_filters
- add x9, x9, x12, lsl #2
- ld1r {v28.4s}, [x9]
- mov w10, #-6
- sub w10, w10, w5
- dup v30.4s, w6
- dup v31.4s, w10
- dup v29.4s, w7
-.endm
-
function ff_hevc_put_hevc_epel_uni_w_h4_8_neon_i8mm, export=1
EPEL_UNI_W_H_HEADER
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index ece911b8d4..be24737c9c 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -235,6 +235,11 @@ NEON8_FNPROTO(epel_hv, (int16_t *dst,
const uint8_t *src, ptrdiff_t srcstride,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
+NEON8_FNPROTO(epel_uni_w_h, (uint8_t *_dst, ptrdiff_t _dststride,
+ const uint8_t *_src, ptrdiff_t _srcstride,
+ int height, int denom, int wx, int ox,
+ intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(epel_uni_w_h, (uint8_t *_dst, ptrdiff_t _dststride,
const uint8_t *_src, ptrdiff_t _srcstride,
int height, int denom, int wx, int ox,
@@ -400,6 +405,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN_PARTIAL_4(c->put_hevc_qpel_uni_w, 1, 0, qpel_uni_w_v,);
NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel, 0, 1, epel_h,);
+ NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h,);
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (6 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating Martin Storsjö
` (13 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
The first horizontal filter can use either i8mm or plain neon
versions, while the second part is a pure neon implementation.
---
libavcodec/aarch64/hevcdsp_epel_neon.S | 100 +++++++++++++++++++++++++
1 file changed, 100 insertions(+)
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 0e49491a81..6be171ece1 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2186,6 +2186,10 @@ function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv4_8_end_neon
load_epel_filterh x5, x4
mov x10, #(MAX_PB_SIZE * 2)
ldr d16, [sp]
@@ -2215,6 +2219,10 @@ function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv6_8_end_neon
load_epel_filterh x5, x4
mov x5, #120
mov x10, #(MAX_PB_SIZE * 2)
@@ -2247,6 +2255,10 @@ function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv8_8_end_neon
load_epel_filterh x5, x4
mov x10, #(MAX_PB_SIZE * 2)
ldr q16, [sp]
@@ -2277,6 +2289,10 @@ function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv12_8_end_neon
load_epel_filterh x5, x4
mov x5, #112
mov x10, #(MAX_PB_SIZE * 2)
@@ -2309,6 +2325,10 @@ function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv16_8_end_neon
load_epel_filterh x5, x4
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h}, [sp], x10
@@ -2340,6 +2360,10 @@ function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_hv24_8_end_neon
load_epel_filterh x5, x4
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -2445,6 +2469,10 @@ function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv4_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.4h}, [sp], x10
@@ -2478,6 +2506,10 @@ function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv6_8_end_neon
load_epel_filterh x6, x5
sub x1, x1, #4
mov x10, #(MAX_PB_SIZE * 2)
@@ -2514,6 +2546,10 @@ function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv8_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h}, [sp], x10
@@ -2548,6 +2584,10 @@ function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv12_8_end_neon
load_epel_filterh x6, x5
sub x1, x1, #8
mov x10, #(MAX_PB_SIZE * 2)
@@ -2586,6 +2626,10 @@ function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv16_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h}, [sp], x10
@@ -2623,6 +2667,10 @@ function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_hv24_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -3173,6 +3221,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv4_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.4h}, [sp], x10
@@ -3240,6 +3292,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv6_8_end_neon
load_epel_filterh x6, x5
sub x1, x1, #4
mov x10, #(MAX_PB_SIZE * 2)
@@ -3312,6 +3368,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv8_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h}, [sp], x10
@@ -3379,6 +3439,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv12_8_end_neon
load_epel_filterh x6, x5
sub x1, x1, #8
mov x10, #(MAX_PB_SIZE * 2)
@@ -3459,6 +3523,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv16_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h}, [sp], x10
@@ -3538,6 +3606,10 @@ function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -3715,6 +3787,10 @@ function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv4_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.4h}, [sp], x10
@@ -3751,6 +3827,10 @@ function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv6_8_end_neon
load_epel_filterh x7, x6
sub x1, x1, #4
mov x10, #(MAX_PB_SIZE * 2)
@@ -3790,6 +3870,10 @@ function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv8_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h}, [sp], x10
@@ -3827,6 +3911,10 @@ function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv12_8_end_neon
load_epel_filterh x7, x6
sub x1, x1, #8
mov x10, #(MAX_PB_SIZE * 2)
@@ -3869,6 +3957,10 @@ function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv16_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h}, [sp], x10
@@ -3910,6 +4002,10 @@ function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv24_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv24_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h, v18.8h}, [sp], x10
@@ -3956,6 +4052,10 @@ function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv32_8_end_neon
+endfunc
+
+function hevc_put_hevc_epel_bi_hv32_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
ld1 {v16.8h, v17.8h, v18.8h, v19.8h}, [sp], x10
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (7 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm Martin Storsjö
` (12 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
This is a pure reordering of code without changing anything in
the individual functions.
---
libavcodec/aarch64/hevcdsp_epel_neon.S | 971 +++++++++++++------------
1 file changed, 497 insertions(+), 474 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 6be171ece1..2088630da1 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2173,21 +2173,9 @@ function ff_hevc_put_hevc_epel_h64_8_neon_i8mm, export=1
ret
endfunc
+DISABLE_I8MM
+#endif
-function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
- add w10, w3, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- sub x1, x1, x2
- add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_epel_hv4_8_end_neon
-endfunc
function hevc_put_hevc_epel_hv4_8_end_neon
load_epel_filterh x5, x4
@@ -2207,21 +2195,6 @@ function hevc_put_hevc_epel_hv4_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
- add w10, w3, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- sub x1, x1, x2
- add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_epel_hv6_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_hv6_8_end_neon
load_epel_filterh x5, x4
mov x5, #120
@@ -2243,21 +2216,6 @@ function hevc_put_hevc_epel_hv6_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
- add w10, w3, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- sub x1, x1, x2
- add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_epel_hv8_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_hv8_8_end_neon
load_epel_filterh x5, x4
mov x10, #(MAX_PB_SIZE * 2)
@@ -2277,21 +2235,6 @@ function hevc_put_hevc_epel_hv8_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
- add w10, w3, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- sub x1, x1, x2
- add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_epel_hv12_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_hv12_8_end_neon
load_epel_filterh x5, x4
mov x5, #112
@@ -2313,21 +2256,6 @@ function hevc_put_hevc_epel_hv12_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
- add w10, w3, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- sub x1, x1, x2
- add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_epel_hv16_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_hv16_8_end_neon
load_epel_filterh x5, x4
mov x10, #(MAX_PB_SIZE * 2)
@@ -2348,21 +2276,6 @@ function hevc_put_hevc_epel_hv16_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
- add w10, w3, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- sub x1, x1, x2
- add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_epel_hv24_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_hv24_8_end_neon
load_epel_filterh x5, x4
mov x10, #(MAX_PB_SIZE * 2)
@@ -2385,6 +2298,99 @@ function hevc_put_hevc_epel_hv24_8_end_neon
2: ret
endfunc
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
+ add w10, w3, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ sub x1, x1, x2
+ add w3, w3, #3
+ bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
+ add w10, w3, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ sub x1, x1, x2
+ add w3, w3, #3
+ bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
+ add w10, w3, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ sub x1, x1, x2
+ add w3, w3, #3
+ bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
+ add w10, w3, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ sub x1, x1, x2
+ add w3, w3, #3
+ bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
+ add w10, w3, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ sub x1, x1, x2
+ add w3, w3, #3
+ bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
+ add w10, w3, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ sub x1, x1, x2
+ add w3, w3, #3
+ bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_epel_hv24_8_end_neon
+endfunc
+
function ff_hevc_put_hevc_epel_hv32_8_neon_i8mm, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
@@ -2453,24 +2459,8 @@ function ff_hevc_put_hevc_epel_hv64_8_neon_i8mm, export=1
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
- add w10, w4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_hv4_8_end_neon
-endfunc
+DISABLE_I8MM
+#endif
function hevc_put_hevc_epel_uni_hv4_8_end_neon
load_epel_filterh x6, x5
@@ -2490,25 +2480,6 @@ function hevc_put_hevc_epel_uni_hv4_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
- add w10, w4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_hv6_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_hv6_8_end_neon
load_epel_filterh x6, x5
sub x1, x1, #4
@@ -2530,25 +2501,6 @@ function hevc_put_hevc_epel_uni_hv6_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
- add w10, w4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_hv8_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_hv8_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
@@ -2568,25 +2520,6 @@ function hevc_put_hevc_epel_uni_hv8_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
- add w10, w4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_hv12_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_hv12_8_end_neon
load_epel_filterh x6, x5
sub x1, x1, #8
@@ -2610,25 +2543,6 @@ function hevc_put_hevc_epel_uni_hv12_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
- add w10, w4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_hv16_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_hv16_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
@@ -2651,25 +2565,6 @@ function hevc_put_hevc_epel_uni_hv16_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
- add w10, w4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_hv24_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_hv24_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
@@ -2695,6 +2590,123 @@ function hevc_put_hevc_epel_uni_hv24_8_end_neon
2: ret
endfunc
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
+ add w10, w4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
+ add w10, w4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
+ add w10, w4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
+ add w10, w4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
+ add w10, w4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
+ add w10, w4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_hv24_8_end_neon
+endfunc
+
function ff_hevc_put_hevc_epel_uni_hv32_8_neon_i8mm, export=1
stp x5, x6, [sp, #-64]!
stp x3, x4, [sp, #16]
@@ -3098,6 +3110,8 @@ function ff_hevc_put_hevc_epel_uni_w_h64_8_neon_i8mm, export=1
b.hi 1b
ret
endfunc
+DISABLE_I8MM
+#endif
.macro epel_uni_w_hv_start
mov x15, x5 //denom
@@ -3202,28 +3216,6 @@ endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
- epel_uni_w_hv_start
- sxtw x4, w4
-
- add x10, x4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add x3, x4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_w_hv4_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_w_hv4_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
@@ -3273,28 +3265,6 @@ function hevc_put_hevc_epel_uni_w_hv4_8_end_neon
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
- epel_uni_w_hv_start
- sxtw x4, w4
-
- add x10, x4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add x3, x4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_w_hv6_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_w_hv6_8_end_neon
load_epel_filterh x6, x5
sub x1, x1, #4
@@ -3349,28 +3319,6 @@ function hevc_put_hevc_epel_uni_w_hv6_8_end_neon
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
- epel_uni_w_hv_start
- sxtw x4, w4
-
- add x10, x4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add x3, x4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_w_hv8_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_w_hv8_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
@@ -3420,28 +3368,6 @@ function hevc_put_hevc_epel_uni_w_hv8_8_end_neon
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
- epel_uni_w_hv_start
- sxtw x4, w4
-
- add x10, x4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add x3, x4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_w_hv12_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_w_hv12_8_end_neon
load_epel_filterh x6, x5
sub x1, x1, #8
@@ -3504,28 +3430,6 @@ function hevc_put_hevc_epel_uni_w_hv12_8_end_neon
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
- epel_uni_w_hv_start
- sxtw x4, w4
-
- add x10, x4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add x3, x4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_w_hv16_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_w_hv16_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
@@ -3587,28 +3491,6 @@ function hevc_put_hevc_epel_uni_w_hv16_8_end_neon
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
- epel_uni_w_hv_start
- sxtw x4, w4
-
- add x10, x4, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add x3, x4, #3
- mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_epel_uni_w_hv24_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
@@ -3686,6 +3568,141 @@ function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
ret
endfunc
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
+ epel_uni_w_hv_start
+ sxtw x4, w4
+
+ add x10, x4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add x3, x4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
+ epel_uni_w_hv_start
+ sxtw x4, w4
+
+ add x10, x4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add x3, x4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
+ epel_uni_w_hv_start
+ sxtw x4, w4
+
+ add x10, x4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add x3, x4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
+ epel_uni_w_hv_start
+ sxtw x4, w4
+
+ add x10, x4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add x3, x4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
+ epel_uni_w_hv_start
+ sxtw x4, w4
+
+ add x10, x4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add x3, x4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
+ epel_uni_w_hv_start
+ sxtw x4, w4
+
+ add x10, x4, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add x3, x4, #3
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_epel_uni_w_hv24_8_end_neon
+endfunc
+
function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
ldp x15, x16, [sp]
mov x17, #16
@@ -3769,26 +3786,9 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_neon_i8mm, export=1
ret
endfunc
+DISABLE_I8MM
+#endif
-function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
- add w10, w5, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w5, #3
- mov x4, x6
- mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_epel_bi_hv4_8_end_neon
-endfunc
function hevc_put_hevc_epel_bi_hv4_8_end_neon
load_epel_filterh x7, x6
@@ -3810,26 +3810,6 @@ function hevc_put_hevc_epel_bi_hv4_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
- add w10, w5, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w5, #3
- mov x4, x6
- mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_epel_bi_hv6_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_bi_hv6_8_end_neon
load_epel_filterh x7, x6
sub x1, x1, #4
@@ -3853,26 +3833,6 @@ function hevc_put_hevc_epel_bi_hv6_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
- add w10, w5, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w5, #3
- mov x4, x6
- mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_epel_bi_hv8_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_bi_hv8_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
@@ -3894,26 +3854,6 @@ function hevc_put_hevc_epel_bi_hv8_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
- add w10, w5, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w5, #3
- mov x4, x6
- mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_epel_bi_hv12_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_bi_hv12_8_end_neon
load_epel_filterh x7, x6
sub x1, x1, #8
@@ -3940,26 +3880,6 @@ function hevc_put_hevc_epel_bi_hv12_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
- add w10, w5, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w5, #3
- mov x4, x6
- mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_epel_bi_hv16_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_bi_hv16_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
@@ -3985,26 +3905,6 @@ function hevc_put_hevc_epel_bi_hv16_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
- add w10, w5, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w5, #3
- mov x4, x6
- mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_epel_bi_hv24_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_bi_hv24_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
@@ -4034,27 +3934,6 @@ function hevc_put_hevc_epel_bi_hv24_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
- str d8, [sp, #-16]!
- add w10, w5, #3
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3
- mov x2, x3
- add w3, w5, #3
- mov x4, x6
- mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h32_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_epel_bi_hv32_8_end_neon
-endfunc
-
function hevc_put_hevc_epel_bi_hv32_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
@@ -4089,6 +3968,150 @@ function hevc_put_hevc_epel_bi_hv32_8_end_neon
ret
endfunc
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
+ add w10, w5, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w5, #3
+ mov x4, x6
+ mov x5, x7
+ bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
+ add w10, w5, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w5, #3
+ mov x4, x6
+ mov x5, x7
+ bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
+ add w10, w5, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w5, #3
+ mov x4, x6
+ mov x5, x7
+ bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
+ add w10, w5, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w5, #3
+ mov x4, x6
+ mov x5, x7
+ bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
+ add w10, w5, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w5, #3
+ mov x4, x6
+ mov x5, x7
+ bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
+ add w10, w5, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w5, #3
+ mov x4, x6
+ mov x5, x7
+ bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv24_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
+ str d8, [sp, #-16]!
+ add w10, w5, #3
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3
+ mov x2, x3
+ add w3, w5, #3
+ mov x4, x6
+ mov x5, x7
+ bl X(ff_hevc_put_hevc_epel_h32_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_epel_bi_hv32_8_end_neon
+endfunc
+
function ff_hevc_put_hevc_epel_bi_hv48_8_neon_i8mm, export=1
stp x6, x7, [sp, #-80]!
stp x4, x5, [sp, #16]
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (8 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both " Martin Storsjö
` (11 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
AWS Graviton 3:
put_hevc_epel_hv4_8_c: 163.7
put_hevc_epel_hv4_8_neon: 52.5
put_hevc_epel_hv4_8_i8mm: 49.5
put_hevc_epel_hv6_8_c: 292.2
put_hevc_epel_hv6_8_neon: 97.7
put_hevc_epel_hv6_8_i8mm: 101.2
put_hevc_epel_hv8_8_c: 471.0
put_hevc_epel_hv8_8_neon: 106.7
put_hevc_epel_hv8_8_i8mm: 102.5
put_hevc_epel_hv12_8_c: 1030.2
put_hevc_epel_hv12_8_neon: 240.5
put_hevc_epel_hv12_8_i8mm: 215.0
put_hevc_epel_hv16_8_c: 1711.5
put_hevc_epel_hv16_8_neon: 340.2
put_hevc_epel_hv16_8_i8mm: 319.2
put_hevc_epel_hv24_8_c: 3670.0
put_hevc_epel_hv24_8_neon: 702.0
put_hevc_epel_hv24_8_i8mm: 666.5
put_hevc_epel_hv32_8_c: 6785.5
put_hevc_epel_hv32_8_neon: 1247.0
put_hevc_epel_hv32_8_i8mm: 1169.0
put_hevc_epel_hv48_8_c: 14689.7
put_hevc_epel_hv48_8_neon: 2665.2
put_hevc_epel_hv48_8_i8mm: 2740.0
put_hevc_epel_hv64_8_c: 25899.2
put_hevc_epel_hv64_8_neon: 4801.2
put_hevc_epel_hv64_8_i8mm: 4487.7
---
libavcodec/aarch64/hevcdsp_epel_neon.S | 58 +++++++++++++----------
libavcodec/aarch64/hevcdsp_init_aarch64.c | 6 +++
2 files changed, 38 insertions(+), 26 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 2088630da1..024464723b 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2298,10 +2298,8 @@ function hevc_put_hevc_epel_hv24_8_end_neon
2: ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
+.macro epel_hv suffix
+function ff_hevc_put_hevc_epel_hv4_8_\suffix, export=1
add w10, w3, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2310,13 +2308,13 @@ function ff_hevc_put_hevc_epel_hv4_8_neon_i8mm, export=1
add x0, sp, #32
sub x1, x1, x2
add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h4_8_\suffix)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
b hevc_put_hevc_epel_hv4_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv6_8_\suffix, export=1
add w10, w3, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2325,13 +2323,13 @@ function ff_hevc_put_hevc_epel_hv6_8_neon_i8mm, export=1
add x0, sp, #32
sub x1, x1, x2
add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h6_8_\suffix)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
b hevc_put_hevc_epel_hv6_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv8_8_\suffix, export=1
add w10, w3, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2340,13 +2338,13 @@ function ff_hevc_put_hevc_epel_hv8_8_neon_i8mm, export=1
add x0, sp, #32
sub x1, x1, x2
add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h8_8_\suffix)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
b hevc_put_hevc_epel_hv8_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv12_8_\suffix, export=1
add w10, w3, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2355,13 +2353,13 @@ function ff_hevc_put_hevc_epel_hv12_8_neon_i8mm, export=1
add x0, sp, #32
sub x1, x1, x2
add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h12_8_\suffix)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
b hevc_put_hevc_epel_hv12_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv16_8_\suffix, export=1
add w10, w3, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2370,13 +2368,13 @@ function ff_hevc_put_hevc_epel_hv16_8_neon_i8mm, export=1
add x0, sp, #32
sub x1, x1, x2
add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h16_8_\suffix)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
b hevc_put_hevc_epel_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv24_8_\suffix, export=1
add w10, w3, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2385,79 +2383,87 @@ function ff_hevc_put_hevc_epel_hv24_8_neon_i8mm, export=1
add x0, sp, #32
sub x1, x1, x2
add w3, w3, #3
- bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h24_8_\suffix)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
b hevc_put_hevc_epel_hv24_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_hv32_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv32_8_\suffix, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
stp x0, x1, [sp, #32]
str x30, [sp, #48]
mov x6, #16
- bl X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
ldp x0, x1, [sp, #32]
ldp x2, x3, [sp, #16]
ldp x4, x5, [sp], #48
add x0, x0, #32
add x1, x1, #16
mov x6, #16
- bl X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
-function ff_hevc_put_hevc_epel_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv48_8_\suffix, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
stp x0, x1, [sp, #32]
str x30, [sp, #48]
mov x6, #24
- bl X(ff_hevc_put_hevc_epel_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_hv24_8_\suffix)
ldp x0, x1, [sp, #32]
ldp x2, x3, [sp, #16]
ldp x4, x5, [sp], #48
add x0, x0, #48
add x1, x1, #24
mov x6, #24
- bl X(ff_hevc_put_hevc_epel_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_hv24_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
-function ff_hevc_put_hevc_epel_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_hv64_8_\suffix, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
stp x0, x1, [sp, #32]
str x30, [sp, #48]
mov x6, #16
- bl X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
ldp x4, x5, [sp]
ldp x2, x3, [sp, #16]
ldp x0, x1, [sp, #32]
add x0, x0, #32
add x1, x1, #16
mov x6, #16
- bl X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
ldp x4, x5, [sp]
ldp x2, x3, [sp, #16]
ldp x0, x1, [sp, #32]
add x0, x0, #64
add x1, x1, #32
mov x6, #16
- bl X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
ldp x0, x1, [sp, #32]
ldp x2, x3, [sp, #16]
ldp x4, x5, [sp], #48
add x0, x0, #96
add x1, x1, #48
mov x6, #16
- bl X(ff_hevc_put_hevc_epel_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_hv16_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
+.endm
+
+epel_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+epel_hv neon_i8mm
DISABLE_I8MM
#endif
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index be24737c9c..87e321da71 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -227,6 +227,10 @@ NEON8_FNPROTO(epel_h, (int16_t *dst,
const uint8_t *_src, ptrdiff_t _srcstride,
int height, intptr_t mx, intptr_t my, int width),);
+NEON8_FNPROTO(epel_hv, (int16_t *dst,
+ const uint8_t *src, ptrdiff_t srcstride,
+ int height, intptr_t mx, intptr_t my, int width), );
+
NEON8_FNPROTO(epel_h, (int16_t *dst,
const uint8_t *_src, ptrdiff_t _srcstride,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -407,6 +411,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel, 0, 1, epel_h,);
NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h,);
+ NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv,);
+
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both neon and i8mm
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (9 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv " Martin Storsjö
` (10 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
AWS Graviton 3:
put_hevc_epel_uni_hv4_8_c: 163.5
put_hevc_epel_uni_hv4_8_neon: 59.7
put_hevc_epel_uni_hv4_8_i8mm: 57.5
put_hevc_epel_uni_hv6_8_c: 344.7
put_hevc_epel_uni_hv6_8_neon: 105.0
put_hevc_epel_uni_hv6_8_i8mm: 102.7
put_hevc_epel_uni_hv8_8_c: 552.2
put_hevc_epel_uni_hv8_8_neon: 111.2
put_hevc_epel_uni_hv8_8_i8mm: 104.0
put_hevc_epel_uni_hv12_8_c: 1195.0
put_hevc_epel_uni_hv12_8_neon: 248.7
put_hevc_epel_uni_hv12_8_i8mm: 229.5
put_hevc_epel_uni_hv16_8_c: 1910.2
put_hevc_epel_uni_hv16_8_neon: 339.5
put_hevc_epel_uni_hv16_8_i8mm: 323.2
put_hevc_epel_uni_hv24_8_c: 4048.2
put_hevc_epel_uni_hv24_8_neon: 737.7
put_hevc_epel_uni_hv24_8_i8mm: 713.7
put_hevc_epel_uni_hv32_8_c: 6865.7
put_hevc_epel_uni_hv32_8_neon: 1285.0
put_hevc_epel_uni_hv32_8_i8mm: 1206.0
put_hevc_epel_uni_hv48_8_c: 15830.5
put_hevc_epel_uni_hv48_8_neon: 2844.7
put_hevc_epel_uni_hv48_8_i8mm: 2914.0
put_hevc_epel_uni_hv64_8_c: 27912.7
put_hevc_epel_uni_hv64_8_neon: 4970.5
put_hevc_epel_uni_hv64_8_i8mm: 4653.7
---
libavcodec/aarch64/hevcdsp_epel_neon.S | 67 +++++++++++------------
libavcodec/aarch64/hevcdsp_init_aarch64.c | 5 ++
2 files changed, 38 insertions(+), 34 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 024464723b..876db9d449 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -2460,14 +2460,6 @@ endfunc
epel_hv neon
-#if HAVE_I8MM
-ENABLE_I8MM
-
-epel_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
function hevc_put_hevc_epel_uni_hv4_8_end_neon
load_epel_filterh x6, x5
mov x10, #(MAX_PB_SIZE * 2)
@@ -2596,10 +2588,8 @@ function hevc_put_hevc_epel_uni_hv24_8_end_neon
2: ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
+.macro epel_uni_hv suffix
+function ff_hevc_put_hevc_epel_uni_hv4_8_\suffix, export=1
add w10, w4, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2611,14 +2601,14 @@ function ff_hevc_put_hevc_epel_uni_hv4_8_neon_i8mm, export=1
mov x2, x3
add w3, w4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h4_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_hv4_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv6_8_\suffix, export=1
add w10, w4, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2630,14 +2620,14 @@ function ff_hevc_put_hevc_epel_uni_hv6_8_neon_i8mm, export=1
mov x2, x3
add w3, w4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h6_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_hv6_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv8_8_\suffix, export=1
add w10, w4, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2649,14 +2639,14 @@ function ff_hevc_put_hevc_epel_uni_hv8_8_neon_i8mm, export=1
mov x2, x3
add w3, w4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h8_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_hv8_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv12_8_\suffix, export=1
add w10, w4, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2668,14 +2658,14 @@ function ff_hevc_put_hevc_epel_uni_hv12_8_neon_i8mm, export=1
mov x2, x3
add w3, w4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h12_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_hv12_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv16_8_\suffix, export=1
add w10, w4, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2687,14 +2677,14 @@ function ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm, export=1
mov x2, x3
add w3, w4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h16_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv24_8_\suffix, export=1
add w10, w4, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -2706,20 +2696,20 @@ function ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm, export=1
mov x2, x3
add w3, w4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h24_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_hv24_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_hv32_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv32_8_\suffix, export=1
stp x5, x6, [sp, #-64]!
stp x3, x4, [sp, #16]
stp x1, x2, [sp, #32]
stp x0, x30, [sp, #48]
mov x7, #16
- bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
ldp x5, x6, [sp]
ldp x3, x4, [sp, #16]
ldp x1, x2, [sp, #32]
@@ -2727,19 +2717,19 @@ function ff_hevc_put_hevc_epel_uni_hv32_8_neon_i8mm, export=1
add x0, x0, #16
add x2, x2, #16
mov x7, #16
- bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
ldr x30, [sp, #56]
add sp, sp, #64
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv48_8_\suffix, export=1
stp x5, x6, [sp, #-64]!
stp x3, x4, [sp, #16]
stp x1, x2, [sp, #32]
stp x0, x30, [sp, #48]
mov x7, #24
- bl X(ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_hv24_8_\suffix)
ldp x5, x6, [sp]
ldp x3, x4, [sp, #16]
ldp x1, x2, [sp, #32]
@@ -2747,19 +2737,19 @@ function ff_hevc_put_hevc_epel_uni_hv48_8_neon_i8mm, export=1
add x0, x0, #24
add x2, x2, #24
mov x7, #24
- bl X(ff_hevc_put_hevc_epel_uni_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_hv24_8_\suffix)
ldr x30, [sp, #56]
add sp, sp, #64
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_hv64_8_\suffix, export=1
stp x5, x6, [sp, #-64]!
stp x3, x4, [sp, #16]
stp x1, x2, [sp, #32]
stp x0, x30, [sp, #48]
mov x7, #16
- bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
ldp x5, x6, [sp]
ldp x3, x4, [sp, #16]
ldp x1, x2, [sp, #32]
@@ -2767,7 +2757,7 @@ function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
add x0, x0, #16
add x2, x2, #16
mov x7, #16
- bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
ldp x5, x6, [sp]
ldp x3, x4, [sp, #16]
ldp x1, x2, [sp, #32]
@@ -2775,7 +2765,7 @@ function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
add x0, x0, #32
add x2, x2, #32
mov x7, #16
- bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
ldp x5, x6, [sp]
ldp x3, x4, [sp, #16]
ldp x1, x2, [sp, #32]
@@ -2783,12 +2773,21 @@ function ff_hevc_put_hevc_epel_uni_hv64_8_neon_i8mm, export=1
add x0, x0, #48
add x2, x2, #48
mov x7, #16
- bl X(ff_hevc_put_hevc_epel_uni_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_hv16_8_\suffix)
ldr x30, [sp, #56]
add sp, sp, #64
ret
endfunc
+.endm
+
+epel_uni_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+epel_hv neon_i8mm
+epel_uni_hv neon_i8mm
function ff_hevc_put_hevc_epel_uni_w_h4_8_neon_i8mm, export=1
EPEL_UNI_W_H_HEADER
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 87e321da71..447ae80bfb 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -209,6 +209,10 @@ NEON8_FNPROTO(epel_uni_v, (uint8_t *dst, ptrdiff_t dststride,
const uint8_t *src, ptrdiff_t srcstride,
int height, intptr_t mx, intptr_t my, int width),);
+NEON8_FNPROTO(epel_uni_hv, (uint8_t *dst, ptrdiff_t _dststride,
+ const uint8_t *src, ptrdiff_t srcstride,
+ int height, intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(epel_uni_hv, (uint8_t *dst, ptrdiff_t _dststride,
const uint8_t *src, ptrdiff_t srcstride,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -412,6 +416,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN_SHARED_32(c->put_hevc_epel_uni_w, 0, 1, epel_uni_w_h,);
NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv,);
+ NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv,);
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv functions for both neon and i8mm
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (10 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both " Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv " Martin Storsjö
` (9 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
AWS Graviton 3:
put_hevc_epel_uni_w_hv4_8_c: 191.2
put_hevc_epel_uni_w_hv4_8_neon: 87.7
put_hevc_epel_uni_w_hv4_8_i8mm: 83.2
put_hevc_epel_uni_w_hv6_8_c: 349.5
put_hevc_epel_uni_w_hv6_8_neon: 153.0
put_hevc_epel_uni_w_hv6_8_i8mm: 148.5
put_hevc_epel_uni_w_hv8_8_c: 581.2
put_hevc_epel_uni_w_hv8_8_neon: 166.7
put_hevc_epel_uni_w_hv8_8_i8mm: 163.5
put_hevc_epel_uni_w_hv12_8_c: 1230.0
put_hevc_epel_uni_w_hv12_8_neon: 387.7
put_hevc_epel_uni_w_hv12_8_i8mm: 370.2
put_hevc_epel_uni_w_hv16_8_c: 2003.2
put_hevc_epel_uni_w_hv16_8_neon: 501.5
put_hevc_epel_uni_w_hv16_8_i8mm: 490.2
put_hevc_epel_uni_w_hv24_8_c: 4448.7
put_hevc_epel_uni_w_hv24_8_neon: 1092.2
put_hevc_epel_uni_w_hv24_8_i8mm: 1069.7
put_hevc_epel_uni_w_hv32_8_c: 7817.2
put_hevc_epel_uni_w_hv32_8_neon: 1916.2
put_hevc_epel_uni_w_hv32_8_i8mm: 1829.5
put_hevc_epel_uni_w_hv48_8_c: 16728.2
put_hevc_epel_uni_w_hv48_8_neon: 4263.7
put_hevc_epel_uni_w_hv48_8_i8mm: 4342.7
put_hevc_epel_uni_w_hv64_8_c: 29563.2
put_hevc_epel_uni_w_hv64_8_neon: 7474.2
put_hevc_epel_uni_w_hv64_8_i8mm: 7128.5
---
libavcodec/aarch64/hevcdsp_epel_neon.S | 55 ++++++++++++-----------
libavcodec/aarch64/hevcdsp_init_aarch64.c | 6 +++
2 files changed, 36 insertions(+), 25 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index 876db9d449..d0c6205e1c 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -3573,10 +3573,8 @@ function hevc_put_hevc_epel_uni_w_hv24_8_end_neon
ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
+.macro epel_uni_w_hv suffix
+function ff_hevc_put_hevc_epel_uni_w_hv4_8_\suffix, export=1
epel_uni_w_hv_start
sxtw x4, w4
@@ -3591,14 +3589,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
mov x2, x3
add x3, x4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h4_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_w_hv4_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv6_8_\suffix, export=1
epel_uni_w_hv_start
sxtw x4, w4
@@ -3613,14 +3611,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv6_8_neon_i8mm, export=1
mov x2, x3
add x3, x4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h6_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_w_hv6_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv8_8_\suffix, export=1
epel_uni_w_hv_start
sxtw x4, w4
@@ -3635,14 +3633,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv8_8_neon_i8mm, export=1
mov x2, x3
add x3, x4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h8_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_w_hv8_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv12_8_\suffix, export=1
epel_uni_w_hv_start
sxtw x4, w4
@@ -3657,14 +3655,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv12_8_neon_i8mm, export=1
mov x2, x3
add x3, x4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h12_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_w_hv12_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv16_8_\suffix, export=1
epel_uni_w_hv_start
sxtw x4, w4
@@ -3679,14 +3677,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm, export=1
mov x2, x3
add x3, x4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h16_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_w_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv24_8_\suffix, export=1
epel_uni_w_hv_start
sxtw x4, w4
@@ -3701,14 +3699,14 @@ function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
mov x2, x3
add x3, x4, #3
mov x4, x5
- bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h24_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
b hevc_put_hevc_epel_uni_w_hv24_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv32_8_\suffix, export=1
ldp x15, x16, [sp]
mov x17, #16
stp x15, x16, [sp, #-96]!
@@ -3718,7 +3716,7 @@ function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
stp x5, x6, [sp, #64]
stp x17, x7, [sp, #80]
- bl X(ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_w_hv16_8_\suffix)
ldp x0, x30, [sp, #16]
ldp x1, x2, [sp, #32]
ldp x3, x4, [sp, #48]
@@ -3730,13 +3728,13 @@ function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
mov x17, #16
stp x15, x16, [sp, #-32]!
stp x17, x30, [sp, #16]
- bl X(ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_w_hv16_8_\suffix)
ldp x17, x30, [sp, #16]
ldp x15, x16, [sp], #32
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv48_8_\suffix, export=1
ldp x15, x16, [sp]
mov x17, #24
stp x15, x16, [sp, #-96]!
@@ -3745,7 +3743,7 @@ function ff_hevc_put_hevc_epel_uni_w_hv48_8_neon_i8mm, export=1
stp x3, x4, [sp, #48]
stp x5, x6, [sp, #64]
stp x17, x7, [sp, #80]
- bl X(ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_w_hv24_8_\suffix)
ldp x0, x30, [sp, #16]
ldp x1, x2, [sp, #32]
ldp x3, x4, [sp, #48]
@@ -3757,13 +3755,13 @@ function ff_hevc_put_hevc_epel_uni_w_hv48_8_neon_i8mm, export=1
mov x17, #24
stp x15, x16, [sp, #-32]!
stp x17, x30, [sp, #16]
- bl X(ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_w_hv24_8_\suffix)
ldp x17, x30, [sp, #16]
ldp x15, x16, [sp], #32
ret
endfunc
-function ff_hevc_put_hevc_epel_uni_w_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_uni_w_hv64_8_\suffix, export=1
ldp x15, x16, [sp]
mov x17, #32
stp x15, x16, [sp, #-96]!
@@ -3773,7 +3771,7 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_neon_i8mm, export=1
stp x5, x6, [sp, #64]
stp x17, x7, [sp, #80]
- bl X(ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_w_hv32_8_\suffix)
ldp x0, x30, [sp, #16]
ldp x1, x2, [sp, #32]
ldp x3, x4, [sp, #48]
@@ -3785,16 +3783,23 @@ function ff_hevc_put_hevc_epel_uni_w_hv64_8_neon_i8mm, export=1
mov x17, #32
stp x15, x16, [sp, #-32]!
stp x17, x30, [sp, #16]
- bl X(ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_uni_w_hv32_8_\suffix)
ldp x17, x30, [sp, #16]
ldp x15, x16, [sp], #32
ret
endfunc
+.endm
+
+epel_uni_w_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+epel_uni_w_hv neon_i8mm
DISABLE_I8MM
#endif
-
function hevc_put_hevc_epel_bi_hv4_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 447ae80bfb..948103aa09 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -278,6 +278,11 @@ NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst, ptrdiff_t _dststride,
int height, int denom, int wx, int ox,
intptr_t mx, intptr_t my, int width), _i8mm);
+NEON8_FNPROTO(epel_uni_w_hv, (uint8_t *_dst, ptrdiff_t _dststride,
+ const uint8_t *_src, ptrdiff_t _srcstride,
+ int height, int denom, int wx, int ox,
+ intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(epel_uni_w_hv, (uint8_t *_dst, ptrdiff_t _dststride,
const uint8_t *_src, ptrdiff_t _srcstride,
int height, int denom, int wx, int ox,
@@ -417,6 +422,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv,);
NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv,);
+ NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv,);
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv functions for both neon and i8mm
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (11 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv " Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 Martin Storsjö
` (8 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
In addition to just templating, this contains one change to
ff_hevc_put_hevc_epel_bi_hv32_8, by setting the w6 register
which ff_hevc_put_hevc_epel_h32_8_neon requires.
AWS Graviton 3:
put_hevc_epel_bi_hv4_8_c: 176.5
put_hevc_epel_bi_hv4_8_neon: 62.0
put_hevc_epel_bi_hv4_8_i8mm: 58.0
put_hevc_epel_bi_hv6_8_c: 343.7
put_hevc_epel_bi_hv6_8_neon: 109.7
put_hevc_epel_bi_hv6_8_i8mm: 105.7
put_hevc_epel_bi_hv8_8_c: 536.0
put_hevc_epel_bi_hv8_8_neon: 112.7
put_hevc_epel_bi_hv8_8_i8mm: 111.7
put_hevc_epel_bi_hv12_8_c: 1107.7
put_hevc_epel_bi_hv12_8_neon: 254.7
put_hevc_epel_bi_hv12_8_i8mm: 239.0
put_hevc_epel_bi_hv16_8_c: 1927.7
put_hevc_epel_bi_hv16_8_neon: 356.2
put_hevc_epel_bi_hv16_8_i8mm: 334.2
put_hevc_epel_bi_hv24_8_c: 4195.2
put_hevc_epel_bi_hv24_8_neon: 736.7
put_hevc_epel_bi_hv24_8_i8mm: 715.5
put_hevc_epel_bi_hv32_8_c: 7280.5
put_hevc_epel_bi_hv32_8_neon: 1287.7
put_hevc_epel_bi_hv32_8_i8mm: 1162.2
put_hevc_epel_bi_hv48_8_c: 16857.7
put_hevc_epel_bi_hv48_8_neon: 2836.2
put_hevc_epel_bi_hv48_8_i8mm: 2908.5
put_hevc_epel_bi_hv64_8_c: 29248.2
put_hevc_epel_bi_hv64_8_neon: 5051.7
put_hevc_epel_bi_hv64_8_i8mm: 4491.5
---
libavcodec/aarch64/hevcdsp_epel_neon.S | 62 +++++++++++------------
libavcodec/aarch64/hevcdsp_init_aarch64.c | 5 ++
2 files changed, 36 insertions(+), 31 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S
index d0c6205e1c..cb17758a72 100644
--- a/libavcodec/aarch64/hevcdsp_epel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
@@ -3792,14 +3792,6 @@ endfunc
epel_uni_w_hv neon
-#if HAVE_I8MM
-ENABLE_I8MM
-
-epel_uni_w_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
function hevc_put_hevc_epel_bi_hv4_8_end_neon
load_epel_filterh x7, x6
mov x10, #(MAX_PB_SIZE * 2)
@@ -3978,10 +3970,8 @@ function hevc_put_hevc_epel_bi_hv32_8_end_neon
ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
+.macro epel_bi_hv suffix
+function ff_hevc_put_hevc_epel_bi_hv4_8_\suffix, export=1
add w10, w5, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -3994,14 +3984,14 @@ function ff_hevc_put_hevc_epel_bi_hv4_8_neon_i8mm, export=1
add w3, w5, #3
mov x4, x6
mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h4_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
b hevc_put_hevc_epel_bi_hv4_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv6_8_\suffix, export=1
add w10, w5, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -4014,14 +4004,14 @@ function ff_hevc_put_hevc_epel_bi_hv6_8_neon_i8mm, export=1
add w3, w5, #3
mov x4, x6
mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h6_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h6_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
b hevc_put_hevc_epel_bi_hv6_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv8_8_\suffix, export=1
add w10, w5, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -4034,14 +4024,14 @@ function ff_hevc_put_hevc_epel_bi_hv8_8_neon_i8mm, export=1
add w3, w5, #3
mov x4, x6
mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h8_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
b hevc_put_hevc_epel_bi_hv8_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv12_8_\suffix, export=1
add w10, w5, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -4054,14 +4044,14 @@ function ff_hevc_put_hevc_epel_bi_hv12_8_neon_i8mm, export=1
add w3, w5, #3
mov x4, x6
mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h12_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h12_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
b hevc_put_hevc_epel_bi_hv12_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv16_8_\suffix, export=1
add w10, w5, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -4074,14 +4064,14 @@ function ff_hevc_put_hevc_epel_bi_hv16_8_neon_i8mm, export=1
add w3, w5, #3
mov x4, x6
mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h16_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
b hevc_put_hevc_epel_bi_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv24_8_\suffix, export=1
add w10, w5, #3
lsl x10, x10, #7
sub sp, sp, x10 // tmp_array
@@ -4094,14 +4084,14 @@ function ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm, export=1
add w3, w5, #3
mov x4, x6
mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_h24_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
b hevc_put_hevc_epel_bi_hv24_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv32_8_\suffix, export=1
str d8, [sp, #-16]!
add w10, w5, #3
lsl x10, x10, #7
@@ -4115,20 +4105,21 @@ function ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm, export=1
add w3, w5, #3
mov x4, x6
mov x5, x7
- bl X(ff_hevc_put_hevc_epel_h32_8_neon_i8mm)
+ mov w6, #32
+ bl X(ff_hevc_put_hevc_epel_h32_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
b hevc_put_hevc_epel_bi_hv32_8_end_neon
endfunc
-function ff_hevc_put_hevc_epel_bi_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv48_8_\suffix, export=1
stp x6, x7, [sp, #-80]!
stp x4, x5, [sp, #16]
stp x2, x3, [sp, #32]
stp x0, x1, [sp, #48]
str x30, [sp, #64]
- bl X(ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_bi_hv24_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x2, x3, [sp, #32]
ldp x0, x1, [sp, #48]
@@ -4136,18 +4127,18 @@ function ff_hevc_put_hevc_epel_bi_hv48_8_neon_i8mm, export=1
add x0, x0, #24
add x2, x2, #24
add x4, x4, #48
- bl X(ff_hevc_put_hevc_epel_bi_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_bi_hv24_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
-function ff_hevc_put_hevc_epel_bi_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_epel_bi_hv64_8_\suffix, export=1
stp x6, x7, [sp, #-80]!
stp x4, x5, [sp, #16]
stp x2, x3, [sp, #32]
stp x0, x1, [sp, #48]
str x30, [sp, #64]
- bl X(ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_bi_hv32_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x2, x3, [sp, #32]
ldp x0, x1, [sp, #48]
@@ -4155,10 +4146,19 @@ function ff_hevc_put_hevc_epel_bi_hv64_8_neon_i8mm, export=1
add x0, x0, #32
add x2, x2, #32
add x4, x4, #64
- bl X(ff_hevc_put_hevc_epel_bi_hv32_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_epel_bi_hv32_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
+.endm
+
+epel_bi_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+epel_uni_w_hv neon_i8mm
+epel_bi_hv neon_i8mm
DISABLE_I8MM
#endif
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 948103aa09..6110a360d8 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -188,6 +188,10 @@ NEON8_FNPROTO(epel_bi_v, (uint8_t *dst, ptrdiff_t dststride,
const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
int height, intptr_t mx, intptr_t my, int width),);
+NEON8_FNPROTO(epel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
+ const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
+ int height, intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(epel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -423,6 +427,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv,);
NEON8_FNASSIGN(c->put_hevc_epel_uni, 1, 1, epel_uni_hv,);
NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv,);
+ NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv,);
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (12 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv " Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts Martin Storsjö
` (7 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
AWS Graviton 3:
put_hevc_qpel_uni_w_h4_8_c: 159.0
put_hevc_qpel_uni_w_h4_8_neon: 64.2
put_hevc_qpel_uni_w_h4_8_i8mm: 40.0
put_hevc_qpel_uni_w_h6_8_c: 344.7
put_hevc_qpel_uni_w_h6_8_neon: 114.5
put_hevc_qpel_uni_w_h6_8_i8mm: 82.0
put_hevc_qpel_uni_w_h8_8_c: 596.2
put_hevc_qpel_uni_w_h8_8_neon: 132.2
put_hevc_qpel_uni_w_h8_8_i8mm: 106.0
put_hevc_qpel_uni_w_h12_8_c: 1325.0
put_hevc_qpel_uni_w_h12_8_neon: 299.0
put_hevc_qpel_uni_w_h12_8_i8mm: 211.5
put_hevc_qpel_uni_w_h16_8_c: 2300.0
put_hevc_qpel_uni_w_h16_8_neon: 422.0
put_hevc_qpel_uni_w_h16_8_i8mm: 286.2
put_hevc_qpel_uni_w_h24_8_c: 5059.0
put_hevc_qpel_uni_w_h24_8_neon: 912.2
put_hevc_qpel_uni_w_h24_8_i8mm: 664.2
put_hevc_qpel_uni_w_h32_8_c: 9198.2
put_hevc_qpel_uni_w_h32_8_neon: 1638.2
put_hevc_qpel_uni_w_h32_8_i8mm: 1033.7
put_hevc_qpel_uni_w_h48_8_c: 20754.7
put_hevc_qpel_uni_w_h48_8_neon: 3633.7
put_hevc_qpel_uni_w_h48_8_i8mm: 2300.7
put_hevc_qpel_uni_w_h64_8_c: 36854.7
put_hevc_qpel_uni_w_h64_8_neon: 6435.7
put_hevc_qpel_uni_w_h64_8_i8mm: 4039.2
---
libavcodec/aarch64/hevcdsp_init_aarch64.c | 7 +
libavcodec/aarch64/hevcdsp_qpel_neon.S | 405 +++++++++++++++++++++-
2 files changed, 410 insertions(+), 2 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 6110a360d8..ea0d26c019 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -277,6 +277,11 @@ NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst, ptrdiff_t dststride,
const uint8_t *src, ptrdiff_t srcstride,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
+NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst, ptrdiff_t _dststride,
+ const uint8_t *_src, ptrdiff_t _srcstride,
+ int height, int denom, int wx, int ox,
+ intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst, ptrdiff_t _dststride,
const uint8_t *_src, ptrdiff_t _srcstride,
int height, int denom, int wx, int ox,
@@ -429,6 +434,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 1, epel_uni_w_hv,);
NEON8_FNASSIGN(c->put_hevc_epel_bi, 1, 1, epel_bi_hv,);
+ NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
+
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 062b7d4d0f..fba063186c 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2456,8 +2456,10 @@ function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
ldp x7, x30, [sp], #48
b .Lqpel_uni_hv16_loop
endfunc
+DISABLE_I8MM
+#endif
-.macro QPEL_UNI_W_H_HEADER
+.macro QPEL_UNI_W_H_HEADER elems=4s
ldr x12, [sp]
sub x2, x2, #3
movrel x9, qpel_filters
@@ -2465,11 +2467,410 @@ endfunc
ld1r {v28.2d}, [x9]
mov w10, #-6
sub w10, w10, w5
- dup v30.4s, w6 // wx
+ dup v30.\elems, w6 // wx
dup v31.4s, w10 // shift
dup v29.4s, w7 // ox
.endm
+function ff_hevc_put_hevc_qpel_uni_w_h4_8_neon, export=1
+ QPEL_UNI_W_H_HEADER 4h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v1.8b, v2.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ ext v3.16b, v1.16b, v2.16b, #2
+ ext v4.16b, v1.16b, v2.16b, #4
+ ext v5.16b, v1.16b, v2.16b, #6
+ ext v6.16b, v1.16b, v2.16b, #8
+ ext v7.16b, v1.16b, v2.16b, #10
+ ext v16.16b, v1.16b, v2.16b, #12
+ ext v17.16b, v1.16b, v2.16b, #14
+ mul v18.4h, v1.4h, v0.h[0]
+ mla v18.4h, v3.4h, v0.h[1]
+ mla v18.4h, v4.4h, v0.h[2]
+ mla v18.4h, v5.4h, v0.h[3]
+ mla v18.4h, v6.4h, v0.h[4]
+ mla v18.4h, v7.4h, v0.h[5]
+ mla v18.4h, v16.4h, v0.h[6]
+ mla v18.4h, v17.4h, v0.h[7]
+ smull v16.4s, v18.4h, v30.4h
+ sqrshl v16.4s, v16.4s, v31.4s
+ sqadd v16.4s, v16.4s, v29.4s
+ sqxtn v16.4h, v16.4s
+ sqxtun v16.8b, v16.8h
+ str s16, [x0]
+ add x0, x0, x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h6_8_neon, export=1
+ QPEL_UNI_W_H_HEADER 8h
+ sub x1, x1, #4
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v1.8b, v2.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ ext v3.16b, v1.16b, v2.16b, #2
+ ext v4.16b, v1.16b, v2.16b, #4
+ ext v5.16b, v1.16b, v2.16b, #6
+ ext v6.16b, v1.16b, v2.16b, #8
+ ext v7.16b, v1.16b, v2.16b, #10
+ ext v16.16b, v1.16b, v2.16b, #12
+ ext v17.16b, v1.16b, v2.16b, #14
+ mul v18.8h, v1.8h, v0.h[0]
+ mla v18.8h, v3.8h, v0.h[1]
+ mla v18.8h, v4.8h, v0.h[2]
+ mla v18.8h, v5.8h, v0.h[3]
+ mla v18.8h, v6.8h, v0.h[4]
+ mla v18.8h, v7.8h, v0.h[5]
+ mla v18.8h, v16.8h, v0.h[6]
+ mla v18.8h, v17.8h, v0.h[7]
+ smull v16.4s, v18.4h, v30.4h
+ smull2 v17.4s, v18.8h, v30.8h
+ sqrshl v16.4s, v16.4s, v31.4s
+ sqrshl v17.4s, v17.4s, v31.4s
+ sqadd v16.4s, v16.4s, v29.4s
+ sqadd v17.4s, v17.4s, v29.4s
+ sqxtn v16.4h, v16.4s
+ sqxtn2 v16.8h, v17.4s
+ sqxtun v16.8b, v16.8h
+ str s16, [x0], #4
+ st1 {v16.h}[2], [x0], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h8_8_neon, export=1
+ QPEL_UNI_W_H_HEADER 8h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v1.8b, v2.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ ext v3.16b, v1.16b, v2.16b, #2
+ ext v4.16b, v1.16b, v2.16b, #4
+ ext v5.16b, v1.16b, v2.16b, #6
+ ext v6.16b, v1.16b, v2.16b, #8
+ ext v7.16b, v1.16b, v2.16b, #10
+ ext v16.16b, v1.16b, v2.16b, #12
+ ext v17.16b, v1.16b, v2.16b, #14
+ mul v18.8h, v1.8h, v0.h[0]
+ mla v18.8h, v3.8h, v0.h[1]
+ mla v18.8h, v4.8h, v0.h[2]
+ mla v18.8h, v5.8h, v0.h[3]
+ mla v18.8h, v6.8h, v0.h[4]
+ mla v18.8h, v7.8h, v0.h[5]
+ mla v18.8h, v16.8h, v0.h[6]
+ mla v18.8h, v17.8h, v0.h[7]
+ smull v16.4s, v18.4h, v30.4h
+ smull2 v17.4s, v18.8h, v30.8h
+ sqrshl v16.4s, v16.4s, v31.4s
+ sqrshl v17.4s, v17.4s, v31.4s
+ sqadd v16.4s, v16.4s, v29.4s
+ sqadd v17.4s, v17.4s, v29.4s
+ sqxtn v16.4h, v16.4s
+ sqxtn2 v16.8h, v17.4s
+ sqxtun v16.8b, v16.8h
+ st1 {v16.8b}, [x0], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h12_8_neon, export=1
+ QPEL_UNI_W_H_HEADER 8h
+ add x13, x0, #8
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v1.8b, v2.8b, v3.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ ext v4.16b, v1.16b, v2.16b, #2
+ ext v5.16b, v1.16b, v2.16b, #4
+ ext v6.16b, v1.16b, v2.16b, #6
+ ext v7.16b, v1.16b, v2.16b, #8
+ ext v16.16b, v1.16b, v2.16b, #10
+ ext v17.16b, v1.16b, v2.16b, #12
+ ext v18.16b, v1.16b, v2.16b, #14
+ mul v19.8h, v1.8h, v0.h[0]
+ mla v19.8h, v4.8h, v0.h[1]
+ mla v19.8h, v5.8h, v0.h[2]
+ mla v19.8h, v6.8h, v0.h[3]
+ mla v19.8h, v7.8h, v0.h[4]
+ mla v19.8h, v16.8h, v0.h[5]
+ mla v19.8h, v17.8h, v0.h[6]
+ mla v19.8h, v18.8h, v0.h[7]
+ ext v4.16b, v2.16b, v3.16b, #2
+ ext v5.16b, v2.16b, v3.16b, #4
+ ext v6.16b, v2.16b, v3.16b, #6
+ ext v7.16b, v2.16b, v3.16b, #8
+ ext v16.16b, v2.16b, v3.16b, #10
+ ext v17.16b, v2.16b, v3.16b, #12
+ ext v18.16b, v2.16b, v3.16b, #14
+ mul v20.4h, v2.4h, v0.h[0]
+ mla v20.4h, v4.4h, v0.h[1]
+ mla v20.4h, v5.4h, v0.h[2]
+ mla v20.4h, v6.4h, v0.h[3]
+ mla v20.4h, v7.4h, v0.h[4]
+ mla v20.4h, v16.4h, v0.h[5]
+ mla v20.4h, v17.4h, v0.h[6]
+ mla v20.4h, v18.4h, v0.h[7]
+ smull v16.4s, v19.4h, v30.4h
+ smull2 v17.4s, v19.8h, v30.8h
+ smull v18.4s, v20.4h, v30.4h
+ sqrshl v16.4s, v16.4s, v31.4s
+ sqrshl v17.4s, v17.4s, v31.4s
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqadd v16.4s, v16.4s, v29.4s
+ sqadd v17.4s, v17.4s, v29.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqxtn v16.4h, v16.4s
+ sqxtn2 v16.8h, v17.4s
+ sqxtn v17.4h, v18.4s
+ sqxtun v16.8b, v16.8h
+ sqxtun v17.8b, v17.8h
+ st1 {v16.8b}, [x0], x1
+ st1 {v17.s}[0], [x13], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h16_8_neon, export=1
+ QPEL_UNI_W_H_HEADER 8h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v1.8b, v2.8b, v3.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ ext v4.16b, v1.16b, v2.16b, #2
+ ext v5.16b, v1.16b, v2.16b, #4
+ ext v6.16b, v1.16b, v2.16b, #6
+ ext v7.16b, v1.16b, v2.16b, #8
+ ext v16.16b, v1.16b, v2.16b, #10
+ ext v17.16b, v1.16b, v2.16b, #12
+ ext v18.16b, v1.16b, v2.16b, #14
+ mul v19.8h, v1.8h, v0.h[0]
+ mla v19.8h, v4.8h, v0.h[1]
+ mla v19.8h, v5.8h, v0.h[2]
+ mla v19.8h, v6.8h, v0.h[3]
+ mla v19.8h, v7.8h, v0.h[4]
+ mla v19.8h, v16.8h, v0.h[5]
+ mla v19.8h, v17.8h, v0.h[6]
+ mla v19.8h, v18.8h, v0.h[7]
+ ext v4.16b, v2.16b, v3.16b, #2
+ ext v5.16b, v2.16b, v3.16b, #4
+ ext v6.16b, v2.16b, v3.16b, #6
+ ext v7.16b, v2.16b, v3.16b, #8
+ ext v16.16b, v2.16b, v3.16b, #10
+ ext v17.16b, v2.16b, v3.16b, #12
+ ext v18.16b, v2.16b, v3.16b, #14
+ mul v20.8h, v2.8h, v0.h[0]
+ mla v20.8h, v4.8h, v0.h[1]
+ mla v20.8h, v5.8h, v0.h[2]
+ mla v20.8h, v6.8h, v0.h[3]
+ mla v20.8h, v7.8h, v0.h[4]
+ mla v20.8h, v16.8h, v0.h[5]
+ mla v20.8h, v17.8h, v0.h[6]
+ mla v20.8h, v18.8h, v0.h[7]
+ smull v16.4s, v19.4h, v30.4h
+ smull2 v17.4s, v19.8h, v30.8h
+ smull v18.4s, v20.4h, v30.4h
+ smull2 v19.4s, v20.8h, v30.8h
+ sqrshl v16.4s, v16.4s, v31.4s
+ sqrshl v17.4s, v17.4s, v31.4s
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqrshl v19.4s, v19.4s, v31.4s
+ sqadd v16.4s, v16.4s, v29.4s
+ sqadd v17.4s, v17.4s, v29.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqadd v19.4s, v19.4s, v29.4s
+ sqxtn v16.4h, v16.4s
+ sqxtn2 v16.8h, v17.4s
+ sqxtn v17.4h, v18.4s
+ sqxtn2 v17.8h, v19.4s
+ sqxtun v16.8b, v16.8h
+ sqxtun v17.8b, v17.8h
+ st1 {v16.8b, v17.8b}, [x0], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h24_8_neon, export=1
+ QPEL_UNI_W_H_HEADER 8h
+ sxtl v0.8h, v28.8b
+1:
+ ld1 {v1.8b, v2.8b, v3.8b, v4.8b}, [x2], x3
+ subs w4, w4, #1
+ uxtl v1.8h, v1.8b
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ uxtl v4.8h, v4.8b
+ ext v5.16b, v1.16b, v2.16b, #2
+ ext v6.16b, v1.16b, v2.16b, #4
+ ext v7.16b, v1.16b, v2.16b, #6
+ ext v16.16b, v1.16b, v2.16b, #8
+ ext v17.16b, v1.16b, v2.16b, #10
+ ext v18.16b, v1.16b, v2.16b, #12
+ ext v19.16b, v1.16b, v2.16b, #14
+ mul v20.8h, v1.8h, v0.h[0]
+ mla v20.8h, v5.8h, v0.h[1]
+ mla v20.8h, v6.8h, v0.h[2]
+ mla v20.8h, v7.8h, v0.h[3]
+ mla v20.8h, v16.8h, v0.h[4]
+ mla v20.8h, v17.8h, v0.h[5]
+ mla v20.8h, v18.8h, v0.h[6]
+ mla v20.8h, v19.8h, v0.h[7]
+ ext v5.16b, v2.16b, v3.16b, #2
+ ext v6.16b, v2.16b, v3.16b, #4
+ ext v7.16b, v2.16b, v3.16b, #6
+ ext v16.16b, v2.16b, v3.16b, #8
+ ext v17.16b, v2.16b, v3.16b, #10
+ ext v18.16b, v2.16b, v3.16b, #12
+ ext v19.16b, v2.16b, v3.16b, #14
+ mul v21.8h, v2.8h, v0.h[0]
+ mla v21.8h, v5.8h, v0.h[1]
+ mla v21.8h, v6.8h, v0.h[2]
+ mla v21.8h, v7.8h, v0.h[3]
+ mla v21.8h, v16.8h, v0.h[4]
+ mla v21.8h, v17.8h, v0.h[5]
+ mla v21.8h, v18.8h, v0.h[6]
+ mla v21.8h, v19.8h, v0.h[7]
+ ext v5.16b, v3.16b, v4.16b, #2
+ ext v6.16b, v3.16b, v4.16b, #4
+ ext v7.16b, v3.16b, v4.16b, #6
+ ext v16.16b, v3.16b, v4.16b, #8
+ ext v17.16b, v3.16b, v4.16b, #10
+ ext v18.16b, v3.16b, v4.16b, #12
+ ext v19.16b, v3.16b, v4.16b, #14
+ mul v22.8h, v3.8h, v0.h[0]
+ mla v22.8h, v5.8h, v0.h[1]
+ mla v22.8h, v6.8h, v0.h[2]
+ mla v22.8h, v7.8h, v0.h[3]
+ mla v22.8h, v16.8h, v0.h[4]
+ mla v22.8h, v17.8h, v0.h[5]
+ mla v22.8h, v18.8h, v0.h[6]
+ mla v22.8h, v19.8h, v0.h[7]
+ smull v16.4s, v20.4h, v30.4h
+ smull2 v17.4s, v20.8h, v30.8h
+ smull v18.4s, v21.4h, v30.4h
+ smull2 v19.4s, v21.8h, v30.8h
+ smull v20.4s, v22.4h, v30.4h
+ smull2 v21.4s, v22.8h, v30.8h
+ sqrshl v16.4s, v16.4s, v31.4s
+ sqrshl v17.4s, v17.4s, v31.4s
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqrshl v19.4s, v19.4s, v31.4s
+ sqrshl v20.4s, v20.4s, v31.4s
+ sqrshl v21.4s, v21.4s, v31.4s
+ sqadd v16.4s, v16.4s, v29.4s
+ sqadd v17.4s, v17.4s, v29.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqadd v19.4s, v19.4s, v29.4s
+ sqadd v20.4s, v20.4s, v29.4s
+ sqadd v21.4s, v21.4s, v29.4s
+ sqxtn v16.4h, v16.4s
+ sqxtn2 v16.8h, v17.4s
+ sqxtn v17.4h, v18.4s
+ sqxtn2 v17.8h, v19.4s
+ sqxtn v18.4h, v20.4s
+ sqxtn2 v18.8h, v21.4s
+ sqxtun v16.8b, v16.8h
+ sqxtun v17.8b, v17.8h
+ sqxtun v18.8b, v18.8h
+ st1 {v16.8b, v17.8b, v18.8b}, [x0], x1
+ b.hi 1b
+ ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_h32_8_neon, export=1
+ QPEL_UNI_W_H_HEADER 8h
+ ldr w10, [sp, #16] // width
+ ld1 {v1.8b}, [x2], #8
+ sub x3, x3, w10, uxtw // decrement src stride
+ mov w11, w10 // original width
+ sub x3, x3, #8 // decrement src stride
+ sub x1, x1, w10, uxtw // decrement dst stride
+ sxtl v0.8h, v28.8b
+ uxtl v1.8h, v1.8b
+1:
+ ld1 {v2.8b, v3.8b}, [x2], #16
+ subs w10, w10, #16 // width
+ uxtl v2.8h, v2.8b
+ uxtl v3.8h, v3.8b
+ ext v4.16b, v1.16b, v2.16b, #2
+ ext v5.16b, v1.16b, v2.16b, #4
+ ext v6.16b, v1.16b, v2.16b, #6
+ ext v7.16b, v1.16b, v2.16b, #8
+ ext v16.16b, v1.16b, v2.16b, #10
+ ext v17.16b, v1.16b, v2.16b, #12
+ ext v18.16b, v1.16b, v2.16b, #14
+ mul v19.8h, v1.8h, v0.h[0]
+ mla v19.8h, v4.8h, v0.h[1]
+ mla v19.8h, v5.8h, v0.h[2]
+ mla v19.8h, v6.8h, v0.h[3]
+ mla v19.8h, v7.8h, v0.h[4]
+ mla v19.8h, v16.8h, v0.h[5]
+ mla v19.8h, v17.8h, v0.h[6]
+ mla v19.8h, v18.8h, v0.h[7]
+ ext v4.16b, v2.16b, v3.16b, #2
+ ext v5.16b, v2.16b, v3.16b, #4
+ ext v6.16b, v2.16b, v3.16b, #6
+ ext v7.16b, v2.16b, v3.16b, #8
+ ext v16.16b, v2.16b, v3.16b, #10
+ ext v17.16b, v2.16b, v3.16b, #12
+ ext v18.16b, v2.16b, v3.16b, #14
+ mul v20.8h, v2.8h, v0.h[0]
+ mla v20.8h, v4.8h, v0.h[1]
+ mla v20.8h, v5.8h, v0.h[2]
+ mla v20.8h, v6.8h, v0.h[3]
+ mla v20.8h, v7.8h, v0.h[4]
+ mla v20.8h, v16.8h, v0.h[5]
+ mla v20.8h, v17.8h, v0.h[6]
+ mla v20.8h, v18.8h, v0.h[7]
+ smull v16.4s, v19.4h, v30.4h
+ smull2 v17.4s, v19.8h, v30.8h
+ smull v18.4s, v20.4h, v30.4h
+ smull2 v19.4s, v20.8h, v30.8h
+ sqrshl v16.4s, v16.4s, v31.4s
+ sqrshl v17.4s, v17.4s, v31.4s
+ sqrshl v18.4s, v18.4s, v31.4s
+ sqrshl v19.4s, v19.4s, v31.4s
+ sqadd v16.4s, v16.4s, v29.4s
+ sqadd v17.4s, v17.4s, v29.4s
+ sqadd v18.4s, v18.4s, v29.4s
+ sqadd v19.4s, v19.4s, v29.4s
+ sqxtn v16.4h, v16.4s
+ sqxtn2 v16.8h, v17.4s
+ sqxtn v17.4h, v18.4s
+ sqxtn2 v17.8h, v19.4s
+ sqxtun v16.8b, v16.8h
+ sqxtun v17.8b, v17.8h
+ st1 {v16.8b, v17.8b}, [x0], #16
+ mov v1.16b, v3.16b
+ b.gt 1b
+ subs w4, w4, #1 // height
+ add x2, x2, x3
+ b.le 9f
+ ld1 {v1.8b}, [x2], #8
+ mov w10, w11
+ add x0, x0, x1
+ uxtl v1.8h, v1.8b
+ b 1b
+9:
+ ret
+endfunc
+
+#if HAVE_I8MM
+ENABLE_I8MM
function ff_hevc_put_hevc_qpel_uni_w_h4_8_neon_i8mm, export=1
QPEL_UNI_W_H_HEADER
1:
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (13 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions Martin Storsjö
` (6 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
---
libavcodec/aarch64/hevcdsp_qpel_neon.S | 94 +++++++++++++++++++++++---
1 file changed, 86 insertions(+), 8 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index fba063186c..c04e8dbea8 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2166,6 +2166,10 @@ function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv4_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
ldr d16, [sp]
@@ -2208,6 +2212,10 @@ function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv6_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
sub x1, x1, #4
@@ -2253,6 +2261,10 @@ function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldr x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv8_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
ldr q16, [sp]
@@ -2296,6 +2308,10 @@ function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv12_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
sub x1, x1, #8
@@ -2339,7 +2355,10 @@ function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
-.Lqpel_uni_hv16_loop:
+ b hevc_put_hevc_qpel_uni_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_hv16_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
sub w12, w9, w7, lsl #1
@@ -2414,7 +2433,7 @@ function ff_hevc_put_hevc_qpel_uni_hv32_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
- b .Lqpel_uni_hv16_loop
+ b hevc_put_hevc_qpel_uni_hv16_8_end_neon
endfunc
function ff_hevc_put_hevc_qpel_uni_hv48_8_neon_i8mm, export=1
@@ -2434,7 +2453,7 @@ function ff_hevc_put_hevc_qpel_uni_hv48_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
- b .Lqpel_uni_hv16_loop
+ b hevc_put_hevc_qpel_uni_hv16_8_end_neon
endfunc
function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
@@ -2454,7 +2473,7 @@ function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
- b .Lqpel_uni_hv16_loop
+ b hevc_put_hevc_qpel_uni_hv16_8_end_neon
endfunc
DISABLE_I8MM
#endif
@@ -3776,6 +3795,10 @@ function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv4_8_end_neon
load_qpel_filterh x5, x4
ldr d16, [sp]
ldr d17, [sp, x7]
@@ -3813,6 +3836,10 @@ function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv6_8_end_neon
mov x8, #120
load_qpel_filterh x5, x4
ldr q16, [sp]
@@ -3852,6 +3879,10 @@ function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv8_8_end_neon
mov x7, #128
load_qpel_filterh x5, x4
ldr q16, [sp]
@@ -3890,6 +3921,10 @@ function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv12_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv12_8_end_neon
mov x7, #128
load_qpel_filterh x5, x4
mov x8, #112
@@ -3927,6 +3962,10 @@ function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv16_8_end_neon
mov x7, #128
load_qpel_filterh x5, x4
ld1 {v16.8h, v17.8h}, [sp], x7
@@ -3979,6 +4018,10 @@ function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
bl X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
ldp x0, x3, [sp, #16]
ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv32_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_hv32_8_end_neon
mov x7, #128
load_qpel_filterh x5, x4
0: mov x8, sp // src
@@ -4127,6 +4170,10 @@ endfunc
function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
QPEL_UNI_W_HV_HEADER 4
+ b hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
ldr d16, [sp]
ldr d17, [sp, x10]
add sp, sp, x10, lsl #1
@@ -4217,6 +4264,10 @@ endfunc
function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
QPEL_UNI_W_HV_HEADER 8
+ b hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
ldr q16, [sp]
ldr q17, [sp, x10]
add sp, sp, x10, lsl #1
@@ -4327,6 +4378,10 @@ endfunc
function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
QPEL_UNI_W_HV_HEADER 16
+ b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
ldp q16, q1, [sp]
add sp, sp, x10
ldp q17, q2, [sp]
@@ -4430,6 +4485,10 @@ endfunc
function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
QPEL_UNI_W_HV_HEADER 32
+ b hevc_put_hevc_qpel_uni_w_hv32_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv32_8_end_neon
mov x11, sp
mov w12, w22
mov x13, x20
@@ -4543,6 +4602,10 @@ endfunc
function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
QPEL_UNI_W_HV_HEADER 64
+ b hevc_put_hevc_qpel_uni_w_hv64_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_uni_w_hv64_8_end_neon
mov x11, sp
mov w12, w22
mov x13, x20
@@ -4671,6 +4734,10 @@ function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_bi_hv4_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_bi_hv4_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x7, x6
ld1 {v16.4h}, [sp], x9
@@ -4712,6 +4779,10 @@ function ff_hevc_put_hevc_qpel_bi_hv6_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_bi_hv6_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_bi_hv6_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x7, x6
sub x1, x1, #4
@@ -4758,6 +4829,10 @@ function ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm, export=1
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_bi_hv8_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_bi_hv8_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x7, x6
ld1 {v16.8h}, [sp], x9
@@ -4822,7 +4897,10 @@ function ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm, export=1
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
mov x6, #16 // width
-.Lqpel_bi_hv16_loop:
+ b hevc_put_hevc_qpel_bi_hv16_8_end_neon
+endfunc
+
+function hevc_put_hevc_qpel_bi_hv16_8_end_neon
load_qpel_filterh x7, x8
mov x9, #(MAX_PB_SIZE * 2)
mov x10, x6
@@ -4908,7 +4986,7 @@ function ff_hevc_put_hevc_qpel_bi_hv32_8_neon_i8mm, export=1
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
mov x6, #32 // width
- b .Lqpel_bi_hv16_loop
+ b hevc_put_hevc_qpel_bi_hv16_8_end_neon
endfunc
function ff_hevc_put_hevc_qpel_bi_hv48_8_neon_i8mm, export=1
@@ -4929,7 +5007,7 @@ function ff_hevc_put_hevc_qpel_bi_hv48_8_neon_i8mm, export=1
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
mov x6, #48 // width
- b .Lqpel_bi_hv16_loop
+ b hevc_put_hevc_qpel_bi_hv16_8_end_neon
endfunc
function ff_hevc_put_hevc_qpel_bi_hv64_8_neon_i8mm, export=1
@@ -4950,7 +5028,7 @@ function ff_hevc_put_hevc_qpel_bi_hv64_8_neon_i8mm, export=1
ldp x0, x1, [sp, #32]
ldp x7, x30, [sp], #48
mov x6, #64 // width
- b .Lqpel_bi_hv16_loop
+ b hevc_put_hevc_qpel_bi_hv16_8_end_neon
endfunc
DISABLE_I8MM
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (14 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating Martin Storsjö
` (5 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
The hv32 and hv64 functions were identical - both loop and
process 16 pixels at a time.
The hv16 function was near identical, except for the outer loop
(and using sp instead of a separate register).
Given the size of these functions, the extra cost of the outer
loop is negligible, so use the same function for hv16 as well.
This removes over 200 lines of duplicated assembly, and over 4 KB
of binary size.
---
libavcodec/aarch64/hevcdsp_qpel_neon.S | 220 +------------------------
1 file changed, 3 insertions(+), 217 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index c04e8dbea8..06832603d9 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4381,231 +4381,17 @@ function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
endfunc
-function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
- ldp q16, q1, [sp]
- add sp, sp, x10
- ldp q17, q2, [sp]
- add sp, sp, x10
- ldp q18, q3, [sp]
- add sp, sp, x10
- ldp q19, q4, [sp]
- add sp, sp, x10
- ldp q20, q5, [sp]
- add sp, sp, x10
- ldp q21, q6, [sp]
- add sp, sp, x10
- ldp q22, q7, [sp]
- add sp, sp, x10
-1:
- ldp q23, q31, [sp]
- add sp, sp, x10
- QPEL_FILTER_H v24, v16, v17, v18, v19, v20, v21, v22, v23
- QPEL_FILTER_H2 v25, v16, v17, v18, v19, v20, v21, v22, v23
- QPEL_FILTER_H v26, v1, v2, v3, v4, v5, v6, v7, v31
- QPEL_FILTER_H2 v27, v1, v2, v3, v4, v5, v6, v7, v31
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q16, q1, [sp]
- add sp, sp, x10
- QPEL_FILTER_H v24, v17, v18, v19, v20, v21, v22, v23, v16
- QPEL_FILTER_H2 v25, v17, v18, v19, v20, v21, v22, v23, v16
- QPEL_FILTER_H v26, v2, v3, v4, v5, v6, v7, v31, v1
- QPEL_FILTER_H2 v27, v2, v3, v4, v5, v6, v7, v31, v1
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q17, q2, [sp]
- add sp, sp, x10
- QPEL_FILTER_H v24, v18, v19, v20, v21, v22, v23, v16, v17
- QPEL_FILTER_H2 v25, v18, v19, v20, v21, v22, v23, v16, v17
- QPEL_FILTER_H v26, v3, v4, v5, v6, v7, v31, v1, v2
- QPEL_FILTER_H2 v27, v3, v4, v5, v6, v7, v31, v1, v2
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q18, q3, [sp]
- add sp, sp, x10
- QPEL_FILTER_H v24, v19, v20, v21, v22, v23, v16, v17, v18
- QPEL_FILTER_H2 v25, v19, v20, v21, v22, v23, v16, v17, v18
- QPEL_FILTER_H v26, v4, v5, v6, v7, v31, v1, v2, v3
- QPEL_FILTER_H2 v27, v4, v5, v6, v7, v31, v1, v2, v3
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q19, q4, [sp]
- add sp, sp, x10
- QPEL_FILTER_H v24, v20, v21, v22, v23, v16, v17, v18, v19
- QPEL_FILTER_H2 v25, v20, v21, v22, v23, v16, v17, v18, v19
- QPEL_FILTER_H v26, v5, v6, v7, v31, v1, v2, v3, v4
- QPEL_FILTER_H2 v27, v5, v6, v7, v31, v1, v2, v3, v4
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q20, q5, [sp]
- add sp, sp, x10
- QPEL_FILTER_H v24, v21, v22, v23, v16, v17, v18, v19, v20
- QPEL_FILTER_H2 v25, v21, v22, v23, v16, v17, v18, v19, v20
- QPEL_FILTER_H v26, v6, v7, v31, v1, v2, v3, v4, v5
- QPEL_FILTER_H2 v27, v6, v7, v31, v1, v2, v3, v4, v5
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q21, q6, [sp]
- add sp, sp, x10
- QPEL_FILTER_H v24, v22, v23, v16, v17, v18, v19, v20, v21
- QPEL_FILTER_H2 v25, v22, v23, v16, v17, v18, v19, v20, v21
- QPEL_FILTER_H v26, v7, v31, v1, v2, v3, v4, v5, v6
- QPEL_FILTER_H2 v27, v7, v31, v1, v2, v3, v4, v5, v6
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q22, q7, [sp]
- add sp, sp, x10
- QPEL_FILTER_H v24, v23, v16, v17, v18, v19, v20, v21, v22
- QPEL_FILTER_H2 v25, v23, v16, v17, v18, v19, v20, v21, v22
- QPEL_FILTER_H v26, v31, v1, v2, v3, v4, v5, v6, v7
- QPEL_FILTER_H2 v27, v31, v1, v2, v3, v4, v5, v6, v7
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.hi 1b
-
-2:
- QPEL_UNI_W_HV_END
- ret
-endfunc
-
-
function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
QPEL_UNI_W_HV_HEADER 32
- b hevc_put_hevc_qpel_uni_w_hv32_8_end_neon
-endfunc
-
-function hevc_put_hevc_qpel_uni_w_hv32_8_end_neon
- mov x11, sp
- mov w12, w22
- mov x13, x20
- mov x14, sp
-3:
- ldp q16, q1, [x11]
- add x11, x11, x10
- ldp q17, q2, [x11]
- add x11, x11, x10
- ldp q18, q3, [x11]
- add x11, x11, x10
- ldp q19, q4, [x11]
- add x11, x11, x10
- ldp q20, q5, [x11]
- add x11, x11, x10
- ldp q21, q6, [x11]
- add x11, x11, x10
- ldp q22, q7, [x11]
- add x11, x11, x10
-1:
- ldp q23, q31, [x11]
- add x11, x11, x10
- QPEL_FILTER_H v24, v16, v17, v18, v19, v20, v21, v22, v23
- QPEL_FILTER_H2 v25, v16, v17, v18, v19, v20, v21, v22, v23
- QPEL_FILTER_H v26, v1, v2, v3, v4, v5, v6, v7, v31
- QPEL_FILTER_H2 v27, v1, v2, v3, v4, v5, v6, v7, v31
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q16, q1, [x11]
- add x11, x11, x10
- QPEL_FILTER_H v24, v17, v18, v19, v20, v21, v22, v23, v16
- QPEL_FILTER_H2 v25, v17, v18, v19, v20, v21, v22, v23, v16
- QPEL_FILTER_H v26, v2, v3, v4, v5, v6, v7, v31, v1
- QPEL_FILTER_H2 v27, v2, v3, v4, v5, v6, v7, v31, v1
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q17, q2, [x11]
- add x11, x11, x10
- QPEL_FILTER_H v24, v18, v19, v20, v21, v22, v23, v16, v17
- QPEL_FILTER_H2 v25, v18, v19, v20, v21, v22, v23, v16, v17
- QPEL_FILTER_H v26, v3, v4, v5, v6, v7, v31, v1, v2
- QPEL_FILTER_H2 v27, v3, v4, v5, v6, v7, v31, v1, v2
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q18, q3, [x11]
- add x11, x11, x10
- QPEL_FILTER_H v24, v19, v20, v21, v22, v23, v16, v17, v18
- QPEL_FILTER_H2 v25, v19, v20, v21, v22, v23, v16, v17, v18
- QPEL_FILTER_H v26, v4, v5, v6, v7, v31, v1, v2, v3
- QPEL_FILTER_H2 v27, v4, v5, v6, v7, v31, v1, v2, v3
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q19, q4, [x11]
- add x11, x11, x10
- QPEL_FILTER_H v24, v20, v21, v22, v23, v16, v17, v18, v19
- QPEL_FILTER_H2 v25, v20, v21, v22, v23, v16, v17, v18, v19
- QPEL_FILTER_H v26, v5, v6, v7, v31, v1, v2, v3, v4
- QPEL_FILTER_H2 v27, v5, v6, v7, v31, v1, v2, v3, v4
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q20, q5, [x11]
- add x11, x11, x10
- QPEL_FILTER_H v24, v21, v22, v23, v16, v17, v18, v19, v20
- QPEL_FILTER_H2 v25, v21, v22, v23, v16, v17, v18, v19, v20
- QPEL_FILTER_H v26, v6, v7, v31, v1, v2, v3, v4, v5
- QPEL_FILTER_H2 v27, v6, v7, v31, v1, v2, v3, v4, v5
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q21, q6, [x11]
- add x11, x11, x10
- QPEL_FILTER_H v24, v22, v23, v16, v17, v18, v19, v20, v21
- QPEL_FILTER_H2 v25, v22, v23, v16, v17, v18, v19, v20, v21
- QPEL_FILTER_H v26, v7, v31, v1, v2, v3, v4, v5, v6
- QPEL_FILTER_H2 v27, v7, v31, v1, v2, v3, v4, v5, v6
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.eq 2f
-
- ldp q22, q7, [x11]
- add x11, x11, x10
- QPEL_FILTER_H v24, v23, v16, v17, v18, v19, v20, v21, v22
- QPEL_FILTER_H2 v25, v23, v16, v17, v18, v19, v20, v21, v22
- QPEL_FILTER_H v26, v31, v1, v2, v3, v4, v5, v6, v7
- QPEL_FILTER_H2 v27, v31, v1, v2, v3, v4, v5, v6, v7
- QPEL_UNI_W_HV_16
- subs w22, w22, #1
- b.hi 1b
-2:
- subs w27, w27, #16
- add x11, x14, #32
- add x20, x13, #16
- mov w22, w12
- mov x14, x11
- mov x13, x20
- b.hi 3b
- QPEL_UNI_W_HV_END
- ret
+ b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
endfunc
function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
QPEL_UNI_W_HV_HEADER 64
- b hevc_put_hevc_qpel_uni_w_hv64_8_end_neon
+ b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
endfunc
-function hevc_put_hevc_qpel_uni_w_hv64_8_end_neon
+function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
mov x11, sp
mov w12, w22
mov x13, x20
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (15 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv Martin Storsjö
` (4 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
---
libavcodec/aarch64/hevcdsp_qpel_neon.S | 695 +++++++++++++------------
1 file changed, 355 insertions(+), 340 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 06832603d9..ad568e415b 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2146,29 +2146,6 @@ function ff_hevc_put_hevc_qpel_uni_w_v64_8_neon, export=1
ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
- add w10, w4, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- add x0, sp, #48
- mov x2, x3
- add x3, x4, #7
- mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_qpel_uni_hv4_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_uni_hv4_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
@@ -2195,26 +2172,6 @@ function hevc_put_hevc_qpel_uni_hv4_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
- add w10, w4, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- add x0, sp, #48
- mov x2, x3
- add w3, w4, #7
- mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_qpel_uni_hv6_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_uni_hv6_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
@@ -2244,26 +2201,6 @@ function hevc_put_hevc_qpel_uni_hv6_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
- add w10, w4, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- add x0, sp, #48
- mov x2, x3
- add w3, w4, #7
- mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
- b hevc_put_hevc_qpel_uni_hv8_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_uni_hv8_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
@@ -2291,26 +2228,6 @@ function hevc_put_hevc_qpel_uni_hv8_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
- add w10, w4, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- mov x2, x3
- add x0, sp, #48
- add w3, w4, #7
- mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_qpel_uni_hv12_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_uni_hv12_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
@@ -2338,26 +2255,6 @@ function hevc_put_hevc_qpel_uni_hv12_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
- add w10, w4, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x6, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- mov x2, x3
- add w3, w4, #7
- mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
- ldp x4, x6, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_qpel_uni_hv16_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_uni_hv16_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x6, x5
@@ -2396,6 +2293,109 @@ function hevc_put_hevc_qpel_uni_hv16_8_end_neon
ret
endfunc
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
+ add w10, w4, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ add x0, sp, #48
+ mov x2, x3
+ add x3, x4, #7
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
+ add w10, w4, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ add x0, sp, #48
+ mov x2, x3
+ add w3, w4, #7
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
+ add w10, w4, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ str x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ add x0, sp, #48
+ mov x2, x3
+ add w3, w4, #7
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldr x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
+ add w10, w4, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ mov x2, x3
+ add x0, sp, #48
+ add w3, w4, #7
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
+ add w10, w4, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x6, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ mov x2, x3
+ add w3, w4, #7
+ mov x4, x5
+ bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+ ldp x4, x6, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_uni_hv16_8_end_neon
+endfunc
+
function ff_hevc_put_hevc_qpel_uni_hv24_8_neon_i8mm, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
@@ -3779,25 +3779,10 @@ function ff_hevc_put_hevc_qpel_h64_8_neon_i8mm, export=1
b.ne 1b
ret
endfunc
+DISABLE_I8MM
+#endif
-function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
- add w10, w3, #7
- mov x7, #128
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- sub x1, x1, x2, lsl #1
- add x3, x3, #7
- sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_qpel_hv4_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_hv4_8_end_neon
load_qpel_filterh x5, x4
ldr d16, [sp]
@@ -3822,23 +3807,6 @@ function hevc_put_hevc_qpel_hv4_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
- add w10, w3, #7
- mov x7, #128
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- sub x1, x1, x2, lsl #1
- add x3, x3, #7
- sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_qpel_hv6_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_hv6_8_end_neon
mov x8, #120
load_qpel_filterh x5, x4
@@ -3866,22 +3834,6 @@ function hevc_put_hevc_qpel_hv6_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
- add w10, w3, #7
- lsl x10, x10, #7
- sub x1, x1, x2, lsl #1
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- add x3, x3, #7
- sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_qpel_hv8_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_hv8_8_end_neon
mov x7, #128
load_qpel_filterh x5, x4
@@ -3908,22 +3860,6 @@ function hevc_put_hevc_qpel_hv8_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
- add w10, w3, #7
- lsl x10, x10, #7
- sub x1, x1, x2, lsl #1
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
- add x3, x3, #7
- sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_qpel_hv12_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_hv12_8_end_neon
mov x7, #128
load_qpel_filterh x5, x4
@@ -3949,22 +3885,6 @@ function hevc_put_hevc_qpel_hv12_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
- add w10, w3, #7
- lsl x10, x10, #7
- sub x1, x1, x2, lsl #1
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x3, x3, #7
- add x0, sp, #32
- sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_qpel_hv16_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_hv16_8_end_neon
mov x7, #128
load_qpel_filterh x5, x4
@@ -3989,38 +3909,6 @@ function hevc_put_hevc_qpel_hv16_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm, export=1
- stp x4, x5, [sp, #-64]!
- stp x2, x3, [sp, #16]
- stp x0, x1, [sp, #32]
- str x30, [sp, #48]
- bl X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
- ldp x0, x1, [sp, #32]
- ldp x2, x3, [sp, #16]
- ldp x4, x5, [sp], #48
- add x1, x1, #12
- add x0, x0, #24
- bl X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
- ldr x30, [sp], #16
- ret
-endfunc
-
-function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
- add w10, w3, #7
- sub x1, x1, x2, lsl #1
- lsl x10, x10, #7
- sub x1, x1, x2
- sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x3, x3, #7
- add x0, sp, #32
- bl X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
- b hevc_put_hevc_qpel_hv32_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_hv32_8_end_neon
mov x7, #128
load_qpel_filterh x5, x4
@@ -4056,6 +3944,122 @@ function hevc_put_hevc_qpel_hv32_8_end_neon
ret
endfunc
+#if HAVE_I8MM
+ENABLE_I8MM
+function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
+ add w10, w3, #7
+ mov x7, #128
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ sub x1, x1, x2, lsl #1
+ add x3, x3, #7
+ sub x1, x1, x2
+ bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
+ add w10, w3, #7
+ mov x7, #128
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ sub x1, x1, x2, lsl #1
+ add x3, x3, #7
+ sub x1, x1, x2
+ bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
+ add w10, w3, #7
+ lsl x10, x10, #7
+ sub x1, x1, x2, lsl #1
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ add x3, x3, #7
+ sub x1, x1, x2
+ bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
+ add w10, w3, #7
+ lsl x10, x10, #7
+ sub x1, x1, x2, lsl #1
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x0, sp, #32
+ add x3, x3, #7
+ sub x1, x1, x2
+ bl X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv12_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
+ add w10, w3, #7
+ lsl x10, x10, #7
+ sub x1, x1, x2, lsl #1
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x3, x3, #7
+ add x0, sp, #32
+ sub x1, x1, x2
+ bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm, export=1
+ stp x4, x5, [sp, #-64]!
+ stp x2, x3, [sp, #16]
+ stp x0, x1, [sp, #32]
+ str x30, [sp, #48]
+ bl X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
+ ldp x0, x1, [sp, #32]
+ ldp x2, x3, [sp, #16]
+ ldp x4, x5, [sp], #48
+ add x1, x1, #12
+ add x0, x0, #24
+ bl X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
+ ldr x30, [sp], #16
+ ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
+ add w10, w3, #7
+ sub x1, x1, x2, lsl #1
+ lsl x10, x10, #7
+ sub x1, x1, x2
+ sub sp, sp, x10 // tmp_array
+ stp x5, x30, [sp, #-32]!
+ stp x0, x3, [sp, #16]
+ add x3, x3, #7
+ add x0, sp, #32
+ bl X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #32
+ b hevc_put_hevc_qpel_hv32_8_end_neon
+endfunc
+
function ff_hevc_put_hevc_qpel_hv48_8_neon_i8mm, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
@@ -4089,6 +4093,8 @@ function ff_hevc_put_hevc_qpel_hv64_8_neon_i8mm, export=1
ldr x30, [sp], #16
ret
endfunc
+DISABLE_I8MM
+#endif
.macro QPEL_UNI_W_HV_HEADER width
ldp x14, x15, [sp] // mx, my
@@ -4168,11 +4174,6 @@ endfunc
smlal2 \dst\().4s, \src7\().8h, v0.h[7]
.endm
-function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 4
- b hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
ldr d16, [sp]
ldr d17, [sp, x10]
@@ -4262,11 +4263,6 @@ endfunc
st1 {v24.d}[0], [x20], x21
.endm
-function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 8
- b hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
ldr q16, [sp]
ldr q17, [sp, x10]
@@ -4376,21 +4372,6 @@ endfunc
st1 {v24.16b}, [x20], x21
.endm
-function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 16
- b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
-endfunc
-
-function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 32
- b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
-endfunc
-
-function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 64
- b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
mov x11, sp
mov w12, w22
@@ -4503,26 +4484,37 @@ function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
ret
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
- add w10, w5, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- add x0, sp, #48
- mov x2, x3
- add w3, w5, #7
- mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_qpel_bi_hv4_8_end_neon
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
+ QPEL_UNI_W_HV_HEADER 4
+ b hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
+ QPEL_UNI_W_HV_HEADER 8
+ b hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
+ QPEL_UNI_W_HV_HEADER 16
+ b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
+ QPEL_UNI_W_HV_HEADER 32
+ b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
endfunc
+function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
+ QPEL_UNI_W_HV_HEADER 64
+ b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
+endfunc
+
+DISABLE_I8MM
+#endif
+
function hevc_put_hevc_qpel_bi_hv4_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x7, x6
@@ -4548,26 +4540,6 @@ function hevc_put_hevc_qpel_bi_hv4_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv6_8_neon_i8mm, export=1
- add w10, w5, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- add x0, sp, #48
- mov x2, x3
- add x3, x5, #7
- mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_qpel_bi_hv6_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_bi_hv6_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x7, x6
@@ -4598,26 +4570,6 @@ function hevc_put_hevc_qpel_bi_hv6_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm, export=1
- add w10, w5, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- add x0, sp, #48
- mov x2, x3
- add x3, x5, #7
- mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- b hevc_put_hevc_qpel_bi_hv8_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_bi_hv8_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x7, x6
@@ -4646,46 +4598,6 @@ function hevc_put_hevc_qpel_bi_hv8_8_end_neon
2: ret
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv12_8_neon_i8mm, export=1
- stp x6, x7, [sp, #-80]!
- stp x4, x5, [sp, #16]
- stp x2, x3, [sp, #32]
- stp x0, x1, [sp, #48]
- str x30, [sp, #64]
- bl X(ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x2, x3, [sp, #32]
- ldp x0, x1, [sp, #48]
- ldp x6, x7, [sp], #64
- add x4, x4, #16
- add x2, x2, #8
- add x0, x0, #8
- bl X(ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm)
- ldr x30, [sp], #16
- ret
-endfunc
-
-function ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm, export=1
- add w10, w5, #7
- lsl x10, x10, #7
- sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
- stp x4, x5, [sp, #16]
- stp x0, x1, [sp, #32]
- add x0, sp, #48
- sub x1, x2, x3, lsl #1
- sub x1, x1, x3
- mov x2, x3
- add w3, w5, #7
- mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
- ldp x4, x5, [sp, #16]
- ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
- mov x6, #16 // width
- b hevc_put_hevc_qpel_bi_hv16_8_end_neon
-endfunc
-
function hevc_put_hevc_qpel_bi_hv16_8_end_neon
load_qpel_filterh x7, x8
mov x9, #(MAX_PB_SIZE * 2)
@@ -4735,6 +4647,109 @@ function hevc_put_hevc_qpel_bi_hv16_8_end_neon
ret
endfunc
+#if HAVE_I8MM
+ENABLE_I8MM
+
+function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
+ add w10, w5, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ add x0, sp, #48
+ mov x2, x3
+ add w3, w5, #7
+ mov x4, x6
+ bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_bi_hv4_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_bi_hv6_8_neon_i8mm, export=1
+ add w10, w5, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ add x0, sp, #48
+ mov x2, x3
+ add x3, x5, #7
+ mov x4, x6
+ bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_bi_hv6_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm, export=1
+ add w10, w5, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ add x0, sp, #48
+ mov x2, x3
+ add x3, x5, #7
+ mov x4, x6
+ bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ b hevc_put_hevc_qpel_bi_hv8_8_end_neon
+endfunc
+
+function ff_hevc_put_hevc_qpel_bi_hv12_8_neon_i8mm, export=1
+ stp x6, x7, [sp, #-80]!
+ stp x4, x5, [sp, #16]
+ stp x2, x3, [sp, #32]
+ stp x0, x1, [sp, #48]
+ str x30, [sp, #64]
+ bl X(ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x2, x3, [sp, #32]
+ ldp x0, x1, [sp, #48]
+ ldp x6, x7, [sp], #64
+ add x4, x4, #16
+ add x2, x2, #8
+ add x0, x0, #8
+ bl X(ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm)
+ ldr x30, [sp], #16
+ ret
+endfunc
+
+function ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm, export=1
+ add w10, w5, #7
+ lsl x10, x10, #7
+ sub sp, sp, x10 // tmp_array
+ stp x7, x30, [sp, #-48]!
+ stp x4, x5, [sp, #16]
+ stp x0, x1, [sp, #32]
+ add x0, sp, #48
+ sub x1, x2, x3, lsl #1
+ sub x1, x1, x3
+ mov x2, x3
+ add w3, w5, #7
+ mov x4, x6
+ bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+ ldp x4, x5, [sp, #16]
+ ldp x0, x1, [sp, #32]
+ ldp x7, x30, [sp], #48
+ mov x6, #16 // width
+ b hevc_put_hevc_qpel_bi_hv16_8_end_neon
+endfunc
+
function ff_hevc_put_hevc_qpel_bi_hv24_8_neon_i8mm, export=1
stp x6, x7, [sp, #-80]!
stp x4, x5, [sp, #16]
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (16 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv Martin Storsjö
` (3 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.
By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.
AWS Graviton 3:
put_hevc_qpel_hv4_8_c: 386.0
put_hevc_qpel_hv4_8_neon: 125.7
put_hevc_qpel_hv4_8_i8mm: 83.2
put_hevc_qpel_hv6_8_c: 749.0
put_hevc_qpel_hv6_8_neon: 207.0
put_hevc_qpel_hv6_8_i8mm: 166.0
put_hevc_qpel_hv8_8_c: 1305.2
put_hevc_qpel_hv8_8_neon: 216.5
put_hevc_qpel_hv8_8_i8mm: 213.0
put_hevc_qpel_hv12_8_c: 2570.5
put_hevc_qpel_hv12_8_neon: 480.0
put_hevc_qpel_hv12_8_i8mm: 398.2
put_hevc_qpel_hv16_8_c: 4158.7
put_hevc_qpel_hv16_8_neon: 659.7
put_hevc_qpel_hv16_8_i8mm: 593.5
put_hevc_qpel_hv24_8_c: 8626.7
put_hevc_qpel_hv24_8_neon: 1653.5
put_hevc_qpel_hv24_8_i8mm: 1398.7
put_hevc_qpel_hv32_8_c: 14646.0
put_hevc_qpel_hv32_8_neon: 2566.2
put_hevc_qpel_hv32_8_i8mm: 2287.5
put_hevc_qpel_hv48_8_c: 31072.5
put_hevc_qpel_hv48_8_neon: 6228.5
put_hevc_qpel_hv48_8_i8mm: 5291.0
put_hevc_qpel_hv64_8_c: 53847.2
put_hevc_qpel_hv64_8_neon: 9856.7
put_hevc_qpel_hv64_8_i8mm: 8831.0
---
libavcodec/aarch64/hevcdsp_init_aarch64.c | 6 +
libavcodec/aarch64/hevcdsp_qpel_neon.S | 166 +++++++++++++---------
2 files changed, 104 insertions(+), 68 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index ea0d26c019..105c26017b 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -265,6 +265,10 @@ NEON8_FNPROTO(qpel_v, (int16_t *dst,
const uint8_t *src, ptrdiff_t srcstride,
int height, intptr_t mx, intptr_t my, int width),);
+NEON8_FNPROTO(qpel_hv, (int16_t *dst,
+ const uint8_t *src, ptrdiff_t srcstride,
+ int height, intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(qpel_hv, (int16_t *dst,
const uint8_t *src, ptrdiff_t srcstride,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -436,6 +440,8 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
+ NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
+
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
NEON8_FNASSIGN(c->put_hevc_epel, 1, 1, epel_hv, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index ad568e415b..7bffb991a7 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -3804,7 +3804,8 @@ function hevc_put_hevc_qpel_hv4_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_hv6_8_end_neon
@@ -3831,7 +3832,8 @@ function hevc_put_hevc_qpel_hv6_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_hv8_8_end_neon
@@ -3857,7 +3859,8 @@ function hevc_put_hevc_qpel_hv8_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_hv12_8_end_neon
@@ -3882,7 +3885,8 @@ function hevc_put_hevc_qpel_hv12_8_end_neon
.endm
1: calc_all2
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_hv16_8_end_neon
@@ -3906,7 +3910,8 @@ function hevc_put_hevc_qpel_hv16_8_end_neon
.endm
1: calc_all2
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_hv32_8_end_neon
@@ -3937,162 +3942,187 @@ function hevc_put_hevc_qpel_hv32_8_end_neon
add sp, sp, #32
subs w6, w6, #16
b.hi 0b
- add w10, w3, #6
- add sp, sp, #64 // discard rest of first line
- lsl x10, x10, #7
- add sp, sp, x10 // tmp_array without first line
+ mov sp, x14
ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-function ff_hevc_put_hevc_qpel_hv4_8_neon_i8mm, export=1
- add w10, w3, #7
+.macro qpel_hv suffix
+function ff_hevc_put_hevc_qpel_hv4_8_\suffix, export=1
+ add w10, w3, #8
mov x7, #128
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
+ stp x5, x30, [sp, #-48]!
+ stp x0, x3, [sp, #16]
+ str x14, [sp, #32]
+ add x0, sp, #48
sub x1, x1, x2, lsl #1
add x3, x3, #7
sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
+ bl X(ff_hevc_put_hevc_qpel_h4_8_\suffix)
+ ldr x14, [sp, #32]
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #48
b hevc_put_hevc_qpel_hv4_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_hv6_8_neon_i8mm, export=1
- add w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv6_8_\suffix, export=1
+ add w10, w3, #8
mov x7, #128
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
+ stp x5, x30, [sp, #-48]!
+ stp x0, x3, [sp, #16]
+ str x14, [sp, #32]
+ add x0, sp, #48
sub x1, x1, x2, lsl #1
add x3, x3, #7
sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
+ bl X(ff_hevc_put_hevc_qpel_h6_8_\suffix)
+ ldr x14, [sp, #32]
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #48
b hevc_put_hevc_qpel_hv6_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_hv8_8_neon_i8mm, export=1
- add w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv8_8_\suffix, export=1
+ add w10, w3, #8
lsl x10, x10, #7
sub x1, x1, x2, lsl #1
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
+ stp x5, x30, [sp, #-48]!
+ stp x0, x3, [sp, #16]
+ str x14, [sp, #32]
+ add x0, sp, #48
add x3, x3, #7
sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
+ bl X(ff_hevc_put_hevc_qpel_h8_8_\suffix)
+ ldr x14, [sp, #32]
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #48
b hevc_put_hevc_qpel_hv8_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm, export=1
- add w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv12_8_\suffix, export=1
+ add w10, w3, #8
lsl x10, x10, #7
sub x1, x1, x2, lsl #1
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
- add x0, sp, #32
+ stp x5, x30, [sp, #-48]!
+ stp x0, x3, [sp, #16]
+ str x14, [sp, #32]
+ add x0, sp, #48
add x3, x3, #7
sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
+ mov w6, #12
+ bl X(ff_hevc_put_hevc_qpel_h12_8_\suffix)
+ ldr x14, [sp, #32]
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #48
b hevc_put_hevc_qpel_hv12_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_hv16_8_neon_i8mm, export=1
- add w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv16_8_\suffix, export=1
+ add w10, w3, #8
lsl x10, x10, #7
sub x1, x1, x2, lsl #1
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
+ stp x5, x30, [sp, #-48]!
+ stp x0, x3, [sp, #16]
+ str x14, [sp, #32]
add x3, x3, #7
- add x0, sp, #32
+ add x0, sp, #48
sub x1, x1, x2
- bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
+ bl X(ff_hevc_put_hevc_qpel_h16_8_\suffix)
+ ldr x14, [sp, #32]
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #48
b hevc_put_hevc_qpel_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_hv24_8_\suffix, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
stp x0, x1, [sp, #32]
str x30, [sp, #48]
- bl X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_hv12_8_\suffix)
ldp x0, x1, [sp, #32]
ldp x2, x3, [sp, #16]
ldp x4, x5, [sp], #48
add x1, x1, #12
add x0, x0, #24
- bl X(ff_hevc_put_hevc_qpel_hv12_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_hv12_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
-function ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm, export=1
- add w10, w3, #7
+function ff_hevc_put_hevc_qpel_hv32_8_\suffix, export=1
+ add w10, w3, #8
sub x1, x1, x2, lsl #1
lsl x10, x10, #7
sub x1, x1, x2
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x5, x30, [sp, #-32]!
- stp x0, x3, [sp, #16]
+ stp x5, x30, [sp, #-48]!
+ stp x0, x3, [sp, #16]
+ str x14, [sp, #32]
add x3, x3, #7
- add x0, sp, #32
- bl X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
- ldp x0, x3, [sp, #16]
- ldp x5, x30, [sp], #32
+ add x0, sp, #48
+ mov w6, #32
+ bl X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+ ldr x14, [sp, #32]
+ ldp x0, x3, [sp, #16]
+ ldp x5, x30, [sp], #48
b hevc_put_hevc_qpel_hv32_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_hv48_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_hv48_8_\suffix, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
stp x0, x1, [sp, #32]
str x30, [sp, #48]
- bl X(ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_hv24_8_\suffix)
ldp x0, x1, [sp, #32]
ldp x2, x3, [sp, #16]
ldp x4, x5, [sp], #48
add x1, x1, #24
add x0, x0, #48
- bl X(ff_hevc_put_hevc_qpel_hv24_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_hv24_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
-function ff_hevc_put_hevc_qpel_hv64_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_hv64_8_\suffix, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
stp x0, x1, [sp, #32]
str x30, [sp, #48]
mov x6, #32
- bl X(ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_hv32_8_\suffix)
ldp x0, x1, [sp, #32]
ldp x2, x3, [sp, #16]
ldp x4, x5, [sp], #48
add x1, x1, #32
add x0, x0, #64
mov x6, #32
- bl X(ff_hevc_put_hevc_qpel_hv32_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_hv32_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
+.endm
+
+qpel_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+qpel_hv neon_i8mm
+
DISABLE_I8MM
#endif
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (17 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv Martin Storsjö
` (2 subsequent siblings)
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.
By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.
AWS Graviton 3:
put_hevc_qpel_uni_hv4_8_c: 384.2
put_hevc_qpel_uni_hv4_8_neon: 127.5
put_hevc_qpel_uni_hv4_8_i8mm: 85.5
put_hevc_qpel_uni_hv6_8_c: 705.5
put_hevc_qpel_uni_hv6_8_neon: 224.5
put_hevc_qpel_uni_hv6_8_i8mm: 176.2
put_hevc_qpel_uni_hv8_8_c: 1136.5
put_hevc_qpel_uni_hv8_8_neon: 216.5
put_hevc_qpel_uni_hv8_8_i8mm: 214.0
put_hevc_qpel_uni_hv12_8_c: 2259.5
put_hevc_qpel_uni_hv12_8_neon: 498.5
put_hevc_qpel_uni_hv12_8_i8mm: 410.7
put_hevc_qpel_uni_hv16_8_c: 3824.7
put_hevc_qpel_uni_hv16_8_neon: 670.0
put_hevc_qpel_uni_hv16_8_i8mm: 603.7
put_hevc_qpel_uni_hv24_8_c: 8113.5
put_hevc_qpel_uni_hv24_8_neon: 1474.7
put_hevc_qpel_uni_hv24_8_i8mm: 1351.5
put_hevc_qpel_uni_hv32_8_c: 14744.5
put_hevc_qpel_uni_hv32_8_neon: 2599.7
put_hevc_qpel_uni_hv32_8_i8mm: 2266.0
put_hevc_qpel_uni_hv48_8_c: 32800.0
put_hevc_qpel_uni_hv48_8_neon: 5650.0
put_hevc_qpel_uni_hv48_8_i8mm: 5011.7
put_hevc_qpel_uni_hv64_8_c: 57856.2
put_hevc_qpel_uni_hv64_8_neon: 9863.5
put_hevc_qpel_uni_hv64_8_i8mm: 8767.7
---
libavcodec/aarch64/hevcdsp_init_aarch64.c | 5 +
libavcodec/aarch64/hevcdsp_qpel_neon.S | 156 ++++++++++++++--------
2 files changed, 102 insertions(+), 59 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 105c26017b..0531db027b 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -277,6 +277,10 @@ NEON8_FNPROTO(qpel_uni_v, (uint8_t *dst, ptrdiff_t dststride,
const uint8_t *src, ptrdiff_t srcstride,
int height, intptr_t mx, intptr_t my, int width),);
+NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst, ptrdiff_t dststride,
+ const uint8_t *src, ptrdiff_t srcstride,
+ int height, intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(qpel_uni_hv, (uint8_t *dst, ptrdiff_t dststride,
const uint8_t *src, ptrdiff_t srcstride,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -441,6 +445,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN_SHARED_32(c->put_hevc_qpel_uni_w, 0, 1, qpel_uni_w_h,);
NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
+ NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index 7bffb991a7..f285ab7461 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -2169,7 +2169,8 @@ function hevc_put_hevc_qpel_uni_hv4_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_uni_hv6_8_end_neon
@@ -2198,7 +2199,8 @@ function hevc_put_hevc_qpel_uni_hv6_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_uni_hv8_8_end_neon
@@ -2225,7 +2227,8 @@ function hevc_put_hevc_qpel_uni_hv8_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_uni_hv12_8_end_neon
@@ -2252,7 +2255,8 @@ function hevc_put_hevc_qpel_uni_hv12_8_end_neon
.endm
1: calc_all2
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_uni_hv16_8_end_neon
@@ -2286,21 +2290,17 @@ function hevc_put_hevc_qpel_uni_hv16_8_end_neon
add sp, sp, #32
subs w7, w7, #16
b.ne 0b
- add w10, w4, #6
- add sp, sp, x12 // discard rest of first line
- lsl x10, x10, #7
- add sp, sp, x10 // tmp_array without first line
+ mov sp, x14
ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
- add w10, w4, #7
+.macro qpel_uni_hv suffix
+function ff_hevc_put_hevc_qpel_uni_hv4_8_\suffix, export=1
+ add w10, w4, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
+ stp x30, x14,[sp, #-48]!
stp x4, x6, [sp, #16]
stp x0, x1, [sp, #32]
sub x1, x2, x3, lsl #1
@@ -2309,18 +2309,19 @@ function ff_hevc_put_hevc_qpel_uni_hv4_8_neon_i8mm, export=1
mov x2, x3
add x3, x4, #7
mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_h4_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
+ ldp x30, x14, [sp], #48
b hevc_put_hevc_qpel_uni_hv4_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
- add w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv6_8_\suffix, export=1
+ add w10, w4, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
+ stp x30, x14,[sp, #-48]!
stp x4, x6, [sp, #16]
stp x0, x1, [sp, #32]
sub x1, x2, x3, lsl #1
@@ -2329,18 +2330,19 @@ function ff_hevc_put_hevc_qpel_uni_hv6_8_neon_i8mm, export=1
mov x2, x3
add w3, w4, #7
mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_h6_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
+ ldp x30, x14, [sp], #48
b hevc_put_hevc_qpel_uni_hv6_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
- add w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv8_8_\suffix, export=1
+ add w10, w4, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- str x30, [sp, #-48]!
+ stp x30, x14,[sp, #-48]!
stp x4, x6, [sp, #16]
stp x0, x1, [sp, #32]
sub x1, x2, x3, lsl #1
@@ -2349,60 +2351,67 @@ function ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm, export=1
mov x2, x3
add w3, w4, #7
mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_h8_8_\suffix)
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
- ldr x30, [sp], #48
+ ldp x30, x14, [sp], #48
b hevc_put_hevc_qpel_uni_hv8_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv12_8_neon_i8mm, export=1
- add w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv12_8_\suffix, export=1
+ add w10, w4, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x6, [sp, #16]
stp x0, x1, [sp, #32]
+ str x14, [sp, #48]
sub x1, x2, x3, lsl #1
sub x1, x1, x3
mov x2, x3
- add x0, sp, #48
+ add x0, sp, #64
add w3, w4, #7
mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h12_8_neon_i8mm)
+ mov w6, #12
+ bl X(ff_hevc_put_hevc_qpel_h12_8_\suffix)
+ ldr x14, [sp, #48]
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldp x7, x30, [sp], #64
b hevc_put_hevc_qpel_uni_hv12_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm, export=1
- add w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv16_8_\suffix, export=1
+ add w10, w4, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x6, [sp, #16]
stp x0, x1, [sp, #32]
- add x0, sp, #48
+ str x14, [sp, #48]
+ add x0, sp, #64
sub x1, x2, x3, lsl #1
sub x1, x1, x3
mov x2, x3
add w3, w4, #7
mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_h16_8_\suffix)
+ ldr x14, [sp, #48]
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldp x7, x30, [sp], #64
b hevc_put_hevc_qpel_uni_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_uni_hv24_8_\suffix, export=1
stp x4, x5, [sp, #-64]!
stp x2, x3, [sp, #16]
stp x0, x1, [sp, #32]
stp x6, x30, [sp, #48]
mov x7, #16
- bl X(ff_hevc_put_hevc_qpel_uni_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_uni_hv16_8_\suffix)
ldp x2, x3, [sp, #16]
add x2, x2, #16
ldp x0, x1, [sp, #32]
@@ -2410,71 +2419,100 @@ function ff_hevc_put_hevc_qpel_uni_hv24_8_neon_i8mm, export=1
mov x7, #8
add x0, x0, #16
ldr x6, [sp]
- bl X(ff_hevc_put_hevc_qpel_uni_hv8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_uni_hv8_8_\suffix)
ldr x30, [sp, #8]
add sp, sp, #16
ret
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv32_8_neon_i8mm, export=1
- add w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv32_8_\suffix, export=1
+ add w10, w4, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x6, [sp, #16]
stp x0, x1, [sp, #32]
+ str x14, [sp, #48]
sub x1, x2, x3, lsl #1
- add x0, sp, #48
+ add x0, sp, #64
sub x1, x1, x3
mov x2, x3
add w3, w4, #7
mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
+ mov w6, #32
+ bl X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+ ldr x14, [sp, #48]
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldp x7, x30, [sp], #64
b hevc_put_hevc_qpel_uni_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv48_8_neon_i8mm, export=1
- add w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv48_8_\suffix, export=1
+ add w10, w4, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x6, [sp, #16]
stp x0, x1, [sp, #32]
+ str x14, [sp, #48]
sub x1, x2, x3, lsl #1
sub x1, x1, x3
mov x2, x3
- add x0, sp, #48
+ add x0, sp, #64
add w3, w4, #7
mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h48_8_neon_i8mm)
+.ifc \suffix, neon
+ mov w6, #48
+ bl X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+.else
+ bl X(ff_hevc_put_hevc_qpel_h48_8_\suffix)
+.endif
+ ldr x14, [sp, #48]
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldp x7, x30, [sp], #64
b hevc_put_hevc_qpel_uni_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_hv64_8_neon_i8mm, export=1
- add w10, w4, #7
+function ff_hevc_put_hevc_qpel_uni_hv64_8_\suffix, export=1
+ add w10, w4, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x6, [sp, #16]
stp x0, x1, [sp, #32]
- add x0, sp, #48
+ str x14, [sp, #48]
+ add x0, sp, #64
sub x1, x2, x3, lsl #1
mov x2, x3
sub x1, x1, x3
add w3, w4, #7
mov x4, x5
- bl X(ff_hevc_put_hevc_qpel_h64_8_neon_i8mm)
+.ifc \suffix, neon
+ mov w6, #64
+ bl X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+.else
+ bl X(ff_hevc_put_hevc_qpel_h64_8_\suffix)
+.endif
+ ldr x14, [sp, #48]
ldp x4, x6, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldp x7, x30, [sp], #64
b hevc_put_hevc_qpel_uni_hv16_8_end_neon
endfunc
+.endm
+
+qpel_uni_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+qpel_uni_hv neon_i8mm
+
DISABLE_I8MM
#endif
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (18 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv Martin Storsjö
2024-03-25 21:15 ` [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.
AWS Graviton 3:
put_hevc_qpel_uni_w_hv4_8_c: 422.2
put_hevc_qpel_uni_w_hv4_8_neon: 140.7
put_hevc_qpel_uni_w_hv4_8_i8mm: 100.7
put_hevc_qpel_uni_w_hv8_8_c: 1208.0
put_hevc_qpel_uni_w_hv8_8_neon: 268.2
put_hevc_qpel_uni_w_hv8_8_i8mm: 261.5
put_hevc_qpel_uni_w_hv16_8_c: 4297.2
put_hevc_qpel_uni_w_hv16_8_neon: 802.2
put_hevc_qpel_uni_w_hv16_8_i8mm: 731.2
put_hevc_qpel_uni_w_hv32_8_c: 15518.5
put_hevc_qpel_uni_w_hv32_8_neon: 3085.2
put_hevc_qpel_uni_w_hv32_8_i8mm: 2783.2
put_hevc_qpel_uni_w_hv64_8_c: 57254.5
put_hevc_qpel_uni_w_hv64_8_neon: 11787.5
put_hevc_qpel_uni_w_hv64_8_i8mm: 10659.0
---
libavcodec/aarch64/hevcdsp_init_aarch64.c | 6 +++
libavcodec/aarch64/hevcdsp_qpel_neon.S | 47 +++++++++++++++--------
2 files changed, 37 insertions(+), 16 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index 0531db027b..e9ee901322 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -305,6 +305,11 @@ NEON8_FNPROTO(epel_uni_w_hv, (uint8_t *_dst, ptrdiff_t _dststride,
int height, int denom, int wx, int ox,
intptr_t mx, intptr_t my, int width), _i8mm);
+NEON8_FNPROTO_PARTIAL_5(qpel_uni_w_hv, (uint8_t *_dst, ptrdiff_t _dststride,
+ const uint8_t *_src, ptrdiff_t _srcstride,
+ int height, int denom, int wx, int ox,
+ intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO_PARTIAL_5(qpel_uni_w_hv, (uint8_t *_dst, ptrdiff_t _dststride,
const uint8_t *_src, ptrdiff_t _srcstride,
int height, int denom, int wx, int ox,
@@ -446,6 +451,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
+ NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, qpel_uni_w_hv,);
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index f285ab7461..df7032b692 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4164,7 +4164,7 @@ qpel_hv neon_i8mm
DISABLE_I8MM
#endif
-.macro QPEL_UNI_W_HV_HEADER width
+.macro QPEL_UNI_W_HV_HEADER width, suffix
ldp x14, x15, [sp] // mx, my
ldr w13, [sp, #16] // width
stp x19, x30, [sp, #-80]!
@@ -4173,7 +4173,7 @@ DISABLE_I8MM
stp x24, x25, [sp, #48]
stp x26, x27, [sp, #64]
mov x19, sp
- mov x11, #9088
+ mov x11, #(MAX_PB_SIZE*(MAX_PB_SIZE+8)*2)
sub sp, sp, x11
mov x20, x0
mov x21, x1
@@ -4190,7 +4190,16 @@ DISABLE_I8MM
mov w26, #-6
sub w26, w26, w5 // -shift
mov w27, w13 // width
- bl X(ff_hevc_put_hevc_qpel_h\width\()_8_neon_i8mm)
+.ifc \suffix, neon
+.if \width >= 32
+ mov w6, #\width
+ bl X(ff_hevc_put_hevc_qpel_h32_8_neon)
+.else
+ bl X(ff_hevc_put_hevc_qpel_h\width\()_8_\suffix)
+.endif
+.else
+ bl X(ff_hevc_put_hevc_qpel_h\width\()_8_\suffix)
+.endif
movrel x9, qpel_filters
add x9, x9, x23, lsl #3
ld1 {v0.8b}, [x9]
@@ -4552,33 +4561,39 @@ function hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_uni_w_hv4_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 4
+.macro qpel_uni_w_hv suffix
+function ff_hevc_put_hevc_qpel_uni_w_hv4_8_\suffix, export=1
+ QPEL_UNI_W_HV_HEADER 4, \suffix
b hevc_put_hevc_qpel_uni_w_hv4_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_w_hv8_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 8
+function ff_hevc_put_hevc_qpel_uni_w_hv8_8_\suffix, export=1
+ QPEL_UNI_W_HV_HEADER 8, \suffix
b hevc_put_hevc_qpel_uni_w_hv8_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_w_hv16_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 16
+function ff_hevc_put_hevc_qpel_uni_w_hv16_8_\suffix, export=1
+ QPEL_UNI_W_HV_HEADER 16, \suffix
b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_w_hv32_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 32
+function ff_hevc_put_hevc_qpel_uni_w_hv32_8_\suffix, export=1
+ QPEL_UNI_W_HV_HEADER 32, \suffix
b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_uni_w_hv64_8_neon_i8mm, export=1
- QPEL_UNI_W_HV_HEADER 64
+function ff_hevc_put_hevc_qpel_uni_w_hv64_8_\suffix, export=1
+ QPEL_UNI_W_HV_HEADER 64, \suffix
b hevc_put_hevc_qpel_uni_w_hv16_8_end_neon
endfunc
+.endm
+
+qpel_uni_w_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+qpel_uni_w_hv neon_i8mm
DISABLE_I8MM
#endif
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* [FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (19 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv Martin Storsjö
@ 2024-03-25 15:02 ` Martin Storsjö
2024-03-25 21:15 ` [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
21 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 15:02 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
As the plain neon qpel_h functions process two rows at a time,
we need to allocate storage for h+8 rows instead of h+7.
By allocating storage for h+8 rows, incrementing the stack
pointer won't end up at the right spot in the end. Store the
intended final stack pointer value in a register x14 which we
store on the stack.
AWS Graviton 3:
put_hevc_qpel_bi_hv4_8_c: 385.7
put_hevc_qpel_bi_hv4_8_neon: 131.0
put_hevc_qpel_bi_hv4_8_i8mm: 92.2
put_hevc_qpel_bi_hv6_8_c: 701.0
put_hevc_qpel_bi_hv6_8_neon: 239.5
put_hevc_qpel_bi_hv6_8_i8mm: 191.0
put_hevc_qpel_bi_hv8_8_c: 1162.0
put_hevc_qpel_bi_hv8_8_neon: 228.0
put_hevc_qpel_bi_hv8_8_i8mm: 225.2
put_hevc_qpel_bi_hv12_8_c: 2305.0
put_hevc_qpel_bi_hv12_8_neon: 558.0
put_hevc_qpel_bi_hv12_8_i8mm: 483.2
put_hevc_qpel_bi_hv16_8_c: 3965.2
put_hevc_qpel_bi_hv16_8_neon: 732.7
put_hevc_qpel_bi_hv16_8_i8mm: 656.5
put_hevc_qpel_bi_hv24_8_c: 8709.7
put_hevc_qpel_bi_hv24_8_neon: 1555.2
put_hevc_qpel_bi_hv24_8_i8mm: 1448.7
put_hevc_qpel_bi_hv32_8_c: 14818.0
put_hevc_qpel_bi_hv32_8_neon: 2763.7
put_hevc_qpel_bi_hv32_8_i8mm: 2468.0
put_hevc_qpel_bi_hv48_8_c: 32855.5
put_hevc_qpel_bi_hv48_8_neon: 6107.2
put_hevc_qpel_bi_hv48_8_i8mm: 5452.7
put_hevc_qpel_bi_hv64_8_c: 57591.5
put_hevc_qpel_bi_hv64_8_neon: 10660.2
put_hevc_qpel_bi_hv64_8_i8mm: 9580.0
---
libavcodec/aarch64/hevcdsp_init_aarch64.c | 5 +
libavcodec/aarch64/hevcdsp_qpel_neon.S | 164 +++++++++++++---------
2 files changed, 103 insertions(+), 66 deletions(-)
diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c
index e9ee901322..e24dd0cbda 100644
--- a/libavcodec/aarch64/hevcdsp_init_aarch64.c
+++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c
@@ -319,6 +319,10 @@ NEON8_FNPROTO(qpel_bi_v, (uint8_t *dst, ptrdiff_t dststride,
const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
int height, intptr_t mx, intptr_t my, int width),);
+NEON8_FNPROTO(qpel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
+ const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
+ int height, intptr_t mx, intptr_t my, int width),);
+
NEON8_FNPROTO(qpel_bi_hv, (uint8_t *dst, ptrdiff_t dststride,
const uint8_t *src, ptrdiff_t srcstride, const int16_t *src2,
int height, intptr_t mx, intptr_t my, int width), _i8mm);
@@ -452,6 +456,7 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth)
NEON8_FNASSIGN(c->put_hevc_qpel, 1, 1, qpel_hv,);
NEON8_FNASSIGN(c->put_hevc_qpel_uni, 1, 1, qpel_uni_hv,);
NEON8_FNASSIGN_PARTIAL_5(c->put_hevc_qpel_uni_w, 1, 1, qpel_uni_w_hv,);
+ NEON8_FNASSIGN(c->put_hevc_qpel_bi, 1, 1, qpel_bi_hv,);
if (have_i8mm(cpu_flags)) {
NEON8_FNASSIGN(c->put_hevc_epel, 0, 1, epel_h, _i8mm);
diff --git a/libavcodec/aarch64/hevcdsp_qpel_neon.S b/libavcodec/aarch64/hevcdsp_qpel_neon.S
index df7032b692..8ddaa32b70 100644
--- a/libavcodec/aarch64/hevcdsp_qpel_neon.S
+++ b/libavcodec/aarch64/hevcdsp_qpel_neon.S
@@ -4590,14 +4590,6 @@ endfunc
qpel_uni_w_hv neon
-#if HAVE_I8MM
-ENABLE_I8MM
-
-qpel_uni_w_hv neon_i8mm
-
-DISABLE_I8MM
-#endif
-
function hevc_put_hevc_qpel_bi_hv4_8_end_neon
mov x9, #(MAX_PB_SIZE * 2)
load_qpel_filterh x7, x6
@@ -4620,7 +4612,8 @@ function hevc_put_hevc_qpel_bi_hv4_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_bi_hv6_8_end_neon
@@ -4650,7 +4643,8 @@ function hevc_put_hevc_qpel_bi_hv6_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_bi_hv8_8_end_neon
@@ -4678,7 +4672,8 @@ function hevc_put_hevc_qpel_bi_hv8_8_end_neon
.endm
1: calc_all
.purgem calc
-2: ret
+2: mov sp, x14
+ ret
endfunc
function hevc_put_hevc_qpel_bi_hv16_8_end_neon
@@ -4723,83 +4718,87 @@ function hevc_put_hevc_qpel_bi_hv16_8_end_neon
subs x10, x10, #16
add x4, x4, #32
b.ne 0b
- add w10, w5, #7
- lsl x10, x10, #7
- sub x10, x10, x6, lsl #1 // part of first line
- add sp, sp, x10 // tmp_array without first line
+ mov sp, x14
ret
endfunc
-#if HAVE_I8MM
-ENABLE_I8MM
-
-function ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm, export=1
- add w10, w5, #7
+.macro qpel_bi_hv suffix
+function ff_hevc_put_hevc_qpel_bi_hv4_8_\suffix, export=1
+ add w10, w5, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x5, [sp, #16]
stp x0, x1, [sp, #32]
+ str x14, [sp, #48]
sub x1, x2, x3, lsl #1
sub x1, x1, x3
- add x0, sp, #48
+ add x0, sp, #64
mov x2, x3
add w3, w5, #7
mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h4_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_h4_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldr x14, [sp, #48]
+ ldp x7, x30, [sp], #64
b hevc_put_hevc_qpel_bi_hv4_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv6_8_neon_i8mm, export=1
- add w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv6_8_\suffix, export=1
+ add w10, w5, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x5, [sp, #16]
stp x0, x1, [sp, #32]
+ str x14, [sp, #48]
sub x1, x2, x3, lsl #1
sub x1, x1, x3
- add x0, sp, #48
+ add x0, sp, #64
mov x2, x3
add x3, x5, #7
mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h6_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_h6_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldr x14, [sp, #48]
+ ldp x7, x30, [sp], #64
b hevc_put_hevc_qpel_bi_hv6_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm, export=1
- add w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv8_8_\suffix, export=1
+ add w10, w5, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x5, [sp, #16]
stp x0, x1, [sp, #32]
+ str x14, [sp, #48]
sub x1, x2, x3, lsl #1
sub x1, x1, x3
- add x0, sp, #48
+ add x0, sp, #64
mov x2, x3
add x3, x5, #7
mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_h8_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldr x14, [sp, #48]
+ ldp x7, x30, [sp], #64
b hevc_put_hevc_qpel_bi_hv8_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv12_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_bi_hv12_8_\suffix, export=1
stp x6, x7, [sp, #-80]!
stp x4, x5, [sp, #16]
stp x2, x3, [sp, #32]
stp x0, x1, [sp, #48]
str x30, [sp, #64]
- bl X(ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_bi_hv8_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x2, x3, [sp, #32]
ldp x0, x1, [sp, #48]
@@ -4807,39 +4806,42 @@ function ff_hevc_put_hevc_qpel_bi_hv12_8_neon_i8mm, export=1
add x4, x4, #16
add x2, x2, #8
add x0, x0, #8
- bl X(ff_hevc_put_hevc_qpel_bi_hv4_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_bi_hv4_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm, export=1
- add w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv16_8_\suffix, export=1
+ add w10, w5, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x5, [sp, #16]
stp x0, x1, [sp, #32]
- add x0, sp, #48
+ str x14, [sp, #48]
+ add x0, sp, #64
sub x1, x2, x3, lsl #1
sub x1, x1, x3
mov x2, x3
add w3, w5, #7
mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_h16_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldr x14, [sp, #48]
+ ldp x7, x30, [sp], #64
mov x6, #16 // width
b hevc_put_hevc_qpel_bi_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv24_8_neon_i8mm, export=1
+function ff_hevc_put_hevc_qpel_bi_hv24_8_\suffix, export=1
stp x6, x7, [sp, #-80]!
stp x4, x5, [sp, #16]
stp x2, x3, [sp, #32]
stp x0, x1, [sp, #48]
str x30, [sp, #64]
- bl X(ff_hevc_put_hevc_qpel_bi_hv16_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_bi_hv16_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x2, x3, [sp, #32]
ldp x0, x1, [sp, #48]
@@ -4847,73 +4849,103 @@ function ff_hevc_put_hevc_qpel_bi_hv24_8_neon_i8mm, export=1
add x4, x4, #32
add x2, x2, #16
add x0, x0, #16
- bl X(ff_hevc_put_hevc_qpel_bi_hv8_8_neon_i8mm)
+ bl X(ff_hevc_put_hevc_qpel_bi_hv8_8_\suffix)
ldr x30, [sp], #16
ret
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv32_8_neon_i8mm, export=1
- add w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv32_8_\suffix, export=1
+ add w10, w5, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x5, [sp, #16]
stp x0, x1, [sp, #32]
- add x0, sp, #48
+ str x14, [sp, #48]
+ add x0, sp, #64
sub x1, x2, x3, lsl #1
mov x2, x3
sub x1, x1, x3
add w3, w5, #7
mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h32_8_neon_i8mm)
+ mov w6, #32
+ bl X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldr x14, [sp, #48]
+ ldp x7, x30, [sp], #64
mov x6, #32 // width
b hevc_put_hevc_qpel_bi_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv48_8_neon_i8mm, export=1
- add w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv48_8_\suffix, export=1
+ add w10, w5, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x5, [sp, #16]
stp x0, x1, [sp, #32]
- add x0, sp, #48
+ str x14, [sp, #48]
+ add x0, sp, #64
sub x1, x2, x3, lsl #1
mov x2, x3
sub x1, x1, x3
add w3, w5, #7
mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h48_8_neon_i8mm)
+.ifc \suffix, neon
+ mov w6, #48
+ bl X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+.else
+ bl X(ff_hevc_put_hevc_qpel_h48_8_\suffix)
+.endif
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldr x14, [sp, #48]
+ ldp x7, x30, [sp], #64
mov x6, #48 // width
b hevc_put_hevc_qpel_bi_hv16_8_end_neon
endfunc
-function ff_hevc_put_hevc_qpel_bi_hv64_8_neon_i8mm, export=1
- add w10, w5, #7
+function ff_hevc_put_hevc_qpel_bi_hv64_8_\suffix, export=1
+ add w10, w5, #8
lsl x10, x10, #7
+ mov x14, sp
sub sp, sp, x10 // tmp_array
- stp x7, x30, [sp, #-48]!
+ stp x7, x30, [sp, #-64]!
stp x4, x5, [sp, #16]
stp x0, x1, [sp, #32]
- add x0, sp, #48
+ str x14, [sp, #48]
+ add x0, sp, #64
sub x1, x2, x3, lsl #1
mov x2, x3
sub x1, x1, x3
add w3, w5, #7
mov x4, x6
- bl X(ff_hevc_put_hevc_qpel_h64_8_neon_i8mm)
+.ifc \suffix, neon
+ mov w6, #64
+ bl X(ff_hevc_put_hevc_qpel_h32_8_\suffix)
+.else
+ bl X(ff_hevc_put_hevc_qpel_h64_8_\suffix)
+.endif
ldp x4, x5, [sp, #16]
ldp x0, x1, [sp, #32]
- ldp x7, x30, [sp], #48
+ ldr x14, [sp, #48]
+ ldp x7, x30, [sp], #64
mov x6, #64 // width
b hevc_put_hevc_qpel_bi_hv16_8_end_neon
endfunc
+.endm
+
+qpel_bi_hv neon
+
+#if HAVE_I8MM
+ENABLE_I8MM
+
+qpel_uni_w_hv neon_i8mm
+
+qpel_bi_hv neon_i8mm
DISABLE_I8MM
#endif // HAVE_I8MM
--
2.39.3 (Apple Git-146)
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
` (20 preceding siblings ...)
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv Martin Storsjö
@ 2024-03-25 21:15 ` Martin Storsjö
2024-03-25 21:56 ` J. Dekker
21 siblings, 1 reply; 26+ messages in thread
From: Martin Storsjö @ 2024-03-25 21:15 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Logan Lyu, J . Dekker
On Mon, 25 Mar 2024, Martin Storsjö wrote:
> Since some time, we have pretty complete AArch64 NEON coverage
> for the hevc decoder.
>
> However, some of these functions require the I8MM instruction set
> extension, and many of them (but not all) lack a plain NEON
> version.
>
> This patchset fills in a regular NEON version of all functions
> where we have an I8MM function.
>
> For context; the I8MM instruction set extension is a mandatory
> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
> but Apple M1 and Ampere Altra don't.
>
> This patchset takes decoding of a 1080p HEVC clip from 402
> fps to 649 fps on an Apple M1.
>
> Patch #2 also fixes a subtle bug in the existing implementation;
> two functions relied on the contents on the stack, below the
> stack pointer, being untouched within a function. If a signal
> gets delivered, those parts of the stack could be clobbered.
I know this is a bit short notice for a patchset of this size - but, would
people be OK with merging this patchset before the impending 7.0 branch
(which is made within the next 24h)?
The patches pass all my tricky build configurations, they give a very
non-negligible speedup on many common CPUs, and patch #2 fixes a real bug
in the existing impleemntations. (A bug fix patch can of course be
backported after the branch too, but performance optimizations aren't
generally relevant for backporting.)
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
2024-03-25 21:15 ` [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
@ 2024-03-25 21:56 ` J. Dekker
2024-03-26 6:01 ` Jean-Baptiste Kempf
0 siblings, 1 reply; 26+ messages in thread
From: J. Dekker @ 2024-03-25 21:56 UTC (permalink / raw)
To: Martin Storsjö; +Cc: Logan Lyu, J . Dekker, ffmpeg-devel
> On Mon, 25 Mar 2024, Martin Storsjö wrote:
>
>> Since some time, we have pretty complete AArch64 NEON coverage
>> for the hevc decoder.
>>
>> However, some of these functions require the I8MM instruction set
>> extension, and many of them (but not all) lack a plain NEON
>> version.
>>
>> This patchset fills in a regular NEON version of all functions
>> where we have an I8MM function.
>>
>> For context; the I8MM instruction set extension is a mandatory
>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>> but Apple M1 and Ampere Altra don't.
>>
>> This patchset takes decoding of a 1080p HEVC clip from 402
>> fps to 649 fps on an Apple M1.
>>
>> Patch #2 also fixes a subtle bug in the existing implementation;
>> two functions relied on the contents on the stack, below the
>> stack pointer, being untouched within a function. If a signal
>> gets delivered, those parts of the stack could be clobbered.
>
> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
>
> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
>
> // Martin
Yes, please. I will tomorrow morning if you didn’t already push.
--
jd
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
2024-03-25 21:56 ` J. Dekker
@ 2024-03-26 6:01 ` Jean-Baptiste Kempf
2024-03-26 7:09 ` Martin Storsjö
0 siblings, 1 reply; 26+ messages in thread
From: Jean-Baptiste Kempf @ 2024-03-26 6:01 UTC (permalink / raw)
To: J. Dekker, ffmpeg-devel, Martin Storsjö; +Cc: myais
On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote:
>> On Mon, 25 Mar 2024, Martin Storsjö wrote:
>>
>>> Since some time, we have pretty complete AArch64 NEON coverage
>>> for the hevc decoder.
>>>
>>> However, some of these functions require the I8MM instruction set
>>> extension, and many of them (but not all) lack a plain NEON
>>> version.
>>>
>>> This patchset fills in a regular NEON version of all functions
>>> where we have an I8MM function.
>>>
>>> For context; the I8MM instruction set extension is a mandatory
>>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>>> but Apple M1 and Ampere Altra don't.
>>>
>>> This patchset takes decoding of a 1080p HEVC clip from 402
>>> fps to 649 fps on an Apple M1.
>>>
>>> Patch #2 also fixes a subtle bug in the existing implementation;
>>> two functions relied on the contents on the stack, below the
>>> stack pointer, being untouched within a function. If a signal
>>> gets delivered, those parts of the stack could be clobbered.
>>
>> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
>>
>> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
>>
>> // Martin
>
> Yes, please. I will tomorrow morning if you didn’t already push.
+1
--
Jean-Baptiste Kempf - President
+33 672 704 734
https://jbkempf.com/
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions
2024-03-26 6:01 ` Jean-Baptiste Kempf
@ 2024-03-26 7:09 ` Martin Storsjö
0 siblings, 0 replies; 26+ messages in thread
From: Martin Storsjö @ 2024-03-26 7:09 UTC (permalink / raw)
To: Jean-Baptiste Kempf; +Cc: myais, J. Dekker, ffmpeg-devel
On Tue, 26 Mar 2024, Jean-Baptiste Kempf wrote:
> On Mon, 25 Mar 2024, at 22:56, J. Dekker wrote:
>>> On Mon, 25 Mar 2024, Martin Storsjö wrote:
>>>
>>>> Since some time, we have pretty complete AArch64 NEON coverage
>>>> for the hevc decoder.
>>>>
>>>> However, some of these functions require the I8MM instruction set
>>>> extension, and many of them (but not all) lack a plain NEON
>>>> version.
>>>>
>>>> This patchset fills in a regular NEON version of all functions
>>>> where we have an I8MM function.
>>>>
>>>> For context; the I8MM instruction set extension is a mandatory
>>>> part of armv8.6-a. E.g. Apple M2, AWS Graviton 3 have it,
>>>> but Apple M1 and Ampere Altra don't.
>>>>
>>>> This patchset takes decoding of a 1080p HEVC clip from 402
>>>> fps to 649 fps on an Apple M1.
>>>>
>>>> Patch #2 also fixes a subtle bug in the existing implementation;
>>>> two functions relied on the contents on the stack, below the
>>>> stack pointer, being untouched within a function. If a signal
>>>> gets delivered, those parts of the stack could be clobbered.
>>>
>>> I know this is a bit short notice for a patchset of this size - but, would people be OK with merging this patchset before the impending 7.0 branch (which is made within the next 24h)?
>>>
>>> The patches pass all my tricky build configurations, they give a very non-negligible speedup on many common CPUs, and patch #2 fixes a real bug in the existing impleemntations. (A bug fix patch can of course be backported after the branch too, but performance optimizations aren't generally relevant for backporting.)
>>>
>>> // Martin
>>
>> Yes, please. I will tomorrow morning if you didn’t already push.
>
> +1
Thanks, I pushed this set now.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 26+ messages in thread
end of thread, other threads:[~2024-03-26 7:09 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-03-25 15:02 [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 01/21] aarch64: hevc: Reorder a misplaced function init line Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 02/21] aarch64: hevc: Don't iterate with sp in ff_hevc_put_hevc_qpel_uni_w_hv32/64_8_neon_i8mm Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 03/21] aarch64: hevc: Merge consecutive stores in put_hevc_\type\()_h16_8_neon Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 04/21] aarch64: hevc: Specialize put_hevc_\type\()_h*_8_neon for horizontal looping Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 05/21] aarch64: hevc: Use ld1r instead of ldr+dup in hevc_qpel_uni_w_h Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 06/21] aarch64: hevc: Implement a neon version of put_hevc_epel_h*_8 Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 07/21] aarch64: hevc: Implement a neon version of hevc_epel_uni_w_h*_8 Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 08/21] aarch64: hevc: Split the epel_*_hv functions into two parts Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 09/21] aarch64: hevc: Reorder epel_hv functions to prepare for templating Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 10/21] aarch64: hevc: Produce epel_hv functions for both plain neon and i8mm Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 11/21] aarch64: hevc: Produce epel_uni_hv functions for both " Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 12/21] aarch64: hevc: Produce epel_uni_w_hv " Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 13/21] aarch64: hevc: Produce epel_bi_hv " Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 14/21] aarch64: hevc: Implement a neon version of hevc_qpel_uni_w_h*_8 Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 15/21] aarch64: hevc: Split the qpel_*_hv functions into two parts Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 16/21] aarch64: hevc: Deduplicate the hevc_put_hevc_qpel_uni_w_hv*_8_end_neon functions Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 17/21] aarch64: hevc: Reorder qpel_hv functions to prepare for templating Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 18/21] aarch64: hevc: Produce plain neon versions of qpel_hv Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 19/21] aarch64: hevc: Produce plain neon versions of qpel_uni_hv Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 20/21] aarch64: hevc: Produce plain neon versions of qpel_uni_w_hv Martin Storsjö
2024-03-25 15:02 ` [FFmpeg-devel] [PATCH 21/21] aarch64: hevc: Produce plain neon versions of qpel_bi_hv Martin Storsjö
2024-03-25 21:15 ` [FFmpeg-devel] [PATCH 00/21] aarch64: hevc: Add missing hevc_pel NEON functions Martin Storsjö
2024-03-25 21:56 ` J. Dekker
2024-03-26 6:01 ` Jean-Baptiste Kempf
2024-03-26 7:09 ` Martin Storsjö
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git