* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents
@ 2025-02-28 21:21 Krzysztof Pyrkosz via ffmpeg-devel
2025-02-28 21:21 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Krzysztof Pyrkosz via ffmpeg-devel
2025-03-01 22:59 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Martin Storsjö
0 siblings, 2 replies; 4+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-28 21:21 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz
Before and after:
A78
ac3_extract_exponents_n512_neon: 503.2 ( 3.36x)
ac3_extract_exponents_n3072_neon: 2986.2 ( 3.35x)
ac3_extract_exponents_n512_neon: 211.2 ( 8.02x)
ac3_extract_exponents_n3072_neon: 1251.5 ( 8.00x)
A72
ac3_extract_exponents_n512_neon: 964.7 ( 2.39x)
ac3_extract_exponents_n3072_neon: 5434.5 ( 2.47x)
ac3_extract_exponents_n512_neon: 465.6 ( 4.87x)
ac3_extract_exponents_n3072_neon: 2696.3 ( 4.97x)
---
This version handles 16 ints in one go and consolidates separate
extractions and writes into one. I assume the length of the input is a
multiple of 16 (there are no constraints defined in the template file),
but the tests are passing.
Krzysztof
libavcodec/aarch64/ac3dsp_neon.S | 25 ++++++++++++++++++-------
1 file changed, 18 insertions(+), 7 deletions(-)
diff --git a/libavcodec/aarch64/ac3dsp_neon.S b/libavcodec/aarch64/ac3dsp_neon.S
index 7e97cc39f7..b2ac9edc55 100644
--- a/libavcodec/aarch64/ac3dsp_neon.S
+++ b/libavcodec/aarch64/ac3dsp_neon.S
@@ -38,15 +38,26 @@ function ff_ac3_exponent_min_neon, export=1
endfunc
function ff_ac3_extract_exponents_neon, export=1
- movi v1.4s, #8
-1: ld1 {v0.4s}, [x1], #16
+ movi v4.16b, #8
+1: ld1 {v0.4s - v3.4s}, [x1], #64
+ subs w2, w2, #16
abs v0.4s, v0.4s
+ abs v1.4s, v1.4s
+ abs v2.4s, v2.4s
+ abs v3.4s, v3.4s
+
clz v0.4s, v0.4s
- sub v0.4s, v0.4s, v1.4s
- xtn v0.4h, v0.4s
- xtn v0.8b, v0.8h
- st1 {v0.s}[0], [x0], #4
- subs w2, w2, #4
+ clz v1.4s, v1.4s
+ clz v2.4s, v2.4s
+ clz v3.4s, v3.4s
+
+ uzp1 v0.8h, v0.8h, v1.8h
+ uzp1 v2.8h, v2.8h, v3.8h
+ uzp1 v0.16b, v0.16b, v2.16b
+ sub v0.16b, v0.16b, v4.16b
+
+ st1 {v0.16b}, [x0], #16
+
b.gt 1b
ret
endfunc
--
2.47.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 4+ messages in thread
* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon
2025-02-28 21:21 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-28 21:21 ` Krzysztof Pyrkosz via ffmpeg-devel
2025-03-01 23:07 ` Martin Storsjö
2025-03-01 22:59 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Martin Storsjö
1 sibling, 1 reply; 4+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-28 21:21 UTC (permalink / raw)
To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz
Before and after:
A78
ac3_sum_square_bufferfly_int32_neon: 484.8 ( 2.00x)
ac3_sum_square_bufferfly_int32_neon: 468.2 ( 2.08x)
A72
ac3_sum_square_bufferfly_int32_neon: 793.6 ( 1.26x)
ac3_sum_square_bufferfly_int32_neon: 527.3 ( 1.92x)
---
Instead of calculating a^2, b^2, (a+b)^2 and (a-b)^2, calculate only
a^2, b^2 and 2*a*b in each iteration and derive the latter parts from
these three at the end.
Krzysztof
libavcodec/aarch64/ac3dsp_neon.S | 15 +++++++--------
1 file changed, 7 insertions(+), 8 deletions(-)
diff --git a/libavcodec/aarch64/ac3dsp_neon.S b/libavcodec/aarch64/ac3dsp_neon.S
index b2ac9edc55..2128eab528 100644
--- a/libavcodec/aarch64/ac3dsp_neon.S
+++ b/libavcodec/aarch64/ac3dsp_neon.S
@@ -80,21 +80,20 @@ function ff_ac3_sum_square_butterfly_int32_neon, export=1
movi v0.2d, #0
movi v1.2d, #0
movi v2.2d, #0
- movi v3.2d, #0
1: ld1 {v4.2s}, [x1], #8
ld1 {v5.2s}, [x2], #8
- add v6.2s, v4.2s, v5.2s
- sub v7.2s, v4.2s, v5.2s
- smlal v0.2d, v4.2s, v4.2s
- smlal v1.2d, v5.2s, v5.2s
- smlal v2.2d, v6.2s, v6.2s
- smlal v3.2d, v7.2s, v7.2s
subs w3, w3, #2
+ smlal v0.2d, v4.2s, v4.2s // sum of a^2
+ smlal v1.2d, v5.2s, v5.2s // sum of b^2
+ sqdmlal v2.2d, v4.2s, v5.2s // sum of 2ab
b.gt 1b
addp d0, v0.2d
addp d1, v1.2d
addp d2, v2.2d
- addp d3, v3.2d
+ sub d3, d0, d2 // a^2 + b^2 - 2ab
+ add d2, d0, d2
+ add d3, d3, d1 // a^2 + b^2 + 2ab
+ add d2, d2, d1
st1 {v0.1d-v3.1d}, [x0]
ret
endfunc
--
2.47.2
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon
2025-02-28 21:21 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-03-01 23:07 ` Martin Storsjö
0 siblings, 0 replies; 4+ messages in thread
From: Martin Storsjö @ 2025-03-01 23:07 UTC (permalink / raw)
To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz
On Fri, 28 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:
> Before and after:
>
> A78
> ac3_sum_square_bufferfly_int32_neon: 484.8 ( 2.00x)
> ac3_sum_square_bufferfly_int32_neon: 468.2 ( 2.08x)
>
> A72
> ac3_sum_square_bufferfly_int32_neon: 793.6 ( 1.26x)
> ac3_sum_square_bufferfly_int32_neon: 527.3 ( 1.92x)
> ---
> Instead of calculating a^2, b^2, (a+b)^2 and (a-b)^2, calculate only
> a^2, b^2 and 2*a*b in each iteration and derive the latter parts from
> these three at the end.
This patch looks good to me, thanks!
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents
2025-02-28 21:21 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Krzysztof Pyrkosz via ffmpeg-devel
2025-02-28 21:21 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-03-01 22:59 ` Martin Storsjö
1 sibling, 0 replies; 4+ messages in thread
From: Martin Storsjö @ 2025-03-01 22:59 UTC (permalink / raw)
To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz
On Fri, 28 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:
> Before and after:
>
> A78
> ac3_extract_exponents_n512_neon: 503.2 ( 3.36x)
> ac3_extract_exponents_n3072_neon: 2986.2 ( 3.35x)
>
> ac3_extract_exponents_n512_neon: 211.2 ( 8.02x)
> ac3_extract_exponents_n3072_neon: 1251.5 ( 8.00x)
>
> A72
> ac3_extract_exponents_n512_neon: 964.7 ( 2.39x)
> ac3_extract_exponents_n3072_neon: 5434.5 ( 2.47x)
>
> ac3_extract_exponents_n512_neon: 465.6 ( 4.87x)
> ac3_extract_exponents_n3072_neon: 2696.3 ( 4.97x)
> ---
> This version handles 16 ints in one go and consolidates separate
> extractions and writes into one. I assume the length of the input is a
> multiple of 16 (there are no constraints defined in the template file),
> but the tests are passing.
I have no clue about whehter this is ok or not (it may be good to check
other assembly implementations if we do this on e.g. x86). Codewise, the
patch looks good, thanks!
This description of the patch, what it does and the assumptions it makes,
is probably nice to keep in the final commit as well, so it could be
included above "---" too.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2025-03-01 23:07 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-28 21:21 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Krzysztof Pyrkosz via ffmpeg-devel
2025-02-28 21:21 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Krzysztof Pyrkosz via ffmpeg-devel
2025-03-01 23:07 ` Martin Storsjö
2025-03-01 22:59 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Martin Storsjö
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git