Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents
@ 2025-02-28 21:21 Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-28 21:21 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Krzysztof Pyrkosz via ffmpeg-devel
  2025-03-01 22:59 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Martin Storsjö
  0 siblings, 2 replies; 4+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-28 21:21 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

Before and after:

A78
ac3_extract_exponents_n512_neon:                       503.2 ( 3.36x)
ac3_extract_exponents_n3072_neon:                     2986.2 ( 3.35x)

ac3_extract_exponents_n512_neon:                       211.2 ( 8.02x)
ac3_extract_exponents_n3072_neon:                     1251.5 ( 8.00x)

A72
ac3_extract_exponents_n512_neon:                       964.7 ( 2.39x)
ac3_extract_exponents_n3072_neon:                     5434.5 ( 2.47x)

ac3_extract_exponents_n512_neon:                       465.6 ( 4.87x)
ac3_extract_exponents_n3072_neon:                     2696.3 ( 4.97x)
---
This version handles 16 ints in one go and consolidates separate
extractions and writes into one. I assume the length of the input is a
multiple of 16 (there are no constraints defined in the template file),
but the tests are passing.

Krzysztof

 libavcodec/aarch64/ac3dsp_neon.S | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/libavcodec/aarch64/ac3dsp_neon.S b/libavcodec/aarch64/ac3dsp_neon.S
index 7e97cc39f7..b2ac9edc55 100644
--- a/libavcodec/aarch64/ac3dsp_neon.S
+++ b/libavcodec/aarch64/ac3dsp_neon.S
@@ -38,15 +38,26 @@ function ff_ac3_exponent_min_neon, export=1
 endfunc
 
 function ff_ac3_extract_exponents_neon, export=1
-        movi            v1.4s, #8
-1:      ld1             {v0.4s}, [x1], #16
+        movi            v4.16b, #8
+1:      ld1             {v0.4s - v3.4s}, [x1], #64
+        subs            w2, w2, #16
         abs             v0.4s, v0.4s
+        abs             v1.4s, v1.4s
+        abs             v2.4s, v2.4s
+        abs             v3.4s, v3.4s
+
         clz             v0.4s, v0.4s
-        sub             v0.4s, v0.4s, v1.4s
-        xtn             v0.4h, v0.4s
-        xtn             v0.8b, v0.8h
-        st1             {v0.s}[0], [x0], #4
-        subs            w2, w2, #4
+        clz             v1.4s, v1.4s
+        clz             v2.4s, v2.4s
+        clz             v3.4s, v3.4s
+
+        uzp1            v0.8h, v0.8h, v1.8h
+        uzp1            v2.8h, v2.8h, v3.8h
+        uzp1            v0.16b, v0.16b, v2.16b
+        sub             v0.16b, v0.16b, v4.16b
+
+        st1             {v0.16b}, [x0], #16
+
         b.gt            1b
         ret
 endfunc
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon
  2025-02-28 21:21 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-02-28 21:21 ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-03-01 23:07   ` Martin Storsjö
  2025-03-01 22:59 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Martin Storsjö
  1 sibling, 1 reply; 4+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-02-28 21:21 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

Before and after:

A78
ac3_sum_square_bufferfly_int32_neon:                   484.8 ( 2.00x)
ac3_sum_square_bufferfly_int32_neon:                   468.2 ( 2.08x)

A72
ac3_sum_square_bufferfly_int32_neon:                   793.6 ( 1.26x)
ac3_sum_square_bufferfly_int32_neon:                   527.3 ( 1.92x)
---
Instead of calculating a^2, b^2, (a+b)^2 and (a-b)^2, calculate only
a^2, b^2 and 2*a*b in each iteration and derive the latter parts from
these three at the end.

Krzysztof

 libavcodec/aarch64/ac3dsp_neon.S | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/libavcodec/aarch64/ac3dsp_neon.S b/libavcodec/aarch64/ac3dsp_neon.S
index b2ac9edc55..2128eab528 100644
--- a/libavcodec/aarch64/ac3dsp_neon.S
+++ b/libavcodec/aarch64/ac3dsp_neon.S
@@ -80,21 +80,20 @@ function ff_ac3_sum_square_butterfly_int32_neon, export=1
         movi            v0.2d, #0
         movi            v1.2d, #0
         movi            v2.2d, #0
-        movi            v3.2d, #0
 1:      ld1             {v4.2s}, [x1], #8
         ld1             {v5.2s}, [x2], #8
-        add             v6.2s, v4.2s, v5.2s
-        sub             v7.2s, v4.2s, v5.2s
-        smlal           v0.2d, v4.2s, v4.2s
-        smlal           v1.2d, v5.2s, v5.2s
-        smlal           v2.2d, v6.2s, v6.2s
-        smlal           v3.2d, v7.2s, v7.2s
         subs            w3, w3, #2
+        smlal           v0.2d, v4.2s, v4.2s // sum of a^2
+        smlal           v1.2d, v5.2s, v5.2s // sum of b^2
+        sqdmlal         v2.2d, v4.2s, v5.2s // sum of 2ab
         b.gt            1b
         addp            d0, v0.2d
         addp            d1, v1.2d
         addp            d2, v2.2d
-        addp            d3, v3.2d
+        sub             d3, d0, d2 // a^2 + b^2 - 2ab
+        add             d2, d0, d2
+        add             d3, d3, d1 // a^2 + b^2 + 2ab
+        add             d2, d2, d1
         st1             {v0.1d-v3.1d}, [x0]
         ret
 endfunc
-- 
2.47.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents
  2025-02-28 21:21 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Krzysztof Pyrkosz via ffmpeg-devel
  2025-02-28 21:21 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-03-01 22:59 ` Martin Storsjö
  1 sibling, 0 replies; 4+ messages in thread
From: Martin Storsjö @ 2025-03-01 22:59 UTC (permalink / raw)
  To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz

On Fri, 28 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> Before and after:
>
> A78
> ac3_extract_exponents_n512_neon:                       503.2 ( 3.36x)
> ac3_extract_exponents_n3072_neon:                     2986.2 ( 3.35x)
>
> ac3_extract_exponents_n512_neon:                       211.2 ( 8.02x)
> ac3_extract_exponents_n3072_neon:                     1251.5 ( 8.00x)
>
> A72
> ac3_extract_exponents_n512_neon:                       964.7 ( 2.39x)
> ac3_extract_exponents_n3072_neon:                     5434.5 ( 2.47x)
>
> ac3_extract_exponents_n512_neon:                       465.6 ( 4.87x)
> ac3_extract_exponents_n3072_neon:                     2696.3 ( 4.97x)
> ---
> This version handles 16 ints in one go and consolidates separate
> extractions and writes into one. I assume the length of the input is a
> multiple of 16 (there are no constraints defined in the template file),
> but the tests are passing.

I have no clue about whehter this is ok or not (it may be good to check 
other assembly implementations if we do this on e.g. x86). Codewise, the 
patch looks good, thanks!

This description of the patch, what it does and the assumptions it makes, 
is probably nice to keep in the final commit as well, so it could be 
included above "---" too.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon
  2025-02-28 21:21 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-03-01 23:07   ` Martin Storsjö
  0 siblings, 0 replies; 4+ messages in thread
From: Martin Storsjö @ 2025-03-01 23:07 UTC (permalink / raw)
  To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz

On Fri, 28 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> Before and after:
>
> A78
> ac3_sum_square_bufferfly_int32_neon:                   484.8 ( 2.00x)
> ac3_sum_square_bufferfly_int32_neon:                   468.2 ( 2.08x)
>
> A72
> ac3_sum_square_bufferfly_int32_neon:                   793.6 ( 1.26x)
> ac3_sum_square_bufferfly_int32_neon:                   527.3 ( 1.92x)
> ---
> Instead of calculating a^2, b^2, (a+b)^2 and (a-b)^2, calculate only
> a^2, b^2 and 2*a*b in each iteration and derive the latter parts from
> these three at the end.

This patch looks good to me, thanks!

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-03-01 23:07 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-02-28 21:21 [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Krzysztof Pyrkosz via ffmpeg-devel
2025-02-28 21:21 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_sum_square_butterfly_int32_neon Krzysztof Pyrkosz via ffmpeg-devel
2025-03-01 23:07   ` Martin Storsjö
2025-03-01 22:59 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/ac3dsp_neon.S: Optimize ac3_extract_exponents Martin Storsjö

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git