Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
* [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_butterflies_float_neon
@ 2025-01-19 18:07 Krzysztof Pyrkosz via ffmpeg-devel
  2025-01-19 21:28 ` Martin Storsjö
  0 siblings, 1 reply; 2+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-01-19 18:07 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

Modified the main loop to handle 8 floats in one go. A special case of
length not being multiple of 8 is handled at the beginning. The speed
increased from 3.90 to 4.50.

Krzysztof

---
 libavutil/aarch64/float_dsp_neon.S | 30 ++++++++++++++++++++++--------
 1 file changed, 22 insertions(+), 8 deletions(-)

diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
index 35e2715b87..f99cc2e6b2 100644
--- a/libavutil/aarch64/float_dsp_neon.S
+++ b/libavutil/aarch64/float_dsp_neon.S
@@ -178,14 +178,28 @@ function ff_vector_fmul_reverse_neon, export=1
 endfunc
 
 function ff_butterflies_float_neon, export=1
-1:      ld1             {v0.4s}, [x0]
-        ld1             {v1.4s}, [x1]
-        subs            w2,  w2,  #4
-        fsub            v2.4s,   v0.4s,  v1.4s
-        fadd            v3.4s,   v0.4s,  v1.4s
-        st1             {v2.4s}, [x1],   #16
-        st1             {v3.4s}, [x0],   #16
-        b.gt            1b
+        tst            w2, #4
+        b.eq           bf8
+        ldr            q0, [x0]
+        ldr            q1, [x1]
+        fadd           v2.4s, v0.4s, v1.4s
+        fsub           v3.4s, v0.4s, v1.4s
+        str            q2, [x0], #16
+        str            q3, [x1], #16
+        sub            w2, w2, #4
+        cbnz           w2, bf8
+        ret
+bf8:
+        ldp            q0, q1, [x0]
+        ldp            q4, q5, [x1]
+        fadd           v2.4s, v0.4s, v4.4s
+        fadd           v3.4s, v1.4s, v5.4s
+        stp            q2, q3, [x0], #32
+        fsub           v2.4s, v0.4s, v4.4s
+        fsub           v3.4s, v1.4s, v5.4s
+        stp            q2, q3, [x1], #32
+        sub            w2, w2, #8
+        cbnz           w2, bf8
         ret
 endfunc
 
-- 
2.45.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_butterflies_float_neon
  2025-01-19 18:07 [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_butterflies_float_neon Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-01-19 21:28 ` Martin Storsjö
  0 siblings, 0 replies; 2+ messages in thread
From: Martin Storsjö @ 2025-01-19 21:28 UTC (permalink / raw)
  To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz

On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> Modified the main loop to handle 8 floats in one go. A special case of
> length not being multiple of 8 is handled at the beginning. The speed
> increased from 3.90 to 4.50.

The same request about benchmark numbers and reference to where they were 
measured.


>
> Krzysztof
>
> ---
> libavutil/aarch64/float_dsp_neon.S | 30 ++++++++++++++++++++++--------
> 1 file changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
> index 35e2715b87..f99cc2e6b2 100644
> --- a/libavutil/aarch64/float_dsp_neon.S
> +++ b/libavutil/aarch64/float_dsp_neon.S
> @@ -178,14 +178,28 @@ function ff_vector_fmul_reverse_neon, export=1
> endfunc
>
> function ff_butterflies_float_neon, export=1
> -1:      ld1             {v0.4s}, [x0]
> -        ld1             {v1.4s}, [x1]
> -        subs            w2,  w2,  #4
> -        fsub            v2.4s,   v0.4s,  v1.4s
> -        fadd            v3.4s,   v0.4s,  v1.4s
> -        st1             {v2.4s}, [x1],   #16
> -        st1             {v3.4s}, [x0],   #16
> -        b.gt            1b
> +        tst            w2, #4

Note how you're altering the indentation of the whole function here - 
don't do that, keep things aligned as it was. (The patch for 
ff_vector_fmul_add_neon also had that issue.)

> +        b.eq           bf8
> +        ldr            q0, [x0]
> +        ldr            q1, [x1]
> +        fadd           v2.4s, v0.4s, v1.4s
> +        fsub           v3.4s, v0.4s, v1.4s
> +        str            q2, [x0], #16
> +        str            q3, [x1], #16
> +        sub            w2, w2, #4
> +        cbnz           w2, bf8
> +        ret
> +bf8:

This will emit an actual symbol to the symbol table. That's not an issue 
per se, but the output of some tools (disassemblers etc) can be nicer if 
we avoid them. If we really do want a textual label, there are also ways 
of making them into local labels that don't end up in the symbol table; 
prefixing it with .L will do that, for ELF. We have a couple cases of 
that in our assembly, but there's not a lot of it. However, for MachO, it 
has to be prefixed by a plain "L", not ".L". So if we want to start doing 
this, we could add this macro to asm.S: 
https://code.videolan.org/videolan/dav1d/-/blob/master/src/arm/asm.S?ref_type=heads#L352-356

But for this case, a plain numeric 8: could probably be just as good.

> +        ldp            q0, q1, [x0]
> +        ldp            q4, q5, [x1]
> +        fadd           v2.4s, v0.4s, v4.4s
> +        fadd           v3.4s, v1.4s, v5.4s
> +        stp            q2, q3, [x0], #32

Why are you doing the store of q3 directly after the calculation of it? 
This will cause stalls on in-order cores.

> +        fsub           v2.4s, v0.4s, v4.4s
> +        fsub           v3.4s, v1.4s, v5.4s
> +        stp            q2, q3, [x1], #32
> +        sub            w2, w2, #8
> +        cbnz           w2, bf8

Likewise, move the sub between the fsub and cbnz, to improve things for 
in-order cores.

Overall, this change does seem to be helpful (except for A73, 
surprisingly), but it can be made better with the suggestions I made 
above.

Before:              Cortex A53    A72     A73    A78
butterflies_float_neon:   782.0  336.8   405.8  163.0
After this patch:
butterflies_float_neon:   718.5  323.8   525.2  147.8
With extra modifications:
butterflies_float_neon:   660.5  328.8   472.8  147.0

That is with the following changes on top:

+++ b/libavutil/aarch64/float_dsp_neon.S
@@ -195,12 +195,12 @@ bf8:
          ldp            q4, q5, [x1]
          fadd           v2.4s, v0.4s, v4.4s
          fadd           v3.4s, v1.4s, v5.4s
+        fsub           v6.4s, v0.4s, v4.4s
          stp            q2, q3, [x0], #32
-        fsub           v2.4s, v0.4s, v4.4s
-        fsub           v3.4s, v1.4s, v5.4s
-        stp            q2, q3, [x1], #32
-        sub            w2, w2, #8
-        cbnz           w2, bf8
+        fsub           v7.4s, v1.4s, v5.4s
+        subs           w2, w2, #8
+        stp            q6, q7, [x1], #32
+        b.gt           bf8
          ret

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2025-01-19 21:28 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-19 18:07 [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_butterflies_float_neon Krzysztof Pyrkosz via ffmpeg-devel
2025-01-19 21:28 ` Martin Storsjö

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git