[FFmpeg-devel] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed

* [FFmpeg-devel] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon
@ 2025-01-19 15:04 Krzysztof Pyrkosz via ffmpeg-devel
  2025-01-19 15:04 ` [FFmpeg-devel] [PATCH] " Krzysztof Pyrkosz via ffmpeg-devel
  0 siblings, 1 reply; 4+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-01-19 15:04 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

Removed two redundant pointer arithmetic operations and split load
section into two smaller ones.
Speedup compared to C increased from 4.50 to 5.80.


_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon
  2025-01-19 15:04 [FFmpeg-devel] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-01-19 15:04 ` Krzysztof Pyrkosz via ffmpeg-devel
  2025-01-19 16:04   ` Lynne
  2025-01-19 20:41   ` Martin Storsjö
  0 siblings, 2 replies; 4+ messages in thread
From: Krzysztof Pyrkosz via ffmpeg-devel @ 2025-01-19 15:04 UTC (permalink / raw)
  To: ffmpeg-devel; +Cc: Krzysztof Pyrkosz

---
 libavutil/aarch64/float_dsp_neon.S | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
index 35e2715b87..b21f34c084 100644
--- a/libavutil/aarch64/float_dsp_neon.S
+++ b/libavutil/aarch64/float_dsp_neon.S
@@ -40,18 +40,17 @@ function ff_vector_fmul_neon, export=1
 endfunc
 
 function ff_vector_fmac_scalar_neon, export=1
-        mov             x3,  #-32
 1:      subs            w2,  w2,  #16
-        ld1             {v16.4s, v17.4s}, [x0], #32
-        ld1             {v18.4s, v19.4s}, [x0], x3
-        ld1             {v4.4s,  v5.4s},  [x1], #32
-        ld1             {v6.4s,  v7.4s},  [x1], #32
+        ldp             q16, q17, [x0]
+        ldp             q4, q5, [x1], #32
         fmla            v16.4s, v4.4s,  v0.s[0]
         fmla            v17.4s, v5.4s,  v0.s[0]
+        stp             q16, q17, [x0], #32
+        ldp             q18, q19, [x0]
+        ldp             q6, q7, [x1], #32
         fmla            v18.4s, v6.4s,  v0.s[0]
         fmla            v19.4s, v7.4s,  v0.s[0]
-        st1             {v16.4s, v17.4s}, [x0], #32
-        st1             {v18.4s, v19.4s}, [x0], #32
+        stp             q18, q19, [x0], #32
         b.ne            1b
         ret
 endfunc
-- 
2.45.2

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon
  2025-01-19 15:04 ` [FFmpeg-devel] [PATCH] " Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-01-19 16:04   ` Lynne
  2025-01-19 20:41   ` Martin Storsjö
  1 sibling, 0 replies; 4+ messages in thread
From: Lynne @ 2025-01-19 16:04 UTC (permalink / raw)
  To: ffmpeg-devel


[-- Attachment #1.1.1.1: Type: text/plain, Size: 1656 bytes --]

On 20/01/2025 00:04, Krzysztof Pyrkosz via ffmpeg-devel wrote:
> ---
>   libavutil/aarch64/float_dsp_neon.S | 13 ++++++-------
>   1 file changed, 6 insertions(+), 7 deletions(-)
> 
> diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
> index 35e2715b87..b21f34c084 100644
> --- a/libavutil/aarch64/float_dsp_neon.S
> +++ b/libavutil/aarch64/float_dsp_neon.S
> @@ -40,18 +40,17 @@ function ff_vector_fmul_neon, export=1
>   endfunc
>   
>   function ff_vector_fmac_scalar_neon, export=1
> -        mov             x3,  #-32
>   1:      subs            w2,  w2,  #16
> -        ld1             {v16.4s, v17.4s}, [x0], #32
> -        ld1             {v18.4s, v19.4s}, [x0], x3
> -        ld1             {v4.4s,  v5.4s},  [x1], #32
> -        ld1             {v6.4s,  v7.4s},  [x1], #32
> +        ldp             q16, q17, [x0]
> +        ldp             q4, q5, [x1], #32
>           fmla            v16.4s, v4.4s,  v0.s[0]
>           fmla            v17.4s, v5.4s,  v0.s[0]
> +        stp             q16, q17, [x0], #32
> +        ldp             q18, q19, [x0]
> +        ldp             q6, q7, [x1], #32
>           fmla            v18.4s, v6.4s,  v0.s[0]
>           fmla            v19.4s, v7.4s,  v0.s[0]
> -        st1             {v16.4s, v17.4s}, [x0], #32
> -        st1             {v18.4s, v19.4s}, [x0], #32
> +        stp             q18, q19, [x0], #32
>           b.ne            1b
>           ret
>   endfunc

LGTM.
ldp/stp are underrated instructions, they let you ignore the consecutive 
requirement of ld1/st1 when moving 2 vectors at a time.
Will push soon.

[-- Attachment #1.1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 637 bytes --]

[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]

[-- Attachment #2: Type: text/plain, Size: 251 bytes --]

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon
  2025-01-19 15:04 ` [FFmpeg-devel] [PATCH] " Krzysztof Pyrkosz via ffmpeg-devel
  2025-01-19 16:04   ` Lynne
@ 2025-01-19 20:41   ` Martin Storsjö
  1 sibling, 0 replies; 4+ messages in thread
From: Martin Storsjö @ 2025-01-19 20:41 UTC (permalink / raw)
  To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz

On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> ---
> libavutil/aarch64/float_dsp_neon.S | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
> index 35e2715b87..b21f34c084 100644
> --- a/libavutil/aarch64/float_dsp_neon.S
> +++ b/libavutil/aarch64/float_dsp_neon.S
> @@ -40,18 +40,17 @@ function ff_vector_fmul_neon, export=1
> endfunc
>
> function ff_vector_fmac_scalar_neon, export=1
> -        mov             x3,  #-32
> 1:      subs            w2,  w2,  #16
> -        ld1             {v16.4s, v17.4s}, [x0], #32
> -        ld1             {v18.4s, v19.4s}, [x0], x3
> -        ld1             {v4.4s,  v5.4s},  [x1], #32
> -        ld1             {v6.4s,  v7.4s},  [x1], #32
> +        ldp             q16, q17, [x0]
> +        ldp             q4, q5, [x1], #32
>         fmla            v16.4s, v4.4s,  v0.s[0]
>         fmla            v17.4s, v5.4s,  v0.s[0]
> +        stp             q16, q17, [x0], #32
> +        ldp             q18, q19, [x0]
> +        ldp             q6, q7, [x1], #32
>         fmla            v18.4s, v6.4s,  v0.s[0]
>         fmla            v19.4s, v7.4s,  v0.s[0]
> -        st1             {v16.4s, v17.4s}, [x0], #32
> -        st1             {v18.4s, v19.4s}, [x0], #32
> +        stp             q18, q19, [x0], #32
>         b.ne            1b
>         ret
> endfunc
> --

The change looks ok to me, but for changes like this, please do quote 
actual checkasm benchmark numbers, with the actual numbers before/after 
(not just the change in speedup factor), and include such numbers in the 
commit message for reference. And include info about what core you have 
benchmarked it on.

The reason for that is that various tunings of assembly like this can have 
a lot of effect on one core, and little/no effect on others, or even 
contradictory effects on others. So in order not to go back and forth (if 
someone else comes along next week and does a different tuning for a 
different core), please include info about the actual numbers and what you 
have benchmarked on.

I benchmarked this change, on a handful of different cores, and got the 
following numbers:

Before:               Cortex A53    A72     A73    A78
vector_fmac_scalar_neon:   606.0  213.8   321.8  115.5
After:
vector_fmac_scalar_neon:   608.0  198.2   303.2   90.8

So it shows no change (but no regression either) on A53, and a 5-27% 
speedup on in-order cores. So it looks like a reasonable tradeoff.

However, to show the contrast, if one were to favour optimizations for the 
A53, one could do this simple change:

--- a/libavutil/aarch64/float_dsp_neon.S
+++ b/libavutil/aarch64/float_dsp_neon.S
@@ -43,8 +43,8 @@ function ff_vector_fmac_scalar_neon, export=1
          mov             x3,  #-32
  1:      subs            w2,  w2,  #16
          ld1             {v16.4s, v17.4s}, [x0], #32
-        ld1             {v18.4s, v19.4s}, [x0], x3
          ld1             {v4.4s,  v5.4s},  [x1], #32
+        ld1             {v18.4s, v19.4s}, [x0], x3
          ld1             {v6.4s,  v7.4s},  [x1], #32
          fmla            v16.4s, v4.4s,  v0.s[0]
          fmla            v17.4s, v5.4s,  v0.s[0]

This gives the following numbers:

Before:               Cortex A53    A72     A73    A78
vector_fmac_scalar_neon:   606.0  213.8   321.8  115.5
After:
vector_fmac_scalar_neon:   572.0  213.8   308.0  115.2

I.e. such a change would be almost as good for the A73 and even better for 
A53, while having no effect on A72 and A78.

In this case, the change probably is fine and probably a good compromise. 
But I wanted to show this example to highlight the fact that tunings of 
functions can be very subjective.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2025-01-19 20:41 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2025-01-19 15:04 [FFmpeg-devel] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon Krzysztof Pyrkosz via ffmpeg-devel
2025-01-19 15:04 ` [FFmpeg-devel] [PATCH] " Krzysztof Pyrkosz via ffmpeg-devel
2025-01-19 16:04   ` Lynne
2025-01-19 20:41   ` Martin Storsjö

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git