* Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon
2025-01-19 15:04 ` [FFmpeg-devel] [PATCH] " Krzysztof Pyrkosz via ffmpeg-devel
@ 2025-01-19 16:04 ` Lynne
2025-01-19 20:41 ` Martin Storsjö
1 sibling, 0 replies; 4+ messages in thread
From: Lynne @ 2025-01-19 16:04 UTC (permalink / raw)
To: ffmpeg-devel
[-- Attachment #1.1.1.1: Type: text/plain, Size: 1656 bytes --]
On 20/01/2025 00:04, Krzysztof Pyrkosz via ffmpeg-devel wrote:
> ---
> libavutil/aarch64/float_dsp_neon.S | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
> index 35e2715b87..b21f34c084 100644
> --- a/libavutil/aarch64/float_dsp_neon.S
> +++ b/libavutil/aarch64/float_dsp_neon.S
> @@ -40,18 +40,17 @@ function ff_vector_fmul_neon, export=1
> endfunc
>
> function ff_vector_fmac_scalar_neon, export=1
> - mov x3, #-32
> 1: subs w2, w2, #16
> - ld1 {v16.4s, v17.4s}, [x0], #32
> - ld1 {v18.4s, v19.4s}, [x0], x3
> - ld1 {v4.4s, v5.4s}, [x1], #32
> - ld1 {v6.4s, v7.4s}, [x1], #32
> + ldp q16, q17, [x0]
> + ldp q4, q5, [x1], #32
> fmla v16.4s, v4.4s, v0.s[0]
> fmla v17.4s, v5.4s, v0.s[0]
> + stp q16, q17, [x0], #32
> + ldp q18, q19, [x0]
> + ldp q6, q7, [x1], #32
> fmla v18.4s, v6.4s, v0.s[0]
> fmla v19.4s, v7.4s, v0.s[0]
> - st1 {v16.4s, v17.4s}, [x0], #32
> - st1 {v18.4s, v19.4s}, [x0], #32
> + stp q18, q19, [x0], #32
> b.ne 1b
> ret
> endfunc
LGTM.
ldp/stp are underrated instructions, they let you ignore the consecutive
requirement of ld1/st1 when moving 2 vectors at a time.
Will push soon.
[-- Attachment #1.1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 637 bytes --]
[-- Attachment #1.2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 236 bytes --]
[-- Attachment #2: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon
2025-01-19 15:04 ` [FFmpeg-devel] [PATCH] " Krzysztof Pyrkosz via ffmpeg-devel
2025-01-19 16:04 ` Lynne
@ 2025-01-19 20:41 ` Martin Storsjö
1 sibling, 0 replies; 4+ messages in thread
From: Martin Storsjö @ 2025-01-19 20:41 UTC (permalink / raw)
To: Krzysztof Pyrkosz via ffmpeg-devel; +Cc: Krzysztof Pyrkosz
On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:
> ---
> libavutil/aarch64/float_dsp_neon.S | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S
> index 35e2715b87..b21f34c084 100644
> --- a/libavutil/aarch64/float_dsp_neon.S
> +++ b/libavutil/aarch64/float_dsp_neon.S
> @@ -40,18 +40,17 @@ function ff_vector_fmul_neon, export=1
> endfunc
>
> function ff_vector_fmac_scalar_neon, export=1
> - mov x3, #-32
> 1: subs w2, w2, #16
> - ld1 {v16.4s, v17.4s}, [x0], #32
> - ld1 {v18.4s, v19.4s}, [x0], x3
> - ld1 {v4.4s, v5.4s}, [x1], #32
> - ld1 {v6.4s, v7.4s}, [x1], #32
> + ldp q16, q17, [x0]
> + ldp q4, q5, [x1], #32
> fmla v16.4s, v4.4s, v0.s[0]
> fmla v17.4s, v5.4s, v0.s[0]
> + stp q16, q17, [x0], #32
> + ldp q18, q19, [x0]
> + ldp q6, q7, [x1], #32
> fmla v18.4s, v6.4s, v0.s[0]
> fmla v19.4s, v7.4s, v0.s[0]
> - st1 {v16.4s, v17.4s}, [x0], #32
> - st1 {v18.4s, v19.4s}, [x0], #32
> + stp q18, q19, [x0], #32
> b.ne 1b
> ret
> endfunc
> --
The change looks ok to me, but for changes like this, please do quote
actual checkasm benchmark numbers, with the actual numbers before/after
(not just the change in speedup factor), and include such numbers in the
commit message for reference. And include info about what core you have
benchmarked it on.
The reason for that is that various tunings of assembly like this can have
a lot of effect on one core, and little/no effect on others, or even
contradictory effects on others. So in order not to go back and forth (if
someone else comes along next week and does a different tuning for a
different core), please include info about the actual numbers and what you
have benchmarked on.
I benchmarked this change, on a handful of different cores, and got the
following numbers:
Before: Cortex A53 A72 A73 A78
vector_fmac_scalar_neon: 606.0 213.8 321.8 115.5
After:
vector_fmac_scalar_neon: 608.0 198.2 303.2 90.8
So it shows no change (but no regression either) on A53, and a 5-27%
speedup on in-order cores. So it looks like a reasonable tradeoff.
However, to show the contrast, if one were to favour optimizations for the
A53, one could do this simple change:
--- a/libavutil/aarch64/float_dsp_neon.S
+++ b/libavutil/aarch64/float_dsp_neon.S
@@ -43,8 +43,8 @@ function ff_vector_fmac_scalar_neon, export=1
mov x3, #-32
1: subs w2, w2, #16
ld1 {v16.4s, v17.4s}, [x0], #32
- ld1 {v18.4s, v19.4s}, [x0], x3
ld1 {v4.4s, v5.4s}, [x1], #32
+ ld1 {v18.4s, v19.4s}, [x0], x3
ld1 {v6.4s, v7.4s}, [x1], #32
fmla v16.4s, v4.4s, v0.s[0]
fmla v17.4s, v5.4s, v0.s[0]
This gives the following numbers:
Before: Cortex A53 A72 A73 A78
vector_fmac_scalar_neon: 606.0 213.8 321.8 115.5
After:
vector_fmac_scalar_neon: 572.0 213.8 308.0 115.2
I.e. such a change would be almost as good for the A73 and even better for
A53, while having no effect on A72 and A78.
In this case, the change probably is fine and probably a good compromise.
But I wanted to show this example to highlight the fact that tunings of
functions can be very subjective.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
^ permalink raw reply [flat|nested] 4+ messages in thread