From: "Martin Storsjö" <martin@martin.st> To: Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org> Cc: Krzysztof Pyrkosz <ffmpeg@szaka.eu> Subject: Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_butterflies_float_neon Date: Sun, 19 Jan 2025 23:28:30 +0200 (EET) Message-ID: <e27d084-9d5f-264b-e137-153e84325356@martin.st> (raw) In-Reply-To: <20250119180706.3075-2-ffmpeg@szaka.eu> On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > Modified the main loop to handle 8 floats in one go. A special case of > length not being multiple of 8 is handled at the beginning. The speed > increased from 3.90 to 4.50. The same request about benchmark numbers and reference to where they were measured. > > Krzysztof > > --- > libavutil/aarch64/float_dsp_neon.S | 30 ++++++++++++++++++++++-------- > 1 file changed, 22 insertions(+), 8 deletions(-) > > diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S > index 35e2715b87..f99cc2e6b2 100644 > --- a/libavutil/aarch64/float_dsp_neon.S > +++ b/libavutil/aarch64/float_dsp_neon.S > @@ -178,14 +178,28 @@ function ff_vector_fmul_reverse_neon, export=1 > endfunc > > function ff_butterflies_float_neon, export=1 > -1: ld1 {v0.4s}, [x0] > - ld1 {v1.4s}, [x1] > - subs w2, w2, #4 > - fsub v2.4s, v0.4s, v1.4s > - fadd v3.4s, v0.4s, v1.4s > - st1 {v2.4s}, [x1], #16 > - st1 {v3.4s}, [x0], #16 > - b.gt 1b > + tst w2, #4 Note how you're altering the indentation of the whole function here - don't do that, keep things aligned as it was. (The patch for ff_vector_fmul_add_neon also had that issue.) > + b.eq bf8 > + ldr q0, [x0] > + ldr q1, [x1] > + fadd v2.4s, v0.4s, v1.4s > + fsub v3.4s, v0.4s, v1.4s > + str q2, [x0], #16 > + str q3, [x1], #16 > + sub w2, w2, #4 > + cbnz w2, bf8 > + ret > +bf8: This will emit an actual symbol to the symbol table. That's not an issue per se, but the output of some tools (disassemblers etc) can be nicer if we avoid them. If we really do want a textual label, there are also ways of making them into local labels that don't end up in the symbol table; prefixing it with .L will do that, for ELF. We have a couple cases of that in our assembly, but there's not a lot of it. However, for MachO, it has to be prefixed by a plain "L", not ".L". So if we want to start doing this, we could add this macro to asm.S: https://code.videolan.org/videolan/dav1d/-/blob/master/src/arm/asm.S?ref_type=heads#L352-356 But for this case, a plain numeric 8: could probably be just as good. > + ldp q0, q1, [x0] > + ldp q4, q5, [x1] > + fadd v2.4s, v0.4s, v4.4s > + fadd v3.4s, v1.4s, v5.4s > + stp q2, q3, [x0], #32 Why are you doing the store of q3 directly after the calculation of it? This will cause stalls on in-order cores. > + fsub v2.4s, v0.4s, v4.4s > + fsub v3.4s, v1.4s, v5.4s > + stp q2, q3, [x1], #32 > + sub w2, w2, #8 > + cbnz w2, bf8 Likewise, move the sub between the fsub and cbnz, to improve things for in-order cores. Overall, this change does seem to be helpful (except for A73, surprisingly), but it can be made better with the suggestions I made above. Before: Cortex A53 A72 A73 A78 butterflies_float_neon: 782.0 336.8 405.8 163.0 After this patch: butterflies_float_neon: 718.5 323.8 525.2 147.8 With extra modifications: butterflies_float_neon: 660.5 328.8 472.8 147.0 That is with the following changes on top: +++ b/libavutil/aarch64/float_dsp_neon.S @@ -195,12 +195,12 @@ bf8: ldp q4, q5, [x1] fadd v2.4s, v0.4s, v4.4s fadd v3.4s, v1.4s, v5.4s + fsub v6.4s, v0.4s, v4.4s stp q2, q3, [x0], #32 - fsub v2.4s, v0.4s, v4.4s - fsub v3.4s, v1.4s, v5.4s - stp q2, q3, [x1], #32 - sub w2, w2, #8 - cbnz w2, bf8 + fsub v7.4s, v1.4s, v5.4s + subs w2, w2, #8 + stp q6, q7, [x1], #32 + b.gt bf8 ret // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
prev parent reply other threads:[~2025-01-19 21:28 UTC|newest] Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top 2025-01-19 18:07 Krzysztof Pyrkosz via ffmpeg-devel 2025-01-19 21:28 ` Martin Storsjö [this message]
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=e27d084-9d5f-264b-e137-153e84325356@martin.st \ --to=martin@martin.st \ --cc=ffmpeg-devel@ffmpeg.org \ --cc=ffmpeg@szaka.eu \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git