From: "Martin Storsjö" <martin@martin.st>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH] lavu/tx: implement aarch64 NEON SIMD
Date: Thu, 25 Aug 2022 13:51:57 +0300 (EEST)
Message-ID: <8854c096-5855-d110-29a0-dace5d887bf0@martin.st> (raw)
In-Reply-To: <N9PbCTK--3-2@lynne.ee>
On Sun, 14 Aug 2022, Lynne wrote:
> The fastest fast Fourier transform in not just the west, but the world,
> now for the most popular toy ISA.
>
> On a high level, it follows the design of the AVX2 version closely,
> with the exception that the input is slightly less permuted as we don't have
> to do lane switching with the input on double 4pt and 8pt.
>
> On a low level, the lack of subadd/addsub instructions REALLY penalizes
> any attempt at writing an FFT. That single register matters a _lot_,
> and reloading it simply takes unacceptably long.
> In x86 land, vendors would've noticed developers need this.
> In ARM land, you get a badly designed complex multiplication instruction
> we cannot use, that's not present on 95% of devices. Because only
> compilers matter, right?
>
> There's still room for improvement. I think using stp
> instead of st1 may help in a few places, some reordering
> may help performance in the recombination macro,
> and there are other TODOs I've left marked in the code.
> There are also a few places where the limited range on
> immediates in adds may be worked around.
>
> All timings below are in cycles:
> A53:
> Length | C | New (lavu) | Old (lavc) | FFTW
> ------ |-------------|-------------|-------------|-----
> 4 | 842 | 420 | 1210 | 1460
> 8 | 1538 | 1020 | 1850 | 2520
> 16 | 3717 | 1900 | 3700 | 3990
> 32 | 9156 | 4070 | 8289 | 8860
> 64 | 21160 | 9931 | 18600 | 19625
> 128 | 49180 | 23278 | 41922 | 41922
> 256 | 112073 | 53876 | 93202 | 101092
> 512 | 252864 | 122884 | 205897 | 207868
> 1024 | 560512 | 278322 | 458071 | 453053
> 2048 | 1295402 | 775835 | 1038205 | 1020265
> 4096 | 3281263 | 2021221 | 2409718 | 2577554
> 8192 | 8577845 | 4780526 | 5673041 | 6802722
>
> Apple M1
> New - Total for len 512 reps 2097152 = 1.459141 s
> Old - Total for len 512 reps 2097152 = 2.251344 s
> FFTW - Total for len 512 reps 2097152 = 1.868429 s
>
> New - Total for len 1024 reps 4194304 = 6.490080 s
> Old - Total for len 1024 reps 4194304 = 9.604949 s
> FFTW - Total for len 1024 reps 4194304 = 7.889281 s
>
> New - Total for len 16384 reps 262144 = 10.374001 s
> Old - Total for len 16384 reps 262144 = 15.266713 s
> FFTW - Total for len 16384 reps 262144 = 12.341745 s
>
> New - Total for len 65536 reps 8192 = 1.769812 s
> Old - Total for len 65536 reps 8192 = 4.209413 s
> FFTW - Total for len 65536 reps 8192 = 3.012365 s
>
> New - Total for len 131072 reps 4096 = 1.942836 s
> Old - Segfaults
> FFTW - Total for len 131072 reps 4096 = 3.713713 s
>
> Patch attached.
I've had a look at this now.
I don't have much to add/comment about the core implementation itself and
the performance of it (I didn't try to read it and follow it from that
perspective).
Wrt non-functional aspects, the patch needs a couple fixes to build with
other assemblers (binutils, and MS armasm64.exe). I've also done a couple
minor fixes - instead of using a series of mov+add+add for loading a large
constant, use the ldr= pseudo instruction which is made exactly for
loading odd constants, and avoid unnecessary \() operators after macro
arguments.
See https://github.com/mstorsjo/ffmpeg/commits/aarch64-fft for my
incremental fixes on top; at least the first three are needed for fixing
assembling with the other tools, but all up to the WIP (for removing
prefetching) probably are worthwhile to include; feel free to squash these
into your patch.
Coding style wise, it looks mostly reasonable; some things use a bit
nonstandard style (spaces within {} for loads/stores, and some operand
columns are right-adjusted instead of left-adjusted), but it's probably
acceptable as such.
// Martin
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
prev parent reply other threads:[~2022-08-25 10:52 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-14 4:31 Lynne
2022-08-16 11:07 ` Anton Khirnov
2022-08-16 16:33 ` Paul B Mahol
2022-08-17 8:18 ` Anton Khirnov
2022-08-25 10:51 ` Martin Storsjö [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=8854c096-5855-d110-29a0-dace5d887bf0@martin.st \
--to=martin@martin.st \
--cc=ffmpeg-devel@ffmpeg.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git