The fastest fast Fourier transform in not just the west, but the world, now for the most popular toy ISA. On a high level, it follows the design of the AVX2 version closely, with the exception that the input is slightly less permuted as we don't have to do lane switching with the input on double 4pt and 8pt. On a low level, the lack of subadd/addsub instructions REALLY penalizes any attempt at writing an FFT. That single register matters a _lot_, and reloading it simply takes unacceptably long. In x86 land, vendors would've noticed developers need this. In ARM land, you get a badly designed complex multiplication instruction we cannot use, that's not present on 95% of devices. Because only compilers matter, right? There's still room for improvement. I think using stp instead of st1 may help in a few places, some reordering may help performance in the recombination macro, and there are other TODOs I've left marked in the code. There are also a few places where the limited range on immediates in adds may be worked around. All timings below are in cycles: A53: Length | C           | New (lavu)  | Old (lavc)  | FFTW ------ |-------------|-------------|-------------|----- 4      |         842 | 420         | 1210        | 1460 8      |        1538 | 1020        | 1850        | 2520 16     |        3717 | 1900        | 3700        | 3990 32     |        9156 | 4070        | 8289        | 8860 64     |       21160 | 9931        | 18600       | 19625 128    |       49180 | 23278       | 41922       | 41922 256    |      112073 | 53876       | 93202       | 101092 512    |      252864 | 122884      | 205897      | 207868 1024   |      560512 | 278322      | 458071      | 453053 2048   |     1295402 | 775835      | 1038205     | 1020265 4096   |     3281263 | 2021221     | 2409718     | 2577554 8192   |     8577845 | 4780526     | 5673041     | 6802722 Apple M1 New  - Total for len 512 reps 2097152 = 1.459141 s Old  - Total for len 512 reps 2097152 = 2.251344 s FFTW - Total for len 512 reps 2097152 = 1.868429 s New  - Total for len 1024 reps 4194304 = 6.490080 s Old  - Total for len 1024 reps 4194304 = 9.604949 s FFTW - Total for len 1024 reps 4194304 = 7.889281 s New  - Total for len 16384 reps 262144 = 10.374001 s Old  - Total for len 16384 reps 262144 = 15.266713 s FFTW - Total for len 16384 reps 262144 = 12.341745 s New  - Total for len 65536 reps 8192 = 1.769812 s Old  - Total for len 65536 reps 8192 = 4.209413 s FFTW - Total for len 65536 reps 8192 = 3.012365 s New  - Total for len 131072 reps 4096 = 1.942836 s Old  - Segfaults FFTW - Total for len 131072 reps 4096 = 3.713713 s Patch attached.