Sep 2, 2022, 07:49 by dev@lynne.ee:

> Version 2 notes: halved the amount of loads and loops for the
> pre-transform loop by exploiting the symmetry.
>
> This commit implements an iMDCT in pure assembly.
>
> This is capable of processing any mod-8 transforms, rather than just
> power of two, but since power of two is all we have assembly for
> currently, that's what's supported.
> It would really benefit if we could somehow use the C code to decide
> which function to jump into, but exposing function labels from assebly
> into C is anything but easy.
> The post-transform loop could probably be improved.
>
> This was somewhat annoying to write, as we must support arbitrary
> strides during runtime. There's a fast branch for stride == 4 bytes
> and a slower one which uses vgatherdps.
>
> Zen 3 benchmarks for stride == 4 for old (av_imdct_half) vs new (av_tx):
>
> 128pt:
>    2815 decicycles in         av_tx (imdct),16776766 runs,    450 skips
>    3097 decicycles in         av_imdct_half,16776745 runs,    471 skips
>
> 256pt:
>    4931 decicycles in         av_tx (imdct), 4193127 runs,   1177 skips
>    5401 decicycles in         av_imdct_half, 2097058 runs,     94 skips
>
> 512pt:
>    9764 decicycles in         av_tx (imdct), 4193929 runs,    375 skips
>   10690 decicycles in         av_imdct_half, 2096948 runs,    204 skips
>
> 1024pt:
>   20113 decicycles in         av_tx (imdct), 4194202 runs,    102 skips
>   21258 decicycles in         av_imdct_half, 2097147 runs,      5 skips
>
> Patch attached.
>

Forgot to git add some minor reordering/fma changes.
W/e.