Sep 2, 2022, 07:49 by dev@lynne.ee: > Version 2 notes: halved the amount of loads and loops for the > pre-transform loop by exploiting the symmetry. > > This commit implements an iMDCT in pure assembly. > > This is capable of processing any mod-8 transforms, rather than just > power of two, but since power of two is all we have assembly for > currently, that's what's supported. > It would really benefit if we could somehow use the C code to decide > which function to jump into, but exposing function labels from assebly > into C is anything but easy. > The post-transform loop could probably be improved. > > This was somewhat annoying to write, as we must support arbitrary > strides during runtime. There's a fast branch for stride == 4 bytes > and a slower one which uses vgatherdps. > > Zen 3 benchmarks for stride == 4 for old (av_imdct_half) vs new (av_tx): > > 128pt: >    2815 decicycles in         av_tx (imdct),16776766 runs,    450 skips >    3097 decicycles in         av_imdct_half,16776745 runs,    471 skips > > 256pt: >    4931 decicycles in         av_tx (imdct), 4193127 runs,   1177 skips >    5401 decicycles in         av_imdct_half, 2097058 runs,     94 skips > > 512pt: >    9764 decicycles in         av_tx (imdct), 4193929 runs,    375 skips >   10690 decicycles in         av_imdct_half, 2096948 runs,    204 skips > > 1024pt: >   20113 decicycles in         av_tx (imdct), 4194202 runs,    102 skips >   21258 decicycles in         av_imdct_half, 2097147 runs,      5 skips > > Patch attached. > Forgot to git add some minor reordering/fma changes. W/e.