This commit implements an iMDCT in pure assembly. This is capable of processing any mod-8 transforms, rather than just power of two, but since power of two is all we have assembly for currently, that's what's supported. It would really benefit if we could somehow use the C code to decide which function to jump into, but exposing function labels from assebly into C is anything but easy. The post-transform loop could probably be improved. This was somewhat annoying to write, as we must support arbitrary strides during runtime. There's a fast branch for stride == 4 bytes and a slower one which uses vgatherdps. Benchmarks for stride == 4 for old (av_imdct_half) vs new (av_tx): 128pt:    2791 decicycles in         av_tx (imdct),16775675 runs,   1541 skips    3024 decicycles in         av_imdct_half,16776779 runs,    437 skips 256pt:    5055 decicycles in         av_tx (imdct), 2096602 runs,    550 skips    5324 decicycles in         av_imdct_half, 2097046 runs,    106 skips 512pt:    9922 decicycles in         av_tx (imdct), 2096983 runs,    169 skips   10390 decicycles in         av_imdct_half, 2097002 runs,    150 skips 1024pt:   20482 decicycles in         av_tx (imdct), 2097089 runs,     63 skips   20662 decicycles in         av_imdct_half, 2097115 runs,     37 skips Patch attached.