Hi, Please see attached an attempt to optimise the 8-bit input to v210enc to reduce the number of shuffles. This comes at the cost of having to extract the middle element and perform a DWORD shift on it and then reinserting it. I have added a few comments but any other ideas are welcome. Crude benchmarks on Intel(R) Xeon(R) D-2123IT: Before: v210_planar_pack_8_ssse3: 316.5 v210_planar_pack_8_avx: 319.0 v210_planar_pack_8_avx2: 223.0 After: v210_planar_pack_8_ssse3: 321.0 v210_planar_pack_8_avx: 326.0 v210_planar_pack_8_avx2: 217.0 v210_planar_pack_8_avx512: 211.0 Regards, Kieran Kunhya