On 1/29/2025 10:03 AM, Shreesh Adiga wrote: > Hi Andreas, > > I am not sure if that is needed. I can add the data observed on my machine > (AMD 7950x Zen 4), > I think this will vary from machine to machine. It is expected to be around > 2x > compared to AVX2 and there is no core change apart from processing the > scalar loop with masked instructions. > > The data doesn't entirely look consistent as per my expectations. > All the shuffle variants are equivalent in the work they do, yet the > speedups > are not consistent as per the report. > > shuffle_bytes_0321_c: 56.5 ( 1.00x) > shuffle_bytes_0321_ssse3: 15.2 ( 3.70x) > shuffle_bytes_0321_avx2: 10.2 ( 5.51x) > shuffle_bytes_0321_avx512icl: 9.2 ( 6.11x) > shuffle_bytes_1230_c: 84.5 ( 1.00x) > shuffle_bytes_1230_ssse3: 14.2 ( 5.93x) > shuffle_bytes_1230_avx2: 15.2 ( 5.54x) > shuffle_bytes_1230_avx512icl: 11.2 ( 7.51x) > shuffle_bytes_2103_c: 48.5 ( 1.00x) > shuffle_bytes_2103_ssse3: 21.2 ( 2.28x) > shuffle_bytes_2103_avx2: 13.8 ( 3.53x) > shuffle_bytes_2103_avx512icl: 9.2 ( 5.24x) > shuffle_bytes_3012_c: 84.5 ( 1.00x) > shuffle_bytes_3012_ssse3: 14.2 ( 5.93x) > shuffle_bytes_3012_avx2: 16.2 ( 5.20x) > shuffle_bytes_3012_avx512icl: 10.2 ( 8.24x) > shuffle_bytes_3210_c: 89.2 ( 1.00x) > shuffle_bytes_3210_ssse3: 24.2 ( 3.68x) > shuffle_bytes_3210_avx2: 16.2 ( 5.49x) > shuffle_bytes_3210_avx512icl: 9.2 ( 9.65x) > > I can add the details to commit message if you can confirm if it is needed. > > Thanks, > Shreesh Added the benchmarks and pushed the patch. Thanks.