On Mon, Feb 10, 2025 at 03:15:35PM +0200, Martin Storsjö wrote: > > Just as I'm about to send this patch, I'm thinking if non-interleaved > > read followed by 4 invocations of TBL wouldn't be more performant. One > > call to generate a contiguous vector of u, second for v and two for y. > > I'm curious to find out. > > My guess is that it may be more performant on more modern cores, but > probably not on older ones. That's the case. It's 15% faster on A78 and twice as slow on A72. > > > + sxtw x7, w7 > > + ldrsw x8, [sp] > > + ubfx x10, x4, #1, #31 > > The ubfx instruction is kinda esoteric; I presume what you're doing here is > essentially the same as "lsr #1"? That'd be much more idiomatic and > readable. That's correct. What put me off was that register 4 is passed as int (w4) and I expected register 10 to be 64 bits long with high bits set to 0. lsr w10, w4, #1 already does that. I modified the code to handle {uyvy,yuyv}toyuv{420,422} using macros, since these 4 functions share common routines. The code lost on the readability, though. Krzysztof