On Mon, Feb 10, 2025 at 03:15:35PM +0200, Martin Storsjö wrote:
> > Just as I'm about to send this patch, I'm thinking if non-interleaved
> > read followed by 4 invocations of TBL wouldn't be more performant. One
> > call to generate a contiguous vector of u, second for v and two for y.
> > I'm curious to find out.
> 
> My guess is that it may be more performant on more modern cores, but
> probably not on older ones.

That's the case. It's 15% faster on A78 and twice as slow on A72.

> 
> > +        sxtw             x7, w7
> > +        ldrsw            x8, [sp]
> > +        ubfx             x10, x4, #1, #31
> 
> The ubfx instruction is kinda esoteric; I presume what you're doing here is
> essentially the same as "lsr #1"? That'd be much more idiomatic and
> readable.

That's correct. What put me off was that register 4 is passed as int
(w4) and I expected register 10 to be 64 bits long with high bits set to
0. lsr w10, w4, #1 already does that.

I modified the code to handle {uyvy,yuyv}toyuv{420,422} using macros,
since these 4 functions share common routines. The code lost on the
readability, though.

Krzysztof