On Mon, 22 Aug 2022, Hubert Mazur wrote: > Provide optimized implementation of vsad16 function for arm64. > > Performance comparison tests are shown below. > - vsad_0_c: 285.0 > - vsad_0_neon: 42.5 > > Benchmarks and tests are run with checkasm tool on AWS Graviton 3. > > Signed-off-by: Hubert Mazur > --- > libavcodec/aarch64/me_cmp_init_aarch64.c | 5 ++ > libavcodec/aarch64/me_cmp_neon.S | 75 ++++++++++++++++++++++++ > 2 files changed, 80 insertions(+) > > diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S > index 4198985c6c..d4c0099854 100644 > --- a/libavcodec/aarch64/me_cmp_neon.S > +++ b/libavcodec/aarch64/me_cmp_neon.S > @@ -584,3 +584,78 @@ function sse4_neon, export=1 > > ret > endfunc > + > +function vsad16_neon, export=1 > + // x0 unused > + // x1 uint8_t *pix1 > + // x2 uint8_t *pix2 > + // x3 ptrdiff_t stride > + // w4 int h > + > + sub w4, w4, #1 // we need to make h-1 iterations > + movi v16.8h, #0 > + > + cmp w4, #3 // check if we can make 3 iterations at once > + add x5, x1, x3 // pix1 + stride > + add x6, x2, x3 // pix2 + stride > + b.le 2f > + > +1: > + // abs(pix1[0] - pix2[0] - pix1[0 + stride] + pix2[0 + stride]) > + // abs(x) = (x < 0 ? (-x) : (x)) This comment (the second line) about abs() doesn't really add anything of value here, does it? > + ld1 {v0.16b}, [x1], x3 // Load pix1[0], first iteration > + ld1 {v1.16b}, [x2], x3 // Load pix2[0], first iteration > + ld1 {v2.16b}, [x5], x3 // Load pix1[0 + stride], first iteration > + usubl v31.8h, v0.8b, v1.8b // Signed difference pix1[0] - pix2[0], first iteration > + ld1 {v3.16b}, [x6], x3 // Load pix2[0 + stride], first iteration > + usubl2 v30.8h, v0.16b, v1.16b // Signed difference pix1[0] - pix2[0], first iteration > + usubl v29.8h, v2.8b, v3.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration > + ld1 {v4.16b}, [x1], x3 // Load pix1[0], second iteration > + usubl2 v28.8h, v2.16b, v3.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], first iteration > + ld1 {v5.16b}, [x2], x3 // Load pix2[0], second iteration > + saba v16.8h, v31.8h, v29.8h // Signed absolute difference and accumulate the result. first iteration > + ld1 {v6.16b}, [x5], x3 // Load pix1[0 + stride], second iteration > + saba v16.8h, v30.8h, v28.8h // Signed absolute difference and accumulate the result. first iteration > + usubl v27.8h, v4.8b, v5.8b // Signed difference pix1[0] - pix2[0], second iteration > + ld1 {v7.16b}, [x6], x3 // Load pix2[0 + stride], second iteration > + usubl2 v26.8h, v4.16b, v5.16b // Signed difference pix1[0] - pix2[0], second iteration > + usubl v25.8h, v6.8b, v7.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration > + ld1 {v17.16b}, [x1], x3 // Load pix1[0], third iteration > + usubl2 v24.8h, v6.16b, v7.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], second iteration > + ld1 {v18.16b}, [x2], x3 // Load pix2[0], second iteration > + saba v16.8h, v27.8h, v25.8h // Signed absolute difference and accumulate the result. second iteration > + ld1 {v19.16b}, [x5], x3 // Load pix1[0 + stride], third iteration > + saba v16.8h, v26.8h, v24.8h // Signed absolute difference and accumulate the result. second iteration > + usubl v23.8h, v17.8b, v18.8b // Signed difference pix1[0] - pix2[0], third iteration > + ld1 {v20.16b}, [x6], x3 // Load pix2[0 + stride], third iteration > + usubl2 v22.8h, v17.16b, v18.16b // Signed difference pix1[0] - pix2[0], third iteration > + usubl v21.8h, v19.8b, v20.8b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration > + sub w4, w4, #3 // h -= 3 > + saba v16.8h, v23.8h, v21.8h // Signed absolute difference and accumulate the result. third iteration > + usubl2 v31.8h, v19.16b, v20.16b // Signed difference pix1[0 + stride] - pix2[0 + stride], third iteration > + cmp w4, #3 > + saba v16.8h, v22.8h, v31.8h // Signed absolute difference and accumulate the result. third iteration This unrolled implementation isn't really interleaved much at all. In such a case there's not really much benefit from an unrolled implementation, other than saving a couple decrements/branch instructions. I tested playing with your code here. Originally, checkasm gave me these benchmark numbers: Cortex A53 A72 A73 vsad_0_neon: 162.7 74.5 67.7 When I switched to only running the non-unrolled version below, I got these numbers: vsad_0_neon: 165.2 78.5 76.0 I.e. hardly any difference at all. In such a case, we're just wasting a lot of code size (and code maintainability!) on big unrolled code without much benefit at all. But we can do better. > + > + b.ge 1b > + cbz w4, 3f > +2: > + > + ld1 {v0.16b}, [x1], x3 > + ld1 {v1.16b}, [x2], x3 > + ld1 {v2.16b}, [x5], x3 > + usubl v30.8h, v0.8b, v1.8b > + ld1 {v3.16b}, [x6], x3 > + usubl2 v29.8h, v0.16b, v1.16b > + usubl v28.8h, v2.8b, v3.8b > + usubl2 v27.8h, v2.16b, v3.16b > + saba v16.8h, v30.8h, v28.8h > + subs w4, w4, #1 > + saba v16.8h, v29.8h, v27.8h Now let's look at this simple non-unrolled implementation first. We're doing a lot of redundant work here (and in the other implementation). When we've loaded v2/v3 in the first round in this loop, and calculated v28/v27 from them, and we then proceed to the next iteration, v0/v1 will be identical to v2/v3, and v30/v29 will be identical to v28/v27 from the previous iteration. So if we just load one row from pix1/pix2 each and calculate their difference in the start of the function, and then just load one row from each, subtract them and accumulate the difference to the previous one, we're done. Originally with your non-unrolled implementation, I got these benchmark numbers: vsad_0_neon: 165.2 78.5 76.0 After this simplification, I got these numbers: vsad_0_neon: 131.2 68.5 70.0 In this case, my code looked like this: ld1 {v0.16b}, [x1], x3 ld1 {v1.16b}, [x2], x3 usubl v31.8h, v0.8b, v1.8b usubl2 v30.8h, v0.16b, v1.16b 2: ld1 {v0.16b}, [x1], x3 ld1 {v1.16b}, [x2], x3 subs w4, w4, #1 usubl v29.8h, v0.8b, v1.8b usubl2 v28.8h, v0.16b, v1.16b saba v16.8h, v31.8h, v29.8h mov v31.16b, v29.16b saba v16.8h, v30.8h, v28.8h mov v30.16b, v28.16b b.ne 2b Isn't that much simpler? Now after that, I tried doing the same modification to the unrolled version. FWIW, I did it like this: For an algorithm this simple, where there's more than enough registers, I wrote the unrolled implementation like this: ld1 // first iter ld1 // first iter ld1 // second iter ld1 // second iter ld1 // third iter ld1 // third iter usubl // first usubl2 // first usubl .. usubl2 .. ... saba // first saba // first ... After that, I reordered them so that we start with the first usubl's after a couple of instructions after the corresponding loads, then moved the following saba's closer. With that, I'm getting these checkasm numbers: Cortex A53 A72 A73 vsad_0_neon: 108.7 61.2 58.0 As context, this patch originally gave these numbers: vsad_0_neon: 162.7 74.5 67.7 I.e. a 1.15x speedup on A73, 1.21x speedup on A72 and an 1.5x speedup on A53. You can see my version in the attached patch; apply it on top of yours. // Martin