From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id 20CED42ECF for ; Mon, 11 Jul 2022 21:21:31 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id DFD9F68B74C; Tue, 12 Jul 2022 00:21:28 +0300 (EEST) Received: from mail8.parnet.fi (mail8.parnet.fi [77.234.108.134]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 2C5F368B411 for ; Tue, 12 Jul 2022 00:21:22 +0300 (EEST) Received: from mail9.parnet.fi (mail9.parnet.fi [77.234.108.21]) by mail8.parnet.fi with ESMTP id 26BLLIPc005711-26BLLIPd005711; Tue, 12 Jul 2022 00:21:18 +0300 Received: from foo.martin.st (host-97-187.parnet.fi [77.234.97.187]) by mail9.parnet.fi (Postfix) with ESMTPS id 55D62A150B; Tue, 12 Jul 2022 00:21:18 +0300 (EEST) Date: Tue, 12 Jul 2022 00:21:18 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Hubert Mazur In-Reply-To: <20220629082440.119841-3-hum@semihalf.com> Message-ID: References: <20220629082440.119841-1-hum@semihalf.com> <20220629082440.119841-3-hum@semihalf.com> MIME-Version: 1.0 X-FE-Policy-ID: 3:14:2:SYSTEM Subject: Re: [FFmpeg-devel] [PATCH 2/2] lavc/aarch64: Add pix_abs16_x2 neon implementation X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: mw@semihalf.com, gjb@semihalf.com, jswinney@amazon.com, spop@amazon.com, ffmpeg-devel@ffmpeg.org Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Wed, 29 Jun 2022, Hubert Mazur wrote: > Provide neon implementation for pix_abs16_x2 function. > > Performance tests of implementation are below. > - pix_abs_0_1_c: 291.9 > - pix_abs_0_1_neon: 73.7 > > Benchmarks and tests run with checkasm tool on AWS Graviton 3. > > Signed-off-by: Hubert Mazur > --- > libavcodec/aarch64/me_cmp_init_aarch64.c | 3 + > libavcodec/aarch64/me_cmp_neon.S | 134 +++++++++++++++++++++++ > 2 files changed, 137 insertions(+) > > diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S > index a7937bd8be..c2fd94f4b3 100644 > --- a/libavcodec/aarch64/me_cmp_neon.S > +++ b/libavcodec/aarch64/me_cmp_neon.S > @@ -203,3 +203,137 @@ function ff_pix_abs16_xy2_neon, export=1 > fmov w0, s0 // copy result to general purpose register > ret > endfunc > + > +function ff_pix_abs16_x2_neon, export=1 > + // x0 unused > + // x1 uint8_t *pix1 > + // x2 uint8_t *pix2 > + // x3 ptrdiff_t stride > + // x4 int h As this is 'int', it would be w4, not x4 > + > + // preserve value of v8-v12 registers > + stp d10, d11, [sp, #-0x10]! > + stp d8, d9, [sp, #-0x10]! > + Yes, if possible, avoid using v8-v15. Also if you still do need to back up registers, don't update sp in each write; something like this is preferred: stp d8, d9, [sp, #-0x20]! stp d10, d11, [sp, #0x10] > + // initialize buffers > + movi d18, #0 > + movi v20.8h, #1 > + add x5, x2, #1 // pix2 + 1 > + cmp w4, #4 > + b.lt 2f Do the cmp earlier, e.g. before the first movi, to avoid having b.lt needing to wait for the result of the cmp. > + > +// make 4 iterations at once > +1: > + // v0 - pix1 > + // v1 - pix2 > + // v2 - pix2 + 1 > + ld1 {v0.16b}, [x1], x3 > + ld1 {v1.16b}, [x2], x3 > + ld1 {v2.16b}, [x5], x3 > + > + ld1 {v3.16b}, [x1], x3 > + ld1 {v4.16b}, [x2], x3 > + ld1 {v5.16b}, [x5], x3 > + > + ld1 {v6.16b}, [x1], x3 > + ld1 {v7.16b}, [x2], x3 > + ld1 {v8.16b}, [x5], x3 > + > + ld1 {v9.16b}, [x1], x3 > + ld1 {v10.16b}, [x2], x3 > + ld1 {v11.16b}, [x5], x3 I guess this goes for the existing ff_pix_abs16_xy2_neon too, but I think it could be more efficient to start doing e.g. the first few steps of the first iteration after loading the data for the second iteration. > + > + // abs(pix1[0] - avg2(pix2[0], pix2[1])) > + // avg2(a,b) = (((a) + (b) + 1) >> 1) > + // abs(x) = (x < 0 ? -x : x) > + > + // pix2[0] + pix2[1] > + uaddl v30.8h, v1.8b, v2.8b > + uaddl2 v29.8h, v1.16b, v2.16b > + // add one to each element > + add v30.8h, v30.8h, v20.8h > + add v29.8h, v29.8h, v20.8h > + // divide by 2, narrow width and store in v30 > + uqshrn v30.8b, v30.8h, #1 > + uqshrn2 v30.16b, v29.8h, #1 Instead of add+uqshrn, you can do uqrshrn, where the 'r' stands for rounding, which implicitly adds the 1 before right shifting. But for this particular case, there's an even simpler alternative; you can do rhadd, which does rounding halving add, which avoids the whole widening/narrowing here. Thus these 6 instructions could just be "rhadd v30.16b, v1.16b, v2.16b". > + > + // abs(pix1[0] - avg2(pix2[0], pix2[1])) > + uabd v16.16b, v0.16b, v30.16b > + uaddlv h16, v16.16b In general, avoid doing the horizontal adds (uaddlv here) too early. Here, I think it would be better to just accumulate things in a regular vector (e.g. "uaddw v18.8h, v18.8h, v16.8b", "uaddw2 v19.8h, v19.8h, v16.16b"), then finally add v18.8h and v19.8h into each other in the end, and just do one single addv. Also then you can fuse the accumulation into the absolute operation, so you should be able to make do with just uabal + uabal2. The finally when you have the calculation for each iteration simplified as suggested, it becomes a tight sequence of 3 instructions where each of them relies on the result of the previous one. Then it's better to interleave the instructions from the 4 parallel iterations, e.g. like this: load (1st iteration) load (2nd iteration) rhadd (1st iteration) load (3rd iteration) rhadd (2nd iteration) uabal (1st iteration) uabal2 (1st iteration) load (4th iteration) rhadd (3rd iteration) uabal (2nd iteration) uabal2 (2nd iteration) rhadd (4th iteration) uabal (3rd iteration) uabal2 (3rd iteration) uabal (4th iteration) uabal2 (4th iteration) That way, you have a quite ideal distance between all instructions and the preceding/following instructions that depend on its output. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".