From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id A788443914 for ; Wed, 3 Aug 2022 13:22:28 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id C741968B7CA; Wed, 3 Aug 2022 16:22:26 +0300 (EEST) Received: from mail8.parnet.fi (mail8.parnet.fi [77.234.108.134]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id C35AF68B701 for ; Wed, 3 Aug 2022 16:22:19 +0300 (EEST) Received: from mail9.parnet.fi (mail9.parnet.fi [77.234.108.21]) by mail8.parnet.fi with ESMTP id 273DMDCt018530-273DMDCu018530; Wed, 3 Aug 2022 16:22:13 +0300 Received: from foo.martin.st (host-97-187.parnet.fi [77.234.97.187]) by mail9.parnet.fi (Postfix) with ESMTPS id 8E55CA1468; Wed, 3 Aug 2022 16:22:13 +0300 (EEST) Date: Wed, 3 Aug 2022 16:22:12 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Hubert Mazur In-Reply-To: <20220725111208.43542-1-hum@semihalf.com> Message-ID: References: <20220715080228.686736-2-hum@semihalf.com> <20220725111208.43542-1-hum@semihalf.com> MIME-Version: 1.0 X-FE-Policy-ID: 3:14:2:SYSTEM Subject: Re: [FFmpeg-devel] [PATCH] lavc/aarch64: Add neon implementation for sse16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, ffmpeg-devel@ffmpeg.org, mw@semihalf.com, spop@amazon.com Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Mon, 25 Jul 2022, Hubert Mazur wrote: > Provide neon implementation for sse16 function. > > Performance comparison tests are shown below. > - sse_0_c: 273.0 > - sse_0_neon: 48.2 > > Benchmarks and tests run with checkasm tool on AWS Graviton 3. > > Signed-off-by: Hubert Mazur > --- > libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++ > libavcodec/aarch64/me_cmp_neon.S | 82 ++++++++++++++++++++++++ > 2 files changed, 86 insertions(+) > > diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c > index 136b008eb7..3ff5767bd0 100644 > --- a/libavcodec/aarch64/me_cmp_init_aarch64.c > +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c > @@ -30,6 +30,9 @@ int ff_pix_abs16_xy2_neon(MpegEncContext *s, uint8_t *blk1, uint8_t *blk2, > int ff_pix_abs16_x2_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2, > ptrdiff_t stride, int h); > > +int sse16_neon(MpegEncContext *v, uint8_t *pix1, uint8_t *pix2, > + ptrdiff_t stride, int h); The signature of these functions has been changed now (right after these patches were submitted); the pix1/pix2 parameters are now const. Also, nitpick; please align the following line ("ptrdiff_t stride, ...") correctly with the parenthese on the line above. > + > av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > { > int cpu_flags = av_get_cpu_flags(); > @@ -40,5 +43,6 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > c->pix_abs[0][3] = ff_pix_abs16_xy2_neon; > > c->sad[0] = ff_pix_abs16_neon; > + c->sse[0] = sse16_neon; > } > } > diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S > index cda7ce0408..98c912b608 100644 > --- a/libavcodec/aarch64/me_cmp_neon.S > +++ b/libavcodec/aarch64/me_cmp_neon.S > @@ -270,3 +270,85 @@ function ff_pix_abs16_x2_neon, export=1 > > ret > endfunc > + > +function sse16_neon, export=1 > + // x0 - unused > + // x1 - pix1 > + // x2 - pix2 > + // x3 - stride > + // w4 - h > + > + cmp w4, #4 > + movi d18, #0 > + b.lt 2f > + > +// Make 4 iterations at once > +1: > + > + // res = abs(pix1[0] - pix2[0]) > + // res * res > + > + ld1 {v0.16b}, [x1], x3 // Load pix1 vector for first iteration > + ld1 {v1.16b}, [x2], x3 // Load pix2 vector for first iteration > + uabd v30.16b, v0.16b, v1.16b // Absolute difference, first iteration Try to improve the interleaving of this function; I did a quick test on Cortex A53, A72 and A73, and got these numbers: Before: sse_0_neon: 147.7 64.5 64.7 After: sse_0_neon: 133.7 60.7 59.2 Overall, try to avoid having consecutive instructions operating on the same iteration (except for when doing the same operation on different halves of the same iteration), i.e. not "absolute difference third iteration; multiply lower half third iteration, multiply upper half third iteration, pairwise add third iteration", but bundle it up so you have e.g. "absolute difference third iteration; pairwise add first iteration; multiply {upper,lower} half third iteration; pairwise add second iteration; pairwise add third iteration", or something like that. Then secondly, in general, don't serialize the summation down to a single element in each iteration! You can keep the accumulated sum as a vX.4s vector (or maybe even better, two .4s vectors!) throughout the whole algorithm, and then only add them up horizontally (with an uaddv) at the end. For adding vectors, I would instinctively prefer doing "uaddl v0.4s, v2.4h, v3.4h; uaddl2 v1.4s, v2.8h, v3.8h" instead of "uaddlp v0.4s, v1.4h; uadalp v0.4s, v1.8h" etc. I didn't try out this modification, but please do, I'm pretty sure it will be a fair bit faster, and if not, at least more idiomatic SIMD. I didn't check the other patches yet, but if the other sse* functions are implemented similarly, I would expect the same feedback to apply to them too. Let's iterate on the sse16 patch first now at least and get that one great, and then update sse4/sse8 similarly once we have that one settled. I'll try to have a look at the other patches in the set later today/tomorrow. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".