From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id 320304122B for ; Fri, 16 Sep 2022 21:09:15 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 0A29168BBFE; Sat, 17 Sep 2022 00:09:13 +0300 (EEST) Received: from mail8.parnet.fi (mail8.parnet.fi [77.234.108.134]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 30D4468BB6F for ; Sat, 17 Sep 2022 00:09:06 +0300 (EEST) Received: from mail9.parnet.fi (mail9.parnet.fi [77.234.108.21]) by mail8.parnet.fi with ESMTP id 28GL8wYw028462-28GL8wYx028462; Sat, 17 Sep 2022 00:08:58 +0300 Received: from foo.martin.st (host-97-187.parnet.fi [77.234.97.187]) by mail9.parnet.fi (Postfix) with ESMTPS id 08EECA1467; Sat, 17 Sep 2022 00:08:57 +0300 (EEST) Date: Sat, 17 Sep 2022 00:08:55 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Hubert Mazur In-Reply-To: <20220913115824.60792-2-hum@semihalf.com> Message-ID: <6f4672f1-54e8-f6be-bd66-22111a7011a6@martin.st> References: <20220913115824.60792-1-hum@semihalf.com> <20220913115824.60792-2-hum@semihalf.com> MIME-Version: 1.0 X-FE-Policy-ID: 3:14:2:SYSTEM Subject: Re: [FFmpeg-devel] [PATCH 1/3] lavc/aarch64: Add neon implementation for pix_median_abs16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, ffmpeg-devel@ffmpeg.org, mw@semihalf.com, spop@amazon.com Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Tue, 13 Sep 2022, Hubert Mazur wrote: > Provide optimized implementation for pix_median_abs16 function. > > Performance comparison tests are shown below. > - median_sad_0_c: 722.0 > - median_sad_0_neon: 144.7 > > Benchmarks and tests run with checkasm tool on AWS Graviton 3. > > Signed-off-by: Hubert Mazur > --- > libavcodec/aarch64/me_cmp_init_aarch64.c | 4 ++ > libavcodec/aarch64/me_cmp_neon.S | 81 ++++++++++++++++++++++++ > libavcodec/me_cmp.c | 5 +- > 3 files changed, 88 insertions(+), 2 deletions(-) > > diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c > index ade3e9a4c1..fb51a833be 100644 > --- a/libavcodec/aarch64/me_cmp_init_aarch64.c > +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c > @@ -53,6 +53,8 @@ int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2, > ptrdiff_t stride, int h); > int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, > ptrdiff_t stride, int h); > +int pix_median_abs16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, > + ptrdiff_t stride, int h); > > av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > { > @@ -78,6 +80,8 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > c->vsse[4] = vsse_intra16_neon; > > c->nsse[0] = nsse16_neon_wrapper; > + > + c->median_sad[0] = pix_median_abs16_neon; > } > } > > diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S > index f8998749a5..a4a4344f42 100644 > --- a/libavcodec/aarch64/me_cmp_neon.S > +++ b/libavcodec/aarch64/me_cmp_neon.S > @@ -969,3 +969,84 @@ function nsse16_neon, export=1 > > ret > endfunc > + > +function pix_median_abs16_neon, export=1 > + // x0 unused > + // x1 uint8_t *pix1 > + // x2 uint8_t *pix2 > + // x3 ptrdiff_t stride > + // w4 int h > + > + ld1 {v2.16b}, [x1], x3 > + ld1 {v3.16b}, [x2], x3 > + movi v31.8h, #0 > + movi v16.8h, #0 > + ext v0.16b, v2.16b, v2.16b, #1 > + ext v1.16b, v3.16b, v3.16b, #1 > + usubl v28.8h, v2.8b, v3.8b > + usubl2 v27.8h, v2.16b, v3.16b > + usubl v26.8h, v0.8b, v1.8b > + usubl2 v25.8h, v0.16b, v1.16b > + sub w4, w4, #1 // we need to make h-1 iterations > + saba v31.8h, v26.8h, v28.8h > + saba v16.8h, v25.8h, v27.8h > + mov h18, v28.h[0] > + cmp w4, #1 > + sqabs h18, h18 > + > + b.lt 2f > +1: > + > + ld1 {v6.16b}, [x1], x3 // pix1 vector for V(j-1) > + ld1 {v7.16b}, [x2], x3 // pix2 vector for V(j-1) > + subs w4, w4, #1 > + mov v2.16b, v6.16b > + mov v3.16b, v7.16b These two mov instructions seem unnecessary? > + ext v4.16b, v6.16b, v6.16b, #1 // pix1 vector for V(j) > + ext v5.16b, v7.16b, v7.16b, #1 // pix2 vector for V(j) > + > + // protected registers: v30, v29, v28, v27, v26, v25, v24, v23 > + // scratch registers: v22, v21, v20, v19, v17 > + > + // To find median of three values, calculate sum of them > + // and subtract max and min value from it. > + usubl v30.8h, v6.8b, v7.8b // V(j-1) > + usubl2 v29.8h, v6.16b, v7.16b // V(j-1) > + usubl v24.8h, v4.8b, v5.8b // V(j) > + usubl2 v23.8h, v4.16b, v5.16b // V(j) > + mov v0.16b, v4.16b > + mov v1.16b, v5.16b These two movs are unused, too, right? > + sabd v20.8h, v30.8h, v28.8h > + mov h17, v20.h[0] > + add d18, d18, d17 These are quite suboptimally scheduled here. However, we shouldn't need them. In general, try to avoid these single-element calculations if not strictly necessary. You can just keep using both the input (here, v20) and the accumulator (v18) as a .4h vector, where you only care about the first element. Then at the very end you can extract the individual first element from it, instead of doing it every round in the loop. Then you can potentially change sabd into saba too, unless the non-accumulated result is needed too. > + add v22.8h, v26.8h, v30.8h > + smin v20.8h, v26.8h, v30.8h > + add v21.8h, v25.8h, v29.8h > + smax v19.8h, v26.8h, v30.8h > + sub v22.8h, v22.8h, v28.8h > + sub v21.8h, v21.8h, v27.8h > + smin v17.8h, v19.8h, v22.8h > + smin v22.8h, v25.8h, v29.8h > + mov v28.16b, v30.16b > + smax v20.8h, v20.8h, v17.8h // median values lower half > + smax v19.8h, v25.8h, v29.8h > + saba v31.8h, v24.8h, v20.8h > + mov v27.16b, v29.16b > + smin v19.8h, v19.8h, v21.8h > + mov v26.16b, v24.16b > + smax v17.8h, v22.8h, v19.8h // median values upper half > + mov v25.16b, v23.16b > + saba v16.8h, v23.8h, v17.8h > + > + b.ne 1b > + > +2: > + ins v16.h[7], wzr > + add v31.8h, v31.8h, v16.8h > + uaddlv s17, v31.8h > + add d18, d18, d17 > + fmov w0, s18 > + > + ret > + > +endfunc > diff --git a/libavcodec/me_cmp.c b/libavcodec/me_cmp.c > index 4242fbc6e4..230e7ea54a 100644 > --- a/libavcodec/me_cmp.c > +++ b/libavcodec/me_cmp.c > @@ -1048,6 +1048,9 @@ av_cold void ff_me_cmp_init(MECmpContext *c, AVCodecContext *avctx) > ff_dsputil_init_dwt(c); > #endif > > +c->median_sad[0] = pix_median_abs16_c; > +c->median_sad[1] = pix_median_abs8_c; > + These are incorrectly indented. Other than that, this seems reasonable I think. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".