From: "Martin Storsjö" <martin@martin.st> To: Hubert Mazur <hum@semihalf.com> Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, ffmpeg-devel@ffmpeg.org, mw@semihalf.com, spop@amazon.com Subject: Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Date: Wed, 7 Sep 2022 12:06:49 +0300 (EEST) Message-ID: <6b139c49-11b4-29f4-26e8-f66d3fc2d12d@martin.st> (raw) In-Reply-To: <20220906102722.53266-6-hum@semihalf.com> [-- Attachment #1: Type: text/plain, Size: 4851 bytes --] On Tue, 6 Sep 2022, Hubert Mazur wrote: > Add vectorized implementation of nsse16 function. > > Performance comparison tests are shown below. > - nsse_0_c: 707.0 > - nsse_0_neon: 120.0 > > Benchmarks and tests run with checkasm tool on AWS Graviton 3. > > Signed-off-by: Hubert Mazur <hum@semihalf.com> > --- > libavcodec/aarch64/me_cmp_init_aarch64.c | 15 +++ > libavcodec/aarch64/me_cmp_neon.S | 124 +++++++++++++++++++++++ > 2 files changed, 139 insertions(+) > > diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c > index 8c295d5457..ea7f295373 100644 > --- a/libavcodec/aarch64/me_cmp_init_aarch64.c > +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c > @@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, > ptrdiff_t stride, int h); > int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy, > ptrdiff_t stride, int h); > +int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2, > + ptrdiff_t stride, int h); > +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, > + ptrdiff_t stride, int h); > > av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > { > @@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > > c->vsse[0] = vsse16_neon; > c->vsse[4] = vsse_intra16_neon; > + > + c->nsse[0] = nsse16_neon_wrapper; > } > } > + > +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, > + ptrdiff_t stride, int h) > +{ > + if (c) > + return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h); > + else > + return nsse16_neon(8, s1, s2, stride, h); > +} The body of this function is still indented 4 spaces too much. > diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S > index cf2b8da425..bd21122a21 100644 > --- a/libavcodec/aarch64/me_cmp_neon.S > +++ b/libavcodec/aarch64/me_cmp_neon.S > @@ -847,3 +847,127 @@ function vsse_intra16_neon, export=1 > > ret > endfunc > + > +function nsse16_neon, export=1 > + // x0 multiplier > + // x1 uint8_t *pix1 > + // x2 uint8_t *pix2 > + // x3 ptrdiff_t stride > + // w4 int h > + > + str x0, [sp, #-0x40]! > + stp x1, x2, [sp, #0x10] > + stp x3, x4, [sp, #0x20] > + str x30, [sp, #0x30] > + bl X(sse16_neon) > + ldr x30, [sp, #0x30] > + mov w9, w0 // here we store score1 > + ldr x5, [sp] > + ldp x1, x2, [sp, #0x10] > + ldp x3, x4, [sp, #0x20] > + add sp, sp, #0x40 > + > + movi v16.8h, #0 > + movi v17.8h, #0 > + movi v18.8h, #0 > + movi v19.8h, #0 > + > + ld1 {v0.16b}, [x1], x3 > + subs w4, w4, #1 // we need to make h-1 iterations > + ext v1.16b, v0.16b, v0.16b, #1 // x1 + 1 > + ld1 {v2.16b}, [x2], x3 > + cmp w4, #2 > + ext v3.16b, v2.16b, v2.16b, #1 // x2 + 1 The scheduling here (and in both loops below) is a bit non-ideal; we use v0 very early after loading it, while we could start the load of v2 before that. > + > + b.lt 2f > + > +// make 2 iterations at once > +1: > + ld1 {v4.16b}, [x1], x3 > + ld1 {v20.16b}, [x1], x3 The load of v20 here would stall a bit, since the first load updates x1. Instead we can launch the load of v6 inbetween these two > + ext v5.16b, v4.16b, v4.16b, #1 // x1 + stride + 1 > + ext v21.16b, v20.16b, v20.16b, #1 We can move the use of v20 a bit later here, inbetween the two other exts. These changes (plus a similar minor one to the non-unrolled version at the bottom) produces the following benchmark change for me: Before: Cortex A53 A72 A73 nsse_0_neon: 401.0 198.0 194.5 After: nsse_0_neon: 377.0 198.7 196.5 (The differences on A72 and A73 are within the measurement noise, those numbers vary more than that from one run to another.) You can squash in the attached patch for simplicity. Also, remember to fix the indentation of the wrapper function. With those changes, plus updated benchmark numbers in the commit messages, I think this patchset should be good to go. // Martin [-- Warning: decoded text below may be mangled, UTF-8 assumed --] [-- Attachment #2: Type: text/x-diff; name=0001-squash-nsse16-Tune-scheduling.patch, Size: 2496 bytes --] From 081aff967d4fdc3d475c777033223625db3bb532 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Martin=20Storsj=C3=B6?= <martin@martin.st> Date: Wed, 7 Sep 2022 11:50:29 +0300 Subject: [PATCH] squash: nsse16: Tune scheduling Before: Cortex A53 A72 A73 nsse_0_neon: 401.0 198.0 194.5 After: nsse_0_neon: 377.0 198.7 196.5 (The differences on A72 and A73 are within the measurement noise, those numbers vary more than that from one run to another.) --- libavcodec/aarch64/me_cmp_neon.S | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index bd21122a21..2a2af7a788 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -874,8 +874,8 @@ function nsse16_neon, export=1 ld1 {v0.16b}, [x1], x3 subs w4, w4, #1 // we need to make h-1 iterations - ext v1.16b, v0.16b, v0.16b, #1 // x1 + 1 ld1 {v2.16b}, [x2], x3 + ext v1.16b, v0.16b, v0.16b, #1 // x1 + 1 cmp w4, #2 ext v3.16b, v2.16b, v2.16b, #1 // x2 + 1 @@ -884,12 +884,12 @@ function nsse16_neon, export=1 // make 2 iterations at once 1: ld1 {v4.16b}, [x1], x3 + ld1 {v6.16b}, [x2], x3 ld1 {v20.16b}, [x1], x3 ext v5.16b, v4.16b, v4.16b, #1 // x1 + stride + 1 - ext v21.16b, v20.16b, v20.16b, #1 - ld1 {v6.16b}, [x2], x3 ld1 {v22.16b}, [x2], x3 ext v7.16b, v6.16b, v6.16b, #1 // x2 + stride + 1 + ext v21.16b, v20.16b, v20.16b, #1 ext v23.16b, v22.16b, v22.16b, #1 usubl v31.8h, v0.8b, v4.8b @@ -933,8 +933,8 @@ function nsse16_neon, export=1 2: ld1 {v4.16b}, [x1], x3 subs w4, w4, #1 - ext v5.16b, v4.16b, v4.16b, #1 // x1 + stride + 1 ld1 {v6.16b}, [x2], x3 + ext v5.16b, v4.16b, v4.16b, #1 // x1 + stride + 1 usubl v31.8h, v0.8b, v4.8b ext v7.16b, v6.16b, v6.16b, #1 // x2 + stride + 1 -- 2.25.1 [-- Attachment #3: Type: text/plain, Size: 251 bytes --] _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2022-09-07 9:07 UTC|newest] Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-09-06 10:27 [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Hubert Mazur 2022-09-06 10:27 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: Add neon implementation for vsad16 Hubert Mazur 2022-09-06 10:27 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: Add neon implementation of vsse16 Hubert Mazur 2022-09-07 8:57 ` Martin Storsjö 2022-09-06 10:27 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: Add neon implementation for vsad_intra16 Hubert Mazur 2022-09-06 10:27 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: Add neon implementation for vsse_intra16 Hubert Mazur 2022-09-06 10:27 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur 2022-09-07 9:06 ` Martin Storsjö [this message] 2022-09-07 8:55 ` [FFmpeg-devel] [PATCH 0/5] Provide optimized neon implementation Martin Storsjö -- strict thread matches above, loose matches on Subject: below -- 2022-09-08 9:25 Hubert Mazur 2022-09-08 9:25 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur 2022-08-22 15:26 [FFmpeg-devel] [PATCH 0/5] me_cmp: Provide arm64 neon implementations Hubert Mazur 2022-08-22 15:26 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 Hubert Mazur 2022-09-02 21:29 ` Martin Storsjö 2022-09-04 21:23 ` Martin Storsjö
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=6b139c49-11b4-29f4-26e8-f66d3fc2d12d@martin.st \ --to=martin@martin.st \ --cc=ffmpeg-devel@ffmpeg.org \ --cc=gjb@semihalf.com \ --cc=hum@semihalf.com \ --cc=jswinney@amazon.com \ --cc=mw@semihalf.com \ --cc=spop@amazon.com \ --cc=upstream@semihalf.com \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git