From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id 4FAFC40D92 for ; Wed, 7 Sep 2022 09:07:03 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B82DF68BB6A; Wed, 7 Sep 2022 12:07:00 +0300 (EEST) Received: from mail8.parnet.fi (mail8.parnet.fi [77.234.108.134]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E976C68BA38 for ; Wed, 7 Sep 2022 12:06:53 +0300 (EEST) Received: from mail9.parnet.fi (mail9.parnet.fi [77.234.108.21]) by mail8.parnet.fi with ESMTP id 28796nuC023858-28796nuD023858; Wed, 7 Sep 2022 12:06:49 +0300 Received: from foo.martin.st (host-97-187.parnet.fi [77.234.97.187]) by mail9.parnet.fi (Postfix) with ESMTPS id 7D2E8A1407; Wed, 7 Sep 2022 12:06:49 +0300 (EEST) Date: Wed, 7 Sep 2022 12:06:49 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Hubert Mazur In-Reply-To: <20220906102722.53266-6-hum@semihalf.com> Message-ID: <6b139c49-11b4-29f4-26e8-f66d3fc2d12d@martin.st> References: <20220906102722.53266-1-hum@semihalf.com> <20220906102722.53266-6-hum@semihalf.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="8323329-323898564-1662541609=:1659" X-FE-Attachment-Name: 0001-squash-nsse16-Tune-scheduling.patch X-FE-Policy-ID: 3:14:2:SYSTEM Subject: Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: Provide neon implementation of nsse16 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: gjb@semihalf.com, upstream@semihalf.com, jswinney@amazon.com, ffmpeg-devel@ffmpeg.org, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --8323329-323898564-1662541609=:1659 Content-Type: text/plain; format=flowed; charset=US-ASCII On Tue, 6 Sep 2022, Hubert Mazur wrote: > Add vectorized implementation of nsse16 function. > > Performance comparison tests are shown below. > - nsse_0_c: 707.0 > - nsse_0_neon: 120.0 > > Benchmarks and tests run with checkasm tool on AWS Graviton 3. > > Signed-off-by: Hubert Mazur > --- > libavcodec/aarch64/me_cmp_init_aarch64.c | 15 +++ > libavcodec/aarch64/me_cmp_neon.S | 124 +++++++++++++++++++++++ > 2 files changed, 139 insertions(+) > > diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c > index 8c295d5457..ea7f295373 100644 > --- a/libavcodec/aarch64/me_cmp_init_aarch64.c > +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c > @@ -49,6 +49,10 @@ int vsse16_neon(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, > ptrdiff_t stride, int h); > int vsse_intra16_neon(MpegEncContext *c, const uint8_t *s, const uint8_t *dummy, > ptrdiff_t stride, int h); > +int nsse16_neon(int multiplier, const uint8_t *s, const uint8_t *s2, > + ptrdiff_t stride, int h); > +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, > + ptrdiff_t stride, int h); > > av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > { > @@ -72,5 +76,16 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > > c->vsse[0] = vsse16_neon; > c->vsse[4] = vsse_intra16_neon; > + > + c->nsse[0] = nsse16_neon_wrapper; > } > } > + > +int nsse16_neon_wrapper(MpegEncContext *c, const uint8_t *s1, const uint8_t *s2, > + ptrdiff_t stride, int h) > +{ > + if (c) > + return nsse16_neon(c->avctx->nsse_weight, s1, s2, stride, h); > + else > + return nsse16_neon(8, s1, s2, stride, h); > +} The body of this function is still indented 4 spaces too much. > diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S > index cf2b8da425..bd21122a21 100644 > --- a/libavcodec/aarch64/me_cmp_neon.S > +++ b/libavcodec/aarch64/me_cmp_neon.S > @@ -847,3 +847,127 @@ function vsse_intra16_neon, export=1 > > ret > endfunc > + > +function nsse16_neon, export=1 > + // x0 multiplier > + // x1 uint8_t *pix1 > + // x2 uint8_t *pix2 > + // x3 ptrdiff_t stride > + // w4 int h > + > + str x0, [sp, #-0x40]! > + stp x1, x2, [sp, #0x10] > + stp x3, x4, [sp, #0x20] > + str x30, [sp, #0x30] > + bl X(sse16_neon) > + ldr x30, [sp, #0x30] > + mov w9, w0 // here we store score1 > + ldr x5, [sp] > + ldp x1, x2, [sp, #0x10] > + ldp x3, x4, [sp, #0x20] > + add sp, sp, #0x40 > + > + movi v16.8h, #0 > + movi v17.8h, #0 > + movi v18.8h, #0 > + movi v19.8h, #0 > + > + ld1 {v0.16b}, [x1], x3 > + subs w4, w4, #1 // we need to make h-1 iterations > + ext v1.16b, v0.16b, v0.16b, #1 // x1 + 1 > + ld1 {v2.16b}, [x2], x3 > + cmp w4, #2 > + ext v3.16b, v2.16b, v2.16b, #1 // x2 + 1 The scheduling here (and in both loops below) is a bit non-ideal; we use v0 very early after loading it, while we could start the load of v2 before that. > + > + b.lt 2f > + > +// make 2 iterations at once > +1: > + ld1 {v4.16b}, [x1], x3 > + ld1 {v20.16b}, [x1], x3 The load of v20 here would stall a bit, since the first load updates x1. Instead we can launch the load of v6 inbetween these two > + ext v5.16b, v4.16b, v4.16b, #1 // x1 + stride + 1 > + ext v21.16b, v20.16b, v20.16b, #1 We can move the use of v20 a bit later here, inbetween the two other exts. These changes (plus a similar minor one to the non-unrolled version at the bottom) produces the following benchmark change for me: Before: Cortex A53 A72 A73 nsse_0_neon: 401.0 198.0 194.5 After: nsse_0_neon: 377.0 198.7 196.5 (The differences on A72 and A73 are within the measurement noise, those numbers vary more than that from one run to another.) You can squash in the attached patch for simplicity. Also, remember to fix the indentation of the wrapper function. With those changes, plus updated benchmark numbers in the commit messages, I think this patchset should be good to go. // Martin --8323329-323898564-1662541609=:1659 Content-Type: text/x-diff; name=0001-squash-nsse16-Tune-scheduling.patch Content-Transfer-Encoding: BASE64 Content-ID: <258a45fb-7bdb-a8d-544c-23cca07337b3@martin.st> Content-Description: Content-Disposition: attachment; filename=0001-squash-nsse16-Tune-scheduling.patch RnJvbSAwODFhZmY5NjdkNGZkYzNkNDc1Yzc3NzAzMzIyMzYyNWRiM2JiNTMy IE1vbiBTZXAgMTcgMDA6MDA6MDAgMjAwMQ0KRnJvbTogPT9VVEYtOD9xP01h cnRpbj0yMFN0b3Jzaj1DMz1CNj89IDxtYXJ0aW5AbWFydGluLnN0Pg0KRGF0 ZTogV2VkLCA3IFNlcCAyMDIyIDExOjUwOjI5ICswMzAwDQpTdWJqZWN0OiBb UEFUQ0hdIHNxdWFzaDogbnNzZTE2OiBUdW5lIHNjaGVkdWxpbmcNCg0KQmVm b3JlOiAgIENvcnRleCBBNTMgICAgIEE3MiAgICAgQTczDQpuc3NlXzBfbmVv bjogICA0MDEuMCAgIDE5OC4wICAgMTk0LjUNCkFmdGVyOg0KbnNzZV8wX25l b246ICAgMzc3LjAgICAxOTguNyAgIDE5Ni41DQoNCihUaGUgZGlmZmVyZW5j ZXMgb24gQTcyIGFuZCBBNzMgYXJlIHdpdGhpbiB0aGUgbWVhc3VyZW1lbnQg bm9pc2UsDQp0aG9zZSBudW1iZXJzIHZhcnkgbW9yZSB0aGFuIHRoYXQgZnJv bSBvbmUgcnVuIHRvIGFub3RoZXIuKQ0KLS0tDQogbGliYXZjb2RlYy9hYXJj aDY0L21lX2NtcF9uZW9uLlMgfCA4ICsrKystLS0tDQogMSBmaWxlIGNoYW5n ZWQsIDQgaW5zZXJ0aW9ucygrKSwgNCBkZWxldGlvbnMoLSkNCg0KZGlmZiAt LWdpdCBhL2xpYmF2Y29kZWMvYWFyY2g2NC9tZV9jbXBfbmVvbi5TIGIvbGli YXZjb2RlYy9hYXJjaDY0L21lX2NtcF9uZW9uLlMNCmluZGV4IGJkMjExMjJh MjEuLjJhMmFmN2E3ODggMTAwNjQ0DQotLS0gYS9saWJhdmNvZGVjL2FhcmNo NjQvbWVfY21wX25lb24uUw0KKysrIGIvbGliYXZjb2RlYy9hYXJjaDY0L21l X2NtcF9uZW9uLlMNCkBAIC04NzQsOCArODc0LDggQEAgZnVuY3Rpb24gbnNz ZTE2X25lb24sIGV4cG9ydD0xDQogDQogICAgICAgICBsZDEgICAgICAgICAg ICAge3YwLjE2Yn0sIFt4MV0sIHgzDQogICAgICAgICBzdWJzICAgICAgICAg ICAgdzQsIHc0LCAjMSAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIC8v IHdlIG5lZWQgdG8gbWFrZSBoLTEgaXRlcmF0aW9ucw0KLSAgICAgICAgZXh0 ICAgICAgICAgICAgIHYxLjE2YiwgdjAuMTZiLCB2MC4xNmIsICMxICAgICAg ICAgICAgICAvLyB4MSArIDENCiAgICAgICAgIGxkMSAgICAgICAgICAgICB7 djIuMTZifSwgW3gyXSwgeDMNCisgICAgICAgIGV4dCAgICAgICAgICAgICB2 MS4xNmIsIHYwLjE2YiwgdjAuMTZiLCAjMSAgICAgICAgICAgICAgLy8geDEg KyAxDQogICAgICAgICBjbXAgICAgICAgICAgICAgdzQsICMyDQogICAgICAg ICBleHQgICAgICAgICAgICAgdjMuMTZiLCB2Mi4xNmIsIHYyLjE2YiwgIzEg ICAgICAgICAgICAgIC8vIHgyICsgMQ0KIA0KQEAgLTg4NCwxMiArODg0LDEy IEBAIGZ1bmN0aW9uIG5zc2UxNl9uZW9uLCBleHBvcnQ9MQ0KIC8vIG1ha2Ug MiBpdGVyYXRpb25zIGF0IG9uY2UNCiAxOg0KICAgICAgICAgbGQxICAgICAg ICAgICAgIHt2NC4xNmJ9LCBbeDFdLCB4Mw0KKyAgICAgICAgbGQxICAgICAg ICAgICAgIHt2Ni4xNmJ9LCBbeDJdLCB4Mw0KICAgICAgICAgbGQxICAgICAg ICAgICAgIHt2MjAuMTZifSwgW3gxXSwgeDMNCiAgICAgICAgIGV4dCAgICAg ICAgICAgICB2NS4xNmIsIHY0LjE2YiwgdjQuMTZiLCAjMSAgICAgICAgICAg ICAgLy8geDEgKyBzdHJpZGUgKyAxDQotICAgICAgICBleHQgICAgICAgICAg ICAgdjIxLjE2YiwgdjIwLjE2YiwgdjIwLjE2YiwgIzENCi0gICAgICAgIGxk MSAgICAgICAgICAgICB7djYuMTZifSwgW3gyXSwgeDMNCiAgICAgICAgIGxk MSAgICAgICAgICAgICB7djIyLjE2Yn0sIFt4Ml0sIHgzDQogICAgICAgICBl eHQgICAgICAgICAgICAgdjcuMTZiLCB2Ni4xNmIsIHY2LjE2YiwgIzEgICAg ICAgICAgICAgIC8vIHgyICsgc3RyaWRlICsgMQ0KKyAgICAgICAgZXh0ICAg ICAgICAgICAgIHYyMS4xNmIsIHYyMC4xNmIsIHYyMC4xNmIsICMxDQogICAg ICAgICBleHQgICAgICAgICAgICAgdjIzLjE2YiwgdjIyLjE2YiwgdjIyLjE2 YiwgIzENCiANCiAgICAgICAgIHVzdWJsICAgICAgICAgICB2MzEuOGgsIHYw LjhiLCB2NC44Yg0KQEAgLTkzMyw4ICs5MzMsOCBAQCBmdW5jdGlvbiBuc3Nl MTZfbmVvbiwgZXhwb3J0PTENCiAyOg0KICAgICAgICAgbGQxICAgICAgICAg ICAgIHt2NC4xNmJ9LCBbeDFdLCB4Mw0KICAgICAgICAgc3VicyAgICAgICAg ICAgIHc0LCB3NCwgIzENCi0gICAgICAgIGV4dCAgICAgICAgICAgICB2NS4x NmIsIHY0LjE2YiwgdjQuMTZiLCAjMSAgICAgICAgICAgICAgLy8geDEgKyBz dHJpZGUgKyAxDQogICAgICAgICBsZDEgICAgICAgICAgICAge3Y2LjE2Yn0s IFt4Ml0sIHgzDQorICAgICAgICBleHQgICAgICAgICAgICAgdjUuMTZiLCB2 NC4xNmIsIHY0LjE2YiwgIzEgICAgICAgICAgICAgIC8vIHgxICsgc3RyaWRl ICsgMQ0KICAgICAgICAgdXN1YmwgICAgICAgICAgIHYzMS44aCwgdjAuOGIs IHY0LjhiDQogICAgICAgICBleHQgICAgICAgICAgICAgdjcuMTZiLCB2Ni4x NmIsIHY2LjE2YiwgIzEgICAgICAgICAgICAgIC8vIHgyICsgc3RyaWRlICsg MQ0KIA0KLS0gDQoyLjI1LjENCg0K --8323329-323898564-1662541609=:1659 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". --8323329-323898564-1662541609=:1659--