From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id 0CFC344937 for ; Wed, 28 Sep 2022 09:07:14 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A9CD968BB6B; Wed, 28 Sep 2022 12:07:11 +0300 (EEST) Received: from mail8.parnet.fi (mail8.parnet.fi [77.234.108.134]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id E954A68B903 for ; Wed, 28 Sep 2022 12:07:04 +0300 (EEST) Received: from mail9.parnet.fi (mail9.parnet.fi [77.234.108.21]) by mail8.parnet.fi with ESMTP id 28S96vNb002772-28S96vNc002772; Wed, 28 Sep 2022 12:06:57 +0300 Received: from foo.martin.st (host-97-187.parnet.fi [77.234.97.187]) by mail9.parnet.fi (Postfix) with ESMTPS id BDB29A1437; Wed, 28 Sep 2022 12:06:57 +0300 (EEST) Date: Wed, 28 Sep 2022 12:06:57 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Grzegorz Bernacki In-Reply-To: <20220926091504.3459110-1-gjb@semihalf.com> Message-ID: <376a4a83-724d-263d-fe5d-aaf2944fc157@martin.st> References: <20220926091504.3459110-1-gjb@semihalf.com> MIME-Version: 1.0 Content-Type: multipart/mixed; boundary="8323329-1841389946-1664356017=:1659" X-FE-Attachment-Name: 0001-aarch64-me_cmp-Improve-scheduling-in-ff_pix_abs8_y2_.patch, 0002-aarch64-me_cmp-Fix-up-the-prologue-of-ff_pix_abs8_xy.patch X-FE-Policy-ID: 3:14:2:SYSTEM Subject: Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: Add neon implementation for pix_abs8 functions. X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: upstream@semihalf.com, jswinney@amazon.com, hum@semihalf.com, ffmpeg-devel@ffmpeg.org, mw@semihalf.com, spop@amazon.com Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: This message is in MIME format. The first part should be readable text, while the remaining parts are likely unreadable without MIME-aware tools. --8323329-1841389946-1664356017=:1659 Content-Type: text/plain; format=flowed; charset=US-ASCII On Mon, 26 Sep 2022, Grzegorz Bernacki wrote: > Provide optimized implementation of pix_abs8 function for arm64. > > Performance comparison tests are shown below: > pix_abs_1_1_c: 162.5 > pix_abs_1_1_neon: 27.0 > pix_abs_1_2_c: 174.0 > pix_abs_1_2_neon: 23.5 > pix_abs_1_3_c: 203.2 > pix_abs_1_3_neon: 34.7 > > Benchmarks and tests are run with checkasm tool on AWS Graviton 3. > > Signed-off-by: Grzegorz Bernacki > --- > libavcodec/aarch64/me_cmp_init_aarch64.c | 9 ++ > libavcodec/aarch64/me_cmp_neon.S | 193 +++++++++++++++++++++++ > 2 files changed, 202 insertions(+) I don't see any changes compared to the version you sent last week? If reposting a patchset, please do mention what has changed - or ping the old one. (I had it on my radar to review, but reviews of larger amounts of code doesn't happen immediately, unfortunately.) Overall, you seem to have mixed in tabs among regular spaces. Please do fix that. (I would have kinda expected us to have a fate test that checks this, but apparently we don't.) > diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c b/libavcodec/aarch64/me_cmp_init_aarch64.c > index e143f0816e..3459403ee5 100644 > --- a/libavcodec/aarch64/me_cmp_init_aarch64.c > +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c > @@ -59,6 +59,12 @@ int pix_median_abs16_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t > ptrdiff_t stride, int h); > int pix_median_abs8_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, > ptrdiff_t stride, int h); > +int ff_pix_abs8_x2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, > + ptrdiff_t stride, int h); > +int ff_pix_abs8_y2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, > + ptrdiff_t stride, int h); > +int ff_pix_abs8_xy2_neon(MpegEncContext *v, const uint8_t *pix1, const uint8_t *pix2, > + ptrdiff_t stride, int h); > > av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > { > @@ -70,6 +76,9 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext *avctx) > c->pix_abs[0][2] = ff_pix_abs16_y2_neon; > c->pix_abs[0][3] = ff_pix_abs16_xy2_neon; > c->pix_abs[1][0] = ff_pix_abs8_neon; > + c->pix_abs[1][1] = ff_pix_abs8_x2_neon; > + c->pix_abs[1][2] = ff_pix_abs8_y2_neon; > + c->pix_abs[1][3] = ff_pix_abs8_xy2_neon; In most mediums, the diff here shows that there's something off - tabs. > > c->sad[0] = ff_pix_abs16_neon; > c->sad[1] = ff_pix_abs8_neon; > diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S > index 11af4849f9..e03c0c26cd 100644 > --- a/libavcodec/aarch64/me_cmp_neon.S > +++ b/libavcodec/aarch64/me_cmp_neon.S > @@ -119,6 +119,199 @@ function ff_pix_abs8_neon, export=1 > ret > endfunc > > +function ff_pix_abs8_x2_neon, export=1 > + // x0 unused > + // x1 uint8_t *pix1 > + // x2 uint8_t *pix2 > + // x3 ptrdiff_t stride > + // w4 int h > + > + cmp w4, #4 > + movi v26.8h, #0 > + add x5, x2, #1 // pix2 + 1 > + b.lt 2f > + > +// make 4 iterations at once > +1: > + ld1 {v1.8b}, [x2], x3 > + ld1 {v2.8b}, [x5], x3 > + ld1 {v0.8b}, [x1], x3 > + ld1 {v4.8b}, [x2], x3 > + urhadd v30.8b, v1.8b, v2.8b > + ld1 {v5.8b}, [x5], x3 > + uabal v26.8h, v0.8b, v30.8b > + ld1 {v6.8b}, [x1], x3 > + urhadd v29.8b, v4.8b, v5.8b > + ld1 {v7.8b}, [x2], x3 > + ld1 {v20.8b}, [x5], x3 > + uabal v26.8h, v6.8b, v29.8b > + ld1 {v21.8b}, [x1], x3 > + urhadd v28.8b, v7.8b, v20.8b > + ld1 {v22.8b}, [x2], x3 > + ld1 {v23.8b}, [x5], x3 > + uabal v26.8h, v21.8b, v28.8b > + sub w4, w4, #4 > + ld1 {v24.8b}, [x1], x3 > + urhadd v27.8b, v22.8b, v23.8b > + cmp w4, #4 > + uabal v26.8h, v24.8b, v27.8b > + > + b.ge 1b > + cbz w4, 3f > + > +// iterate by one > +2: > + ld1 {v1.8b}, [x2], x3 > + ld1 {v2.8b}, [x5], x3 > + ld1 {v0.8b}, [x1], x3 > + urhadd v30.8b, v1.8b, v2.8b > + subs w4, w4, #1 > + uabal v26.8h, v0.8b, v30.8b > + > + b.ne 2b > +3: > + uaddlv s20, v26.8h > + fmov w0, s20 > + > + ret > + > +endfunc > + > +function ff_pix_abs8_y2_neon, export=1 > + // x0 unused > + // x1 uint8_t *pix1 > + // x2 uint8_t *pix2 > + // x3 ptrdiff_t stride > + // w4 int h > + > + cmp w4, #4 > + movi v26.8h, #0 > + ld1 {v1.8b}, [x2], x3 > + b.lt 2f > + > +// make 4 iterations at once > +1: > + ld1 {v2.8b}, [x2], x3 > + ld1 {v0.8b}, [x1], x3 > + ld1 {v6.8b}, [x1], x3 > + urhadd v30.8b, v1.8b, v2.8b Good thing that you wrote this better than the current ff_pix_abs16_y2_neon, reusing the data from the previous round. But the scheduling could be much improved here (why load v6 directly after v0 - x1 will delay it due to being updated in the previous instruction, and v6 won't be needed for a long time here yet). By fixing the scheduling of this function (and getting rid of the unnecessary mov instruction at the end of the unrolled loop), I got it down from 74 cycles to 62 cycles on a Cortex A53. I'm attaching the fixup I did, so you can apply it on top instead of having to describe it verbally. > + ld1 {v5.8b}, [x2], x3 > + ld1 {v21.8b}, [x1], x3 > + uabal v26.8h, v0.8b, v30.8b > + urhadd v29.8b, v2.8b, v5.8b > + ld1 {v20.8b}, [x2], x3 > + ld1 {v24.8b}, [x1], x3 > + uabal v26.8h, v6.8b, v29.8b > + urhadd v28.8b, v5.8b, v20.8b > + uabal v26.8h, v21.8b, v28.8b > + ld1 {v23.8b}, [x2], x3 > + mov v1.8b, v23.8b > + sub w4, w4, #4 > + urhadd v27.8b, v20.8b, v23.8b > + cmp w4, #4 > + uabal v26.8h, v24.8b, v27.8b > + > + b.ge 1b > + cbz w4, 3f > + > +// iterate by one > +2: > + ld1 {v0.8b}, [x1], x3 > + ld1 {v2.8b}, [x2], x3 > + urhadd v30.8b, v1.8b, v2.8b > + subs w4, w4, #1 > + uabal v26.8h, v0.8b, v30.8b > + mov v1.8b, v2.8b > + > + b.ne 2b > +3: > + uaddlv s20, v26.8h > + fmov w0, s20 > + > + ret > + > +endfunc > + > +function ff_pix_abs8_xy2_neon, export=1 > + // x0 unused > + // x1 uint8_t *pix1 > + // x2 uint8_t *pix2 > + // x3 ptrdiff_t stride > + // w4 int h > + > + movi v31.8h, #0 > + add x0, x2, 1 // pix2 + 1 > + > + add x5, x2, x3 // pix2 + stride = pix3 > + cmp w4, #4 > + add x6, x5, 1 // pix3 + stride + 1 > + > + b.lt 2f > + > + ld1 {v0.8b}, [x2], x3 > + ld1 {v1.8b}, [x0], x3 > + uaddl v2.8h, v0.8b, v1.8b If we'd start out with h<4, we'd jump to the 2f label above, missing the setup of v0-v2 here. Other than that, the patch looks reasonable to me. // Martin --8323329-1841389946-1664356017=:1659 Content-Type: text/x-diff; name=0001-aarch64-me_cmp-Improve-scheduling-in-ff_pix_abs8_y2_.patch Content-Transfer-Encoding: BASE64 Content-ID: <4b977ff3-80a2-f346-2bac-12d327493e71@martin.st> Content-Description: Content-Disposition: attachment; filename=0001-aarch64-me_cmp-Improve-scheduling-in-ff_pix_abs8_y2_.patch RnJvbSA3OTA5NDMyYjNjZmZkY2IzMWMwOTUzMjc5ZDg2MmY4MjBhNzM4ZWM0 IE1vbiBTZXAgMTcgMDA6MDA6MDAgMjAwMQ0KRnJvbTogPT9VVEYtOD9xP01h cnRpbj0yMFN0b3Jzaj1DMz1CNj89IDxtYXJ0aW5AbWFydGluLnN0Pg0KRGF0 ZTogV2VkLCAyOCBTZXAgMjAyMiAxMToyOTo1NyArMDMwMA0KU3ViamVjdDog W1BBVENIIDEvMl0gYWFyY2g2NDogbWVfY21wOiBJbXByb3ZlIHNjaGVkdWxp bmcgaW4NCiBmZl9waXhfYWJzOF95Ml9uZW9uDQpNSU1FLVZlcnNpb246IDEu MA0KQ29udGVudC1UeXBlOiB0ZXh0L3BsYWluOyBjaGFyc2V0PVVURi04DQpD b250ZW50LVRyYW5zZmVyLUVuY29kaW5nOiA4Yml0DQoNCkJlZm9yZTogICAg ICAgQ29ydGV4IEE1MyAgICBBNzIgICAgQTczDQpwaXhfYWJzXzFfMl9uZW9u OiAgIDczLjcgICAzMS4wICAgMjUuNw0KQWZ0ZXI6DQpwaXhfYWJzXzFfMl9u ZW9uOiAgIDYxLjcgICAzMC4yICAgMjQuNw0KDQpTaWduZWQtb2ZmLWJ5OiBN YXJ0aW4gU3RvcnNqw7YgPG1hcnRpbkBtYXJ0aW4uc3Q+DQotLS0NCiBsaWJh dmNvZGVjL2FhcmNoNjQvbWVfY21wX25lb24uUyB8IDEzICsrKysrKy0tLS0t LS0NCiAxIGZpbGUgY2hhbmdlZCwgNiBpbnNlcnRpb25zKCspLCA3IGRlbGV0 aW9ucygtKQ0KDQpkaWZmIC0tZ2l0IGEvbGliYXZjb2RlYy9hYXJjaDY0L21l X2NtcF9uZW9uLlMgYi9saWJhdmNvZGVjL2FhcmNoNjQvbWVfY21wX25lb24u Uw0KaW5kZXggYTkwMGFlMjE5MC4uN2YyYjdjNzU0ZiAxMDA2NDQNCi0tLSBh L2xpYmF2Y29kZWMvYWFyY2g2NC9tZV9jbXBfbmVvbi5TDQorKysgYi9saWJh dmNvZGVjL2FhcmNoNjQvbWVfY21wX25lb24uUw0KQEAgLTE5MywyMSArMTkz LDIwIEBAIGZ1bmN0aW9uIGZmX3BpeF9hYnM4X3kyX25lb24sIGV4cG9ydD0x DQogMToNCiAgICAgICAgIGxkMSAgICAgICAgICAgICB7djIuOGJ9LCBbeDJd LCB4Mw0KICAgICAgICAgbGQxICAgICAgICAgICAgIHt2MC44Yn0sIFt4MV0s IHgzDQotICAgICAgICBsZDEgICAgICAgICAgICAge3Y2LjhifSwgW3gxXSwg eDMNCiAJdXJoYWRkICAgICAgICAgIHYzMC44YiwgdjEuOGIsIHYyLjhiDQog ICAgICAgICBsZDEgICAgICAgICAgICAge3Y1LjhifSwgW3gyXSwgeDMNCi0g ICAgICAgIGxkMSAgICAgICAgICAgICB7djIxLjhifSwgW3gxXSwgeDMNCisg ICAgICAgIGxkMSAgICAgICAgICAgICB7djYuOGJ9LCBbeDFdLCB4Mw0KICAg ICAgICAgdWFiYWwgICAgICAgICAgIHYyNi44aCwgdjAuOGIsIHYzMC44Yg0K IAl1cmhhZGQgICAgICAgICAgdjI5LjhiLCB2Mi44YiwgdjUuOGINCiAgICAg ICAgIGxkMSAgICAgICAgICAgICB7djIwLjhifSwgW3gyXSwgeDMNCi0gICAg ICAgIGxkMSAgICAgICAgICAgICB7djI0LjhifSwgW3gxXSwgeDMNCisgICAg ICAgIGxkMSAgICAgICAgICAgICB7djIxLjhifSwgW3gxXSwgeDMNCiAgICAg ICAgIHVhYmFsICAgICAgICAgICB2MjYuOGgsIHY2LjhiLCB2MjkuOGINCiAJ dXJoYWRkICAgICAgICAgIHYyOC44YiwgdjUuOGIsIHYyMC44Yg0KLSAgICAg ICAgdWFiYWwgICAgICAgICAgIHYyNi44aCwgdjIxLjhiLCB2MjguOGINCi0g ICAgICAgIGxkMSAgICAgICAgICAgICB7djIzLjhifSwgW3gyXSwgeDMNCi0g ICAgICAgIG1vdiAgICAgICAgICAgICB2MS44YiwgdjIzLjhiDQorICAgICAg ICBsZDEgICAgICAgICAgICAge3YxLjhifSwgIFt4Ml0sIHgzDQorICAgICAg ICBsZDEgICAgICAgICAgICAge3YyNC44Yn0sIFt4MV0sIHgzDQorCXVyaGFk ZCAgICAgICAgICB2MjcuOGIsIHYyMC44YiwgdjEuOGINCiAgICAgICAgIHN1 YiAgICAgICAgICAgICB3NCwgdzQsICM0DQotCXVyaGFkZCAgICAgICAgICB2 MjcuOGIsIHYyMC44YiwgdjIzLjhiDQorICAgICAgICB1YWJhbCAgICAgICAg ICAgdjI2LjhoLCB2MjEuOGIsIHYyOC44Yg0KICAgICAgICAgY21wICAgICAg ICAgICAgIHc0LCAjNA0KICAgICAgICAgdWFiYWwgICAgICAgICAgIHYyNi44 aCwgdjI0LjhiLCB2MjcuOGINCiANCi0tIA0KMi4yNS4xDQoNCg== --8323329-1841389946-1664356017=:1659 Content-Type: text/x-diff; name=0002-aarch64-me_cmp-Fix-up-the-prologue-of-ff_pix_abs8_xy.patch Content-Transfer-Encoding: BASE64 Content-ID: Content-Description: Content-Disposition: attachment; filename=0002-aarch64-me_cmp-Fix-up-the-prologue-of-ff_pix_abs8_xy.patch RnJvbSBkZmM5MTQ1M2FhMzYwOGI2YjZiN2MwZTgzMjI5MjY2ODE1MGZkMGQ1 IE1vbiBTZXAgMTcgMDA6MDA6MDAgMjAwMQ0KRnJvbTogPT9VVEYtOD9xP01h cnRpbj0yMFN0b3Jzaj1DMz1CNj89IDxtYXJ0aW5AbWFydGluLnN0Pg0KRGF0 ZTogV2VkLCAyOCBTZXAgMjAyMiAxMTo0MzowMiArMDMwMA0KU3ViamVjdDog W1BBVENIIDIvMl0gYWFyY2g2NDogbWVfY21wOiBGaXggdXAgdGhlIHByb2xv Z3VlIG9mDQogZmZfcGl4X2FiczhfeHkyX25lb24NCg0KVGhpcyBpbml0aWFs aXplcyB0aGluZ3MgcHJvcGVybHkgaWYgdGhpcyB3ZXJlIHRvIGJlIGNhbGxl ZCB3aXRoDQpoIDwgNC4NCi0tLQ0KIGxpYmF2Y29kZWMvYWFyY2g2NC9tZV9j bXBfbmVvbi5TIHwgNCArKy0tDQogMSBmaWxlIGNoYW5nZWQsIDIgaW5zZXJ0 aW9ucygrKSwgMiBkZWxldGlvbnMoLSkNCg0KZGlmZiAtLWdpdCBhL2xpYmF2 Y29kZWMvYWFyY2g2NC9tZV9jbXBfbmVvbi5TIGIvbGliYXZjb2RlYy9hYXJj aDY0L21lX2NtcF9uZW9uLlMNCmluZGV4IDdmMmI3Yzc1NGYuLjQwMzc5NTM0 ODggMTAwNjQ0DQotLS0gYS9saWJhdmNvZGVjL2FhcmNoNjQvbWVfY21wX25l b24uUw0KKysrIGIvbGliYXZjb2RlYy9hYXJjaDY0L21lX2NtcF9uZW9uLlMN CkBAIC0yNDUsMTIgKzI0NSwxMiBAQCBmdW5jdGlvbiBmZl9waXhfYWJzOF94 eTJfbmVvbiwgZXhwb3J0PTENCiAgICAgICAgIGNtcCAgICAgICAgICAgICB3 NCwgIzQNCiAgICAgICAgIGFkZCAgICAgICAgICAgICB4NiwgeDUsIDEgLy8g cGl4MyArIHN0cmlkZSArIDENCiANCi0gICAgICAgIGIubHQgICAgICAgICAg ICAyZg0KLQ0KICAgICAgICAgbGQxICAgICAgICAgICAgIHt2MC44Yn0sIFt4 Ml0sIHgzDQogICAgICAgICBsZDEgICAgICAgICAgICAge3YxLjhifSwgW3gw XSwgeDMNCiAgICAgICAgIHVhZGRsICAgICAgICAgICB2Mi44aCwgdjAuOGIs IHYxLjhiDQogDQorICAgICAgICBiLmx0ICAgICAgICAgICAgMmYNCisNCiAv LyBtYWtlIDQgaXRlcmF0aW9ucyBhdCBvbmNlDQogMToNCiAgICAgICAgIGxk MSAgICAgICAgICAgICB7djQuOGJ9LCBbeDVdLCB4Mw0KLS0gDQoyLjI1LjEN Cg0K --8323329-1841389946-1664356017=:1659 Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". --8323329-1841389946-1664356017=:1659--