From: Grzegorz Bernacki <gjb@semihalf.com>
To: "Martin Storsjö" <martin@martin.st>
Cc: upstream@semihalf.com, jswinney@amazon.com, hum@semihalf.com,
ffmpeg-devel@ffmpeg.org, mw@semihalf.com, spop@amazon.com
Subject: Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: Add neon implementation for pix_abs8 functions.
Date: Wed, 28 Sep 2022 14:22:20 +0200
Message-ID: <CAA2Cew5bXGsEO-ykYKcuaUeOP+gDOS9b4ovLnfxEQmoKtEVTfg@mail.gmail.com> (raw)
In-Reply-To: <376a4a83-724d-263d-fe5d-aaf2944fc157@martin.st>
Hi Martin,
I resent the patchset, because the first try did not reach ffmpeg-devel
maillist. I apologize, I should have mentioned about that in cover letter.
Thanks a lot for your review, I will apply the changes and send v2 soon.
thanks,
grzegorz
śr., 28 wrz 2022 o 11:07 Martin Storsjö <martin@martin.st> napisał(a):
> On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
>
> > Provide optimized implementation of pix_abs8 function for arm64.
> >
> > Performance comparison tests are shown below:
> > pix_abs_1_1_c: 162.5
> > pix_abs_1_1_neon: 27.0
> > pix_abs_1_2_c: 174.0
> > pix_abs_1_2_neon: 23.5
> > pix_abs_1_3_c: 203.2
> > pix_abs_1_3_neon: 34.7
> >
> > Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
> >
> > Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
> > ---
> > libavcodec/aarch64/me_cmp_init_aarch64.c | 9 ++
> > libavcodec/aarch64/me_cmp_neon.S | 193 +++++++++++++++++++++++
> > 2 files changed, 202 insertions(+)
>
> I don't see any changes compared to the version you sent last week? If
> reposting a patchset, please do mention what has changed - or ping the old
> one. (I had it on my radar to review, but reviews of larger amounts of
> code doesn't happen immediately, unfortunately.)
>
> Overall, you seem to have mixed in tabs among regular spaces. Please do
> fix that. (I would have kinda expected us to have a fate test that checks
> this, but apparently we don't.)
>
> > diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c
> b/libavcodec/aarch64/me_cmp_init_aarch64.c
> > index e143f0816e..3459403ee5 100644
> > --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> > +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> > @@ -59,6 +59,12 @@ int pix_median_abs16_neon(MpegEncContext *v, const
> uint8_t *pix1, const uint8_t
> > ptrdiff_t stride, int h);
> > int pix_median_abs8_neon(MpegEncContext *v, const uint8_t *pix1, const
> uint8_t *pix2,
> > ptrdiff_t stride, int h);
> > +int ff_pix_abs8_x2_neon(MpegEncContext *v, const uint8_t *pix1, const
> uint8_t *pix2,
> > + ptrdiff_t stride, int h);
> > +int ff_pix_abs8_y2_neon(MpegEncContext *v, const uint8_t *pix1, const
> uint8_t *pix2,
> > + ptrdiff_t stride, int h);
> > +int ff_pix_abs8_xy2_neon(MpegEncContext *v, const uint8_t *pix1, const
> uint8_t *pix2,
> > + ptrdiff_t stride, int h);
> >
> > av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext
> *avctx)
> > {
> > @@ -70,6 +76,9 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c,
> AVCodecContext *avctx)
> > c->pix_abs[0][2] = ff_pix_abs16_y2_neon;
> > c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
> > c->pix_abs[1][0] = ff_pix_abs8_neon;
> > + c->pix_abs[1][1] = ff_pix_abs8_x2_neon;
> > + c->pix_abs[1][2] = ff_pix_abs8_y2_neon;
> > + c->pix_abs[1][3] = ff_pix_abs8_xy2_neon;
>
> In most mediums, the diff here shows that there's something off - tabs.
>
> >
> > c->sad[0] = ff_pix_abs16_neon;
> > c->sad[1] = ff_pix_abs8_neon;
> > diff --git a/libavcodec/aarch64/me_cmp_neon.S
> b/libavcodec/aarch64/me_cmp_neon.S
> > index 11af4849f9..e03c0c26cd 100644
> > --- a/libavcodec/aarch64/me_cmp_neon.S
> > +++ b/libavcodec/aarch64/me_cmp_neon.S
> > @@ -119,6 +119,199 @@ function ff_pix_abs8_neon, export=1
> > ret
> > endfunc
> >
> > +function ff_pix_abs8_x2_neon, export=1
> > + // x0 unused
> > + // x1 uint8_t *pix1
> > + // x2 uint8_t *pix2
> > + // x3 ptrdiff_t stride
> > + // w4 int h
> > +
> > + cmp w4, #4
> > + movi v26.8h, #0
> > + add x5, x2, #1 // pix2 + 1
> > + b.lt 2f
> > +
> > +// make 4 iterations at once
> > +1:
> > + ld1 {v1.8b}, [x2], x3
> > + ld1 {v2.8b}, [x5], x3
> > + ld1 {v0.8b}, [x1], x3
> > + ld1 {v4.8b}, [x2], x3
> > + urhadd v30.8b, v1.8b, v2.8b
> > + ld1 {v5.8b}, [x5], x3
> > + uabal v26.8h, v0.8b, v30.8b
> > + ld1 {v6.8b}, [x1], x3
> > + urhadd v29.8b, v4.8b, v5.8b
> > + ld1 {v7.8b}, [x2], x3
> > + ld1 {v20.8b}, [x5], x3
> > + uabal v26.8h, v6.8b, v29.8b
> > + ld1 {v21.8b}, [x1], x3
> > + urhadd v28.8b, v7.8b, v20.8b
> > + ld1 {v22.8b}, [x2], x3
> > + ld1 {v23.8b}, [x5], x3
> > + uabal v26.8h, v21.8b, v28.8b
> > + sub w4, w4, #4
> > + ld1 {v24.8b}, [x1], x3
> > + urhadd v27.8b, v22.8b, v23.8b
> > + cmp w4, #4
> > + uabal v26.8h, v24.8b, v27.8b
> > +
> > + b.ge 1b
> > + cbz w4, 3f
> > +
> > +// iterate by one
> > +2:
> > + ld1 {v1.8b}, [x2], x3
> > + ld1 {v2.8b}, [x5], x3
> > + ld1 {v0.8b}, [x1], x3
> > + urhadd v30.8b, v1.8b, v2.8b
> > + subs w4, w4, #1
> > + uabal v26.8h, v0.8b, v30.8b
> > +
> > + b.ne 2b
> > +3:
> > + uaddlv s20, v26.8h
> > + fmov w0, s20
> > +
> > + ret
> > +
> > +endfunc
> > +
> > +function ff_pix_abs8_y2_neon, export=1
> > + // x0 unused
> > + // x1 uint8_t *pix1
> > + // x2 uint8_t *pix2
> > + // x3 ptrdiff_t stride
> > + // w4 int h
> > +
> > + cmp w4, #4
> > + movi v26.8h, #0
> > + ld1 {v1.8b}, [x2], x3
> > + b.lt 2f
> > +
> > +// make 4 iterations at once
> > +1:
> > + ld1 {v2.8b}, [x2], x3
> > + ld1 {v0.8b}, [x1], x3
> > + ld1 {v6.8b}, [x1], x3
> > + urhadd v30.8b, v1.8b, v2.8b
>
> Good thing that you wrote this better than the current
> ff_pix_abs16_y2_neon, reusing the data from the previous round. But the
> scheduling could be much improved here (why load v6 directly after v0 - x1
> will delay it due to being updated in the previous instruction, and v6
> won't be needed for a long time here yet).
>
> By fixing the scheduling of this function (and getting rid of the
> unnecessary mov instruction at the end of the unrolled loop), I got it
> down from 74 cycles to 62 cycles on a Cortex A53. I'm attaching the fixup
> I did, so you can apply it on top instead of having to describe it
> verbally.
>
> > + ld1 {v5.8b}, [x2], x3
> > + ld1 {v21.8b}, [x1], x3
> > + uabal v26.8h, v0.8b, v30.8b
> > + urhadd v29.8b, v2.8b, v5.8b
> > + ld1 {v20.8b}, [x2], x3
> > + ld1 {v24.8b}, [x1], x3
> > + uabal v26.8h, v6.8b, v29.8b
> > + urhadd v28.8b, v5.8b, v20.8b
> > + uabal v26.8h, v21.8b, v28.8b
> > + ld1 {v23.8b}, [x2], x3
> > + mov v1.8b, v23.8b
> > + sub w4, w4, #4
> > + urhadd v27.8b, v20.8b, v23.8b
> > + cmp w4, #4
> > + uabal v26.8h, v24.8b, v27.8b
> > +
> > + b.ge 1b
> > + cbz w4, 3f
> > +
> > +// iterate by one
> > +2:
> > + ld1 {v0.8b}, [x1], x3
> > + ld1 {v2.8b}, [x2], x3
> > + urhadd v30.8b, v1.8b, v2.8b
> > + subs w4, w4, #1
> > + uabal v26.8h, v0.8b, v30.8b
> > + mov v1.8b, v2.8b
> > +
> > + b.ne 2b
> > +3:
> > + uaddlv s20, v26.8h
> > + fmov w0, s20
> > +
> > + ret
> > +
> > +endfunc
> > +
> > +function ff_pix_abs8_xy2_neon, export=1
> > + // x0 unused
> > + // x1 uint8_t *pix1
> > + // x2 uint8_t *pix2
> > + // x3 ptrdiff_t stride
> > + // w4 int h
> > +
> > + movi v31.8h, #0
> > + add x0, x2, 1 // pix2 + 1
> > +
> > + add x5, x2, x3 // pix2 + stride = pix3
> > + cmp w4, #4
> > + add x6, x5, 1 // pix3 + stride + 1
> > +
> > + b.lt 2f
> > +
> > + ld1 {v0.8b}, [x2], x3
> > + ld1 {v1.8b}, [x0], x3
> > + uaddl v2.8h, v0.8b, v1.8b
>
> If we'd start out with h<4, we'd jump to the 2f label above, missing the
> setup of v0-v2 here.
>
> Other than that, the patch looks reasonable to me.
>
> // Martin
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
prev parent reply other threads:[~2022-09-28 12:22 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-09-26 9:15 Grzegorz Bernacki
2022-09-26 9:15 ` [FFmpeg-devel] [PATCH 2/4] lavc/aarch64: Provide neon implementation of nsse8 Grzegorz Bernacki
2022-09-28 9:08 ` Martin Storsjö
2022-09-26 9:15 ` [FFmpeg-devel] [PATCH 3/4] lavc/aarch64: Provide optimized implementation of vsse8 for arm64 Grzegorz Bernacki
2022-09-28 9:09 ` Martin Storsjö
2022-09-26 9:15 ` [FFmpeg-devel] [PATCH 4/4] lavc/aarch64: Add neon implementation for vsse_intra8 Grzegorz Bernacki
2022-09-28 9:11 ` Martin Storsjö
2022-09-28 9:06 ` [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: Add neon implementation for pix_abs8 functions Martin Storsjö
2022-09-28 12:22 ` Grzegorz Bernacki [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAA2Cew5bXGsEO-ykYKcuaUeOP+gDOS9b4ovLnfxEQmoKtEVTfg@mail.gmail.com \
--to=gjb@semihalf.com \
--cc=ffmpeg-devel@ffmpeg.org \
--cc=hum@semihalf.com \
--cc=jswinney@amazon.com \
--cc=martin@martin.st \
--cc=mw@semihalf.com \
--cc=spop@amazon.com \
--cc=upstream@semihalf.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git