Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
From: Grzegorz Bernacki <gjb@semihalf.com>
To: "Martin Storsjö" <martin@martin.st>
Cc: upstream@semihalf.com, jswinney@amazon.com, hum@semihalf.com,
	ffmpeg-devel@ffmpeg.org, mw@semihalf.com, spop@amazon.com
Subject: Re: [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: Add neon implementation for pix_abs8 functions.
Date: Wed, 28 Sep 2022 14:22:20 +0200
Message-ID: <CAA2Cew5bXGsEO-ykYKcuaUeOP+gDOS9b4ovLnfxEQmoKtEVTfg@mail.gmail.com> (raw)
In-Reply-To: <376a4a83-724d-263d-fe5d-aaf2944fc157@martin.st>

Hi Martin,

I resent the patchset, because the first try did not reach ffmpeg-devel
maillist. I apologize, I should have mentioned about that in cover letter.
Thanks a lot for your review, I will apply the changes and send v2 soon.

thanks,
grzegorz

śr., 28 wrz 2022 o 11:07 Martin Storsjö <martin@martin.st> napisał(a):

> On Mon, 26 Sep 2022, Grzegorz Bernacki wrote:
>
> > Provide optimized implementation of pix_abs8 function for arm64.
> >
> > Performance comparison tests are shown below:
> > pix_abs_1_1_c: 162.5
> > pix_abs_1_1_neon: 27.0
> > pix_abs_1_2_c: 174.0
> > pix_abs_1_2_neon: 23.5
> > pix_abs_1_3_c: 203.2
> > pix_abs_1_3_neon: 34.7
> >
> > Benchmarks and tests are run with checkasm tool on AWS Graviton 3.
> >
> > Signed-off-by: Grzegorz Bernacki <gjb@semihalf.com>
> > ---
> > libavcodec/aarch64/me_cmp_init_aarch64.c |   9 ++
> > libavcodec/aarch64/me_cmp_neon.S         | 193 +++++++++++++++++++++++
> > 2 files changed, 202 insertions(+)
>
> I don't see any changes compared to the version you sent last week? If
> reposting a patchset, please do mention what has changed - or ping the old
> one. (I had it on my radar to review, but reviews of larger amounts of
> code doesn't happen immediately, unfortunately.)
>
> Overall, you seem to have mixed in tabs among regular spaces. Please do
> fix that. (I would have kinda expected us to have a fate test that checks
> this, but apparently we don't.)
>
> > diff --git a/libavcodec/aarch64/me_cmp_init_aarch64.c
> b/libavcodec/aarch64/me_cmp_init_aarch64.c
> > index e143f0816e..3459403ee5 100644
> > --- a/libavcodec/aarch64/me_cmp_init_aarch64.c
> > +++ b/libavcodec/aarch64/me_cmp_init_aarch64.c
> > @@ -59,6 +59,12 @@ int pix_median_abs16_neon(MpegEncContext *v, const
> uint8_t *pix1, const uint8_t
> >                           ptrdiff_t stride, int h);
> > int pix_median_abs8_neon(MpegEncContext *v, const uint8_t *pix1, const
> uint8_t *pix2,
> >                          ptrdiff_t stride, int h);
> > +int ff_pix_abs8_x2_neon(MpegEncContext *v, const uint8_t *pix1, const
> uint8_t *pix2,
> > +                        ptrdiff_t stride, int h);
> > +int ff_pix_abs8_y2_neon(MpegEncContext *v, const uint8_t *pix1, const
> uint8_t *pix2,
> > +                        ptrdiff_t stride, int h);
> > +int ff_pix_abs8_xy2_neon(MpegEncContext *v, const uint8_t *pix1, const
> uint8_t *pix2,
> > +                         ptrdiff_t stride, int h);
> >
> > av_cold void ff_me_cmp_init_aarch64(MECmpContext *c, AVCodecContext
> *avctx)
> > {
> > @@ -70,6 +76,9 @@ av_cold void ff_me_cmp_init_aarch64(MECmpContext *c,
> AVCodecContext *avctx)
> >         c->pix_abs[0][2] = ff_pix_abs16_y2_neon;
> >         c->pix_abs[0][3] = ff_pix_abs16_xy2_neon;
> >         c->pix_abs[1][0] = ff_pix_abs8_neon;
> > +         c->pix_abs[1][1] = ff_pix_abs8_x2_neon;
> > +         c->pix_abs[1][2] = ff_pix_abs8_y2_neon;
> > +         c->pix_abs[1][3] = ff_pix_abs8_xy2_neon;
>
> In most mediums, the diff here shows that there's something off - tabs.
>
> >
> >         c->sad[0] = ff_pix_abs16_neon;
> >         c->sad[1] = ff_pix_abs8_neon;
> > diff --git a/libavcodec/aarch64/me_cmp_neon.S
> b/libavcodec/aarch64/me_cmp_neon.S
> > index 11af4849f9..e03c0c26cd 100644
> > --- a/libavcodec/aarch64/me_cmp_neon.S
> > +++ b/libavcodec/aarch64/me_cmp_neon.S
> > @@ -119,6 +119,199 @@ function ff_pix_abs8_neon, export=1
> >         ret
> > endfunc
> >
> > +function ff_pix_abs8_x2_neon, export=1
> > +        // x0           unused
> > +        // x1           uint8_t *pix1
> > +        // x2           uint8_t *pix2
> > +        // x3           ptrdiff_t stride
> > +        // w4           int h
> > +
> > +        cmp             w4, #4
> > +        movi            v26.8h, #0
> > +        add             x5, x2, #1 // pix2 + 1
> > +        b.lt            2f
> > +
> > +// make 4 iterations at once
> > +1:
> > +        ld1             {v1.8b}, [x2], x3
> > +        ld1             {v2.8b}, [x5], x3
> > +        ld1             {v0.8b}, [x1], x3
> > +        ld1             {v4.8b}, [x2], x3
> > +     urhadd          v30.8b, v1.8b, v2.8b
> > +        ld1             {v5.8b}, [x5], x3
> > +        uabal           v26.8h, v0.8b, v30.8b
> > +        ld1             {v6.8b}, [x1], x3
> > +     urhadd          v29.8b, v4.8b, v5.8b
> > +        ld1             {v7.8b}, [x2], x3
> > +        ld1             {v20.8b}, [x5], x3
> > +        uabal           v26.8h, v6.8b, v29.8b
> > +        ld1             {v21.8b}, [x1], x3
> > +     urhadd          v28.8b, v7.8b, v20.8b
> > +        ld1             {v22.8b}, [x2], x3
> > +        ld1             {v23.8b}, [x5], x3
> > +        uabal           v26.8h, v21.8b, v28.8b
> > +        sub             w4, w4, #4
> > +        ld1             {v24.8b}, [x1], x3
> > +     urhadd          v27.8b, v22.8b, v23.8b
> > +        cmp             w4, #4
> > +        uabal           v26.8h, v24.8b, v27.8b
> > +
> > +        b.ge            1b
> > +        cbz             w4, 3f
> > +
> > +// iterate by one
> > +2:
> > +        ld1             {v1.8b}, [x2], x3
> > +        ld1             {v2.8b}, [x5], x3
> > +        ld1             {v0.8b}, [x1], x3
> > +     urhadd          v30.8b, v1.8b, v2.8b
> > +        subs            w4, w4, #1
> > +        uabal           v26.8h, v0.8b, v30.8b
> > +
> > +        b.ne            2b
> > +3:
> > +        uaddlv          s20, v26.8h
> > +        fmov            w0, s20
> > +
> > +        ret
> > +
> > +endfunc
> > +
> > +function ff_pix_abs8_y2_neon, export=1
> > +        // x0           unused
> > +        // x1           uint8_t *pix1
> > +        // x2           uint8_t *pix2
> > +        // x3           ptrdiff_t stride
> > +        // w4           int h
> > +
> > +        cmp             w4, #4
> > +        movi            v26.8h, #0
> > +        ld1             {v1.8b}, [x2], x3
> > +        b.lt            2f
> > +
> > +// make 4 iterations at once
> > +1:
> > +        ld1             {v2.8b}, [x2], x3
> > +        ld1             {v0.8b}, [x1], x3
> > +        ld1             {v6.8b}, [x1], x3
> > +     urhadd          v30.8b, v1.8b, v2.8b
>
> Good thing that you wrote this better than the current
> ff_pix_abs16_y2_neon, reusing the data from the previous round. But the
> scheduling could be much improved here (why load v6 directly after v0 - x1
> will delay it due to being updated in the previous instruction, and v6
> won't be needed for a long time here yet).
>
> By fixing the scheduling of this function (and getting rid of the
> unnecessary mov instruction at the end of the unrolled loop), I got it
> down from 74 cycles to 62 cycles on a Cortex A53. I'm attaching the fixup
> I did, so you can apply it on top instead of having to describe it
> verbally.
>
> > +        ld1             {v5.8b}, [x2], x3
> > +        ld1             {v21.8b}, [x1], x3
> > +        uabal           v26.8h, v0.8b, v30.8b
> > +     urhadd          v29.8b, v2.8b, v5.8b
> > +        ld1             {v20.8b}, [x2], x3
> > +        ld1             {v24.8b}, [x1], x3
> > +        uabal           v26.8h, v6.8b, v29.8b
> > +     urhadd          v28.8b, v5.8b, v20.8b
> > +        uabal           v26.8h, v21.8b, v28.8b
> > +        ld1             {v23.8b}, [x2], x3
> > +        mov             v1.8b, v23.8b
> > +        sub             w4, w4, #4
> > +     urhadd          v27.8b, v20.8b, v23.8b
> > +        cmp             w4, #4
> > +        uabal           v26.8h, v24.8b, v27.8b
> > +
> > +        b.ge            1b
> > +        cbz             w4, 3f
> > +
> > +// iterate by one
> > +2:
> > +        ld1             {v0.8b}, [x1], x3
> > +        ld1             {v2.8b}, [x2], x3
> > +     urhadd          v30.8b, v1.8b, v2.8b
> > +        subs            w4, w4, #1
> > +        uabal           v26.8h, v0.8b, v30.8b
> > +        mov             v1.8b, v2.8b
> > +
> > +        b.ne            2b
> > +3:
> > +        uaddlv          s20, v26.8h
> > +        fmov            w0, s20
> > +
> > +        ret
> > +
> > +endfunc
> > +
> > +function ff_pix_abs8_xy2_neon, export=1
> > +        // x0           unused
> > +        // x1           uint8_t *pix1
> > +        // x2           uint8_t *pix2
> > +        // x3           ptrdiff_t stride
> > +        // w4           int h
> > +
> > +        movi            v31.8h, #0
> > +        add             x0, x2, 1  // pix2 + 1
> > +
> > +        add             x5, x2, x3 // pix2 + stride = pix3
> > +        cmp             w4, #4
> > +        add             x6, x5, 1 // pix3 + stride + 1
> > +
> > +        b.lt            2f
> > +
> > +        ld1             {v0.8b}, [x2], x3
> > +        ld1             {v1.8b}, [x0], x3
> > +        uaddl           v2.8h, v0.8b, v1.8b
>
> If we'd start out with h<4, we'd jump to the 2f label above, missing the
> setup of v0-v2 here.
>
> Other than that, the patch looks reasonable to me.
>
> // Martin
>
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

      reply	other threads:[~2022-09-28 12:22 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-26  9:15 Grzegorz Bernacki
2022-09-26  9:15 ` [FFmpeg-devel] [PATCH 2/4] lavc/aarch64: Provide neon implementation of nsse8 Grzegorz Bernacki
2022-09-28  9:08   ` Martin Storsjö
2022-09-26  9:15 ` [FFmpeg-devel] [PATCH 3/4] lavc/aarch64: Provide optimized implementation of vsse8 for arm64 Grzegorz Bernacki
2022-09-28  9:09   ` Martin Storsjö
2022-09-26  9:15 ` [FFmpeg-devel] [PATCH 4/4] lavc/aarch64: Add neon implementation for vsse_intra8 Grzegorz Bernacki
2022-09-28  9:11   ` Martin Storsjö
2022-09-28  9:06 ` [FFmpeg-devel] [PATCH 1/4] lavc/aarch64: Add neon implementation for pix_abs8 functions Martin Storsjö
2022-09-28 12:22   ` Grzegorz Bernacki [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAA2Cew5bXGsEO-ykYKcuaUeOP+gDOS9b4ovLnfxEQmoKtEVTfg@mail.gmail.com \
    --to=gjb@semihalf.com \
    --cc=ffmpeg-devel@ffmpeg.org \
    --cc=hum@semihalf.com \
    --cc=jswinney@amazon.com \
    --cc=martin@martin.st \
    --cc=mw@semihalf.com \
    --cc=spop@amazon.com \
    --cc=upstream@semihalf.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git