Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
From: "Swinney, Jonathan" <jswinney@amazon.com>
To: "FFmpeg development discussions and patches"
	<ffmpeg-devel@ffmpeg.org>, "Martin Storsjö" <martin@martin.st>
Subject: Re: [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon
Date: Fri, 15 Jul 2022 19:38:08 +0000
Message-ID: <C8514E1B-E302-449C-807C-E9813E2617AA@amazon.com> (raw)
In-Reply-To: <20220713204854.3114817-4-martin@martin.st>

LGTM.

-- 

Jonathan Swinney

On 7/13/22, 3:49 PM, "ffmpeg-devel on behalf of Martin Storsjö" <ffmpeg-devel-bounces@ffmpeg.org on behalf of martin@martin.st> wrote:

    CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.



    Using absolute-difference-accumulate does use twice the amount of
    absolute-difference instructions, but avoids the need for the
    uaddl and add instructions, reducing the total number of instructions
    by 3.

    These can be interleaved in the rest of the calculation, to avoid
    tight dependencies at the end. Unfortunately, this is marginally
    slower on Cortex A53, but faster on A72 and A73.

    Before:       Cortex A53    A72    A73   Graviton 3
    pix_abs_0_3_neon:  175.7  109.2   92.0   41.2
    After:
    pix_abs_0_3_neon:  179.7   96.7   87.5   41.2
    ---
     libavcodec/aarch64/me_cmp_neon.S | 32 +++++++++++---------------------
     1 file changed, 11 insertions(+), 21 deletions(-)

    diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S
    index 0ae23d8922..89546869fb 100644
    --- a/libavcodec/aarch64/me_cmp_neon.S
    +++ b/libavcodec/aarch64/me_cmp_neon.S
    @@ -124,6 +124,9 @@ function ff_pix_abs16_xy2_neon, export=1
             add             v26.8h, v30.8h, v2.8h       // add up 0..7, using pix2 + pix2+1 values from pix3 above
             add             v27.8h, v31.8h, v3.8h       // add up 8..15, using pix2 + pix2+1 values from pix3 above

    +        uabdl           v24.8h, v1.8b,  v23.8b      // absolute difference 0..7, i=0
    +        uabdl2          v23.8h, v1.16b, v23.16b     // absolute difference 8..15, i=0
    +
             ld1             {v21.16b}, [x5], x3         // load pix3
             ld1             {v20.16b}, [x1], x3         // load pix1

    @@ -137,6 +140,9 @@ function ff_pix_abs16_xy2_neon, export=1
             rshrn           v28.8b, v28.8h, #2          // shift right 2 0..7 (rounding shift right)
             rshrn2          v28.16b, v29.8h, #2         // shift right 2 8..15

    +        uabal           v24.8h, v16.8b,  v26.8b     // absolute difference 0..7, i=1
    +        uabal2          v23.8h, v16.16b, v26.16b    // absolute difference 8..15, i=1
    +
             uaddl           v2.8h, v21.8b, v22.8b       // pix3 + pix3+1 0..7
             uaddl2          v3.8h, v21.16b, v22.16b     // pix3 + pix3+1 8..15
             add             v30.8h, v4.8h, v2.8h        // add up 0..7, using pix2 + pix2+1 values from pix3 above
    @@ -144,33 +150,17 @@ function ff_pix_abs16_xy2_neon, export=1
             rshrn           v30.8b, v30.8h, #2          // shift right 2 0..7 (rounding shift right)
             rshrn2          v30.16b, v31.8h, #2         // shift right 2 8..15

    -        // Averages are now stored in these registers:
    -        // v23, v16, v28, v30
    -        // pix1 values in these registers:
    -        // v1, v16, v17, v20
    -        // available:
    -        // v4, v5, v7, v18, v19, v24, v25, v27, v29, v31
    +        uabal           v24.8h, v17.8b,  v28.8b     // absolute difference 0..7, i=2
    +        uabal2          v23.8h, v17.16b, v28.16b    // absolute difference 8..15, i=2

             sub             w4, w4, #4                  // h -= 4

    -        // Using absolute-difference instructions instead of absolute-difference-accumulate allows
    -        // us to keep the results in 16b vectors instead of widening values with twice the instructions.
    -        // This approach also has fewer data dependencies, allowing better instruction level parallelism.
    -        uabd            v4.16b, v1.16b, v23.16b     // absolute difference 0..15, i=0
    -        uabd            v5.16b, v16.16b, v26.16b    // absolute difference 0..15, i=1
    -        uabd            v6.16b, v17.16b, v28.16b    // absolute difference 0..15, i=2
    -        uabd            v7.16b, v20.16b, v30.16b    // absolute difference 0..15, i=3
    +        uabal           v24.8h, v20.8b,  v30.8b     // absolute difference 0..7, i=3
    +        uabal2          v23.8h, v20.16b, v30.16b    // absolute difference 8..15, i=3

             cmp             w4, #4                      // loop if h >= 4

    -        // Now add up all the values in each vector, v4-v7 with widening adds
    -        uaddl           v19.8h, v4.8b, v5.8b
    -        uaddl2          v18.8h, v4.16b, v5.16b
    -        uaddl           v4.8h, v6.8b, v7.8b
    -        uaddl2          v5.8h, v6.16b, v7.16b
    -        add             v4.8h, v4.8h, v5.8h
    -        add             v4.8h, v4.8h, v18.8h
    -        add             v4.8h, v4.8h, v19.8h
    +        add             v4.8h, v23.8h, v24.8h
             uaddlv          s4, v4.8h                   // finish adding up accumulated values
             add             d0, d0, d4                  // add the value to the top level accumulator

    --
    2.25.1

    _______________________________________________
    ffmpeg-devel mailing list
    ffmpeg-devel@ffmpeg.org
    https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

    To unsubscribe, visit link above, or email
    ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

  reply	other threads:[~2022-07-15 19:38 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-13 20:48 [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case " Martin Storsjö
2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable Martin Storsjö
2022-07-15 19:35   ` Swinney, Jonathan
2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 3/5] aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon Martin Storsjö
2022-07-15 19:34   ` Swinney, Jonathan
2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal " Martin Storsjö
2022-07-15 19:38   ` Swinney, Jonathan [this message]
2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration Martin Storsjö
2022-07-15 19:32   ` Swinney, Jonathan
2022-07-15 19:56     ` Martin Storsjö
2022-07-15 21:19       ` Michael Niedermayer
2022-07-15 21:25         ` Martin Storsjö
2022-07-16 11:23           ` Michael Niedermayer
2022-07-16 12:30             ` Martin Storsjö
2022-07-16 13:20               ` Michael Niedermayer
2022-07-16 14:23                 ` Martin Storsjö
2022-07-16 12:50             ` Ronald S. Bultje
2022-07-16 13:06               ` Michael Niedermayer
2022-07-16  9:18         ` Martin Storsjö
2022-07-15 19:35 ` [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Swinney, Jonathan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=C8514E1B-E302-449C-807C-E9813E2617AA@amazon.com \
    --to=jswinney@amazon.com \
    --cc=ffmpeg-devel@ffmpeg.org \
    --cc=martin@martin.st \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git