From: "Swinney, Jonathan" <jswinney@amazon.com> To: "FFmpeg development discussions and patches" <ffmpeg-devel@ffmpeg.org>, "Martin Storsjö" <martin@martin.st> Subject: Re: [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal in ff_pix_abs16_xy2_neon Date: Fri, 15 Jul 2022 19:38:08 +0000 Message-ID: <C8514E1B-E302-449C-807C-E9813E2617AA@amazon.com> (raw) In-Reply-To: <20220713204854.3114817-4-martin@martin.st> LGTM. -- Jonathan Swinney On 7/13/22, 3:49 PM, "ffmpeg-devel on behalf of Martin Storsjö" <ffmpeg-devel-bounces@ffmpeg.org on behalf of martin@martin.st> wrote: CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. Using absolute-difference-accumulate does use twice the amount of absolute-difference instructions, but avoids the need for the uaddl and add instructions, reducing the total number of instructions by 3. These can be interleaved in the rest of the calculation, to avoid tight dependencies at the end. Unfortunately, this is marginally slower on Cortex A53, but faster on A72 and A73. Before: Cortex A53 A72 A73 Graviton 3 pix_abs_0_3_neon: 175.7 109.2 92.0 41.2 After: pix_abs_0_3_neon: 179.7 96.7 87.5 41.2 --- libavcodec/aarch64/me_cmp_neon.S | 32 +++++++++++--------------------- 1 file changed, 11 insertions(+), 21 deletions(-) diff --git a/libavcodec/aarch64/me_cmp_neon.S b/libavcodec/aarch64/me_cmp_neon.S index 0ae23d8922..89546869fb 100644 --- a/libavcodec/aarch64/me_cmp_neon.S +++ b/libavcodec/aarch64/me_cmp_neon.S @@ -124,6 +124,9 @@ function ff_pix_abs16_xy2_neon, export=1 add v26.8h, v30.8h, v2.8h // add up 0..7, using pix2 + pix2+1 values from pix3 above add v27.8h, v31.8h, v3.8h // add up 8..15, using pix2 + pix2+1 values from pix3 above + uabdl v24.8h, v1.8b, v23.8b // absolute difference 0..7, i=0 + uabdl2 v23.8h, v1.16b, v23.16b // absolute difference 8..15, i=0 + ld1 {v21.16b}, [x5], x3 // load pix3 ld1 {v20.16b}, [x1], x3 // load pix1 @@ -137,6 +140,9 @@ function ff_pix_abs16_xy2_neon, export=1 rshrn v28.8b, v28.8h, #2 // shift right 2 0..7 (rounding shift right) rshrn2 v28.16b, v29.8h, #2 // shift right 2 8..15 + uabal v24.8h, v16.8b, v26.8b // absolute difference 0..7, i=1 + uabal2 v23.8h, v16.16b, v26.16b // absolute difference 8..15, i=1 + uaddl v2.8h, v21.8b, v22.8b // pix3 + pix3+1 0..7 uaddl2 v3.8h, v21.16b, v22.16b // pix3 + pix3+1 8..15 add v30.8h, v4.8h, v2.8h // add up 0..7, using pix2 + pix2+1 values from pix3 above @@ -144,33 +150,17 @@ function ff_pix_abs16_xy2_neon, export=1 rshrn v30.8b, v30.8h, #2 // shift right 2 0..7 (rounding shift right) rshrn2 v30.16b, v31.8h, #2 // shift right 2 8..15 - // Averages are now stored in these registers: - // v23, v16, v28, v30 - // pix1 values in these registers: - // v1, v16, v17, v20 - // available: - // v4, v5, v7, v18, v19, v24, v25, v27, v29, v31 + uabal v24.8h, v17.8b, v28.8b // absolute difference 0..7, i=2 + uabal2 v23.8h, v17.16b, v28.16b // absolute difference 8..15, i=2 sub w4, w4, #4 // h -= 4 - // Using absolute-difference instructions instead of absolute-difference-accumulate allows - // us to keep the results in 16b vectors instead of widening values with twice the instructions. - // This approach also has fewer data dependencies, allowing better instruction level parallelism. - uabd v4.16b, v1.16b, v23.16b // absolute difference 0..15, i=0 - uabd v5.16b, v16.16b, v26.16b // absolute difference 0..15, i=1 - uabd v6.16b, v17.16b, v28.16b // absolute difference 0..15, i=2 - uabd v7.16b, v20.16b, v30.16b // absolute difference 0..15, i=3 + uabal v24.8h, v20.8b, v30.8b // absolute difference 0..7, i=3 + uabal2 v23.8h, v20.16b, v30.16b // absolute difference 8..15, i=3 cmp w4, #4 // loop if h >= 4 - // Now add up all the values in each vector, v4-v7 with widening adds - uaddl v19.8h, v4.8b, v5.8b - uaddl2 v18.8h, v4.16b, v5.16b - uaddl v4.8h, v6.8b, v7.8b - uaddl2 v5.8h, v6.16b, v7.16b - add v4.8h, v4.8h, v5.8h - add v4.8h, v4.8h, v18.8h - add v4.8h, v4.8h, v19.8h + add v4.8h, v23.8h, v24.8h uaddlv s4, v4.8h // finish adding up accumulated values add d0, d0, d4 // add the value to the top level accumulator -- 2.25.1 _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2022-07-15 19:38 UTC|newest] Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top 2022-07-13 20:48 [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case " Martin Storsjö 2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 2/5] checkasm: motion: Make the benchmarks more stable Martin Storsjö 2022-07-15 19:35 ` Swinney, Jonathan 2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 3/5] aarch64: me_cmp: Interleave some of the loads in ff_pix_abs16_xy2_neon Martin Storsjö 2022-07-15 19:34 ` Swinney, Jonathan 2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 4/5] aarch64: me_cmp: Switch from uabd to uabal " Martin Storsjö 2022-07-15 19:38 ` Swinney, Jonathan [this message] 2022-07-13 20:48 ` [FFmpeg-devel] [PATCH 5/5] aarch64: me_cmp: Don't do uaddlv once per iteration Martin Storsjö 2022-07-15 19:32 ` Swinney, Jonathan 2022-07-15 19:56 ` Martin Storsjö 2022-07-15 21:19 ` Michael Niedermayer 2022-07-15 21:25 ` Martin Storsjö 2022-07-16 11:23 ` Michael Niedermayer 2022-07-16 12:30 ` Martin Storsjö 2022-07-16 13:20 ` Michael Niedermayer 2022-07-16 14:23 ` Martin Storsjö 2022-07-16 12:50 ` Ronald S. Bultje 2022-07-16 13:06 ` Michael Niedermayer 2022-07-16 9:18 ` Martin Storsjö 2022-07-15 19:35 ` [FFmpeg-devel] [PATCH 1/5] libavcodec: aarch64: Don't clobber v8 in the h%4 case in ff_pix_abs16_xy2_neon Swinney, Jonathan
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=C8514E1B-E302-449C-807C-E9813E2617AA@amazon.com \ --to=jswinney@amazon.com \ --cc=ffmpeg-devel@ffmpeg.org \ --cc=martin@martin.st \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git