Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
From: "Martin Storsjö" <martin@martin.st>
To: Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org>
Cc: Krzysztof Pyrkosz <ffmpeg@szaka.eu>
Subject: Re: [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12}
Date: Sun, 2 Mar 2025 00:21:23 +0200 (EET)
Message-ID: <ab5b78f9-20ae-a791-91c6-bd2647bbd2c@martin.st> (raw)
In-Reply-To: <20250219174010.3911-2-ffmpeg@szaka.eu>

On Wed, 19 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> ---

As you've noticed in later patches; most of this commentery _is_ valuable 
to keep in the commit message, so I'd keep most of this, including the 
performance diff, in the commit message (i.e. above the ---).

> This patch replaces integer widening with halving addition, and
> multi-step "emulated" rounding shift with a single asm instruction doing
> exactly that. This pattern repeats in other functions in this file, I
> fixed some in the succeeding patch. There's a lot of performance to be
> gained there.
>
> I didn't modify the existing function because it adds a few extra steps
> solely for the shared w_avg implementation (every cycle matters), but also
> because I find this linear version easier to digest and understand.

That's probably reasonable - but if the avg codepath in vvc_avg is unused 
now, we should remove it; that makes the patch clearer to see the change, 
when we see the removed old codepath together with the new added one in 
the same patch.

> Besides, I noticed that removing smin and smax instructions used for
> clamping the values for 10 and 12 bit_depth instantiations does not
> affect the checkasm result, but it breaks FATE.

It would probably be good if we could improve the checkasm to hit those 
cases too, but that's of course a separate question.

>
> Benchmarks before and after:
> A78
> avg_8_2x2_neon:                                         21.0 ( 1.55x)
> avg_8_4x4_neon:                                         25.8 ( 3.05x)
> avg_8_8x8_neon:                                         45.0 ( 5.86x)
> avg_8_16x16_neon:                                      178.5 ( 5.49x)
> avg_8_32x32_neon:                                      709.2 ( 6.20x)
> avg_8_64x64_neon:                                     2686.2 ( 6.12x)
> avg_8_128x128_neon:                                  10734.2 ( 5.88x)
> avg_10_2x2_neon:                                        19.0 ( 1.75x)
> avg_10_4x4_neon:                                        28.2 ( 2.76x)
> avg_10_8x8_neon:                                        44.0 ( 5.82x)
> avg_10_16x16_neon:                                     179.5 ( 4.81x)
> avg_10_32x32_neon:                                     680.8 ( 5.58x)
> avg_10_64x64_neon:                                    2536.8 ( 5.40x)
> avg_10_128x128_neon:                                 10079.0 ( 5.22x)
> avg_12_2x2_neon:                                        20.8 ( 1.59x)
> avg_12_4x4_neon:                                        25.2 ( 3.09x)
> avg_12_8x8_neon:                                        44.0 ( 5.79x)
> avg_12_16x16_neon:                                     182.2 ( 4.80x)
> avg_12_32x32_neon:                                     696.2 ( 5.46x)
> avg_12_64x64_neon:                                    2548.2 ( 5.38x)
> avg_12_128x128_neon:                                 10133.8 ( 5.19x)
>
> avg_8_2x2_neon:                                         16.5 ( 1.98x)
> avg_8_4x4_neon:                                         26.2 ( 2.93x)
> avg_8_8x8_neon:                                         31.8 ( 8.55x)
> avg_8_16x16_neon:                                       82.0 (12.02x)
> avg_8_32x32_neon:                                      310.2 (14.12x)
> avg_8_64x64_neon:                                      897.8 (18.26x)
> avg_8_128x128_neon:                                   3608.5 (17.37x)
> avg_10_2x2_neon:                                        19.5 ( 1.69x)
> avg_10_4x4_neon:                                        28.0 ( 2.79x)
> avg_10_8x8_neon:                                        34.8 ( 7.32x)
> avg_10_16x16_neon:                                     119.8 ( 7.35x)
> avg_10_32x32_neon:                                     444.2 ( 8.51x)
> avg_10_64x64_neon:                                    1711.8 ( 8.00x)
> avg_10_128x128_neon:                                  7065.2 ( 7.43x)
> avg_12_2x2_neon:                                        19.5 ( 1.71x)
> avg_12_4x4_neon:                                        24.2 ( 3.22x)
> avg_12_8x8_neon:                                        33.8 ( 7.57x)
> avg_12_16x16_neon:                                     120.2 ( 7.33x)
> avg_12_32x32_neon:                                     442.5 ( 8.53x)
> avg_12_64x64_neon:                                    1706.2 ( 8.02x)
> avg_12_128x128_neon:                                  7010.0 ( 7.46x)
>
> A72
> avg_8_2x2_neon:                                         30.2 ( 1.48x)
> avg_8_4x4_neon:                                         40.0 ( 3.10x)
> avg_8_8x8_neon:                                         91.0 ( 4.14x)
> avg_8_16x16_neon:                                      340.4 ( 3.92x)
> avg_8_32x32_neon:                                     1220.7 ( 4.67x)
> avg_8_64x64_neon:                                     5823.4 ( 3.88x)
> avg_8_128x128_neon:                                  17430.5 ( 4.73x)
> avg_10_2x2_neon:                                        34.0 ( 1.66x)
> avg_10_4x4_neon:                                        45.2 ( 2.73x)
> avg_10_8x8_neon:                                        97.5 ( 3.87x)
> avg_10_16x16_neon:                                     317.7 ( 3.90x)
> avg_10_32x32_neon:                                    1376.2 ( 4.21x)
> avg_10_64x64_neon:                                    5228.1 ( 3.71x)
> avg_10_128x128_neon:                                 16722.2 ( 4.17x)
> avg_12_2x2_neon:                                        31.7 ( 1.76x)
> avg_12_4x4_neon:                                        36.0 ( 3.44x)
> avg_12_8x8_neon:                                        91.7 ( 4.10x)
> avg_12_16x16_neon:                                     297.2 ( 4.13x)
> avg_12_32x32_neon:                                    1400.5 ( 4.14x)
> avg_12_64x64_neon:                                    5379.1 ( 3.51x)
> avg_12_128x128_neon:                                 16715.7 ( 4.17x)
>
> avg_8_2x2_neon:                                         33.7 ( 1.72x)
> avg_8_4x4_neon:                                         45.5 ( 2.84x)
> avg_8_8x8_neon:                                         65.0 ( 5.98x)
> avg_8_16x16_neon:                                      171.0 ( 7.81x)
> avg_8_32x32_neon:                                      558.2 (10.05x)
> avg_8_64x64_neon:                                     2006.5 (10.61x)
> avg_8_128x128_neon:                                   9158.7 ( 8.96x)
> avg_10_2x2_neon:                                        38.0 ( 1.92x)
> avg_10_4x4_neon:                                        53.2 ( 2.69x)
> avg_10_8x8_neon:                                        95.2 ( 4.08x)
> avg_10_16x16_neon:                                     243.0 ( 5.02x)
> avg_10_32x32_neon:                                     891.7 ( 5.64x)
> avg_10_64x64_neon:                                    3357.7 ( 5.60x)
> avg_10_128x128_neon:                                 12411.7 ( 5.56x)
> avg_12_2x2_neon:                                        34.7 ( 1.97x)
> avg_12_4x4_neon:                                        53.2 ( 2.68x)
> avg_12_8x8_neon:                                        91.7 ( 4.22x)
> avg_12_16x16_neon:                                     239.0 ( 5.08x)
> avg_12_32x32_neon:                                     895.7 ( 5.62x)
> avg_12_64x64_neon:                                    3317.5 ( 5.67x)
> avg_12_128x128_neon:                                 12358.5 ( 5.58x)
>
>
> A53
> avg_8_2x2_neon:                                         58.3 ( 1.41x)
> avg_8_4x4_neon:                                        101.8 ( 2.21x)
> avg_8_8x8_neon:                                        178.6 ( 4.53x)
> avg_8_16x16_neon:                                      569.5 ( 5.01x)
> avg_8_32x32_neon:                                     1962.5 ( 5.50x)
> avg_8_64x64_neon:                                     8327.8 ( 5.18x)
> avg_8_128x128_neon:                                  31631.3 ( 5.34x)
> avg_10_2x2_neon:                                        54.5 ( 1.56x)
> avg_10_4x4_neon:                                        88.8 ( 2.53x)
> avg_10_8x8_neon:                                       163.6 ( 4.97x)
> avg_10_16x16_neon:                                     550.5 ( 5.16x)
> avg_10_32x32_neon:                                    1942.5 ( 5.64x)
> avg_10_64x64_neon:                                    8783.5 ( 4.98x)
> avg_10_128x128_neon:                                 32617.0 ( 5.25x)
> avg_12_2x2_neon:                                        53.3 ( 1.66x)
> avg_12_4x4_neon:                                        86.8 ( 2.61x)
> avg_12_8x8_neon:                                       156.6 ( 5.12x)
> avg_12_16x16_neon:                                     541.3 ( 5.25x)
> avg_12_32x32_neon:                                    1955.3 ( 5.59x)
> avg_12_64x64_neon:                                    8686.0 ( 5.06x)
> avg_12_128x128_neon:                                 32487.5 ( 5.25x)
>
> avg_8_2x2_neon:                                         39.5 ( 1.96x)
> avg_8_4x4_neon:                                         65.3 ( 3.41x)
> avg_8_8x8_neon:                                        168.8 ( 4.79x)
> avg_8_16x16_neon:                                      348.0 ( 8.20x)
> avg_8_32x32_neon:                                     1207.5 ( 8.98x)
> avg_8_64x64_neon:                                     6032.3 ( 7.17x)
> avg_8_128x128_neon:                                  22008.5 ( 7.69x)
> avg_10_2x2_neon:                                        55.5 ( 1.52x)
> avg_10_4x4_neon:                                        73.8 ( 3.08x)
> avg_10_8x8_neon:                                       157.8 ( 5.12x)
> avg_10_16x16_neon:                                     445.0 ( 6.43x)
> avg_10_32x32_neon:                                    1587.3 ( 6.87x)
> avg_10_64x64_neon:                                    7738.0 ( 5.68x)
> avg_10_128x128_neon:                                 27813.8 ( 6.14x)
> avg_12_2x2_neon:                                        48.3 ( 1.80x)
> avg_12_4x4_neon:                                        77.0 ( 2.95x)
> avg_12_8x8_neon:                                       161.5 ( 4.98x)
> avg_12_16x16_neon:                                     433.5 ( 6.59x)
> avg_12_32x32_neon:                                    1622.0 ( 6.75x)
> avg_12_64x64_neon:                                    7844.5 ( 5.60x)
> avg_12_128x128_neon:                                 26999.5 ( 6.34x)
>
> Krzysztof
>
> libavcodec/aarch64/vvc/inter.S | 124 ++++++++++++++++++++++++++++++++-
> 1 file changed, 121 insertions(+), 3 deletions(-)

Overall the change looks reasonable to me, thanks, but remove the now 
unused parts and update the patch to include the valuable comments and 
benchmarks above the "---" bit.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

      parent reply	other threads:[~2025-03-01 22:21 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-02-19 17:40 Krzysztof Pyrkosz via ffmpeg-devel
2025-02-19 17:40 ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
2025-02-20  8:08   ` Zhao Zhili
2025-03-01 22:34   ` Martin Storsjö
2025-02-20  7:20 ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
2025-02-20 18:49   ` Krzysztof Pyrkosz via ffmpeg-devel
2025-02-20 18:49     ` [FFmpeg-devel] [PATCH 2/2] avcodec/aarch64/vvc: Use rounding shift NEON instruction Krzysztof Pyrkosz via ffmpeg-devel
2025-02-26  8:54     ` [FFmpeg-devel] [PATCH 1/2] avcodec/aarch64/vvc: Optimize vvc_avg{8, 10, 12} Zhao Zhili
2025-03-01 22:21 ` Martin Storsjö [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ab5b78f9-20ae-a791-91c6-bd2647bbd2c@martin.st \
    --to=martin@martin.st \
    --cc=ffmpeg-devel@ffmpeg.org \
    --cc=ffmpeg@szaka.eu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git