From: "J. Dekker" <jdek@itanimul.li>
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH v4] avcodec/aarch64/hevc: add luma deblock NEON
Date: Wed, 28 Feb 2024 09:30:02 +0100
Message-ID: <87msrl7yq9.fsf@itanimul.li> (raw)
In-Reply-To: <9bd6fe36-1871-e34f-ae32-ca53a8e7911@martin.st>
Martin Storsjö <martin@martin.st> writes:
> On Wed, 28 Feb 2024, J. Dekker wrote:
>
>>
>> Martin Storsjö <martin@martin.st> writes:
>>
>>> On Tue, 27 Feb 2024, J. Dekker wrote:
>>>
>>>> Benched using single-threaded full decode on an Ampere Altra.
>>>>
>>>> Bpp Before After Speedup
>>>> 8 73,3s 65,2s 1.124x
>>>> 10 114,2s 104,0s 1.098x
>>>> 12 125,8s 115,7s 1.087x
>>>>
>>>> Signed-off-by: J. Dekker <jdek@itanimul.li>
>>>> ---
>>>>
>>>> Slightly improved 12bit version.
>>>>
>>>> libavcodec/aarch64/hevcdsp_deblock_neon.S | 417 ++++++++++++++++++++++
>>>> libavcodec/aarch64/hevcdsp_init_aarch64.c | 18 +
>>>> 2 files changed, 435 insertions(+)
>>>>
>>>> diff --git a/libavcodec/aarch64/hevcdsp_deblock_neon.S b/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> index 8227f65649..581056a91e 100644
>>>> --- a/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> +++ b/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> @@ -181,3 +181,420 @@ hevc_h_loop_filter_chroma 12
>>>> hevc_v_loop_filter_chroma 8
>>>> hevc_v_loop_filter_chroma 10
>>>> hevc_v_loop_filter_chroma 12
>>>> +
>>>> +.macro hevc_loop_filter_luma_body bitdepth
>>>> +function hevc_loop_filter_luma_body_\bitdepth\()_neon, export=0
>>>> +.if \bitdepth > 8
>>>> + lsl w2, w2, #(\bitdepth - 8) // beta <<= BIT_DEPTH - 8
>>>> +.else
>>>> + uxtl v0.8h, v0.8b
>>>> + uxtl v1.8h, v1.8b
>>>> + uxtl v2.8h, v2.8b
>>>> + uxtl v3.8h, v3.8b
>>>> + uxtl v4.8h, v4.8b
>>>> + uxtl v5.8h, v5.8b
>>>> + uxtl v6.8h, v6.8b
>>>> + uxtl v7.8h, v7.8b
>>>> +.endif
>>>> + ldr w7, [x3] // tc[0]
>>>> + ldr w8, [x3, #4] // tc[1]
>>>> + dup v18.4h, w7
>>>> + dup v19.4h, w8
>>>> + trn1 v18.2d, v18.2d, v19.2d
>>>> +.if \bitdepth > 8
>>>> + shl v18.8h, v18.8h, #(\bitdepth - 8)
>>>> +.endif
>>>> + dup v27.8h, w2 // beta
>>>> + // tc25
>>>> + shl v19.8h, v18.8h, #2 // * 4
>>>> + add v19.8h, v19.8h, v18.8h // (tc * 5)
>>>> + srshr v19.8h, v19.8h, #1 // (tc * 5 + 1) >> 1
>>>> + sshr v17.8h, v27.8h, #2 // beta2
>>>> +
>>>> + ////// beta_2 check
>>>> + // dp0 = abs(P2 - 2 * P1 + P0)
>>>> + add v22.8h, v3.8h, v1.8h
>>>> + shl v23.8h, v2.8h, #1
>>>> + sabd v30.8h, v22.8h, v23.8h
>>>> + // dq0 = abs(Q2 - 2 * Q1 + Q0)
>>>> + add v21.8h, v6.8h, v4.8h
>>>> + shl v26.8h, v5.8h, #1
>>>> + sabd v31.8h, v21.8h, v26.8h
>>>> + // d0 = dp0 + dq0
>>>> + add v20.8h, v30.8h, v31.8h
>>>> + shl v25.8h, v20.8h, #1
>>>> + // (d0 << 1) < beta_2
>>>> + cmgt v23.8h, v17.8h, v25.8h
>>>> +
>>>> + ////// beta check
>>>> + // d0 + d3 < beta
>>>> + mov x9, #0xFFFF00000000FFFF
>>>> + dup v24.2d, x9
>>>> + and v25.16b, v24.16b, v20.16b
>>>> + addp v25.8h, v25.8h, v25.8h // 1+0 0+1 1+0 0+1
>>>> + addp v25.4h, v25.4h, v25.4h // 1+0+0+1 1+0+0+1
>>>> + cmgt v25.4h, v27.4h, v25.4h // lower/upper mask in h[0/1]
>>>> + mov w9, v25.s[0]
>>>
>>> I don't quite understand what this sequence does and/or how our data is laid
>>> out in our registers - we have d0 on input in v20, where's d3? An doesn't the
>>> "and" throw away half of the input elements here?
>>>
>>> I see some similar patterns with the masking and handling below as well - I get
>>> a feeling that I don't quite understand the algorithm here, and/or the data
>>> layout.
>>
>> We have d0, d1, d2, d3 for both 4 line blocks in v20, mask out d1/d2 and
>> use pair-wise adds to move our data around and calculate d0+d3
>> together. The first addp just moves elements around, the second addp
>> adds d0 + 0 + 0 + d3.
>
> Right, I guess this is the bit that was surprising. I would have expected to
> have e.g. all the d0 values for e.g. the 8 individual pixels in one SIMD
> register, and all the d3 values for all pixels in another SIMD register.
>
> So as we're operating on 8 pixels in parallel, each of those 8 pixels have
> their own d0/d3 values, right? Or is this a case where we have just one d0/d3
> value for a range of pixels?
Yes, d0/d1/d2/d3 are per 4 lines of 8 pixels, it's because d0 and d3 are
calculated within their own line, d0 from line 0, d3 from line 3. Maybe
it's more confusing since we are doing both halves of the filter at the
same time? v20 contains d0 d1 d2 d3 d0 d1 d2 d3, where the second d0 is
distinct from the first.
But essentially we're doing the same operation across the entire 8
lines, the filter just makes an overall skip decision for each block of
4 lines based on the sum of the result from line 0 and 3.
--
jd
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2024-02-28 8:57 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-27 11:33 J. Dekker
2024-02-27 21:56 ` Martin Storsjö
2024-02-28 8:02 ` J. Dekker
2024-02-28 8:27 ` Martin Storsjö
2024-02-28 8:30 ` J. Dekker [this message]
2024-02-28 9:13 ` Martin Storsjö
2024-02-28 9:17 ` J. Dekker
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87msrl7yq9.fsf@itanimul.li \
--to=jdek@itanimul.li \
--cc=ffmpeg-devel@ffmpeg.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git