Re: [FFmpeg-devel] [PATCH v4] avcodec/aarch64/hevc: add luma deblock NEON

From: "J. Dekker" <jdek@itanimul.li>
To: ffmpeg-devel@ffmpeg.org
Subject: Re: [FFmpeg-devel] [PATCH v4] avcodec/aarch64/hevc: add luma deblock NEON
Date: Wed, 28 Feb 2024 09:30:02 +0100
Message-ID: <87msrl7yq9.fsf@itanimul.li> (raw)
In-Reply-To: <9bd6fe36-1871-e34f-ae32-ca53a8e7911@martin.st>

Martin Storsjö <martin@martin.st> writes:

> On Wed, 28 Feb 2024, J. Dekker wrote:
>
>>
>> Martin Storsjö <martin@martin.st> writes:
>>
>>> On Tue, 27 Feb 2024, J. Dekker wrote:
>>>
>>>> Benched using single-threaded full decode on an Ampere Altra.
>>>>
>>>> Bpp Before  After  Speedup
>>>> 8   73,3s   65,2s  1.124x
>>>> 10  114,2s  104,0s 1.098x
>>>> 12  125,8s  115,7s 1.087x
>>>>
>>>> Signed-off-by: J. Dekker <jdek@itanimul.li>
>>>> ---
>>>>
>>>> Slightly improved 12bit version.
>>>>
>>>> libavcodec/aarch64/hevcdsp_deblock_neon.S | 417 ++++++++++++++++++++++
>>>> libavcodec/aarch64/hevcdsp_init_aarch64.c |  18 +
>>>> 2 files changed, 435 insertions(+)
>>>>
>>>> diff --git a/libavcodec/aarch64/hevcdsp_deblock_neon.S b/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> index 8227f65649..581056a91e 100644
>>>> --- a/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> +++ b/libavcodec/aarch64/hevcdsp_deblock_neon.S
>>>> @@ -181,3 +181,420 @@ hevc_h_loop_filter_chroma 12
>>>> hevc_v_loop_filter_chroma 8
>>>> hevc_v_loop_filter_chroma 10
>>>> hevc_v_loop_filter_chroma 12
>>>> +
>>>> +.macro hevc_loop_filter_luma_body bitdepth
>>>> +function hevc_loop_filter_luma_body_\bitdepth\()_neon, export=0
>>>> +.if \bitdepth > 8
>>>> +        lsl             w2, w2, #(\bitdepth - 8) // beta <<= BIT_DEPTH - 8
>>>> +.else
>>>> +        uxtl            v0.8h, v0.8b
>>>> +        uxtl            v1.8h, v1.8b
>>>> +        uxtl            v2.8h, v2.8b
>>>> +        uxtl            v3.8h, v3.8b
>>>> +        uxtl            v4.8h, v4.8b
>>>> +        uxtl            v5.8h, v5.8b
>>>> +        uxtl            v6.8h, v6.8b
>>>> +        uxtl            v7.8h, v7.8b
>>>> +.endif
>>>> +        ldr             w7, [x3] // tc[0]
>>>> +        ldr             w8, [x3, #4] // tc[1]
>>>> +        dup             v18.4h, w7
>>>> +        dup             v19.4h, w8
>>>> +        trn1            v18.2d, v18.2d, v19.2d
>>>> +.if \bitdepth > 8
>>>> +        shl             v18.8h, v18.8h, #(\bitdepth - 8)
>>>> +.endif
>>>> +        dup             v27.8h, w2 // beta
>>>> +        // tc25
>>>> +        shl             v19.8h, v18.8h, #2 // * 4
>>>> +        add             v19.8h, v19.8h, v18.8h // (tc * 5)
>>>> +        srshr           v19.8h, v19.8h, #1 // (tc * 5 + 1) >> 1
>>>> +        sshr            v17.8h, v27.8h, #2 // beta2
>>>> +
>>>> +        ////// beta_2 check
>>>> +        // dp0  = abs(P2  - 2 * P1  + P0)
>>>> +        add             v22.8h, v3.8h, v1.8h
>>>> +        shl             v23.8h, v2.8h, #1
>>>> +        sabd            v30.8h, v22.8h, v23.8h
>>>> +        // dq0  = abs(Q2  - 2 * Q1  + Q0)
>>>> +        add             v21.8h, v6.8h, v4.8h
>>>> +        shl             v26.8h, v5.8h, #1
>>>> +        sabd            v31.8h, v21.8h, v26.8h
>>>> +        // d0   = dp0 + dq0
>>>> +        add             v20.8h, v30.8h, v31.8h
>>>> +        shl             v25.8h, v20.8h, #1
>>>> +        // (d0 << 1) < beta_2
>>>> +        cmgt            v23.8h, v17.8h, v25.8h
>>>> +
>>>> +        ////// beta check
>>>> +        // d0 + d3 < beta
>>>> +        mov             x9, #0xFFFF00000000FFFF
>>>> +        dup             v24.2d, x9
>>>> +        and             v25.16b, v24.16b, v20.16b
>>>> +        addp            v25.8h, v25.8h, v25.8h // 1+0 0+1 1+0 0+1
>>>> +        addp            v25.4h, v25.4h, v25.4h // 1+0+0+1 1+0+0+1
>>>> +        cmgt            v25.4h, v27.4h, v25.4h // lower/upper mask in h[0/1]
>>>> +        mov             w9, v25.s[0]
>>>
>>> I don't quite understand what this sequence does and/or how our data is laid
>>> out in our registers - we have d0 on input in v20, where's d3? An doesn't the
>>> "and" throw away half of the input elements here?
>>>
>>> I see some similar patterns with the masking and handling below as well - I get
>>> a feeling that I don't quite understand the algorithm here, and/or the data
>>> layout.
>>
>> We have d0, d1, d2, d3 for both 4 line blocks in v20, mask out d1/d2 and
>> use pair-wise adds to move our data around and calculate d0+d3
>> together. The first addp just moves elements around, the second addp
>> adds d0 + 0 + 0 + d3.
>
> Right, I guess this is the bit that was surprising. I would have expected to
> have e.g. all the d0 values for e.g. the 8 individual pixels in one SIMD
> register, and all the d3 values for all pixels in another SIMD register.
>
> So as we're operating on 8 pixels in parallel, each of those 8 pixels have
> their own d0/d3 values, right? Or is this a case where we have just one d0/d3
> value for a range of pixels?

Yes, d0/d1/d2/d3 are per 4 lines of 8 pixels, it's because d0 and d3 are
calculated within their own line, d0 from line 0, d3 from line 3. Maybe
it's more confusing since we are doing both halves of the filter at the
same time? v20 contains d0 d1 d2 d3 d0 d1 d2 d3, where the second d0 is
distinct from the first.

But essentially we're doing the same operation across the entire 8
lines, the filter just makes an overall skip decision for each block of
4 lines based on the sum of the result from line 0 and 3.

-- 
jd
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".