Hi, Martin,

Thanks for your comments.

I have now amended the unreasonable parts of ldp/stp that I have seen.  
And I updated patch 3 and patch 5. (Although I have attached all 5 patches)
In addition, I thought that q8-q15 was required to be saved according to 
the calling convention before, but later I confirmed that it is the 
lower 64bit, thank you for reminding.

Please take a look. If there are some small mistakes, please correct 
them directly. If there are still many problems, please remind me again, 
thank you!


在 2023/7/2 5:28, Martin Storsjö 写道:
> On Sun, 18 Jun 2023, Logan.Lyu wrote:
>
>> Hi, Martin,
>>
>> I modified it according to your comments. Please review again.
>
>> From 47b7f7af634add7680b56a216fff7dbe1f08cd11 Mon Sep 17 00:00:00 2001
>> From: Logan Lyu <Logan.Lyu@myais.com.cn>
>> Date: Sun, 28 May 2023 10:35:43 +0800
>> Subject: [PATCH 5/5] lavc/aarch64: new optimization for 8-bit
>>  hevc_epel_uni_w_hv
>>
>> Signed-off-by: Logan Lyu <Logan.Lyu@myais.com.cn>
>> ---
>>  libavcodec/aarch64/hevcdsp_epel_neon.S    | 694 ++++++++++++++++++++++
>>  libavcodec/aarch64/hevcdsp_init_aarch64.c |   6 +
>>  2 files changed, 700 insertions(+)
>>
>> diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S 
>> b/libavcodec/aarch64/hevcdsp_epel_neon.S
>> index 8b6f396a0b..355679af29 100644
>> --- a/libavcodec/aarch64/hevcdsp_epel_neon.S
>> +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S
>> @@ -717,6 +717,700 @@ function 
>> ff_hevc_put_hevc_epel_uni_w_h64_8_neon_i8mm, export=1
>>          ret
>>  endfunc
>>
>> +.macro epel_uni_w_hv_start
>> +        mov             x15, x5         //denom
>> +        mov             x16, x6         //wx
>> +        mov             x17, x7         //ox
>> +        add             w15, w15, #6    //shift = denom+6
>> +
>> +
>> +        ldp             x5, x6, [sp]
>> +        ldr             x7, [sp, #16]
>> +
>> +        stp             q12, q13, [sp, #-128]!
>> +        stp             q14, q15, [sp, #32]
>> +        stp             q8, q9,   [sp, #64]
>> +        stp             q10, q11, [sp, #96]
>
> Only need to back up 64 bytes, by backing up d8-d15. Also, the order
> is quite weird here, why not keep them in e.g. linear order?
>
>> +function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1
>> +        epel_uni_w_hv_start
>> +        sxtw            x4, w4
>> +
>> +        add             x10, x4, #3
>> +        lsl             x10, x10, #7
>> +        sub             sp, sp, x10     // tmp_array
>> +        stp             xzr, x30, [sp, #-48]!
>
> As mentioned already in the previous review - why do you back up and
> restore xzr here? That's not necessary. Yes, you should keep the stack
> 16 byte aligned, but you can just leave an empty slot, and just do
> "str x30, [sp, #-48]!" here, and vice versa with "ldr" instead of ldp
> when restoring.
>
> The same goes in all functions here.
>
>> +2:
>> +        ldp             q14, q15, [sp, #32]
>> +        ldp             q8, q9,   [sp, #64]
>> +        ldp             q10, q11, [sp, #96]
>> +        ldp             q12, q13, [sp], #128
>
> Only need d8-d15, and weird register order here, and elsewhere.
>
>> +function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1
>> +        epel_uni_w_hv_start
>> +        sxtw            x4, w4
>
> FWIW, it's unusual to need an explicit sxtw instruction, but I guess
> if you use it in the form "add x10, x4, #3" it might be needed.
>
>> +function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1
>> +        ldp             x15, x16, [sp]
>> +        stp             x0, x30, [sp, #-16]!
>> +        stp             x1, x2, [sp, #-16]!
>> +        stp             x3, x4, [sp, #-16]!
>> +        stp             x5, x6, [sp, #-16]!
>
> Don't do consecutive stack pointer updates like this, but merge it
> into one large stack decrement followed by positive offsets, like in
> all the other cases of stp/ldp.
>
>> +        mov             x17, #16
>> +        stp             x17, x7, [sp, #-16]!
>> +        stp             x15, x16, [sp, #-16]!
>> +        bl X(ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm)
>> +        ldp             x15, x16, [sp], #16
>> +        ldp             x17, x7, [sp], #16
>> +        ldp             x5, x6, [sp], #16
>> +        ldp             x3, x4, [sp], #16
>> +        ldp             x1, x2, [sp], #16
>> +        ldr             x0, [sp]
>> +        add             x0, x0, #16
>> +        add             x2, x2, #16
>> +        mov             x17, #16
>> +        stp             x17, xzr, [sp, #-16]!
>> +        stp             x15, x16, [sp, #-16]!
>
> Don't do multiple stack decrements, don't needlessly store xzr here.
>
> The same goes for all the other functions in this patch.
>
> // Martin
>
> _______________________________________________
> ffmpeg-devel mailing list
> ffmpeg-devel@ffmpeg.org
> https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
>
> To unsubscribe, visit link above, or email
> ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".