Hi, Martin, Thanks for your comments. I have now amended the unreasonable parts of ldp/stp that I have seen.  And I updated patch 3 and patch 5. (Although I have attached all 5 patches) In addition, I thought that q8-q15 was required to be saved according to the calling convention before, but later I confirmed that it is the lower 64bit, thank you for reminding. Please take a look. If there are some small mistakes, please correct them directly. If there are still many problems, please remind me again, thank you! 在 2023/7/2 5:28, Martin Storsjö 写道: > On Sun, 18 Jun 2023, Logan.Lyu wrote: > >> Hi, Martin, >> >> I modified it according to your comments. Please review again. > >> From 47b7f7af634add7680b56a216fff7dbe1f08cd11 Mon Sep 17 00:00:00 2001 >> From: Logan Lyu >> Date: Sun, 28 May 2023 10:35:43 +0800 >> Subject: [PATCH 5/5] lavc/aarch64: new optimization for 8-bit >>  hevc_epel_uni_w_hv >> >> Signed-off-by: Logan Lyu >> --- >>  libavcodec/aarch64/hevcdsp_epel_neon.S    | 694 ++++++++++++++++++++++ >>  libavcodec/aarch64/hevcdsp_init_aarch64.c |   6 + >>  2 files changed, 700 insertions(+) >> >> diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S >> b/libavcodec/aarch64/hevcdsp_epel_neon.S >> index 8b6f396a0b..355679af29 100644 >> --- a/libavcodec/aarch64/hevcdsp_epel_neon.S >> +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S >> @@ -717,6 +717,700 @@ function >> ff_hevc_put_hevc_epel_uni_w_h64_8_neon_i8mm, export=1 >>          ret >>  endfunc >> >> +.macro epel_uni_w_hv_start >> +        mov             x15, x5         //denom >> +        mov             x16, x6         //wx >> +        mov             x17, x7         //ox >> +        add             w15, w15, #6    //shift = denom+6 >> + >> + >> +        ldp             x5, x6, [sp] >> +        ldr             x7, [sp, #16] >> + >> +        stp             q12, q13, [sp, #-128]! >> +        stp             q14, q15, [sp, #32] >> +        stp             q8, q9,   [sp, #64] >> +        stp             q10, q11, [sp, #96] > > Only need to back up 64 bytes, by backing up d8-d15. Also, the order > is quite weird here, why not keep them in e.g. linear order? > >> +function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1 >> +        epel_uni_w_hv_start >> +        sxtw            x4, w4 >> + >> +        add             x10, x4, #3 >> +        lsl             x10, x10, #7 >> +        sub             sp, sp, x10     // tmp_array >> +        stp             xzr, x30, [sp, #-48]! > > As mentioned already in the previous review - why do you back up and > restore xzr here? That's not necessary. Yes, you should keep the stack > 16 byte aligned, but you can just leave an empty slot, and just do > "str x30, [sp, #-48]!" here, and vice versa with "ldr" instead of ldp > when restoring. > > The same goes in all functions here. > >> +2: >> +        ldp             q14, q15, [sp, #32] >> +        ldp             q8, q9,   [sp, #64] >> +        ldp             q10, q11, [sp, #96] >> +        ldp             q12, q13, [sp], #128 > > Only need d8-d15, and weird register order here, and elsewhere. > >> +function ff_hevc_put_hevc_epel_uni_w_hv24_8_neon_i8mm, export=1 >> +        epel_uni_w_hv_start >> +        sxtw            x4, w4 > > FWIW, it's unusual to need an explicit sxtw instruction, but I guess > if you use it in the form "add x10, x4, #3" it might be needed. > >> +function ff_hevc_put_hevc_epel_uni_w_hv32_8_neon_i8mm, export=1 >> +        ldp             x15, x16, [sp] >> +        stp             x0, x30, [sp, #-16]! >> +        stp             x1, x2, [sp, #-16]! >> +        stp             x3, x4, [sp, #-16]! >> +        stp             x5, x6, [sp, #-16]! > > Don't do consecutive stack pointer updates like this, but merge it > into one large stack decrement followed by positive offsets, like in > all the other cases of stp/ldp. > >> +        mov             x17, #16 >> +        stp             x17, x7, [sp, #-16]! >> +        stp             x15, x16, [sp, #-16]! >> +        bl X(ff_hevc_put_hevc_epel_uni_w_hv16_8_neon_i8mm) >> +        ldp             x15, x16, [sp], #16 >> +        ldp             x17, x7, [sp], #16 >> +        ldp             x5, x6, [sp], #16 >> +        ldp             x3, x4, [sp], #16 >> +        ldp             x1, x2, [sp], #16 >> +        ldr             x0, [sp] >> +        add             x0, x0, #16 >> +        add             x2, x2, #16 >> +        mov             x17, #16 >> +        stp             x17, xzr, [sp, #-16]! >> +        stp             x15, x16, [sp, #-16]! > > Don't do multiple stack decrements, don't needlessly store xzr here. > > The same goes for all the other functions in this patch. > > // Martin > > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".