Hi, Martin, I modified it according to your comments. Please review again. And here are the checkasm benchmark results of the related functions: put_hevc_epel_uni_w_v4_8_c: 116.1 put_hevc_epel_uni_w_v4_8_neon: 48.6 put_hevc_epel_uni_w_v6_8_c: 248.9 put_hevc_epel_uni_w_v6_8_neon: 80.6 put_hevc_epel_uni_w_v8_8_c: 383.9 put_hevc_epel_uni_w_v8_8_neon: 91.9 put_hevc_epel_uni_w_v12_8_c: 806.1 put_hevc_epel_uni_w_v12_8_neon: 202.9 put_hevc_epel_uni_w_v16_8_c: 1411.1 put_hevc_epel_uni_w_v16_8_neon: 289.9 put_hevc_epel_uni_w_v24_8_c: 3168.9 put_hevc_epel_uni_w_v24_8_neon: 619.4 put_hevc_epel_uni_w_v32_8_c: 5632.9 put_hevc_epel_uni_w_v32_8_neon: 1161.1 put_hevc_epel_uni_w_v48_8_c: 12406.1 put_hevc_epel_uni_w_v48_8_neon: 2476.4 put_hevc_epel_uni_w_v64_8_c: 22001.4 put_hevc_epel_uni_w_v64_8_neon: 4343.9 在 2023/6/12 16:09, Martin Storsjö 写道: > On Sun, 4 Jun 2023, Logan.Lyu@myais.com.cn wrote: > >> From: Logan Lyu >> >> Signed-off-by: Logan Lyu >> --- >> libavcodec/aarch64/hevcdsp_epel_neon.S    | 504 ++++++++++++++++++++++ >> libavcodec/aarch64/hevcdsp_init_aarch64.c |   6 + >> 2 files changed, 510 insertions(+) >> >> diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S >> b/libavcodec/aarch64/hevcdsp_epel_neon.S >> index fe494dd843..4841f49dab 100644 >> --- a/libavcodec/aarch64/hevcdsp_epel_neon.S >> +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S > > >> +function ff_hevc_put_hevc_epel_uni_w_v48_8_neon, export=1 >> +        stp             q8, q9, [sp, #-32] >> +        stp             q10, q11, [sp, #-64] > > This backs up values on the stack without decrementing the stack > pointer, i.e. storing it in the red zone. Whether this is supported > depends on the platform ABI. Linux and macOS have a 128 byte red zone > on aarch64, while Windows only has 16 bytes. So for portability, don't > rely on a red zone at all. > > I.e., here please decrement the stack pointer like in a previous patch: > >     stp q8,  q9,  [sp, #-64]! >     stp q10, q11, [sp, #32] > > And inversely when restoring it. > >> +        EPEL_UNI_W_V_HEADER >> + >> +        ld1             {v16.16b, v17.16b, v18.16b}, [x2], x3 >> +        ld1             {v19.16b, v20.16b, v21.16b}, [x2], x3 >> +        ld1             {v22.16b, v23.16b, v24.16b}, [x2], x3 >> +1: >> +        ld1             {v25.16b, v26.16b, v27.16b}, [x2], x3 >> + >> +        EPEL_UNI_W_V16_CALC v4, v6, v16, v19, v22, v25, v8, v9, v10, >> v11 >> +        EPEL_UNI_W_V16_CALC v5, v7, v17, v20, v23, v26, v8, v9, v10, >> v11 >> +        EPEL_UNI_W_V16_CALC v6, v7, v18, v21, v24, v27, v8, v9, v10, >> v11 >> +        st1             {v4.16b, v5.16b, v6.16b}, [x0], x1 >> +        subs            w4, w4, #1 >> +        b.eq            2f >> +        ld1             {v16.16b, v17.16b, v18.16b}, [x2], x3 >> +        EPEL_UNI_W_V16_CALC v4, v6, v19, v22, v25, v16, v8, v9, v10, >> v11 >> +        EPEL_UNI_W_V16_CALC v5, v7, v20, v23, v26, v17, v8, v9, v10, >> v11 >> +        EPEL_UNI_W_V16_CALC v6, v7, v21, v24, v27, v18, v8, v9, v10, >> v11 >> +        st1             {v4.16b, v5.16b, v6.16b}, [x0], x1 >> +        subs            w4, w4, #1 >> +        b.eq            2f >> +        ld1             {v19.16b, v20.16b, v21.16b}, [x2], x3 >> +        EPEL_UNI_W_V16_CALC v4, v6,  v22, v25, v16, v19, v8, v9, >> v10, v11 >> +        EPEL_UNI_W_V16_CALC v5, v7,  v23, v26, v17, v20, v8, v9, >> v10, v11 >> +        EPEL_UNI_W_V16_CALC v6, v7,  v24, v27, v18, v21, v8, v9, >> v10, v11 >> +        st1             {v4.16b, v5.16b, v6.16b}, [x0], x1 >> +        subs            w4, w4, #1 >> +        b.eq            2f >> +        ld1             {v22.16b, v23.16b, v24.16b}, [x2], x3 >> +        EPEL_UNI_W_V16_CALC v4, v6,  v25, v16, v19, v22, v8, v9, >> v10, v11 >> +        EPEL_UNI_W_V16_CALC v5, v7,  v26, v17, v20, v23, v8, v9, >> v10, v11 >> +        EPEL_UNI_W_V16_CALC v6, v7,  v27, v18, v21, v24, v8, v9, >> v10, v11 >> +        st1             {v4.16b, v5.16b, v6.16b}, [x0], x1 >> +        subs            w4, w4, #1 >> +        b.hi            1b >> +2: >> +        ldp             q8, q9, [sp, #-32] >> +        ldp             q10, q11, [sp, #-64] >> +        ret >> +endfunc >> + >> +function ff_hevc_put_hevc_epel_uni_w_v64_8_neon, export=1 >> +        stp             q8, q9, [sp, #-32] >> +        stp             q10, q11, [sp, #-64] >> +        stp             q12, q13, [sp, #-96] >> +        stp             q14, q15, [sp, #-128] > > Same thing here. > > // Martin >