From: "Martin Storsjö" <martin@martin.st> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Cc: Logan Lyu <Logan.Lyu@myais.com.cn> Subject: Re: [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_hv Date: Mon, 12 Jun 2023 11:19:29 +0300 (EEST) Message-ID: <c07cf4d-6cd8-36c4-124c-11da8aed149@martin.st> (raw) In-Reply-To: <20230604041756.5196-5-Logan.Lyu@myais.com.cn> On Sun, 4 Jun 2023, Logan.Lyu@myais.com.cn wrote: > From: Logan Lyu <Logan.Lyu@myais.com.cn> > > Signed-off-by: Logan Lyu <Logan.Lyu@myais.com.cn> > --- > libavcodec/aarch64/hevcdsp_epel_neon.S | 703 ++++++++++++++++++++++ > libavcodec/aarch64/hevcdsp_init_aarch64.c | 7 + > 2 files changed, 710 insertions(+) > > diff --git a/libavcodec/aarch64/hevcdsp_epel_neon.S b/libavcodec/aarch64/hevcdsp_epel_neon.S > index 32f052a7b1..24a74d2c7d 100644 > --- a/libavcodec/aarch64/hevcdsp_epel_neon.S > +++ b/libavcodec/aarch64/hevcdsp_epel_neon.S > @@ -718,6 +718,709 @@ function ff_hevc_put_hevc_epel_uni_w_h64_8_neon_i8mm, export=1 > ret > endfunc > > +.macro epel_uni_w_hv_start > + mov x15, x5 //denom > + mov x16, x6 //wx > + mov x17, x7 //ox > + add w15, w15, #6 //shift = denom+6 > + > + > + ldp x5, x6, [sp] > + ldp x7, xzr, [sp, #16] Why ldp into xzr, that seems pointless? > + > + sub sp, sp, #128 > + stp q12, q13, [sp] This could be "stp q12, q13, [sp, #-128]!" > + stp q14, q15, [sp, #32] > + stp q8, q9, [sp, #64] > + stp q10, q11, [sp, #96] > + > + dup v13.8h, w16 //wx > + dup v14.4s, w17 //ox > + > + mov w17, #1 > + lsl w17, w17, w15 > + lsr w17, w17, #1 > + dup v15.4s, w17 > + > + neg w15, w15 // -shift > + dup v12.4s, w15 //shift > +.endm > + > +.macro epel_uni_w_hv_end > + smull v28.4s, v4.4h, v13.4h > + smull2 v29.4s, v4.8h, v13.8h > + add v28.4s, v28.4s, v15.4s > + add v29.4s, v29.4s, v15.4s > + sshl v28.4s, v28.4s, v12.4s > + sshl v29.4s, v29.4s, v12.4s > + add v28.4s, v28.4s, v14.4s > + add v29.4s, v29.4s, v14.4s > + sqxtn v4.4h, v28.4s > + sqxtn2 v4.8h, v29.4s > +.endm > + > +.macro epel_uni_w_hv_end2 > + smull v28.4s, v4.4h, v13.4h > + smull2 v29.4s, v4.8h, v13.8h > + smull v30.4s, v5.4h, v13.4h > + smull2 v31.4s, v5.8h, v13.8h > + add v28.4s, v28.4s, v15.4s > + add v29.4s, v29.4s, v15.4s > + add v30.4s, v30.4s, v15.4s > + add v31.4s, v31.4s, v15.4s > + > + sshl v28.4s, v28.4s, v12.4s > + sshl v29.4s, v29.4s, v12.4s > + sshl v30.4s, v30.4s, v12.4s > + sshl v31.4s, v31.4s, v12.4s > + > + add v28.4s, v28.4s, v14.4s > + add v29.4s, v29.4s, v14.4s > + add v30.4s, v30.4s, v14.4s > + add v31.4s, v31.4s, v14.4s > + > + sqxtn v4.4h, v28.4s > + sqxtn2 v4.8h, v29.4s > + sqxtn v5.4h, v30.4s > + sqxtn2 v5.8h, v31.4s > +.endm > + > +.macro epel_uni_w_hv_end3 > + smull v1.4s, v4.4h, v13.4h > + smull2 v2.4s, v4.8h, v13.8h > + smull v28.4s, v5.4h, v13.4h > + smull2 v29.4s, v5.8h, v13.8h > + smull v30.4s, v6.4h, v13.4h > + smull2 v31.4s, v6.8h, v13.8h > + add v1.4s, v1.4s, v15.4s > + add v2.4s, v2.4s, v15.4s > + add v28.4s, v28.4s, v15.4s > + add v29.4s, v29.4s, v15.4s > + add v30.4s, v30.4s, v15.4s > + add v31.4s, v31.4s, v15.4s > + > + sshl v1.4s, v1.4s, v12.4s > + sshl v2.4s, v2.4s, v12.4s > + sshl v28.4s, v28.4s, v12.4s > + sshl v29.4s, v29.4s, v12.4s > + sshl v30.4s, v30.4s, v12.4s > + sshl v31.4s, v31.4s, v12.4s > + add v1.4s, v1.4s, v14.4s > + add v2.4s, v2.4s, v14.4s > + add v28.4s, v28.4s, v14.4s > + add v29.4s, v29.4s, v14.4s > + add v30.4s, v30.4s, v14.4s > + add v31.4s, v31.4s, v14.4s > + > + sqxtn v4.4h, v1.4s > + sqxtn2 v4.8h, v2.4s > + sqxtn v5.4h, v28.4s > + sqxtn2 v5.8h, v29.4s > + sqxtn v6.4h, v30.4s > + sqxtn2 v6.8h, v31.4s > +.endm > + > +.macro calc_epelh dst, src0, src1, src2, src3 > + smull \dst\().4s, \src0\().4h, v0.h[0] > + smlal \dst\().4s, \src1\().4h, v0.h[1] > + smlal \dst\().4s, \src2\().4h, v0.h[2] > + smlal \dst\().4s, \src3\().4h, v0.h[3] > + sqshrn \dst\().4h, \dst\().4s, #6 > +.endm > + > +.macro calc_epelh2 dst, tmp, src0, src1, src2, src3 > + smull2 \tmp\().4s, \src0\().8h, v0.h[0] > + smlal2 \tmp\().4s, \src1\().8h, v0.h[1] > + smlal2 \tmp\().4s, \src2\().8h, v0.h[2] > + smlal2 \tmp\().4s, \src3\().8h, v0.h[3] > + sqshrn2 \dst\().8h, \tmp\().4s, #6 > +.endm > + > +.macro load_epel_filterh freg, xreg > + movrel \xreg, epel_filters > + add \xreg, \xreg, \freg, lsl #2 > + ld1 {v0.8b}, [\xreg] > + sxtl v0.8h, v0.8b > +.endm > + > +function ff_hevc_put_hevc_epel_uni_w_hv4_8_neon_i8mm, export=1 > + epel_uni_w_hv_start > + and x4, x4, 0xffffffff What does this "and" do here? Is it a case where the argument is "int", while the upper bits of the register is undefined? In those cases, you're best off by just using "w4", possibly "w4, uxtw" (or sxtw) instead of manually doing such an "and" here. > + > + add x10, x4, #3 > + lsl x10, x10, #7 > + sub sp, sp, x10 // tmp_array > + stp x0, x1, [sp, #-16]! > + stp x4, x6, [sp, #-16]! > + stp xzr, x30, [sp, #-16]! Don't do consecutive decrements like this, but do one "stp ..., [sp, #-48]!" followed by "stp ..., [sp, #16]" etc. > + add x0, sp, #48 > + sub x1, x2, x3 > + mov x2, x3 > + add x3, x4, #3 > + mov x4, x5 > + bl X(ff_hevc_put_hevc_epel_h4_8_neon_i8mm) > + ldp xzr, x30, [sp], #16 > + ldp x4, x6, [sp], #16 > + ldp x0, x1, [sp], #16 > + load_epel_filterh x6, x5 > + mov x10, #(MAX_PB_SIZE * 2) > + ld1 {v16.4h}, [sp], x10 > + ld1 {v17.4h}, [sp], x10 > + ld1 {v18.4h}, [sp], x10 > +1: ld1 {v19.4h}, [sp], x10 > + calc_epelh v4, v16, v17, v18, v19 > + epel_uni_w_hv_end > + sqxtun v4.8b, v4.8h > + str s4, [x0] > + add x0, x0, x1 > + subs x4, x4, #1 > + b.eq 2f > + > + ld1 {v16.4h}, [sp], x10 > + calc_epelh v4, v17, v18, v19, v16 > + epel_uni_w_hv_end > + sqxtun v4.8b, v4.8h > + str s4, [x0] > + add x0, x0, x1 > + subs x4, x4, #1 > + b.eq 2f > + > + ld1 {v17.4h}, [sp], x10 > + calc_epelh v4, v18, v19, v16, v17 > + epel_uni_w_hv_end > + sqxtun v4.8b, v4.8h > + str s4, [x0] > + add x0, x0, x1 > + subs x4, x4, #1 > + b.eq 2f > + > + ld1 {v18.4h}, [sp], x10 > + calc_epelh v4, v19, v16, v17, v18 > + epel_uni_w_hv_end > + sqxtun v4.8b, v4.8h > + str s4, [x0] > + add x0, x0, x1 > + subs x4, x4, #1 > + b.ne 1b > +2: > + ldp q12, q13, [sp] > + ldp q14, q15, [sp, #32] > + ldp q8, q9, [sp, #64] > + ldp q10, q11, [sp, #96] > + add sp, sp, #128 Fold the stack increment into ldp, like "ldp q12, q13, [sp], #128". The same thing applies to all other functions in this patch too. > diff --git a/libavcodec/aarch64/hevcdsp_init_aarch64.c b/libavcodec/aarch64/hevcdsp_init_aarch64.c > index 348497bbbe..fbbc4e6071 100644 > --- a/libavcodec/aarch64/hevcdsp_init_aarch64.c > +++ b/libavcodec/aarch64/hevcdsp_init_aarch64.c > @@ -189,6 +189,11 @@ NEON8_FNPROTO(qpel_uni_w_h, (uint8_t *_dst, ptrdiff_t _dststride, > int height, int denom, int wx, int ox, > intptr_t mx, intptr_t my, int width), _i8mm); > > +NEON8_FNPROTO(epel_uni_w_hv, (uint8_t *_dst, ptrdiff_t _dststride, > + const uint8_t *_src, ptrdiff_t _srcstride, > + int height, int denom, int wx, int ox, > + intptr_t mx, intptr_t my, int width), _i8mm); > + > NEON8_FNPROTO_PARTIAL_5(qpel_uni_w_hv, (uint8_t *_dst, ptrdiff_t _dststride, > const uint8_t *_src, ptrdiff_t _srcstride, > int height, int denom, int wx, int ox, > @@ -286,11 +291,13 @@ av_cold void ff_hevc_dsp_init_aarch64(HEVCDSPContext *c, const int bit_depth) > NEON8_FNASSIGN(c->put_hevc_epel_uni_w, 1, 0, epel_uni_w_v,); > NEON8_FNASSIGN_PARTIAL_4(c->put_hevc_qpel_uni_w, 1, 0, qpel_uni_w_v,); > > + > if (have_i8mm(cpu_flags)) { Stray whitespace change. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2023-06-12 8:19 UTC|newest] Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top 2023-06-04 4:17 [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: new optimization for 8-bit hevc_pel_uni_pixels Logan.Lyu 2023-06-04 4:17 ` [FFmpeg-devel] [PATCH 2/5] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_h Logan.Lyu 2023-06-12 7:59 ` Martin Storsjö 2023-06-18 8:21 ` Logan.Lyu 2023-06-04 4:17 ` [FFmpeg-devel] [PATCH 3/5] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_v Logan.Lyu 2023-06-12 8:09 ` Martin Storsjö 2023-06-12 9:08 ` Martin Storsjö 2023-06-18 8:22 ` Logan.Lyu 2023-07-01 21:21 ` Martin Storsjö 2023-06-04 4:17 ` [FFmpeg-devel] [PATCH 4/5] lavc/aarch64: new optimization for 8-bit hevc_epel_h Logan.Lyu 2023-06-12 8:12 ` Martin Storsjö 2023-06-18 8:23 ` Logan.Lyu 2023-06-18 8:26 ` Logan.Lyu 2023-06-04 4:17 ` [FFmpeg-devel] [PATCH 5/5] lavc/aarch64: new optimization for 8-bit hevc_epel_uni_w_hv Logan.Lyu 2023-06-12 8:19 ` Martin Storsjö [this message] 2023-06-18 8:25 ` Logan.Lyu 2023-07-01 21:28 ` Martin Storsjö 2023-07-13 14:54 ` Logan.Lyu 2023-07-14 9:28 ` Martin Storsjö 2023-06-12 7:47 ` [FFmpeg-devel] [PATCH 1/5] lavc/aarch64: new optimization for 8-bit hevc_pel_uni_pixels Martin Storsjö 2023-06-18 8:29 ` Logan.Lyu 2023-07-01 21:16 ` Martin Storsjö
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=c07cf4d-6cd8-36c4-124c-11da8aed149@martin.st \ --to=martin@martin.st \ --cc=Logan.Lyu@myais.com.cn \ --cc=ffmpeg-devel@ffmpeg.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel This inbox may be cloned and mirrored by anyone: git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git # If you have public-inbox 1.1+ installed, you may # initialize and index your mirror using the following commands: public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \ ffmpegdev@gitmailbox.com public-inbox-index ffmpegdev Example config snippet for mirrors. AGPL code for this site: git clone https://public-inbox.org/public-inbox.git