From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id 35FF745823 for ; Thu, 23 Feb 2023 08:55:39 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 2D58768C0A8; Thu, 23 Feb 2023 10:55:36 +0200 (EET) Received: from mail8.parnet.fi (mail8.parnet.fi [77.234.108.134]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 8EDD968BDE0 for ; Thu, 23 Feb 2023 10:55:29 +0200 (EET) Received: from mail9.parnet.fi (mail9.parnet.fi [77.234.108.21]) by mail8.parnet.fi with ESMTP id 31N8tReo028458-31N8tRep028458; Thu, 23 Feb 2023 10:55:27 +0200 Received: from foo.martin.st (host-97-187.parnet.fi [77.234.97.187]) by mail9.parnet.fi (Postfix) with ESMTPS id 79AEEA143A; Thu, 23 Feb 2023 10:55:27 +0200 (EET) Date: Thu, 23 Feb 2023 10:55:26 +0200 (EET) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: FFmpeg development discussions and patches In-Reply-To: Message-ID: <494fea8f-5f25-7c20-e31b-edca80dfda70@martin.st> References: MIME-Version: 1.0 X-FE-Policy-ID: 3:14:2:SYSTEM Subject: Re: [FFmpeg-devel] [PATCH] libavcodec/hevc: add hevc idct_4x4_neon of aarch64 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: xufuji456 <839789740@qq.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Tue, 14 Feb 2023, xufuji456 wrote: > --- > libavcodec/aarch64/hevcdsp_idct_neon.S | 51 +++++++++++++++++++++++ > libavcodec/aarch64/hevcdsp_init_aarch64.c | 4 ++ > 2 files changed, 55 insertions(+) This looks mostly fine, but there's a couple issues that breaks it when built with toolchains other than what you've been using. (I presume you've built it with Clang on macOS, since that's the only toolchain that built it successfully and ran without failures.) > > diff --git a/libavcodec/aarch64/hevcdsp_idct_neon.S b/libavcodec/aarch64/hevcdsp_idct_neon.S > index 124c50998a..fe8baf1348 100644 > --- a/libavcodec/aarch64/hevcdsp_idct_neon.S > +++ b/libavcodec/aarch64/hevcdsp_idct_neon.S > @@ -245,6 +245,54 @@ function hevc_add_residual_32x32_16_neon, export=0 > ret > endfunc > > +.macro tr_4x4 in0, in1, in2, in3, out0, out1, out2, out3, shift, tmp0, tmp1, tmp2, tmp3, tmp4 > + sshll \tmp0, \in0, #6 Nitpick: the indentation of \tmp0 is off here > + smull \tmp2, \in1, v4.h[1] > + mov \tmp1, \tmp0 Both binutils and MS armasm64.exe refuse to assemble an instruction like "mov v0.4s, v1.4s" - for the mov instruction, they only accept .8b or .16b, not .4s - even if it doesn't make any functional difference for what the instruction does. Due to how these macros are used here, it's not very easy to come up with a neat solution. One possible solution is to just repeat "sshll \tmp1, \in0, #6"; that does the same shift operation twice, but reduces the chain depth for setting tmp1, so it's possible that it doesn't make any big difference in the end. (On Cortex A53/A72/A73, I didn't measure any significant difference here.) > + smull \tmp3, \in1, v4.h[3] > + smlal \tmp0, \in2, v4.h[0] //e0 > + smlsl \tmp1, \in2, v4.h[0] //e1 > + smlal \tmp2, \in3, v4.h[3] //o0 > + smlsl \tmp3, \in3, v4.h[1] //o1 > + > + add \tmp4, \tmp0, \tmp2 > + sub \tmp0, \tmp0, \tmp2 > + add \tmp2, \tmp1, \tmp3 > + sub \tmp1, \tmp1, \tmp3 > + sqrshrn \out0, \tmp4, #\shift > + sqrshrn \out3, \tmp0, #\shift > + sqrshrn \out1, \tmp2, #\shift > + sqrshrn \out2, \tmp1, #\shift > +.endm > + > +.macro transpose_4x4 r0, r1, r2, r3 > + trn1 v22.8h, \r0\().8h, \r1\().8h > + trn2 v23.8h, \r0\().8h, \r1\().8h > + trn1 v24.8h, \r2\().8h, \r3\().8h > + trn2 v25.8h, \r2\().8h, \r3\().8h > + trn1 \r0\().4s, v22.4s, v24.4s > + trn2 \r2\().4s, v22.4s, v24.4s > + trn1 \r1\().4s, v23.4s, v25.4s > + trn2 \r3\().4s, v23.4s, v25.4s > +.endm neon.S already has got a bunch of similar matching transposes. I see that this file already had some similar custom transposes; I sent https://patchwork.ffmpeg.org/project/ffmpeg/patch/20230223083826.3149827-1-martin@martin.st/ to fix that. Here, you can use transpose_4x8H from neon.S too. > + > +.macro idct_4x4 bitdepth > +function ff_hevc_idct_4x4_\bitdepth\()_neon, export=1 > + ld1 {v0.4h-v3.4h}, [x0] > + > + movrel x1, trans > + ld1 {v4.4h}, [x1] > + > + tr_4x4 v0.4h, v1.4h, v2.4h, v3.4h, v16.4h, v17.4h, v18.4h, v19.4h, 7, v10.4s, v11.4s, v12.4s, v13.4s, v15.4s You're not allowed to clobber v8-v15 without backing them up/restoring them. Checkasm on macOS doesn't detect this, but it does on other OSes. Here it should be trivial to use other temporary registers instead, without needing to backup/restore anything. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".