From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id B657542792 for ; Wed, 30 Mar 2022 12:36:02 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B0B6D68B21A; Wed, 30 Mar 2022 15:35:59 +0300 (EEST) Received: from mail8.parnet.fi (mail8.parnet.fi [77.234.108.134]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BA5DC68A927 for ; Wed, 30 Mar 2022 15:35:53 +0300 (EEST) Received: from mail9.parnet.fi (mail9.parnet.fi [77.234.108.21]) by mail8.parnet.fi with ESMTP id 22UCZqag026728-22UCZqah026728; Wed, 30 Mar 2022 15:35:52 +0300 Received: from foo.martin.st (host-97-187.parnet.fi [77.234.97.187]) by mail9.parnet.fi (Postfix) with ESMTPS id 86E73A150E; Wed, 30 Mar 2022 15:35:52 +0300 (EEST) Date: Wed, 30 Mar 2022 15:35:51 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: FFmpeg development discussions and patches In-Reply-To: <20220325185257.513933-6-bavison@riscosopen.org> Message-ID: References: <20220317185819.466470-1-bavison@riscosopen.org> <20220325185257.513933-1-bavison@riscosopen.org> <20220325185257.513933-6-bavison@riscosopen.org> MIME-Version: 1.0 X-FE-Policy-ID: 3:14:2:SYSTEM Subject: Re: [FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON deblocking filter fast paths X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Ben Avison Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Fri, 25 Mar 2022, Ben Avison wrote: > checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C > version can still outperform the NEON version in specific cases. The balance > between different code paths is stream-dependent, but in practice the best > case happens about 5% of the time, the worst case happens about 40% of the > time, and the complexity of the remaining cases fall somewhere in between. > Therefore, taking the average of the best and worst case timings is > probably a conservative estimate of the degree by which the NEON code > improves performance. > > vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7 > vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5 > vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5 > vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7 > vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2 > vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2 > vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2 > vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2 > vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0 > vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7 > vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7 > vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5 > vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7 > vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0 > vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7 > vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0 > vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2 > vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7 > vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0 > vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2 > vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0 > vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0 > vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2 > vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2 > > Signed-off-by: Ben Avison > --- > libavcodec/aarch64/Makefile | 1 + > libavcodec/aarch64/vc1dsp_init_aarch64.c | 14 + > libavcodec/aarch64/vc1dsp_neon.S | 698 +++++++++++++++++++++++ > 3 files changed, 713 insertions(+) > create mode 100644 libavcodec/aarch64/vc1dsp_neon.S > > diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S > new file mode 100644 > index 0000000000..70391b4179 > --- /dev/null > +++ b/libavcodec/aarch64/vc1dsp_neon.S > @@ -0,0 +1,698 @@ > +/* > + * VC1 AArch64 NEON optimisations > + * > + * Copyright (c) 2022 Ben Avison > + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA > + */ > + > +#include "libavutil/aarch64/asm.S" > + > +.align 5 > +.Lcoeffs: > +.quad 0x00050002 > + This constant is problematic when building with MSVC/armasm64.exe (via gas-preprocessor). (gas-preprocessor, processing assembly for use with armasm, can't handle any completely general gas assembly, but works fine as long as one sticks to the general code patterns/macros we use.) The issue here is that this is a naked label before the first function/endfunc/const/endconst block. In practice it works fine if this follows after an existing function though. (gas-preprocessor currently needs an explicit .text directive, to set it up to emit code to the .text section. I guess I could look into handling that implicitly too.) In practice, this issue disappears further ahead in the patch stack when other functions are added above this in the source file though, so it's not really an issue - I just thought I'd mention it. > +// VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of vertically-neighbouring blocks > +// On entry: > +// x0 -> top-left pel of lower block > +// w1 = row stride, bytes > +// w2 = PQUANT bitstream parameter > +function ff_vc1_v_loop_filter4_neon, export=1 > + sub x3, x0, w1, sxtw #2 > + sxtw x1, w1 // technically, stride is signed int > + ldr d0, .Lcoeffs > + ld1 {v1.s}[0], [x0], x1 // P5 > + ld1 {v2.s}[0], [x3], x1 // P1 > + ld1 {v3.s}[0], [x3], x1 // P2 > + ld1 {v4.s}[0], [x0], x1 // P6 > + ld1 {v5.s}[0], [x3], x1 // P3 > + ld1 {v6.s}[0], [x0], x1 // P7 > + ld1 {v7.s}[0], [x3] // P4 > + ld1 {v16.s}[0], [x0] // P8 > + ushll v17.8h, v1.8b, #1 // 2*P5 > + dup v18.8h, w2 // pq > + ushll v2.8h, v2.8b, #1 // 2*P1 > + uxtl v3.8h, v3.8b // P2 > + uxtl v4.8h, v4.8b // P6 > + uxtl v19.8h, v5.8b // P3 Overall, the code looks sensible to me. Would it make sense to share the core of the filter between the horizontal/vertical cases with e.g. a macro? (I didn't check in detail if there's much differences in the core of the filter. At most some differences in condition registers for partial writeout in the horizontal forms?) If it's shareable, I guess the core of the filter even could be factorized to a separate sub-function to avoid duplicating it across the functions? (For the smaller filters it's probably no big deal, but for the bigger filters it could maybe be worthwhile?) It probably costs a couple cycles extra though, so if that's a too costly here, just macroing it to avoid duplication is fine too. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".