From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTP id B657542792
	for <ffmpegdev@gitmailbox.com>; Wed, 30 Mar 2022 12:36:02 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id B0B6D68B21A;
	Wed, 30 Mar 2022 15:35:59 +0300 (EEST)
Received: from mail8.parnet.fi (mail8.parnet.fi [77.234.108.134])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id BA5DC68A927
 for <ffmpeg-devel@ffmpeg.org>; Wed, 30 Mar 2022 15:35:53 +0300 (EEST)
Received: from mail9.parnet.fi (mail9.parnet.fi [77.234.108.21])
 by mail8.parnet.fi  with ESMTP id 22UCZqag026728-22UCZqah026728;
 Wed, 30 Mar 2022 15:35:52 +0300
Received: from foo.martin.st (host-97-187.parnet.fi [77.234.97.187])
 by mail9.parnet.fi (Postfix) with ESMTPS id 86E73A150E;
 Wed, 30 Mar 2022 15:35:52 +0300 (EEST)
Date: Wed, 30 Mar 2022 15:35:51 +0300 (EEST)
From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
In-Reply-To: <20220325185257.513933-6-bavison@riscosopen.org>
Message-ID: <ca27181c-534e-6d84-d062-72708a706aed@martin.st>
References: <20220317185819.466470-1-bavison@riscosopen.org>
 <20220325185257.513933-1-bavison@riscosopen.org>
 <20220325185257.513933-6-bavison@riscosopen.org>
MIME-Version: 1.0
X-FE-Policy-ID: 3:14:2:SYSTEM
Subject: Re: [FFmpeg-devel] [PATCH 05/10] avcodec/vc1: Arm 64-bit NEON
 deblocking filter fast paths
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Ben Avison <bavison@riscosopen.org>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/ca27181c-534e-6d84-d062-72708a706aed@martin.st/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

On Fri, 25 Mar 2022, Ben Avison wrote:

> checkasm benchmarks on 1.5 GHz Cortex-A72 are as follows. Note that the C
> version can still outperform the NEON version in specific cases. The balance
> between different code paths is stream-dependent, but in practice the best
> case happens about 5% of the time, the worst case happens about 40% of the
> time, and the complexity of the remaining cases fall somewhere in between.
> Therefore, taking the average of the best and worst case timings is
> probably a conservative estimate of the degree by which the NEON code
> improves performance.
>
> vc1dsp.vc1_h_loop_filter4_bestcase_c: 10.7
> vc1dsp.vc1_h_loop_filter4_bestcase_neon: 43.5
> vc1dsp.vc1_h_loop_filter4_worstcase_c: 184.5
> vc1dsp.vc1_h_loop_filter4_worstcase_neon: 73.7
> vc1dsp.vc1_h_loop_filter8_bestcase_c: 31.2
> vc1dsp.vc1_h_loop_filter8_bestcase_neon: 62.2
> vc1dsp.vc1_h_loop_filter8_worstcase_c: 358.2
> vc1dsp.vc1_h_loop_filter8_worstcase_neon: 88.2
> vc1dsp.vc1_h_loop_filter16_bestcase_c: 51.0
> vc1dsp.vc1_h_loop_filter16_bestcase_neon: 107.7
> vc1dsp.vc1_h_loop_filter16_worstcase_c: 722.7
> vc1dsp.vc1_h_loop_filter16_worstcase_neon: 140.5
> vc1dsp.vc1_v_loop_filter4_bestcase_c: 9.7
> vc1dsp.vc1_v_loop_filter4_bestcase_neon: 43.0
> vc1dsp.vc1_v_loop_filter4_worstcase_c: 178.7
> vc1dsp.vc1_v_loop_filter4_worstcase_neon: 69.0
> vc1dsp.vc1_v_loop_filter8_bestcase_c: 30.2
> vc1dsp.vc1_v_loop_filter8_bestcase_neon: 50.7
> vc1dsp.vc1_v_loop_filter8_worstcase_c: 353.0
> vc1dsp.vc1_v_loop_filter8_worstcase_neon: 69.2
> vc1dsp.vc1_v_loop_filter16_bestcase_c: 60.0
> vc1dsp.vc1_v_loop_filter16_bestcase_neon: 90.0
> vc1dsp.vc1_v_loop_filter16_worstcase_c: 714.2
> vc1dsp.vc1_v_loop_filter16_worstcase_neon: 97.2
>
> Signed-off-by: Ben Avison <bavison@riscosopen.org>
> ---
> libavcodec/aarch64/Makefile              |   1 +
> libavcodec/aarch64/vc1dsp_init_aarch64.c |  14 +
> libavcodec/aarch64/vc1dsp_neon.S         | 698 +++++++++++++++++++++++
> 3 files changed, 713 insertions(+)
> create mode 100644 libavcodec/aarch64/vc1dsp_neon.S
>
> diff --git a/libavcodec/aarch64/vc1dsp_neon.S b/libavcodec/aarch64/vc1dsp_neon.S
> new file mode 100644
> index 0000000000..70391b4179
> --- /dev/null
> +++ b/libavcodec/aarch64/vc1dsp_neon.S
> @@ -0,0 +1,698 @@
> +/*
> + * VC1 AArch64 NEON optimisations
> + *
> + * Copyright (c) 2022 Ben Avison <bavison@riscosopen.org>
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +.align  5
> +.Lcoeffs:
> +.quad   0x00050002
> +

This constant is problematic when building with MSVC/armasm64.exe (via 
gas-preprocessor). (gas-preprocessor, processing assembly for use with 
armasm, can't handle any completely general gas assembly, but works fine 
as long as one sticks to the general code patterns/macros we use.)

The issue here is that this is a naked label before the first 
function/endfunc/const/endconst block. In practice it works fine if this 
follows after an existing function though. (gas-preprocessor currently 
needs an explicit .text directive, to set it up to emit code to the .text 
section. I guess I could look into handling that implicitly too.)

In practice, this issue disappears further ahead in the patch stack when 
other functions are added above this in the source file though, so it's 
not really an issue - I just thought I'd mention it.

> +// VC-1 in-loop deblocking filter for 4 pixel pairs at boundary of vertically-neighbouring blocks
> +// On entry:
> +//   x0 -> top-left pel of lower block
> +//   w1 = row stride, bytes
> +//   w2 = PQUANT bitstream parameter
> +function ff_vc1_v_loop_filter4_neon, export=1
> +        sub             x3, x0, w1, sxtw #2
> +        sxtw            x1, w1                  // technically, stride is signed int
> +        ldr             d0, .Lcoeffs
> +        ld1             {v1.s}[0], [x0], x1     // P5
> +        ld1             {v2.s}[0], [x3], x1     // P1
> +        ld1             {v3.s}[0], [x3], x1     // P2
> +        ld1             {v4.s}[0], [x0], x1     // P6
> +        ld1             {v5.s}[0], [x3], x1     // P3
> +        ld1             {v6.s}[0], [x0], x1     // P7
> +        ld1             {v7.s}[0], [x3]         // P4
> +        ld1             {v16.s}[0], [x0]        // P8
> +        ushll           v17.8h, v1.8b, #1       // 2*P5
> +        dup             v18.8h, w2              // pq
> +        ushll           v2.8h, v2.8b, #1        // 2*P1
> +        uxtl            v3.8h, v3.8b            // P2
> +        uxtl            v4.8h, v4.8b            // P6
> +        uxtl            v19.8h, v5.8b           // P3

Overall, the code looks sensible to me.

Would it make sense to share the 
core of the filter between the horizontal/vertical cases with e.g. a 
macro? (I didn't check in detail if there's much differences in the core 
of the filter. At most some differences in condition registers for partial 
writeout in the horizontal forms?)

If it's shareable, I guess the core of the filter even could be factorized 
to a separate sub-function to avoid duplicating it across the functions? 
(For the smaller filters it's probably no big deal, but for the bigger 
filters it could maybe be worthwhile?) It probably costs a couple cycles 
extra though, so if that's a too costly here, just macroing it to avoid 
duplication is fine too.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".