From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 4988746946
	for <ffmpegdev@gitmailbox.com>; Wed, 18 Jun 2025 21:16:11 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id A7E3068CBF5;
	Thu, 19 Jun 2025 00:16:04 +0300 (EEST)
Received: from mail-lf1-f44.google.com (mail-lf1-f44.google.com
 [209.85.167.44])
 by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 22B7D68C69B
 for <ffmpeg-devel@ffmpeg.org>; Thu, 19 Jun 2025 00:15:58 +0300 (EEST)
Received: by mail-lf1-f44.google.com with SMTP id
 2adb3069b0e04-553241d30b3so36138e87.3
 for <ffmpeg-devel@ffmpeg.org>; Wed, 18 Jun 2025 14:15:58 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1750281357; x=1750886157;
 darn=ffmpeg.org; 
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:from:to:cc:subject:date:message-id:reply-to;
 bh=ylBRLhe+PnVtWJURK7i7ZAPureoM5huU7bMx/a1mwv0=;
 b=U94FO0XL3BjBajkh9KvNfj2qCjKoc0SAOLrKhYzpiaRFnOGb8HPzXPY3GOdeSg4ILU
 ruBlVggxX5VwAnVNSfBRk5FwgKF6lQWbR5Q6NYQh5U84iwp0UEfA2KDTAoMWuqjHxFSD
 fLW51iGXeKjMxRD+tIkMowFQbeY2VJWGJ+TX/Oe6X8Oj8T8xWbh9RDapRj1uo7deH8Ct
 f08kdd4OqTG542teThNTLOwe6EPO1LduWblU//EUbTXasumLB/++gXnDO+SJ3eO7ZCuF
 B/6kl+iGj0KAcsD96celdrYXIwtMCX7h57PHc3O+1NibMG1jfFUUHOgoRJhuJGxCe5xg
 CTlQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1750281357; x=1750886157;
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=ylBRLhe+PnVtWJURK7i7ZAPureoM5huU7bMx/a1mwv0=;
 b=DX+Gxvpfr3kDWU4XHxq+5MryLzMwXhoB7caB1y1yyOxlMscZOaHJgI6hn95f9o9821
 kigNcejXZbmEAUhH+LNcT+h19x8EtiWUigE1lGnlTD25Mh4UrKap4fJ5yCzd46O2lX49
 Hi2SZw0OXEpvnqtsQk55+CzTFfJskUP2hW4TwVz6Bkt/egwK30ZUKARpam1d5teUEMFw
 528i6+QifkLuTTNy+6KoNu/kZXsGw4gMLWJZfj+E+BMBqAsGRWodbZtiml6pb/Ty9lXa
 uwaiJ/S7hqyB8GeGZ1kDDDrb3Npfwk2gbs3mkWGSNfgkj0bm1s36FnGRidW9GE5DT5Nf
 VGbw==
X-Gm-Message-State: AOJu0YwJpDjdLMLVRZytv8M99hEZKTropGIQKEMqcj5JdPBSHbp0YfjS
 pj1SjxYclax/3bdY21b3Iyneb81Pk/xC865JNm7OEYvaODC2ZBP7GbnkK2/MwaZ5nT/2cjakFlk
 jhJtilA==
X-Gm-Gg: ASbGncuRX8tHtjl60ns6ThMkWa3Y4f/U9kC8fg+XP3JSzfEYFTe2+bUh3AvRcb3EXtT
 fgqn9fgU5wjI3zglSdUVxmsAlb4aL4a51VcqmJWDV1SEjB3S2ag+ZpDwlz44osKYmvn19GvCLBu
 WWtb4clCCYhg8z8xZMwjCXTgq5bNFj9cJ3tOFu6CAeCTgbHOsYFd4fmxPhiUHjx3YVzz1qIgN2x
 YhKQx8rbh83ALw3dkY+WwMsArPPuN1hMKJdcHZTlrxgvzXYhhLDqW//uyBst8Y0zjyFod+gRfRh
 wJkX6iiDTL8BciTQHjLW8Fl4xdv2W/ibJvy9N3DDT/aHBCx8dvLnfounptksHFJZa5O++a6NjxC
 jxIU8N8befaFh+U9dNYNsNlmT6HoYtv/bxYxcHdlSsEQJFUKvkc5JNY1YaQ==
X-Google-Smtp-Source: AGHT+IGX9a/WC4+9ylFWps4hDUZg/diF3+WJn+t/Mn3Zk/dMW3qc1a9n3Kk9uVwA9k0D+gqLl/pacA==
X-Received: by 2002:a05:6512:1195:b0:553:2e99:c20 with SMTP id
 2adb3069b0e04-553b6f17035mr5156918e87.34.1750281356868; 
 Wed, 18 Jun 2025 14:15:56 -0700 (PDT)
Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net
 (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2])
 by smtp.gmail.com with ESMTPSA id
 2adb3069b0e04-553ac116804sm2391028e87.41.2025.06.18.14.15.56
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 18 Jun 2025 14:15:56 -0700 (PDT)
Date: Thu, 19 Jun 2025 00:15:55 +0300 (EEST)
From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
In-Reply-To: <PN3P287MB33394D7164A5AFDB8AC925CB9A6EA@PN3P287MB3339.INDP287.PROD.OUTLOOK.COM>
Message-ID: <1fb7d51-33b0-0a8-2ed6-c8553638c3df@martin.st>
References: <PN3P287MB33394D7164A5AFDB8AC925CB9A6EA@PN3P287MB3339.INDP287.PROD.OUTLOOK.COM>
MIME-Version: 1.0
Subject: Re: [FFmpeg-devel] [PATCH] swscale/aarch64/output: Implement neon
 assembly for yuv2planeX_10_c_template()
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Dash Santosh Sathyanarayanan <dash.sathyanarayanan@multicorewareinc.com>,
 Harshitha Sarangu Suresh <harshitha@multicorewareinc.com>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/1fb7d51-33b0-0a8-2ed6-c8553638c3df@martin.st/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

On Fri, 6 Jun 2025, Logaprakash Ramajayam wrote:

> Checked FATE tests and gha-aarch64 git workflow.
>
> From 34cdef26eaebcf98916e9881b3a04f4f698f09c6 Mon Sep 17 00:00:00 2001
> From: Logaprakash Ramajayam <logaprakash.ramajayam@multicorewareinc.com>
> Date: Thu, 5 Jun 2025 01:33:39 -0700
> Subject: [PATCH] swscale/aarch64/output: Implement neon assembly for
> yuv2planeX_10_c_template()
> ---
> libswscale/aarch64/output.S  | 167 +++++++++++++++++++++++++++++++++++
> libswscale/aarch64/swscale.c |  38 ++++++++
> 2 files changed, 205 insertions(+)

This is missing checkasm benchmarks for the function. That is presuming 
that there is a checkasm test for it. If not, such a test needs to be 
written.

> diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S
> index 190c438870..e039e820ae 100644
> --- a/libswscale/aarch64/output.S
> +++ b/libswscale/aarch64/output.S
> @@ -20,6 +20,173 @@
>
> #include "libavutil/aarch64/asm.S"
>
> +function ff_yuv2planeX_10_neon, export=1
> +// x0 = filter (int16_t*)
> +// w1 = filterSize
> +// x2 = src (int16_t**)
> +// x3 = dest (uint16_t*)
> +// w4 = dstW
> +// w5 = big_endian
> +// w6 = output_bits
> +
> +        mov             w8, #27
> +        sub             w8, w8, w6                      // shift = 11 + 16 - output_bits
> +
> +        sub             w9, w8, #1
> +        mov             w10, #1
> +        lsl             w9, w10, w9                     // val = 1 << (shift - 1)
> +
> +        dup             v1.4s, w9
> +        dup             v2.4s, w9                       // Create vectors with val
> +
> +        mov             w17, #0
> +        sub             w16, w17, w8

You don't need to assign zero to a register and do subtraction in order to 
negate a value, you can also just do "neg w16, w8".

> +        dup             v8.4s, w16                      // Create (-shift) vector for right shift
> +
> +        movi            v11.4s, #0
> +
> +        mov             w10, #1
> +        lsl             w10, w10, w6
> +        sub             w10, w10, #1                    // (1U << output_bits) - 1
> +        dup             v12.4s, w10                     // Create Clip vector for uppr bound
> +
> +        tst             w4, #15                         // if dstW divisible by 16, process 16 elements
> +        b.ne            4f                              // else process 8 elements

Same question as for the other patch; can we assume that it is ok to 
always write in increments of 8? If not, we'd need a scalar loop to handle 
the tail. And in any case, it's more efficient to use the most unrolled 
version of the function for the majority of a line, instead of running the 
whole line with a less efficient loop just because the tail doesn't line 
up entirely.

> +
> +        mov             x7, #0                          // i = 0
> +1:  // Loop
> +
> +        mov             v3.16b, v1.16b
> +        mov             v4.16b, v2.16b
> +        mov             v5.16b, v1.16b
> +        mov             v6.16b, v2.16b
> +
> +        mov             w11, w1                         // tmpfilterSize = filterSize
> +        mov             x12, x2                         // srcp = src
> +        mov             x13, x0                         // filterp = filter
> +
> +2:  // Filter loop
> +
> +        ldp             x14, x15, [x12], #16            // get 2 pointers: src[j] and src[j+1]
> +        ldr             s7, [x13], #4                   // load filter coefficients
> +        add             x14, x14, x7, lsl #1
> +        add             x15, x15, x7, lsl #1
> +        ld1             {v16.8h, v17.8h}, [x14]
> +        ld1             {v18.8h, v19.8h}, [x15]
> +
> +        // Multiply-accumulate
> +        smlal           v3.4s,  v16.4h, v7.h[0]
> +        smlal2          v4.4s,  v16.8h, v7.h[0]
> +        smlal           v5.4s,  v17.4h, v7.h[0]
> +        smlal2          v6.4s,  v17.8h, v7.h[0]
> +
> +        smlal           v3.4s,  v18.4h, v7.h[1]
> +        smlal2          v4.4s,  v18.8h, v7.h[1]
> +        smlal           v5.4s,  v19.4h, v7.h[1]
> +        smlal2          v6.4s,  v19.8h, v7.h[1]
> +
> +        subs            w11, w11, #2                    // tmpfilterSize -= 2
> +        b.gt            2b                              // continue filter loop
> +
> +        // Shift results
> +        sshl            v3.4s,  v3.4s, v8.4s
> +        sshl            v4.4s,  v4.4s, v8.4s
> +        sshl            v5.4s,  v5.4s, v8.4s
> +        sshl            v6.4s,  v6.4s, v8.4s
> +
> +        // Clamp to 0
> +        smax            v3.4s,  v3.4s, v11.4s
> +        smax            v4.4s,  v4.4s, v11.4s
> +        smax            v5.4s,  v5.4s, v11.4s
> +        smax            v6.4s,  v6.4s, v11.4s
> +
> +        // Clip upper bound
> +        smin            v3.4s,  v3.4s, v12.4s
> +        smin            v4.4s,  v4.4s, v12.4s
> +        smin            v5.4s,  v5.4s, v12.4s
> +        smin            v6.4s,  v6.4s, v12.4s
> +
> +        // Narrow to 16-bit
> +        xtn             v13.4h, v3.4s
> +        xtn2            v13.8h, v4.4s
> +        xtn             v14.4h, v5.4s
> +        xtn2            v14.8h, v6.4s

If we are going to narrow things to 16 bit here, I think it would be more 
efficient to first narrow to 16 bit. You can do that with sqxtun, then you 
also get the clamp to 0 part for free, so you only need to clamp the upper 
bound, with half the number of instructions/registers.


> +
> +        cbz             w5, 3f                          // Check if big endian
> +        rev16           v13.16b, v13.16b
> +        rev16           v14.16b, v14.16b                // Swap bits for big endian
> +3:
> +        // Store 16 pixels
> +        st1             {v13.8h}, [x3], #16
> +        st1             {v14.8h}, [x3], #16

Write both registers with one store - st1 {v13.8h, v14.8h}, [x3], #32.

> +
> +        add             x7, x7, #16                     // i = i + 16
> +        subs            w4, w4, #16                     // dstW = dstW - 16
> +        b.gt            1b                              // Continue loop

If possible, don't do the calculation that sets the condition codes 
directly before the branch; that forces the branch to wait for the 
previous instruction to finish. Instead you can move the "subs" 
instruction a bit earlier; at least above the "add" above, but it could 
also go e.g. after the 3: label.

> diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c
> index 6e5a721c1f..23cdb7d26e 100644
> --- a/libswscale/aarch64/swscale.c
> +++ b/libswscale/aarch64/swscale.c
> @@ -158,6 +158,29 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \
>
> ALL_SCALE_FUNCS(neon);
>
> +void ff_yuv2planeX_10_neon(const int16_t *filter, int filterSize,
> +                const int16_t **src, uint16_t *dest, int dstW,
> +                int big_endian, int output_bits);

Align the later lines with parameters with the parameters on the first 
line. See ff_yuv2planeX_8_neon right below.

> +
> +#define yuv2NBPS(bits, BE_LE, is_be, template_size, typeX_t) \
> +static void yuv2planeX_ ## bits ## BE_LE ## _neon(const int16_t *filter, int filterSize, \
> +                                const int16_t **src, uint8_t *dest, int dstW, \
> +                                const uint8_t *dither, int offset)\

Same thing here

> +{ \
> +    ff_yuv2planeX_## template_size ## _neon(filter, \
> +                                filterSize, (const typeX_t **) src, \
> +                                (uint16_t *) dest, dstW, is_be, bits); \

Same thing here

> +}
> +
> +yuv2NBPS( 9, BE, 1, 10, int16_t)
> +yuv2NBPS( 9, LE, 0, 10, int16_t)
> +yuv2NBPS(10, BE, 1, 10, int16_t)
> +yuv2NBPS(10, LE, 0, 10, int16_t)
> +yuv2NBPS(12, BE, 1, 10, int16_t)
> +yuv2NBPS(12, LE, 0, 10, int16_t)
> +yuv2NBPS(14, BE, 1, 10, int16_t)
> +yuv2NBPS(14, LE, 0, 10, int16_t)

FWIW, I appreciate the effort to save code size here by not templating 8 
different copies of the same functions, but making it use one single 
implementation for all the variants.

> +
> void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize,
>                           const int16_t **src, uint8_t *dest, int dstW,
>                           const uint8_t *dither, int offset);
> @@ -268,6 +291,8 @@ av_cold void ff_sws_init_range_convert_aarch64(SwsInternal *c)
> av_cold void ff_sws_init_swscale_aarch64(SwsInternal *c)
> {
>     int cpu_flags = av_get_cpu_flags();
> +    enum AVPixelFormat dstFormat = c->opts.dst_format;
> +    const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(dstFormat);
>
>     if (have_neon(cpu_flags)) {
>         ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon);
> @@ -276,6 +301,19 @@ av_cold void ff_sws_init_swscale_aarch64(SwsInternal *c)
>         if (c->dstBpc == 8) {
>             c->yuv2planeX = ff_yuv2planeX_8_neon;
>         }
> +
> +        if (isNBPS(dstFormat) && !isSemiPlanarYUV(dstFormat)) {
> +            if (desc->comp[0].depth == 9) {
> +                c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_9BE_neon  : yuv2planeX_9LE_neon;
> +            } else if (desc->comp[0].depth == 10) {
> +                c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_10BE_neon  : yuv2planeX_10LE_neon;
> +            } else if (desc->comp[0].depth == 12) {
> +                c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_12BE_neon  : yuv2planeX_12LE_neon;
> +            } else if (desc->comp[0].depth == 14) {
> +                c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_14BE_neon  : yuv2planeX_14LE_neon;
> +            } else
> +            av_assert0(0);

The av_assert is misindented.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".