From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 653484DC1E
	for <ffmpegdev@gitmailbox.com>; Sat,  1 Mar 2025 22:57:00 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D1DB668E23E;
	Sun,  2 Mar 2025 00:55:59 +0200 (EET)
Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com
 [209.85.167.50])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 6253668E238
 for <ffmpeg-devel@ffmpeg.org>; Sun,  2 Mar 2025 00:55:58 +0200 (EET)
Received: by mail-lf1-f50.google.com with SMTP id
 2adb3069b0e04-54958009d4dso1221748e87.2
 for <ffmpeg-devel@ffmpeg.org>; Sat, 01 Mar 2025 14:55:58 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1740869757; x=1741474557;
 darn=ffmpeg.org; 
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:from:to:cc:subject:date:message-id:reply-to;
 bh=/kF6RFMD6DIcWlUcXDwX1QgIHRZpnjScmmxNeTsHk9g=;
 b=KFxDv4uvNpTufeZH5m+RGnnTsn1mxcMpEASkfXRVMua0SA3Uwfu3/GkP/JS8kjPkRA
 VF8ekIQrIkXj5b2I7cSESzEJPTDxVSQ8Ub+euNgID0o36MRzFJEbvaHHO2A0gjgDzhSt
 /rZWYIUaoJNXVHlFc34X0nreryP0yMunYl28FRT4MobUQRFO75LGmAiIcRW6EDSFAe6l
 eLCJ82mC4HRtbcSVB/QDKO08eOSDBbaEPAFFSsa82YtxR8nibtCxvufwXFCtn1Tk0j39
 j48yXXazpgc7PVlququ74NOYEnu3wATPpvLMgaFfd9GiSnyAuJ/L8shk3BXhzrvEyV3z
 9Nyg==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1740869757; x=1741474557;
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=/kF6RFMD6DIcWlUcXDwX1QgIHRZpnjScmmxNeTsHk9g=;
 b=gIjzxSDwY4LwQ5neoubN6Ia0ObGe71iKomlLC/RBDP6xut+PQ8FDLqidoFTtCutWEK
 vyuvs1xLg2N+NRvRoPcONdWT+MuJMyLmDMffoBzspJ0ktN4UE/EoDnapthnVGmk/+e2E
 X/vWcKtCql7RMFhktYpfx2BSTi/OwxAh4wvmgq8oe1tYa9iITjkF5Eu8gAdn9t+4/D3M
 3FDLsFb9bxEqE1php0P3jv/mSsBTIkbimYv1lUyLHGNU4/eerNM5jTiXhaBADOVHHDYd
 lBbwKpAEHqwu5Cab8v50t3h7hyuh6hk0RcFmfASf80QF/zjA0YBnFrAp6UpYyE15vXEE
 YFXw==
X-Gm-Message-State: AOJu0Yy9XUu/FBi7lsqCI0W5D992KqvxjnNimyYfs6XVfDkqaUOoRepN
 m/Zwo0r/TwZCjRJLhPAFXNvIxd4pxRyEkbejtksVnqmqDSlSjqyZFZhkWlBXlyWzYo3uRsqhIo1
 nkw==
X-Gm-Gg: ASbGncs5KqVWw8IYZTj5aqkaEzZr5W8jnWKYtrW7GK1odrpC9SbBc6AEfAfnEGJJAnA
 KzV5yjII4g3qhKtAcO4EieZpk7Hh2u7v/VGByj1ZIkfL54DZxSojvS04TWv5P3M4mk75CmxYdaw
 hOn1n40nsFdJ97e66PJiRWFMqkeM5IC2ENdQC0ipYMs/g+G7m2XUVyzVw/2/t4Gt2NTDHl8/Sv7
 iT1Iok/zWn2w8PqgzxBlGsEgfz54K/qzSF0nBr9NaJTtppeQFlHkXbNjEUxKhkLwpw2PWW9bxzU
 UGIuNz3Yvloz63CqmQQNnG5/XhP8QEm8GOT7l4VeNU3d/qfwF2nMFNrARPzeOAptj2Nn3pZejwW
 9yNHYFbZwNIP8kfwPtd0tXTHR7aKv0GdNCc9vId2F
X-Google-Smtp-Source: AGHT+IEpRsUak+888el00c+ZD0OoXBl14jkvv3ING9HicP7PsEgfkHFt6kmis6amNombcDUjNDS5Qg==
X-Received: by 2002:ac2:4d93:0:b0:549:5769:6ae1 with SMTP id
 2adb3069b0e04-54957696dddmr1692146e87.9.1740869757183; 
 Sat, 01 Mar 2025 14:55:57 -0800 (PST)
Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net
 (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2])
 by smtp.gmail.com with ESMTPSA id
 2adb3069b0e04-5495b80e31csm233189e87.247.2025.03.01.14.55.56
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Sat, 01 Mar 2025 14:55:56 -0800 (PST)
Date: Sun, 2 Mar 2025 00:55:55 +0200 (EET)
From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st>
To: Krzysztof Pyrkosz via ffmpeg-devel <ffmpeg-devel@ffmpeg.org>
In-Reply-To: <20250227224454.6776-2-ffmpeg@szaka.eu>
Message-ID: <eb9b4df7-2eaf-df0-33a1-cf718949be9@martin.st>
References: <20250227224454.6776-2-ffmpeg@szaka.eu>
MIME-Version: 1.0
Subject: Re: [FFmpeg-devel] [PATCH] swscale/aarch64: dotprod implementation
 of rgba32_to_Y
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Krzysztof Pyrkosz <ffmpeg@szaka.eu>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/eb9b4df7-2eaf-df0-33a1-cf718949be9@martin.st/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

On Thu, 27 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote:

> ---
> I was curious whether it's possible to implement this function without
> any widening, and it turns out it not only is, but it's quite
> performant at the same time!
>
> The idea is to split the 16 bit coefficients into lower and upper half,
> invoke udot for the lower half, shift by 8, and follow by udot for the
> upper half. The code is based upon existing version.

As in the others; this explanation and the benchmarks are valuable to keep 
even after committing it, so please include it in the permanent commit 
message part above "---".

> Benchmark on A78:
> bgra_to_y_128_c:                                       682.0 ( 1.00x)
> bgra_to_y_128_neon:                                    181.2 ( 3.76x)
> bgra_to_y_128_dotprod:                                 117.8 ( 5.79x)
> bgra_to_y_1080_c:                                     5742.5 ( 1.00x)
> bgra_to_y_1080_neon:                                  1472.5 ( 3.90x)
> bgra_to_y_1080_dotprod:                                906.5 ( 6.33x)
> bgra_to_y_1920_c:                                    10194.0 ( 1.00x)
> bgra_to_y_1920_neon:                                  2589.8 ( 3.94x)
> bgra_to_y_1920_dotprod:                               1573.8 ( 6.48x)
>
> Krzysztof
>
> libswscale/aarch64/input.S   | 88 ++++++++++++++++++++++++++++++++++++
> libswscale/aarch64/swscale.c | 17 +++++++
> 2 files changed, 105 insertions(+)
>
> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S
> index 5cb18711fb..5fe6c3f6f5 100644
> --- a/libswscale/aarch64/input.S
> +++ b/libswscale/aarch64/input.S
> @@ -313,3 +313,91 @@ rgbToUV_neon bgr24, rgb24, element=3
> rgbToUV_neon bgra32, rgba32, element=4
>
> rgbToUV_neon abgr32, argb32, element=4, alpha_first=1
> +
> +#if HAVE_DOTPROD
> +ENABLE_DOTPROD
> +
> +function ff_bgra32ToY_neon_dotprod, export=1
> +        cmp             w4, #0                  // check width > 0
> +        ldp             w12, w11, [x5]          // w12: ry, w11: gy
> +        ldr             w10, [x5, #8]           // w10: by
> +        b.gt            4f
> +        ret
> +endfunc
> +
> +function ff_rgba32ToY_neon_dotprod, export=1
> +        cmp             w4, #0                  // check width > 0
> +        ldp             w10, w11, [x5]          // w10: ry, w11: gy
> +        ldr             w12, [x5, #8]           // w12: by
> +        b.le            3f
> +4:
> +        mov             w9, #256                // w9 = 1 << (RGB2YUV_SHIFT - 7)
> +        movk            w9, #8, lsl #16         // w9 += 32 << (RGB2YUV_SHIFT - 1)
> +        dup             v6.4s, w9               // w9: const_offset
> +
> +        cmp             w4, #16
> +        mov             w7, w10
> +        bfi             w7, w11, 8, 8
> +        bfi             w7, w12, 16, 8

These bfi instructions are quite esoteric; it'd probably be good to add 
some comments to explain what you do here.

> +        dup             v0.4s, w7
> +
> +        lsr             w6, w10, #8
> +        lsr             w7, w11, #8
> +        lsr             w8, w12, #8
> +
> +        bfi             w6, w7, 8, 8
> +        bfi             w6, w8, 16, 8
> +        dup             v1.4s, w6
> +        b.lt            2f
> +1:
> +        ld1             { v16.16b, v17.16b, v18.16b, v19.16b }, [x1], #64
> +        sub             w4, w4, #16             // width -= 16
> +        cmp             w4, #16                 // width >= 16 ?

The cmp could be moved e.g. below the mov

Other than that, this patch looks really good to me, thanks!

And while swscale is being rewritten elsewhere, adding this function 
shouldn't make the transition to a rewrite any harder, so I don't see any 
problem with adding this in the meantime.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".