From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 653484DC1E for ; Sat, 1 Mar 2025 22:57:00 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id D1DB668E23E; Sun, 2 Mar 2025 00:55:59 +0200 (EET) Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com [209.85.167.50]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 6253668E238 for ; Sun, 2 Mar 2025 00:55:58 +0200 (EET) Received: by mail-lf1-f50.google.com with SMTP id 2adb3069b0e04-54958009d4dso1221748e87.2 for ; Sat, 01 Mar 2025 14:55:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1740869757; x=1741474557; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=/kF6RFMD6DIcWlUcXDwX1QgIHRZpnjScmmxNeTsHk9g=; b=KFxDv4uvNpTufeZH5m+RGnnTsn1mxcMpEASkfXRVMua0SA3Uwfu3/GkP/JS8kjPkRA VF8ekIQrIkXj5b2I7cSESzEJPTDxVSQ8Ub+euNgID0o36MRzFJEbvaHHO2A0gjgDzhSt /rZWYIUaoJNXVHlFc34X0nreryP0yMunYl28FRT4MobUQRFO75LGmAiIcRW6EDSFAe6l eLCJ82mC4HRtbcSVB/QDKO08eOSDBbaEPAFFSsa82YtxR8nibtCxvufwXFCtn1Tk0j39 j48yXXazpgc7PVlququ74NOYEnu3wATPpvLMgaFfd9GiSnyAuJ/L8shk3BXhzrvEyV3z 9Nyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1740869757; x=1741474557; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/kF6RFMD6DIcWlUcXDwX1QgIHRZpnjScmmxNeTsHk9g=; b=gIjzxSDwY4LwQ5neoubN6Ia0ObGe71iKomlLC/RBDP6xut+PQ8FDLqidoFTtCutWEK vyuvs1xLg2N+NRvRoPcONdWT+MuJMyLmDMffoBzspJ0ktN4UE/EoDnapthnVGmk/+e2E X/vWcKtCql7RMFhktYpfx2BSTi/OwxAh4wvmgq8oe1tYa9iITjkF5Eu8gAdn9t+4/D3M 3FDLsFb9bxEqE1php0P3jv/mSsBTIkbimYv1lUyLHGNU4/eerNM5jTiXhaBADOVHHDYd lBbwKpAEHqwu5Cab8v50t3h7hyuh6hk0RcFmfASf80QF/zjA0YBnFrAp6UpYyE15vXEE YFXw== X-Gm-Message-State: AOJu0Yy9XUu/FBi7lsqCI0W5D992KqvxjnNimyYfs6XVfDkqaUOoRepN m/Zwo0r/TwZCjRJLhPAFXNvIxd4pxRyEkbejtksVnqmqDSlSjqyZFZhkWlBXlyWzYo3uRsqhIo1 nkw== X-Gm-Gg: ASbGncs5KqVWw8IYZTj5aqkaEzZr5W8jnWKYtrW7GK1odrpC9SbBc6AEfAfnEGJJAnA KzV5yjII4g3qhKtAcO4EieZpk7Hh2u7v/VGByj1ZIkfL54DZxSojvS04TWv5P3M4mk75CmxYdaw hOn1n40nsFdJ97e66PJiRWFMqkeM5IC2ENdQC0ipYMs/g+G7m2XUVyzVw/2/t4Gt2NTDHl8/Sv7 iT1Iok/zWn2w8PqgzxBlGsEgfz54K/qzSF0nBr9NaJTtppeQFlHkXbNjEUxKhkLwpw2PWW9bxzU UGIuNz3Yvloz63CqmQQNnG5/XhP8QEm8GOT7l4VeNU3d/qfwF2nMFNrARPzeOAptj2Nn3pZejwW 9yNHYFbZwNIP8kfwPtd0tXTHR7aKv0GdNCc9vId2F X-Google-Smtp-Source: AGHT+IEpRsUak+888el00c+ZD0OoXBl14jkvv3ING9HicP7PsEgfkHFt6kmis6amNombcDUjNDS5Qg== X-Received: by 2002:ac2:4d93:0:b0:549:5769:6ae1 with SMTP id 2adb3069b0e04-54957696dddmr1692146e87.9.1740869757183; Sat, 01 Mar 2025 14:55:57 -0800 (PST) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-5495b80e31csm233189e87.247.2025.03.01.14.55.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 01 Mar 2025 14:55:56 -0800 (PST) Date: Sun, 2 Mar 2025 00:55:55 +0200 (EET) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Krzysztof Pyrkosz via ffmpeg-devel In-Reply-To: <20250227224454.6776-2-ffmpeg@szaka.eu> Message-ID: References: <20250227224454.6776-2-ffmpeg@szaka.eu> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] swscale/aarch64: dotprod implementation of rgba32_to_Y X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Krzysztof Pyrkosz Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Thu, 27 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > --- > I was curious whether it's possible to implement this function without > any widening, and it turns out it not only is, but it's quite > performant at the same time! > > The idea is to split the 16 bit coefficients into lower and upper half, > invoke udot for the lower half, shift by 8, and follow by udot for the > upper half. The code is based upon existing version. As in the others; this explanation and the benchmarks are valuable to keep even after committing it, so please include it in the permanent commit message part above "---". > Benchmark on A78: > bgra_to_y_128_c: 682.0 ( 1.00x) > bgra_to_y_128_neon: 181.2 ( 3.76x) > bgra_to_y_128_dotprod: 117.8 ( 5.79x) > bgra_to_y_1080_c: 5742.5 ( 1.00x) > bgra_to_y_1080_neon: 1472.5 ( 3.90x) > bgra_to_y_1080_dotprod: 906.5 ( 6.33x) > bgra_to_y_1920_c: 10194.0 ( 1.00x) > bgra_to_y_1920_neon: 2589.8 ( 3.94x) > bgra_to_y_1920_dotprod: 1573.8 ( 6.48x) > > Krzysztof > > libswscale/aarch64/input.S | 88 ++++++++++++++++++++++++++++++++++++ > libswscale/aarch64/swscale.c | 17 +++++++ > 2 files changed, 105 insertions(+) > > diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S > index 5cb18711fb..5fe6c3f6f5 100644 > --- a/libswscale/aarch64/input.S > +++ b/libswscale/aarch64/input.S > @@ -313,3 +313,91 @@ rgbToUV_neon bgr24, rgb24, element=3 > rgbToUV_neon bgra32, rgba32, element=4 > > rgbToUV_neon abgr32, argb32, element=4, alpha_first=1 > + > +#if HAVE_DOTPROD > +ENABLE_DOTPROD > + > +function ff_bgra32ToY_neon_dotprod, export=1 > + cmp w4, #0 // check width > 0 > + ldp w12, w11, [x5] // w12: ry, w11: gy > + ldr w10, [x5, #8] // w10: by > + b.gt 4f > + ret > +endfunc > + > +function ff_rgba32ToY_neon_dotprod, export=1 > + cmp w4, #0 // check width > 0 > + ldp w10, w11, [x5] // w10: ry, w11: gy > + ldr w12, [x5, #8] // w12: by > + b.le 3f > +4: > + mov w9, #256 // w9 = 1 << (RGB2YUV_SHIFT - 7) > + movk w9, #8, lsl #16 // w9 += 32 << (RGB2YUV_SHIFT - 1) > + dup v6.4s, w9 // w9: const_offset > + > + cmp w4, #16 > + mov w7, w10 > + bfi w7, w11, 8, 8 > + bfi w7, w12, 16, 8 These bfi instructions are quite esoteric; it'd probably be good to add some comments to explain what you do here. > + dup v0.4s, w7 > + > + lsr w6, w10, #8 > + lsr w7, w11, #8 > + lsr w8, w12, #8 > + > + bfi w6, w7, 8, 8 > + bfi w6, w8, 16, 8 > + dup v1.4s, w6 > + b.lt 2f > +1: > + ld1 { v16.16b, v17.16b, v18.16b, v19.16b }, [x1], #64 > + sub w4, w4, #16 // width -= 16 > + cmp w4, #16 // width >= 16 ? The cmp could be moved e.g. below the mov Other than that, this patch looks really good to me, thanks! And while swscale is being rewritten elsewhere, adding this function shouldn't make the transition to a rewrite any harder, so I don't see any problem with adding this in the meantime. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".