From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTP id 6120A4CB1E for ; Wed, 14 Aug 2024 12:31:24 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A383F68DB0D; Wed, 14 Aug 2024 15:31:21 +0300 (EEST) Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com [209.85.167.50]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 5AC6F68DAD6 for ; Wed, 14 Aug 2024 15:31:15 +0300 (EEST) Received: by mail-lf1-f50.google.com with SMTP id 2adb3069b0e04-52ed741fe46so7113479e87.0 for ; Wed, 14 Aug 2024 05:31:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1723638674; x=1724243474; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:to:from:date :from:to:cc:subject:date:message-id:reply-to; bh=FBzZ4HbRPV4ds1NIq2Lyok87duUs3zt9TPDONprFdiQ=; b=yZ9qLN3SXT02ABdOgQmdQdipSr2wJ/ytWoNNvm9GXc+HUH03zltKZC7JmJrN/dRVDX Iso0eJSr88DnugHPbd4HEHDN32r1drMEdUeh/jMXnn9ez07xWm6LFtOdEL9BnHg8em+w EIeFGn1wy8Pcq0yAU9iDafeCshVDpHXjNi/TiMrWg/19nZM907o5xSw7xQy5xPWUwzkY o9a8a1fUMhzIKojdOkHdDY/GO5yrbIEFBazdTOPOfcZe4s0Hya5UKtuPKbsAv5hRN1sJ NzKo5UJ7LZl74D6HOImcqLSWCO7pJZ9EAzfOmBm8oQp0SnK+sLlLTY/Bc6w45wg+IOVg h2Aw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723638674; x=1724243474; h=mime-version:references:message-id:in-reply-to:subject:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=FBzZ4HbRPV4ds1NIq2Lyok87duUs3zt9TPDONprFdiQ=; b=ZD5039vkLKRtDmbQQUa7bqhZ36D22kmNEa7QcS9X56c9myO0ArqhdZ0ygvrcsTqcIl Fi20xQsSbKynfaJDn2HO3H4wL4O3/c8sJrDhjBV8FLQgYxgxIMwLD8o8lRUOBh+OT9jj RKvwfLrXwvJV+NcIIKRHxBxzMPyHtGOCkQbognjLFdHw9bgqmA39dX2JrRxXA8sKIs+S 8QtasY+GqWee5cPREQJMI2o2OgnP0cL+DJv6CWIzcFZPY8BQuaEJdMAiOCuqikrBMfpE 4p1GEqHue71zq7y1QNvmuH3uydkGKbzgBZCkHxVuAbvTCfq64vtFUZlvRRUiY5n8r5Ge mq1A== X-Gm-Message-State: AOJu0YyejPYzry5N5dk9z36Cbdb+MKILEUlp9LIJTqYb2AWHaigW0wkj Vj/BTYXC4z7NY0e3iljPiLAaeebslGq7QtRK7kw0lav/B8J7co7rYEne+DwgY6awdqiWPk04aue S6w== X-Google-Smtp-Source: AGHT+IFfiQEkunTCnanXarNMOpdFLtV8aioLAkPWMVzDUjZ4caAKubu7TU7Zm5HuvWGvaUu9nS2UKQ== X-Received: by 2002:a05:6512:23a5:b0:52c:d834:4f2d with SMTP id 2adb3069b0e04-532eda792c2mr1951386e87.18.1723638674234; Wed, 14 Aug 2024 05:31:14 -0700 (PDT) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-53200f4e037sm1296245e87.303.2024.08.14.05.31.13 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 14 Aug 2024 05:31:14 -0700 (PDT) Date: Wed, 14 Aug 2024 15:31:12 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: FFmpeg development discussions and patches In-Reply-To: <20240809112612.107000-4-ramiro.polla@gmail.com> Message-ID: <79af6ca-f4b2-6f51-1b4d-e35f8331991@martin.st> References: <20240809112612.107000-1-ramiro.polla@gmail.com> <20240809112612.107000-4-ramiro.polla@gmail.com> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH 4/4] swscale/aarch64: add nv24/nv42 to yuv420p unscaled converter X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Fri, 9 Aug 2024, Ramiro Polla wrote: > checkasm --bench for Raspberry Pi 5 Model B Rev 1.0: > nv24_yuv420p_128_c: 423.0 > nv24_yuv420p_128_neon: 115.7 > nv24_yuv420p_1920_c: 5939.5 > nv24_yuv420p_1920_neon: 1339.7 > nv42_yuv420p_128_c: 423.2 > nv42_yuv420p_128_neon: 115.7 > nv42_yuv420p_1920_c: 5907.5 > nv42_yuv420p_1920_neon: 1342.5 > --- > libswscale/aarch64/Makefile | 1 + > libswscale/aarch64/swscale_unscaled.c | 30 +++++++++ > libswscale/aarch64/swscale_unscaled_neon.S | 75 ++++++++++++++++++++++ > 3 files changed, 106 insertions(+) > create mode 100644 libswscale/aarch64/swscale_unscaled_neon.S > diff --git a/libswscale/aarch64/swscale_unscaled_neon.S b/libswscale/aarch64/swscale_unscaled_neon.S > new file mode 100644 > index 0000000000..a206fda41f > --- /dev/null > +++ b/libswscale/aarch64/swscale_unscaled_neon.S > @@ -0,0 +1,75 @@ > +/* > + * Copyright (c) 2024 Ramiro Polla > + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA > + */ > + > +#include "libavutil/aarch64/asm.S" > + > +function ff_nv24_to_yuv420p_chroma_neon, export=1 > +// x0 uint8_t *dst1 > +// x1 int dstStride1 > +// x2 uint8_t *dst2 > +// x3 int dstStride2 > +// x4 const uint8_t *src > +// x5 int srcStride > +// w6 int w > +// w7 int h > + > + uxtw x1, w1 > + uxtw x3, w3 > + uxtw x5, w5 You can often avoid the explicit uxtw instructions, if you can fold an uxtw attribute into the cases where the register is used. (If it's used often, it may be slightly more performant to do it upfront like this though, but often it can be omitted entirely.) And whenever you do an operation with a wN register as destination, the upper half of the register gets explicitly cleared, so these also may be avoided that way. > + > + add x9, x4, x5 // x9 = src + srcStride > + lsl w5, w5, #1 // srcStride *= 2 > + > +1: > + mov w10, w6 // w10 = w > + mov x11, x4 // x11 = src1 (line 1) > + mov x12, x9 // x12 = src2 (line 2) > + mov x13, x0 // x13 = dst1 (dstU) > + mov x14, x2 // x14 = dst2 (dstV) > + > +2: > + ld2 { v0.16b, v1.16b }, [x11], #32 // v0 = U1, v1 = V1 > + ld2 { v2.16b, v3.16b }, [x12], #32 // v2 = U2, v3 = V2 > + > + uaddlp v0.8h, v0.16b // pairwise add U1 into v0 > + uaddlp v1.8h, v1.16b // pairwise add V1 into v1 > + uadalp v0.8h, v2.16b // pairwise add U2, accumulate into v0 > + uadalp v1.8h, v3.16b // pairwise add V2, accumulate into v1 > + > + shrn v0.8b, v0.8h, #2 // divide by 4 > + shrn v1.8b, v1.8h, #2 // divide by 4 > + > + st1 { v0.8b }, [x13], #8 // store U into dst1 > + st1 { v1.8b }, [x14], #8 // store V into dst2 > + > + subs w10, w10, #8 > + b.gt 2b > + > + // next row > + add x4, x4, x5 // src1 += srcStride * 2 > + add x9, x9, x5 // src2 += srcStride * 2 > + add x0, x0, x1 // dst1 += dstStride1 > + add x2, x2, x3 // dst2 += dstStride2 It's often possible to avoid the extra step of moving the pointers back into the the x11/x12/x13/x14 registers, if you subtract the width from the stride at the start of the function. Then you don't need two separate registers for each pointer, and shortens dependency chain when moving on to the next line. If the width can be any uneven value, but we in practice write in increments of 8 pixels, you may need to align the width up to 8 before using it to decrement the stride that way though. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".