From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id A9601473BE for ; Sun, 19 Jan 2025 21:28:43 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 3345A68B5CC; Sun, 19 Jan 2025 23:28:39 +0200 (EET) Received: from mail-lf1-f42.google.com (mail-lf1-f42.google.com [209.85.167.42]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 1AC6068AF22 for ; Sun, 19 Jan 2025 23:28:33 +0200 (EET) Received: by mail-lf1-f42.google.com with SMTP id 2adb3069b0e04-540218726d5so3632597e87.2 for ; Sun, 19 Jan 2025 13:28:33 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1737322112; x=1737926912; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=chE2JFlJy3TRiaeRumCf/J4f1/qgRNt5GJ0Z2ksrSsM=; b=IcVueB+hX05vK+TWHj1AjONEoqXHeoFY/44seWqcgBlSF7NuH0v0yqIUGhr+Ksgzb8 seUUjOt+Yf4WBcQcIe8KqUVioSlyMfEuWtZ9LFk7yyO+Q+u4Dg/7YuaJoYnrLw4FvAln MQEmCUGQbycjCLDSuZDr9EfVABlQKV8RDVkm5a5KNg4bkIerV2di4ZhPkUeT72h5dXB/ boYDNbAKu9qz7VI7r2a/4Y/mrSFkw/Vt5LpT5UF58omwROo1cxn2yjauK4X7ZnXN9HkL ysl1eqjmksCI0m4VI8HMrq4DCLLH5cKlDXTWGtV5ZCV9qMpkRx6zhVisH4fUbcrKY4K7 Hp5g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737322112; x=1737926912; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=chE2JFlJy3TRiaeRumCf/J4f1/qgRNt5GJ0Z2ksrSsM=; b=Z+uif5gO+l0a6pKJ4OABDeyuyTOOA5ZCpTUIpYW0hLhwM3dnDmhhWfPPvQx1vpFMAQ WTB3lj5S5KHMyG79WGIrpkWK17X4ZH6a0GgwbtrDL/W3C7xeOBtN2JCP1oOl9kQVESDS EZDr/DLacrgLYo/hTCul6C/AE22rA9VLZSrU+aV6PPpkDWYM6617qIoGcBAKRqu50+Oa z6qYhBS/mCBQLpYn+eI7lYLl9sp+eQ/kJzhROhHnKtsc7OofPuejZx2OOboNa7SOrIf+ /DgIiXVxn9eurxuUyI/E0SyfnXXAf1qHZ5K9v0SIzsptJLRIPmopXI3wVrc+UxfRhH6T Cctg== X-Gm-Message-State: AOJu0YwvZ9JNQVnDr0p2iGAZQHk4YIoNAz9kDunuq/MF0CcU+pfHTeY6 tH8kV8PnRMdsCZpc94Ssy1Ey22awUyFf99VmT/L1rnZaApS0QsA6d5y3xXHUNSbWFRZbltGOQ7k 0VQ== X-Gm-Gg: ASbGncvuLSo3zFZLkqUPmV5nWwxvrApcAKLYEj8IEtro+vouGjaX+u9nbIXwKbIAr61 DE/U4dMwa36EkbTcNAlGIIwz2TOR1XblQyWwZAiiKvnQ1GSMXfTTZF68eg+7RXGwNHz62xi5uuw LPQBjJ+EOGtU/T7EhsdBMYtl5gS+NgfZwSh4ghs0a93ZEowRWTLAduPgv5empViUnPp19fjybpB grIfGlwuXFlE5hn+W/IO4yD4JR2V6JkPAaHhsY/959zJo8N19FCnVgQA5jKe8mpXM1BfRrYkOKC itWdfRTVZLutQxamsnp5jQSgXB/7YfWoD0U22uBfl0u2Ba81khnTQCB39AQaROEIjQ== X-Google-Smtp-Source: AGHT+IG2NjXA7fuB5suxy3rkkicqrvRnlM0XMmD2Cc25/1l11LMEvpfwJfzU86LoVU1IA28FFuFSiw== X-Received: by 2002:a05:6512:2346:b0:542:6d01:f55f with SMTP id 2adb3069b0e04-5439c287465mr4064027e87.48.1737322111960; Sun, 19 Jan 2025 13:28:31 -0800 (PST) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-5439af72972sm1102993e87.175.2025.01.19.13.28.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 19 Jan 2025 13:28:31 -0800 (PST) Date: Sun, 19 Jan 2025 23:28:30 +0200 (EET) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Krzysztof Pyrkosz via ffmpeg-devel In-Reply-To: <20250119180706.3075-2-ffmpeg@szaka.eu> Message-ID: References: <20250119180706.3075-2-ffmpeg@szaka.eu> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_butterflies_float_neon X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Krzysztof Pyrkosz Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > Modified the main loop to handle 8 floats in one go. A special case of > length not being multiple of 8 is handled at the beginning. The speed > increased from 3.90 to 4.50. The same request about benchmark numbers and reference to where they were measured. > > Krzysztof > > --- > libavutil/aarch64/float_dsp_neon.S | 30 ++++++++++++++++++++++-------- > 1 file changed, 22 insertions(+), 8 deletions(-) > > diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S > index 35e2715b87..f99cc2e6b2 100644 > --- a/libavutil/aarch64/float_dsp_neon.S > +++ b/libavutil/aarch64/float_dsp_neon.S > @@ -178,14 +178,28 @@ function ff_vector_fmul_reverse_neon, export=1 > endfunc > > function ff_butterflies_float_neon, export=1 > -1: ld1 {v0.4s}, [x0] > - ld1 {v1.4s}, [x1] > - subs w2, w2, #4 > - fsub v2.4s, v0.4s, v1.4s > - fadd v3.4s, v0.4s, v1.4s > - st1 {v2.4s}, [x1], #16 > - st1 {v3.4s}, [x0], #16 > - b.gt 1b > + tst w2, #4 Note how you're altering the indentation of the whole function here - don't do that, keep things aligned as it was. (The patch for ff_vector_fmul_add_neon also had that issue.) > + b.eq bf8 > + ldr q0, [x0] > + ldr q1, [x1] > + fadd v2.4s, v0.4s, v1.4s > + fsub v3.4s, v0.4s, v1.4s > + str q2, [x0], #16 > + str q3, [x1], #16 > + sub w2, w2, #4 > + cbnz w2, bf8 > + ret > +bf8: This will emit an actual symbol to the symbol table. That's not an issue per se, but the output of some tools (disassemblers etc) can be nicer if we avoid them. If we really do want a textual label, there are also ways of making them into local labels that don't end up in the symbol table; prefixing it with .L will do that, for ELF. We have a couple cases of that in our assembly, but there's not a lot of it. However, for MachO, it has to be prefixed by a plain "L", not ".L". So if we want to start doing this, we could add this macro to asm.S: https://code.videolan.org/videolan/dav1d/-/blob/master/src/arm/asm.S?ref_type=heads#L352-356 But for this case, a plain numeric 8: could probably be just as good. > + ldp q0, q1, [x0] > + ldp q4, q5, [x1] > + fadd v2.4s, v0.4s, v4.4s > + fadd v3.4s, v1.4s, v5.4s > + stp q2, q3, [x0], #32 Why are you doing the store of q3 directly after the calculation of it? This will cause stalls on in-order cores. > + fsub v2.4s, v0.4s, v4.4s > + fsub v3.4s, v1.4s, v5.4s > + stp q2, q3, [x1], #32 > + sub w2, w2, #8 > + cbnz w2, bf8 Likewise, move the sub between the fsub and cbnz, to improve things for in-order cores. Overall, this change does seem to be helpful (except for A73, surprisingly), but it can be made better with the suggestions I made above. Before: Cortex A53 A72 A73 A78 butterflies_float_neon: 782.0 336.8 405.8 163.0 After this patch: butterflies_float_neon: 718.5 323.8 525.2 147.8 With extra modifications: butterflies_float_neon: 660.5 328.8 472.8 147.0 That is with the following changes on top: +++ b/libavutil/aarch64/float_dsp_neon.S @@ -195,12 +195,12 @@ bf8: ldp q4, q5, [x1] fadd v2.4s, v0.4s, v4.4s fadd v3.4s, v1.4s, v5.4s + fsub v6.4s, v0.4s, v4.4s stp q2, q3, [x0], #32 - fsub v2.4s, v0.4s, v4.4s - fsub v3.4s, v1.4s, v5.4s - stp q2, q3, [x1], #32 - sub w2, w2, #8 - cbnz w2, bf8 + fsub v7.4s, v1.4s, v5.4s + subs w2, w2, #8 + stp q6, q7, [x1], #32 + b.gt bf8 ret // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".