From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 49E7C4CD0B for ; Sat, 25 Jan 2025 23:29:52 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id E457268B63D; Sun, 26 Jan 2025 01:29:50 +0200 (EET) Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 4C742689DB3 for ; Sun, 26 Jan 2025 01:29:44 +0200 (EET) Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-5401bd6cdb4so3623016e87.2 for ; Sat, 25 Jan 2025 15:29:44 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1737847783; x=1738452583; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=yMoc7IALmOPKUJvNIbIeNmPPtOgidAGa7sNUbYA2lJI=; b=w1XTW4u7p5moaNwcxSTpd6zetO8j5U4YNpRR5wwGZl89F8+4xhL5StrK+VXRk1aw4R O+q/14XEKRnga0oVGflM87hvoym1pG/OKxb3z41Oqt4eXNWWGQ1vqvmS+UXBsKi5WboR /TdwuWoLsN+E7+X9cGgzW4XiVz3wwhjfGNS9VSenPdK9pWm02HfEA4pNctgAqvGvi54g 6Vc8ONx0kxEkyycwM5NMVk3FQJbwXs9nag3RqwGtPwIsVNTQ6SajtxHdX3g78vqY/juO PWm3TYguRCitMlMwewP5CxvOxxyv9UtaQyp8JafjCrkDvDQioNHVr4e1PjenHCuIW12Q b9YQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737847783; x=1738452583; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=yMoc7IALmOPKUJvNIbIeNmPPtOgidAGa7sNUbYA2lJI=; b=oAdoPOIxkL9OLlSNLEBgBdSKnt6EL1H6F4JTTpvJCG2M8V4ylTF2faaVVr42aiG9lg Puc4NF9FMbbiWtS4jXF5W4yYdoQN1weI1vkKqXrpU+LWe/YjAAmgzj3BrRg/KudAohyW IOxFJkIGoOv5olRy9m2Gj/kcWOkd7BVrn2oEGxwNsgNPhIGwMLCLe+0UTBHYfML0AkxB 4oiAgvqo7yNZIWXmRLa5OEM0LrWoBZp0JPqZuV6wSoewnRZqyTu/CUl+m6ViKTb7SI5o 0c0ucyJmqUYTfrSbO1+aDtZEhreIohSreDgsXjoPUXI5SJoDVG3iY+3/20SAaqn6JOBB E61Q== X-Gm-Message-State: AOJu0YzNiupzBg2N2d7PAKNkXZQ1+HLbzSX8QLf1ZK1yeMhOHiUqyCA3 qZDn6T20W0RDea3qff0iReFVu8UoviC2fITgDv+osamdB/1+DDtHRliwy3Laxf4T1ENd0obHkkb TPQ== X-Gm-Gg: ASbGncvfC2sr5xekzJTx9PzpsCBXqMnz/eOyiPp/RHp2+sQHuLWw4Yswou76cb7yBhg QXYVVI7JMDmoi7hFhtpJkRm5i9uakgbP/QrjTjcMXfnyWh/7mschXffEi3TCC/NkxVbtzmK9p4k XXCYZn04JkvHa4/q4+rFZBSbcCENzU66AngcSusOLJMZd76oG6dLK4UYgUHz5mt7R5foE79YUhp E65T8FQtIA1rSEQDoYPFeKznzHsEmrbdQK1mMNfmD8shciKm7ASJENJ9TYPejBcu3/DCPDM733+ rETunzmyh7Wl+8ft2CCnkQgaMB/XeOicQqsTbpeYSnUlT6H6HAMVism/iN4eslxUjxuVtQwVuJh x X-Google-Smtp-Source: AGHT+IG4lz0EPuNAocIjb0cQTj57NrorfcKfETzhDMYygVgdlGr/UfV3PCq4bVsqUYjkgGb1FQPLPw== X-Received: by 2002:ac2:5541:0:b0:542:2335:c43a with SMTP id 2adb3069b0e04-5439c246321mr9288933e87.21.1737847782453; Sat, 25 Jan 2025 15:29:42 -0800 (PST) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-543c8381707sm734827e87.238.2025.01.25.15.29.41 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sat, 25 Jan 2025 15:29:41 -0800 (PST) Date: Sun, 26 Jan 2025 01:29:38 +0200 (EET) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Krzysztof Pyrkosz via ffmpeg-devel In-Reply-To: <20250124185825.1323-3-ffmpeg@szaka.eu> Message-ID: References: <20250124185825.1323-3-ffmpeg@szaka.eu> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] avcodec/aarch64/aacencdsp: NEON implementation X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Krzysztof Pyrkosz Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Fri, 24 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > This patch supplies handwritten NEON code for AAC. > > The benchmarks below were collected by invoking these two commands on > each of my boards, A78, A72 and Thinkpad x13s: > 1) ./tests/checkasm/checkasm --test=aacencdsp --bench --runs=12 > 2) ./ffmpeg -y -t 10:00 -f lavfi -i sine /tmp/foo.aac (the first line is > speed without the patch, second, with) > > > - A78 > abs_pow34_c: 4161.5 ( 1.00x) > abs_pow34_neon: 3586.2 ( 1.16x) > quant_bands_signed_c: 5548.0 ( 1.00x) > quant_bands_signed_neon: 1126.8 ( 4.92x) > quant_bands_unsigned_c: 3979.2 ( 1.00x) > quant_bands_unsigned_neon: 800.2 ( 4.97x) > > size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=71.6x > size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=82.3x > > - A72 > abs_pow34_c: 15362.2 ( 1.00x) > abs_pow34_neon: 15382.5 ( 1.00x) > quant_bands_signed_c: 9926.5 ( 1.00x) > quant_bands_signed_neon: 2467.8 ( 4.02x) > quant_bands_unsigned_c: 5469.8 ( 1.00x) > quant_bands_unsigned_neon: 2089.5 ( 2.62x) > > size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=34.3x > size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed=37.8 > > - x13s > abs_pow34_c: 2413.4 ( 1.00x) > abs_pow34_neon: 1796.2 ( 1.34x) > quant_bands_signed_c: 2968.9 ( 1.00x) > quant_bands_signed_neon: 675.6 ( 4.39x) > quant_bands_unsigned_c: 2311.9 ( 1.00x) > quant_bands_unsigned_neon: 477.1 ( 4.85x) > > size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 135x > size= 5251KiB time=00:10:00.00 bitrate= 71.7kbits/s speed= 159x Thanks for the numbers! It's interesting how the numbers for abs_pow34 end up so identical on the A72, while they do differ on the other cores. In my test, I'm getting the following numbers for that function: Cortex A53 A72 A78 abs_pow34_c: 36889.0 15359.5 14345.0 abs_pow34_neon: 9237.0 15359.0 3592.5 This test is done with the same binary across all three devices. The test is made with a fairly old C compiler so that may explain the much larger difference in speed for the C version - but surprisingly, it is still almost identical in speed compared with the SIMD version on A72. Very surprising, but anyway, the code seems to add value. The code overall looks ok, but there are some minor opportunities to make things a little bit better, primarily by looking at the scheduling. Normally, scheduling mostly helps for the in-order cores, but I do see some smaller speedups on out-of-order cores here as well. > diff --git a/libavcodec/aarch64/aacencdsp_neon.S b/libavcodec/aarch64/aacencdsp_neon.S > new file mode 100644 > index 0000000000..f0bf013c85 > --- /dev/null > +++ b/libavcodec/aarch64/aacencdsp_neon.S > @@ -0,0 +1,62 @@ > +/* > + * Copyright (c) 2025 Krzysztof Aleksander Pyrkosz > + * > + * This file is part of FFmpeg. > + * > + * FFmpeg is free software; you can redistribute it and/or > + * modify it under the terms of the GNU Lesser General Public > + * License as published by the Free Software Foundation; either > + * version 2.1 of the License, or (at your option) any later version. > + * > + * FFmpeg is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU > + * Lesser General Public License for more details. > + * > + * You should have received a copy of the GNU Lesser General Public > + * License along with FFmpeg; if not, write to the Free Software > + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA > + */ > + > +#include "libavutil/aarch64/asm.S" > + > +function ff_abs_pow34_neon, export=1 > +1: subs w2, w2, #4 > + ld1 {v0.4s}, [x1], #16 > + fabs v0.4s, v0.4s > + fsqrt v2.4s, v0.4s > + fmul v0.4s, v2.4s, v0.4s > + fsqrt v0.4s, v0.4s > + st1 {v0.4s}, [x0], #16 > + b.ne 1b > + ret > +endfunc This mostly looks straightforward. As the whole instruction sequence mostly depends on the results of the previous instruction, there's not really much one can do here. But the main usual thing to do is to place the loop decrement instruction where it can help hide latency the best; either between a load ands the use of it (if there's nothing else that can be placed there), or e.g. between the last calculation and the store of it). Here, I did the following modification: @@ -21,8 +21,9 @@ #include "libavutil/aarch64/asm.S" function ff_abs_pow34_neon, export=1 -1: subs w2, w2, #4 +1: ld1 {v0.4s}, [x1], #16 + subs w2, w2, #4 fabs v0.4s, v0.4s fsqrt v2.4s, v0.4s fmul v0.4s, v2.4s, v0.4s And I got the following results: Before: Cortex A53 A72 A78 abs_pow34_neon: 9488.0 15359.0 3586.5 Aftere: abs_pow34_neon: 9232.0 15359.0 3586.5 Thus, no change on the big out-of-order cores, but a small consistent improvement on the in-order core. I.e. no reason not to do it here. > + > +function ff_aac_quant_bands_neon, export=1 > + scvtf s2, w5 > + dup v1.4s, v1.s[0] > + dup v2.4s, v2.s[0] > + cbz w4, 0f > + movi v5.4s, 0x80, lsl #24 > +.irp signed,1,0 > +\signed: > + subs w3, w3, #4 > + ld1 {v3.4s}, [x2], #16 > + fmul v3.4s, v3.4s, v0.s[0] > +.if \signed > + ld1 {v4.4s}, [x1], #16 > +.endif Here, you could do the same - do the "subs" after the first ld1 (while we're waiting for the result to be loaded) anyway. I see that you've interleaved the extra operations on v4 here, that's good. We could try doing the load of v4 before the first fmul, but that doesn't seem to make any difference at all. With the following diff: @@ -40,8 +41,8 @@ function ff_aac_quant_bands_neon, export=1 movi v5.4s, 0x80, lsl #24 .irp signed,1,0 \signed: - subs w3, w3, #4 ld1 {v3.4s}, [x2], #16 + subs w3, w3, #4 fmul v3.4s, v3.4s, v0.s[0] .if \signed ld1 {v4.4s}, [x1], #16 I'm getting the following improvement: Before: Cortex A53 A72 A78 quant_bands_signed_neon: 5661.0 2383.2 1113.2 quant_bands_unsigned_neon: 5401.5 2067.8 811.8 After: quant_bands_signed_neon: 5402.5 2385.5 1090.0 quant_bands_unsigned_neon: 5145.5 2067.8 809.5 No change on the A72 here, but apparently a (very) small improvement on the A78, and a bigger improvement on the A53 as expected. If you don't mind these changes, we could land the change with that tweaked. (I guess the numbers in the commit message could be re-measured, but I'm not sure if they change enough to make much of a difference there, especially on the cores you've measured on.) // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".