From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 5286147250 for ; Sun, 19 Jan 2025 20:41:37 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 1BF4368B5BD; Sun, 19 Jan 2025 22:41:34 +0200 (EET) Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 09C4F68AB87 for ; Sun, 19 Jan 2025 22:41:27 +0200 (EET) Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-3072f8dc069so21683021fa.3 for ; Sun, 19 Jan 2025 12:41:27 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1737319286; x=1737924086; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=N4F301YRBfnNIq7E8wxx7WLva8ykqxqKCyUNGp6iRFc=; b=HTilUjlzE7viffGcb3ch6rPmc1wiPulDlhMqt9qk9KwXw3YOQG9T9IUYhx6RsDnkhF CN6NGg0siAc11FNQ52r26KbcZ7JxISwnJp8o9lKsydj5LYsbhGiAwm2jK6h2ZLMl+4pU hzyYsr3KVUC3FQafHtBdd4kmHXQhCnEqpzreQA9O5dBo+/YaBJ6w2tAMXtWfGHEaeJ2c URR/puAeD04ITESPxzMUwMG01i26qJ4O3ZMjV0ti8YIW/tvPt07hWoXqvVtQzzCSzieY v6z+GhJPAMu8j8Gx5D7TaXPkFlcWU5iiIHDPmpFhQwnEYqEEIaUcuDJauwL3cHPRc3jj 1ctQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1737319286; x=1737924086; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=N4F301YRBfnNIq7E8wxx7WLva8ykqxqKCyUNGp6iRFc=; b=nn+BhUTVJLbU+eSuMgJQbLARhd4+qEYFRrCkZ4FVv74k4pEWpe107UsNox1Yz/axXc x9zqwIHaDzC5XsTdMn6M0YG7uAipWmGMA0hZbqIOodya5JHFYMpuTZHTLwyGXjmNsCK4 w9OVZN4Hubt6nFJ3luPIHiU3iFIFgaHwlCt0gxb2hZ2ZlpGIqHOfZV6y4iplRquSPwNO nCBMQQandTbFLPNNCsapc5slgCHTMFxiMjVLqbhyznarBBVHmAc314QIwKYKxFL+AFkt NiwlOK5HoUE3Fqo5aBIgrfS6r79PTBstplXPkhzF1mUXfCyTXlBMc6qk0e1id3fGtroe HAKg== X-Gm-Message-State: AOJu0YzrrjJwNF2Hy8anAtJSf1Y+Yv17nMFS+KIJuf8RyaMoHYk/phA/ e8tpHDOLdGAVFt/JJj7BBPCLpZwF47BXak+tnzgXpLr4c9B9ZIJ5nf1FUn4HkhcaRhnjNCxu6Ky swA== X-Gm-Gg: ASbGncufVcR1gChONZx3+YnGuLllPLYuKqb1Hi3t4CueM8lzE4egZqYhHslMK9La+dl s03Us6lXdQZBFoxJqXRospAyN6mtjj9uUQFOMLxOWCHjZ1JmUUvCfJggwA6U9L4BJm5cuo5Iu48 OMFgkxnAdZLYKKX2w9pwbTG2ad8Xzmg6Z2yk8ihflOVDYY9TH1MQCQH24wfv+/ZIzFLi9ESZIKG ZH3G+odAbtvEueCQq3/WX3C3U01YUX7zetlkBKtEnsJc/Bguioo6w9S3G4r/RMOgPHdSwFQb6Gc uzBGlpu7O2UZ/3s2W7Bg/WB0KUsneNCaT7lDKYt2j9GgM547bXKvA/lhZM4DDlTXHg== X-Google-Smtp-Source: AGHT+IGenNku2OjvcxCupfGSmDDsKOhbtzH35RrLMIQJIV7iVo8PmvGKdkMgJnvpGrc650AKtOogfw== X-Received: by 2002:a05:651c:1071:b0:306:1500:3efb with SMTP id 38308e7fff4ca-3072ca6e23fmr28674801fa.12.1737319286164; Sun, 19 Jan 2025 12:41:26 -0800 (PST) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 38308e7fff4ca-3072a3448b5sm12868121fa.34.2025.01.19.12.41.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 19 Jan 2025 12:41:25 -0800 (PST) Date: Sun, 19 Jan 2025 22:41:22 +0200 (EET) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Krzysztof Pyrkosz via ffmpeg-devel In-Reply-To: <20250119150440.26868-2-ffmpeg@szaka.eu> Message-ID: References: <20250119150440.26868-1-ffmpeg@szaka.eu> <20250119150440.26868-2-ffmpeg@szaka.eu> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] avutil/aarch64/float_dsp_neon: Refactor ff_vector_fmac_scalar_neon X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Krzysztof Pyrkosz Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Sun, 19 Jan 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > --- > libavutil/aarch64/float_dsp_neon.S | 13 ++++++------- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/libavutil/aarch64/float_dsp_neon.S b/libavutil/aarch64/float_dsp_neon.S > index 35e2715b87..b21f34c084 100644 > --- a/libavutil/aarch64/float_dsp_neon.S > +++ b/libavutil/aarch64/float_dsp_neon.S > @@ -40,18 +40,17 @@ function ff_vector_fmul_neon, export=1 > endfunc > > function ff_vector_fmac_scalar_neon, export=1 > - mov x3, #-32 > 1: subs w2, w2, #16 > - ld1 {v16.4s, v17.4s}, [x0], #32 > - ld1 {v18.4s, v19.4s}, [x0], x3 > - ld1 {v4.4s, v5.4s}, [x1], #32 > - ld1 {v6.4s, v7.4s}, [x1], #32 > + ldp q16, q17, [x0] > + ldp q4, q5, [x1], #32 > fmla v16.4s, v4.4s, v0.s[0] > fmla v17.4s, v5.4s, v0.s[0] > + stp q16, q17, [x0], #32 > + ldp q18, q19, [x0] > + ldp q6, q7, [x1], #32 > fmla v18.4s, v6.4s, v0.s[0] > fmla v19.4s, v7.4s, v0.s[0] > - st1 {v16.4s, v17.4s}, [x0], #32 > - st1 {v18.4s, v19.4s}, [x0], #32 > + stp q18, q19, [x0], #32 > b.ne 1b > ret > endfunc > -- The change looks ok to me, but for changes like this, please do quote actual checkasm benchmark numbers, with the actual numbers before/after (not just the change in speedup factor), and include such numbers in the commit message for reference. And include info about what core you have benchmarked it on. The reason for that is that various tunings of assembly like this can have a lot of effect on one core, and little/no effect on others, or even contradictory effects on others. So in order not to go back and forth (if someone else comes along next week and does a different tuning for a different core), please include info about the actual numbers and what you have benchmarked on. I benchmarked this change, on a handful of different cores, and got the following numbers: Before: Cortex A53 A72 A73 A78 vector_fmac_scalar_neon: 606.0 213.8 321.8 115.5 After: vector_fmac_scalar_neon: 608.0 198.2 303.2 90.8 So it shows no change (but no regression either) on A53, and a 5-27% speedup on in-order cores. So it looks like a reasonable tradeoff. However, to show the contrast, if one were to favour optimizations for the A53, one could do this simple change: --- a/libavutil/aarch64/float_dsp_neon.S +++ b/libavutil/aarch64/float_dsp_neon.S @@ -43,8 +43,8 @@ function ff_vector_fmac_scalar_neon, export=1 mov x3, #-32 1: subs w2, w2, #16 ld1 {v16.4s, v17.4s}, [x0], #32 - ld1 {v18.4s, v19.4s}, [x0], x3 ld1 {v4.4s, v5.4s}, [x1], #32 + ld1 {v18.4s, v19.4s}, [x0], x3 ld1 {v6.4s, v7.4s}, [x1], #32 fmla v16.4s, v4.4s, v0.s[0] fmla v17.4s, v5.4s, v0.s[0] This gives the following numbers: Before: Cortex A53 A72 A73 A78 vector_fmac_scalar_neon: 606.0 213.8 321.8 115.5 After: vector_fmac_scalar_neon: 572.0 213.8 308.0 115.2 I.e. such a change would be almost as good for the A73 and even better for A53, while having no effect on A72 and A78. In this case, the change probably is fine and probably a good compromise. But I wanted to show this example to highlight the fact that tunings of functions can be very subjective. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".