From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id D61B44C99F for ; Mon, 10 Feb 2025 12:57:28 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 2C66B68BE03; Mon, 10 Feb 2025 14:57:26 +0200 (EET) Received: from mail-lf1-f43.google.com (mail-lf1-f43.google.com [209.85.167.43]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 6FBFA68BDBF for ; Mon, 10 Feb 2025 14:57:19 +0200 (EET) Received: by mail-lf1-f43.google.com with SMTP id 2adb3069b0e04-5450f38393aso420974e87.0 for ; Mon, 10 Feb 2025 04:57:19 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1739192238; x=1739797038; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=VGXt0NV+sasKsVZVLrWS3wUmRJ0pGuzc8d78UOlK824=; b=K144a+Lz9VWRXtwUnW7wOcww3j6H05UIFeVmGwl/Gh76prhPMI+ZJwItlAjBOCFbS8 e7V4KVVUPHiK7C7scpO8w8zMwBkvjZE76ogJq6ivwXbfffo/yMVxzHolP0yirj8EelC6 mDR3ILhLAMxfJkzQf4UD7Ix+Qzrf+IJgOxgOdQTxey7NiX2Uzz/U/DOrwpk1oT5XRoRh O/oXE+MnEphOpkS2HtrNkiGzGdT5cD6geEuY7ms1Ul+w+Ga2tnYvQsf2tv5XmzXn6pfV TSB6uPUUMi9/HJXfkZ9fUlo4dqYC9PiHsARyKKIDnA8BPD9hUKYt29MBVurdg/OTKhOc Vmdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739192238; x=1739797038; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=VGXt0NV+sasKsVZVLrWS3wUmRJ0pGuzc8d78UOlK824=; b=mT5lBWg9jvOibt2CXpBRmnxWH/u0Ua/xXVbsvL8Z+ce7aIB6WchF2eYcH+1sDNs/gK xOy44PL1zuurPlmpPueU7IKqQwslkGwnXSliG6BBj+B/bCklyQqf1RlnnzeYpu5fwlDe Gtp5AiWcMNS/aHbqVFAmQF7volEqrBaHXmYO9MjPpux6KzQEVs50wCoS0lu6mjNhloTA qRVpkHOm4XSjxA9z3FSPmzWJps9d7bb7FtCEmFnDZQpuV0xpxwYA2zVP5PuXKnRH1PT+ eLQK2nuvg5dDbql34QEm+2kPiH4voWjFonGTDqaxG/Pcz7Fhy/gZtXsCS6DF7klZpM3s 2UBg== X-Gm-Message-State: AOJu0YxuoMOR2b2OIdALe2V0eZuVmCsgPAetgYpW2ekdm7vPggBulUI8 vGyWFcZaGxprN+8tGCgR1c6B6y+hossJ7rD9ovep3UkxHYEx2o9JG8VCeZzTqrLSkn3BL0l/Sbe Exw== X-Gm-Gg: ASbGncsYcDIOIfFTgixYmoAv1FNEk9eWKZ24E3Lg1SyE+bzjnbTn7DyRZk9HYX8B+ph m7rrT4tez/YWMzUjwBnYoVAt5LNCjFAVt+rLuOdWUD7RGjXOtKHEUsoyFa1dZSf+eVq8pe2kvlE 51rrO/qY2RGIyHsDFw8h6dUPXDdpLBwrPFH0v+RVwef1rF1grsR66YqwAqcEHAQ8VCMR9rGbRuT 13MP7ieJ8DOlALmzwO+6lS5BtfvNFJMHWOOnxZo2IPDrRxpoPPqRE+XfqH5xHFI+/rNT6JR4h90 Upt5Vm2U3sTl3oRUwMzuG1jWHN2LpC2LtQofIeuPffWaFil+eDvx073veuvVnraikFNx2aUyBHc s7xMQjafm89I= X-Google-Smtp-Source: AGHT+IH1CaJbJ/kuMR9utZDA1dcRjd1S9el/8DS1oZWUHZPwX5nEPnY+ffixJ8IU1AJr+aJ6KoIRow== X-Received: by 2002:a05:6512:2399:b0:544:ee5:87b0 with SMTP id 2adb3069b0e04-54414a96604mr4907208e87.3.1739192238391; Mon, 10 Feb 2025 04:57:18 -0800 (PST) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-5450654e778sm648022e87.60.2025.02.10.04.57.17 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Feb 2025 04:57:18 -0800 (PST) Date: Mon, 10 Feb 2025 14:57:16 +0200 (EET) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Krzysztof Pyrkosz via ffmpeg-devel In-Reply-To: Message-ID: References: <20250207194210.5469-2-ffmpeg@szaka.eu> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] avcodec/aarch64/opusdsp_neon: Simplify opus_postfilter_neon X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Krzysztof Pyrkosz Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Sat, 8 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > On Sat, Feb 08, 2025 at 01:59:32AM +0100, Lynne wrote: >> On 07/02/2025 20:42, Krzysztof Pyrkosz via ffmpeg-devel wrote: >>> This change removes one extra floating point operation and simplifies >>> load operations at the beginning of the loop by using dedicated register >>> for each of the 5 pointers and interleaving it with calculations. The >>> first case seems to be a bit slower, but the performance increase is >>> substantial in the other two. >>> >>> A78 before: >>> postfilter_15_neon: 1684.8 ( 4.23x) >>> postfilter_512_neon: 1395.5 ( 5.10x) >>> postfilter_1022_neon: 1357.0 ( 5.25x) >>> >>> After: >>> postfilter_15_neon: 1742.2 ( 4.09x) >>> postfilter_512_neon: 1169.8 ( 6.09x) >>> postfilter_1022_neon: 1160.0 ( 6.12x) >>> >>> >>> A72 before: >>> postfilter_15_neon: 3144.8 ( 2.39x) >>> postfilter_512_neon: 3141.2 ( 2.39x) >>> postfilter_1022_neon: 3230.0 ( 2.33x) >>> >>> After: >>> postfilter_15_neon: 2847.8 ( 2.64x) >>> postfilter_512_neon: 2877.8 ( 2.61x) >>> postfilter_1022_neon: 2837.2 ( 2.65x) >>> >>> >>> x13s before: >>> postfilter_15_neon: 1615.4 ( 2.61x) >>> postfilter_512_neon: 963.1 ( 4.39x) >>> postfilter_1022_neon: 963.6 ( 4.39x) >>> >>> After: >>> postfilter_15_neon: 1749.6 ( 2.41x) >>> postfilter_512_neon: 707.1 ( 5.97x) >>> postfilter_1022_neon: 706.1 ( 5.99x) >>> >>> >>> Krzysztof >>> >>> --- >>> libavcodec/aarch64/opusdsp_neon.S | 31 ++++++++++++------------------- >>> 1 file changed, 12 insertions(+), 19 deletions(-) >>> >>> diff --git a/libavcodec/aarch64/opusdsp_neon.S b/libavcodec/aarch64/opusdsp_neon.S >>> index 253825aa61..990fc44c70 100644 >>> --- a/libavcodec/aarch64/opusdsp_neon.S >>> +++ b/libavcodec/aarch64/opusdsp_neon.S >>> @@ -55,35 +55,28 @@ endfunc >>> function ff_opus_postfilter_neon, export=1 >>> ld1 {v0.4s}, [x2] >>> + sub x5, x0, w1, sxtw #2 >>> + sub x1, x5, #8 >>> dup v1.4s, v0.s[1] >>> dup v2.4s, v0.s[2] >>> dup v0.4s, v0.s[0] >>> - add w1, w1, #2 >>> - sub x1, x0, x1, lsl #2 >>> - >>> - ld1 {v3.4s}, [x1] >>> + ld1 {v3.4s}, [x1], #16 >>> + sub x4, x5, #4 >>> + add x6, x5, #4 >>> fmul v3.4s, v3.4s, v2.4s >>> -1: add x1, x1, #4 >>> - ld1 {v4.4s}, [x1] >>> - add x1, x1, #4 >>> - ld1 {v5.4s}, [x1] >>> - add x1, x1, #4 >>> - ld1 {v6.4s}, [x1] >>> - add x1, x1, #4 >>> - ld1 {v7.4s}, [x1] >>> - >>> +1: ld1 {v7.4s}, [x1], #16 >>> + ld1 {v4.4s}, [x4], #16 >>> fmla v3.4s, v7.4s, v2.4s >>> + ld1 {v6.4s}, [x6], #16 >>> + ld1 {v5.4s}, [x5], #16 >>> fadd v6.4s, v6.4s, v4.4s >>> + fmla v3.4s, v5.4s, v0.4s >>> ld1 {v4.4s}, [x0] >>> - fmla v4.4s, v5.4s, v0.4s >>> - >>> - fmul v6.4s, v6.4s, v1.4s >>> - fadd v6.4s, v6.4s, v3.4s >>> - >>> - fadd v4.4s, v4.4s, v6.4s >>> + fmla v3.4s, v6.4s, v1.4s >>> + fadd v4.4s, v4.4s, v3.4s >>> fmul v3.4s, v7.4s, v2.4s >>> st1 {v4.4s}, [x0], #16 >> >> This function was written and optimized for A53 cores. The fmla chain is >> very performance sensitive on in-order CPUs? >> Could you post a perf difference from an in-order CPU? > > Here are my benchmarks on Raspberry Pi 3B+. > > Before: > deemphasis_neon: 4121.8 ( 2.11x) > postfilter_15_neon: 9405.2 ( 2.46x) > postfilter_512_neon: 9457.0 ( 2.44x) > postfilter_1022_neon: 9398.0 ( 2.46x) > > After: > deemphasis_neon: 4135.8 ( 2.10x) > postfilter_15_neon: 7967.2 ( 2.90x) > postfilter_512_neon: 7980.2 ( 2.89x) > postfilter_1022_neon: 7980.0 ( 2.89x) I'm also getting similar numbers on another A53. This patch looks good to me, so I pushed it. Thanks! // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".