From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id C89134DE9B for ; Thu, 5 Jun 2025 12:13:34 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 7535468CB73; Thu, 5 Jun 2025 15:13:30 +0300 (EEST) Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com [209.85.208.174]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 8CB8568C90D for ; Thu, 5 Jun 2025 15:13:23 +0300 (EEST) Received: by mail-lj1-f174.google.com with SMTP id 38308e7fff4ca-32a64f8480eso19945731fa.1 for ; Thu, 05 Jun 2025 05:13:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1749125603; x=1749730403; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=SS/OxcwGLFnFXTwp7RmkKPtbxGYQoAaEAN/5TnNEEQ4=; b=mXWnQ7Xpwbpbkn5RmBDov7fZwEGN5stM/41U+9T+742/yK4K13gOLsaxVR0hZ9SSIY o2zjZJAiiS0PzVqUFsHvt7jWxFlZo0ERuMbnODV8Tps9dMjx+jPnZKe69uRkgYQuJiJW Vqrxa9//vH1uAY177FugFkueVCIbYJtg5M8cewOBfJu4lTZjP44/o43duHVKKFbpMffv OKI3ERAIO/V9PPNyAkgkaSzpPrOB8J8tF8vGBXG4PmfPskJtbrz6OLIAUWOqwT/Hq1Lp AiIaxY2eiQxW+2f458629L3OEAFyNf4Cct1nPwN3QZSj3pA1IbGhKYMuP0w3Ew01sWp3 pwLQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1749125603; x=1749730403; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SS/OxcwGLFnFXTwp7RmkKPtbxGYQoAaEAN/5TnNEEQ4=; b=bXMpctD9NHm6eqGcZyvry3gOu3e/30r59jgGOPAzv7t1vuY0NgvchhIMR6G4G4mK4N fzDP+LKuKGzmgpDC0K045Levbu79IjBXJhdamLM/hMTn7yAjbnGJP7umYLISXQl77+1a HNVcVvJg9kg5Wn6YZRQMU2wx6+YAX/kjCORf2WlAyPWH+oLtIYH2FiVYaHZLzG1hk1p+ BrrON/dkdZEuofmCi+vNk1aqh54yRME4NVW+qFlzdrAQZ5Pdblfc4s/tcOOl7Frbjtrn EWB6uZddhboaxXNfm2/RNqugIEZnrCONWU9EoVctWUGYyxKS5NaT/jZRL36iCFcUYY7N AEGw== X-Gm-Message-State: AOJu0YyfioGlfuLFm/IH7tDoFF1aX1g4sy/3xCy2yupAYOdSV5U4qbRi yfVrGoaIl8+xY2AaihZiRO5q/H01b4FTQqdg4MBA0wbUwJKw6WIaCF5HJ1jKK11YNLFuG9FGIN7 JsBLvMQ== X-Gm-Gg: ASbGncs7CQ6hlTsllZ3xOJwBkwdz7ilUvAiMtifzQaQWuolkidSAztKzyVnej43qkZP 6URS/oIWAKRzVFmx/CVT99vjLtOlKo+wpsp4lbd4I+evW38EnQjg2XouZbOlpwN6BPIfFSi3rDY N3pPxbU5B/SwtKsd7NIeNk6VzaTb25iYhYXyAjaDDNmvuZ5/lP8RAKiAsaqcN3jq3VAdQXOQEom HJWMKKq8qX2+A1YtPlmHhDQZcfQBpx/4w5gjdGK/WiNqmGVngNEoYwc0NHglFM/V+XuE6hTe0qR uaV+QuaA7q/O8bz8rxPg659NMi8gAYXHpCUJtYkotXAk1oeiW3K7/T6pcpOc6EL2KP0IO7mqsGE ilyHRsmFXezjPTiI63Q9oDmquGZBx8Op8UDHGUCXLHu3XLAEUJKAdOukHGg== X-Google-Smtp-Source: AGHT+IEHNsyTBcpp5GFBoMlQPNjhJ8J8Ff0z4E4UEY8CDcWI9JWFEtzYSTA7v7cqVjP/ObIARhEgbA== X-Received: by 2002:a2e:a009:0:10b0:308:e956:66e with SMTP id 38308e7fff4ca-32ad10a771bmr5633421fa.0.1749125602452; Thu, 05 Jun 2025 05:13:22 -0700 (PDT) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 38308e7fff4ca-32a85b1a699sm26648211fa.23.2025.06.05.05.13.21 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 05 Jun 2025 05:13:21 -0700 (PDT) Date: Thu, 5 Jun 2025 15:13:21 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: FFmpeg development discussions and patches In-Reply-To: Message-ID: <333d901d-591b-baeb-c56a-8ac028d3105d@martin.st> References: <20250531091631.45342-1-dmtr.kovalenko@outlook.com> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH 2/2] swscale: Neon rgb_to_yuv_half process 32 pixels at a time X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Dmitriy Kovalenko Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Sat, 31 May 2025, Dmitriy Kovalenko wrote: > This patch integrates so called double bufferring when we are loading > 2 batch of elements at a time and then processing them in parallel. On the > moden arm processors especially Apple Silicon it gives a visible > benefit, for subsampled pixel processing it is especially nice because > it allows to read elements w/ 2 instructions and write with a single one > (especially visible on a platforms with slower memory like ios). > > Including the previous patch in a stack on macbook pro m4 max rgb_to_yuv_half > in checkasm goes up 2x of the c version Please quote actual checkasm numbers, which shows exactly which tests were run and their numbers, before/after the change. > --- > libswscale/aarch64/input.S | 130 ++++++++++++++++++++++++++----------- > 1 file changed, 91 insertions(+), 39 deletions(-) > > diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S > index 260a26e965..b90ca05996 100644 > --- a/libswscale/aarch64/input.S > +++ b/libswscale/aarch64/input.S > @@ -178,7 +178,7 @@ rgbToY_neon abgr32, argb32, element=4, alpha_first=1 > > .macro rgbToUV_half_neon fmt_bgr, fmt_rgb, element, alpha_first=0 > function ff_\fmt_bgr\()ToUV_half_neon, export=1 > - cbz w5, 3f // check width > 0 > + cbz w5, 3f Unrelated change, you're just removing a comment. Don't intermix unrelated changes in a patch. > ldp w12, w11, [x6, #12] > ldp w10, w15, [x6, #20] > @@ -187,49 +187,101 @@ function ff_\fmt_bgr\()ToUV_half_neon, export=1 > endfunc > > function ff_\fmt_rgb\()ToUV_half_neon, export=1 > - cmp w5, #0 // check width > 0 > + cmp w5, #0 > b.le 3f Unrelated change; only removing a comment. > > - ldp w10, w11, [x6, #12] // w10: ru, w11: gu > - ldp w12, w13, [x6, #20] // w12: bu, w13: rv > - ldp w14, w15, [x6, #28] // w14: gv, w15: bv > + ldp w10, w11, [x6, #12] > + ldp w12, w13, [x6, #20] > + ldp w14, w15, [x6, #28] Unrelated change, removing comments. > 4: > - cmp w5, #8 > rgb_set_uv_coeff half=1 > + > + cmp w5, #16 > b.lt 2f Any specific reason for moving the cmp instruction to after the rgb_set_uv_coeff macro? By having it directly before the b.lt instruction that depends on the cmp instruction, you're introducing stalls on in-order cores. > -1: // load 16 pixels > + > +1: > .if \element == 3 > ld3 { v16.16b, v17.16b, v18.16b }, [x3], #48 > + ld3 { v26.16b, v27.16b, v28.16b }, [x3], #48 > .else > ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64 > + ld4 { v26.16b, v27.16b, v28.16b, v29.16b }, [x3], #64 > .endif > > .if \alpha_first > - uaddlp v21.8h, v19.16b // v21: summed b pairs > - uaddlp v20.8h, v18.16b // v20: summed g pairs > - uaddlp v19.8h, v17.16b // v19: summed r pairs > + uaddlp v21.8h, v19.16b > + uaddlp v20.8h, v18.16b > + uaddlp v19.8h, v17.16b > + uaddlp v31.8h, v29.16b > + uaddlp v30.8h, v28.16b > + uaddlp v29.8h, v27.16b Here, you're removing comments that you yourself added in the preceding commit. Don't do that; if you don't want the comments there, don't add them in the first commit. With that in mind, you could entirely drop that change in the first commit anyway, there's no need to touch those lines there as you're only adding comments anyway. > .else > - uaddlp v19.8h, v16.16b // v19: summed r pairs > - uaddlp v20.8h, v17.16b // v20: summed g pairs > - uaddlp v21.8h, v18.16b // v21: summed b pairs > + uaddlp v19.8h, v16.16b > + uaddlp v20.8h, v17.16b > + uaddlp v21.8h, v18.16b > + uaddlp v29.8h, v26.16b > + uaddlp v30.8h, v27.16b > + uaddlp v31.8h, v28.16b > .endif > > - mov v22.16b, v6.16b // U first half > - mov v23.16b, v6.16b // U second half > - mov v24.16b, v6.16b // V first half > - mov v25.16b, v6.16b // V second half > - > - rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4, v5, v22, v23, v24, v25, v16, v17, #10 > - > - str q16, [x0], #16 // store dst_u > - str q17, [x1], #16 // store dst_v > + mov v7.16b, v6.16b > + mov v16.16b, v6.16b > + mov v17.16b, v6.16b > + mov v18.16b, v6.16b > + mov v26.16b, v6.16b > + mov v27.16b, v6.16b > + mov v28.16b, v6.16b > + mov v25.16b, v6.16b > > - sub w5, w5, #8 // width -= 8 > - cmp w5, #8 // width >= 8 ? > + smlal v7.4s, v0.4h, v19.4h > + smlal v17.4s, v3.4h, v19.4h > + smlal v26.4s, v0.4h, v29.4h > + smlal v28.4s, v3.4h, v29.4h Here you deinline the macro. Is there a significant perfomance benefit of doing that, over just calling the macro twice? The long commentless smlal/smlal2 sequence here feels less readable than what we had before. > + > + smlal2 v16.4s, v0.8h, v19.8h > + smlal2 v18.4s, v3.8h, v19.8h > + smlal2 v27.4s, v0.8h, v29.8h > + smlal2 v25.4s, v3.8h, v29.8h > + > + smlal v7.4s, v1.4h, v20.4h > + smlal v17.4s, v4.4h, v20.4h > + smlal v26.4s, v1.4h, v30.4h > + smlal v28.4s, v4.4h, v30.4h > + > + smlal2 v16.4s, v1.8h, v20.8h > + smlal2 v18.4s, v4.8h, v20.8h > + smlal2 v27.4s, v1.8h, v30.8h > + smlal2 v25.4s, v4.8h, v30.8h > + > + smlal v7.4s, v2.4h, v21.4h > + smlal v17.4s, v5.4h, v21.4h > + smlal v26.4s, v2.4h, v31.4h > + smlal v28.4s, v5.4h, v31.4h > + > + smlal2 v16.4s, v2.8h, v21.8h > + smlal2 v18.4s, v5.8h, v21.8h > + smlal2 v27.4s, v2.8h, v31.8h > + smlal2 v25.4s, v5.8h, v31.8h > + > + sqshrn v19.4h, v7.4s, #10 > + sqshrn v20.4h, v17.4s, #10 > + sqshrn v22.4h, v26.4s, #10 > + sqshrn v23.4h, v28.4s, #10 > + > + sqshrn2 v19.8h, v16.4s, #10 > + sqshrn2 v20.8h, v18.4s, #10 > + sqshrn2 v22.8h, v27.4s, #10 > + sqshrn2 v23.8h, v25.4s, #10 > + > + stp q19, q22, [x0], #32 > + stp q20, q23, [x1], #32 > + > + sub w5, w5, #16 > + cmp w5, #16 > b.ge 1b > - cbz w5, 3f // No pixels left? Exit > + cbz w5, 3f > > -2: // Scalar fallback for remaining pixels > +2: > .if \alpha_first > rgb_load_add_half 1, 5, 2, 6, 3, 7 > .else > @@ -239,24 +291,24 @@ function ff_\fmt_rgb\()ToUV_half_neon, export=1 > rgb_load_add_half 0, 4, 1, 5, 2, 6 > .endif > .endif > - smaddl x8, w2, w10, x9 // dst_u = ru * r + const_offset > - smaddl x16, w2, w13, x9 // dst_v = rv * r + const_offset (parallel) > + smaddl x8, w2, w10, x9 > + smaddl x16, w2, w13, x9 Unrelated change. > > - smaddl x8, w4, w11, x8 // dst_u += gu * g > - smaddl x16, w4, w14, x16 // dst_v += gv * g (parallel) > + smaddl x8, w4, w11, x8 > + smaddl x16, w4, w14, x16 Unrelated change. > > - smaddl x8, w7, w12, x8 // dst_u += bu * b > - smaddl x16, w7, w15, x16 // dst_v += bv * b (parallel) > + smaddl x8, w7, w12, x8 > + smaddl x16, w7, w15, x16 Unrelated change. > > - asr w8, w8, #10 // dst_u >>= 10 > - asr w16, w16, #10 // dst_v >>= 10 > + asr w8, w8, #10 > + asr w16, w16, #10 Unrelated change. > > - strh w8, [x0], #2 // store dst_u > - strh w16, [x1], #2 // store dst_v > + strh w8, [x0], #2 > + strh w16, [x1], #2 Unrelated change. > > - sub w5, w5, #1 // width-- > - add x3, x3, #(2*\element) // Advance source pointer > - cbnz w5, 2b // Process next pixel if any left > + sub w5, w5, #1 > + add x3, x3, #(2*\element) > + cbnz w5, 2b Unrelated change. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".