From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id C89134DE9B
	for <ffmpegdev@gitmailbox.com>; Thu,  5 Jun 2025 12:13:34 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 7535468CB73;
	Thu,  5 Jun 2025 15:13:30 +0300 (EEST)
Received: from mail-lj1-f174.google.com (mail-lj1-f174.google.com
 [209.85.208.174])
 by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 8CB8568C90D
 for <ffmpeg-devel@ffmpeg.org>; Thu,  5 Jun 2025 15:13:23 +0300 (EEST)
Received: by mail-lj1-f174.google.com with SMTP id
 38308e7fff4ca-32a64f8480eso19945731fa.1
 for <ffmpeg-devel@ffmpeg.org>; Thu, 05 Jun 2025 05:13:23 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1749125603; x=1749730403;
 darn=ffmpeg.org; 
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:from:to:cc:subject:date:message-id:reply-to;
 bh=SS/OxcwGLFnFXTwp7RmkKPtbxGYQoAaEAN/5TnNEEQ4=;
 b=mXWnQ7Xpwbpbkn5RmBDov7fZwEGN5stM/41U+9T+742/yK4K13gOLsaxVR0hZ9SSIY
 o2zjZJAiiS0PzVqUFsHvt7jWxFlZo0ERuMbnODV8Tps9dMjx+jPnZKe69uRkgYQuJiJW
 Vqrxa9//vH1uAY177FugFkueVCIbYJtg5M8cewOBfJu4lTZjP44/o43duHVKKFbpMffv
 OKI3ERAIO/V9PPNyAkgkaSzpPrOB8J8tF8vGBXG4PmfPskJtbrz6OLIAUWOqwT/Hq1Lp
 AiIaxY2eiQxW+2f458629L3OEAFyNf4Cct1nPwN3QZSj3pA1IbGhKYMuP0w3Ew01sWp3
 pwLQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1749125603; x=1749730403;
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=SS/OxcwGLFnFXTwp7RmkKPtbxGYQoAaEAN/5TnNEEQ4=;
 b=bXMpctD9NHm6eqGcZyvry3gOu3e/30r59jgGOPAzv7t1vuY0NgvchhIMR6G4G4mK4N
 fzDP+LKuKGzmgpDC0K045Levbu79IjBXJhdamLM/hMTn7yAjbnGJP7umYLISXQl77+1a
 HNVcVvJg9kg5Wn6YZRQMU2wx6+YAX/kjCORf2WlAyPWH+oLtIYH2FiVYaHZLzG1hk1p+
 BrrON/dkdZEuofmCi+vNk1aqh54yRME4NVW+qFlzdrAQZ5Pdblfc4s/tcOOl7Frbjtrn
 EWB6uZddhboaxXNfm2/RNqugIEZnrCONWU9EoVctWUGYyxKS5NaT/jZRL36iCFcUYY7N
 AEGw==
X-Gm-Message-State: AOJu0YyfioGlfuLFm/IH7tDoFF1aX1g4sy/3xCy2yupAYOdSV5U4qbRi
 yfVrGoaIl8+xY2AaihZiRO5q/H01b4FTQqdg4MBA0wbUwJKw6WIaCF5HJ1jKK11YNLFuG9FGIN7
 JsBLvMQ==
X-Gm-Gg: ASbGncs7CQ6hlTsllZ3xOJwBkwdz7ilUvAiMtifzQaQWuolkidSAztKzyVnej43qkZP
 6URS/oIWAKRzVFmx/CVT99vjLtOlKo+wpsp4lbd4I+evW38EnQjg2XouZbOlpwN6BPIfFSi3rDY
 N3pPxbU5B/SwtKsd7NIeNk6VzaTb25iYhYXyAjaDDNmvuZ5/lP8RAKiAsaqcN3jq3VAdQXOQEom
 HJWMKKq8qX2+A1YtPlmHhDQZcfQBpx/4w5gjdGK/WiNqmGVngNEoYwc0NHglFM/V+XuE6hTe0qR
 uaV+QuaA7q/O8bz8rxPg659NMi8gAYXHpCUJtYkotXAk1oeiW3K7/T6pcpOc6EL2KP0IO7mqsGE
 ilyHRsmFXezjPTiI63Q9oDmquGZBx8Op8UDHGUCXLHu3XLAEUJKAdOukHGg==
X-Google-Smtp-Source: AGHT+IEHNsyTBcpp5GFBoMlQPNjhJ8J8Ff0z4E4UEY8CDcWI9JWFEtzYSTA7v7cqVjP/ObIARhEgbA==
X-Received: by 2002:a2e:a009:0:10b0:308:e956:66e with SMTP id
 38308e7fff4ca-32ad10a771bmr5633421fa.0.1749125602452; 
 Thu, 05 Jun 2025 05:13:22 -0700 (PDT)
Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net
 (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2])
 by smtp.gmail.com with ESMTPSA id
 38308e7fff4ca-32a85b1a699sm26648211fa.23.2025.06.05.05.13.21
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 05 Jun 2025 05:13:21 -0700 (PDT)
Date: Thu, 5 Jun 2025 15:13:21 +0300 (EEST)
From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
In-Reply-To: <DBAP193MB0956E3B62E794F364ED459C98D60A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM>
Message-ID: <333d901d-591b-baeb-c56a-8ac028d3105d@martin.st>
References: <20250531091631.45342-1-dmtr.kovalenko@outlook.com>
 <DBAP193MB0956E3B62E794F364ED459C98D60A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM>
MIME-Version: 1.0
Subject: Re: [FFmpeg-devel] [PATCH 2/2] swscale: Neon rgb_to_yuv_half
 process 32 pixels at a time
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Dmitriy Kovalenko <dmtr.kovalenko@outlook.com>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/333d901d-591b-baeb-c56a-8ac028d3105d@martin.st/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

On Sat, 31 May 2025, Dmitriy Kovalenko wrote:

> This patch integrates so called double bufferring when we are loading
> 2 batch of elements at a time and then processing them in parallel. On the
> moden arm processors especially Apple Silicon it gives a visible
> benefit, for subsampled pixel processing it is especially nice because
> it allows to read elements w/ 2 instructions and write with a single one
> (especially visible on a platforms with slower memory like ios).
>
> Including the previous patch in a stack on macbook pro m4 max rgb_to_yuv_half
> in checkasm goes up 2x of the c version

Please quote actual checkasm numbers, which shows exactly which tests were 
run and their numbers, before/after the change.

> ---
> libswscale/aarch64/input.S | 130 ++++++++++++++++++++++++++-----------
> 1 file changed, 91 insertions(+), 39 deletions(-)
>
> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S
> index 260a26e965..b90ca05996 100644
> --- a/libswscale/aarch64/input.S
> +++ b/libswscale/aarch64/input.S
> @@ -178,7 +178,7 @@ rgbToY_neon abgr32, argb32, element=4, alpha_first=1
>
> .macro rgbToUV_half_neon fmt_bgr, fmt_rgb, element, alpha_first=0
> function ff_\fmt_bgr\()ToUV_half_neon, export=1
> -        cbz             w5, 3f          // check width > 0
> +        cbz             w5, 3f

Unrelated change, you're just removing a comment. Don't intermix unrelated 
changes in a patch.

>         ldp             w12, w11, [x6, #12]
>         ldp             w10, w15, [x6, #20]
> @@ -187,49 +187,101 @@ function ff_\fmt_bgr\()ToUV_half_neon, export=1
> endfunc
>
> function ff_\fmt_rgb\()ToUV_half_neon, export=1
> -        cmp             w5, #0                  // check width > 0
> +        cmp             w5, #0
>         b.le            3f

Unrelated change; only removing a comment.

>
> -        ldp             w10, w11, [x6, #12]     // w10: ru, w11: gu
> -        ldp             w12, w13, [x6, #20]     // w12: bu, w13: rv
> -        ldp             w14, w15, [x6, #28]     // w14: gv, w15: bv
> +        ldp             w10, w11, [x6, #12]
> +        ldp             w12, w13, [x6, #20]
> +        ldp             w14, w15, [x6, #28]

Unrelated change, removing comments.

> 4:
> -        cmp             w5, #8
>         rgb_set_uv_coeff half=1
> +
> +        cmp             w5, #16
>         b.lt            2f

Any specific reason for moving the cmp instruction to after the 
rgb_set_uv_coeff macro? By having it directly before the b.lt instruction 
that depends on the cmp instruction, you're introducing stalls on in-order 
cores.

> -1:  // load 16 pixels
> +
> +1:
>     .if \element == 3
>         ld3             { v16.16b, v17.16b, v18.16b }, [x3], #48
> +        ld3             { v26.16b, v27.16b, v28.16b }, [x3], #48
>     .else
>         ld4             { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64
> +        ld4             { v26.16b, v27.16b, v28.16b, v29.16b }, [x3], #64
>     .endif
>
>     .if \alpha_first
> -        uaddlp          v21.8h, v19.16b         // v21: summed b pairs
> -        uaddlp          v20.8h, v18.16b         // v20: summed g pairs
> -        uaddlp          v19.8h, v17.16b         // v19: summed r pairs
> +        uaddlp          v21.8h, v19.16b
> +        uaddlp          v20.8h, v18.16b
> +        uaddlp          v19.8h, v17.16b
> +        uaddlp          v31.8h, v29.16b
> +        uaddlp          v30.8h, v28.16b
> +        uaddlp          v29.8h, v27.16b

Here, you're removing comments that you yourself added in the preceding 
commit. Don't do that; if you don't want the comments there, don't add 
them in the first commit.

With that in mind, you could entirely drop that change in the first commit 
anyway, there's no need to touch those lines there as you're only adding 
comments anyway.

>     .else
> -        uaddlp          v19.8h, v16.16b         // v19: summed r pairs
> -        uaddlp          v20.8h, v17.16b         // v20: summed g pairs
> -        uaddlp          v21.8h, v18.16b         // v21: summed b pairs
> +        uaddlp          v19.8h, v16.16b
> +        uaddlp          v20.8h, v17.16b
> +        uaddlp          v21.8h, v18.16b
> +        uaddlp          v29.8h, v26.16b
> +        uaddlp          v30.8h, v27.16b
> +        uaddlp          v31.8h, v28.16b
>     .endif
>
> -        mov             v22.16b, v6.16b         // U first half
> -        mov             v23.16b, v6.16b         // U second half
> -        mov             v24.16b, v6.16b         // V first half
> -        mov             v25.16b, v6.16b         // V second half
> -
> -        rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4, v5, v22, v23, v24, v25, v16, v17, #10
> -
> -        str             q16, [x0], #16          // store dst_u
> -        str             q17, [x1], #16          // store dst_v
> +        mov             v7.16b, v6.16b
> +        mov             v16.16b, v6.16b
> +        mov             v17.16b, v6.16b
> +        mov             v18.16b, v6.16b
> +        mov             v26.16b, v6.16b
> +        mov             v27.16b, v6.16b
> +        mov             v28.16b, v6.16b
> +        mov             v25.16b, v6.16b
>
> -        sub             w5, w5, #8              // width -= 8
> -        cmp             w5, #8                  // width >= 8 ?
> +        smlal           v7.4s, v0.4h, v19.4h
> +        smlal           v17.4s, v3.4h, v19.4h
> +        smlal           v26.4s, v0.4h, v29.4h
> +        smlal           v28.4s, v3.4h, v29.4h

Here you deinline the macro. Is there a significant perfomance benefit of 
doing that, over just calling the macro twice? The long commentless 
smlal/smlal2 sequence here feels less readable than what we had before.

> +
> +        smlal2          v16.4s, v0.8h, v19.8h
> +        smlal2          v18.4s, v3.8h, v19.8h
> +        smlal2          v27.4s, v0.8h, v29.8h
> +        smlal2          v25.4s, v3.8h, v29.8h
> +
> +        smlal           v7.4s, v1.4h, v20.4h
> +        smlal           v17.4s, v4.4h, v20.4h
> +        smlal           v26.4s, v1.4h, v30.4h
> +        smlal           v28.4s, v4.4h, v30.4h
> +
> +        smlal2          v16.4s, v1.8h, v20.8h
> +        smlal2          v18.4s, v4.8h, v20.8h
> +        smlal2          v27.4s, v1.8h, v30.8h
> +        smlal2          v25.4s, v4.8h, v30.8h
> +
> +        smlal           v7.4s, v2.4h, v21.4h
> +        smlal           v17.4s, v5.4h, v21.4h
> +        smlal           v26.4s, v2.4h, v31.4h
> +        smlal           v28.4s, v5.4h, v31.4h
> +
> +        smlal2          v16.4s, v2.8h, v21.8h
> +        smlal2          v18.4s, v5.8h, v21.8h
> +        smlal2          v27.4s, v2.8h, v31.8h
> +        smlal2          v25.4s, v5.8h, v31.8h
> +
> +        sqshrn          v19.4h, v7.4s, #10
> +        sqshrn          v20.4h, v17.4s, #10
> +        sqshrn          v22.4h, v26.4s, #10
> +        sqshrn          v23.4h, v28.4s, #10
> +
> +        sqshrn2         v19.8h, v16.4s, #10
> +        sqshrn2         v20.8h, v18.4s, #10
> +        sqshrn2         v22.8h, v27.4s, #10
> +        sqshrn2         v23.8h, v25.4s, #10
> +
> +        stp             q19, q22, [x0], #32
> +        stp             q20, q23, [x1], #32
> +
> +        sub             w5, w5, #16
> +        cmp             w5, #16
>         b.ge            1b
> -        cbz             w5, 3f                  // No pixels left? Exit
> +        cbz             w5, 3f
>
> -2:      // Scalar fallback for remaining pixels
> +2:
> .if \alpha_first
>         rgb_load_add_half 1, 5, 2, 6, 3, 7
> .else
> @@ -239,24 +291,24 @@ function ff_\fmt_rgb\()ToUV_half_neon, export=1
>         rgb_load_add_half 0, 4, 1, 5, 2, 6
>     .endif
> .endif
> -        smaddl          x8, w2, w10, x9         // dst_u = ru * r + const_offset
> -        smaddl          x16, w2, w13, x9        // dst_v = rv * r + const_offset (parallel)
> +        smaddl          x8, w2, w10, x9
> +        smaddl          x16, w2, w13, x9

Unrelated change.

>
> -        smaddl          x8, w4, w11, x8         // dst_u += gu * g
> -        smaddl          x16, w4, w14, x16       // dst_v += gv * g (parallel)
> +        smaddl          x8, w4, w11, x8
> +        smaddl          x16, w4, w14, x16

Unrelated change.

>
> -        smaddl          x8, w7, w12, x8         // dst_u += bu * b
> -        smaddl          x16, w7, w15, x16       // dst_v += bv * b (parallel)
> +        smaddl          x8, w7, w12, x8
> +        smaddl          x16, w7, w15, x16

Unrelated change.

>
> -        asr             w8, w8, #10             // dst_u >>= 10
> -        asr             w16, w16, #10           // dst_v >>= 10
> +        asr             w8, w8, #10
> +        asr             w16, w16, #10

Unrelated change.

>
> -        strh            w8, [x0], #2            // store dst_u
> -        strh            w16, [x1], #2           // store dst_v
> +        strh            w8, [x0], #2
> +        strh            w16, [x1], #2

Unrelated change.

>
> -        sub             w5, w5, #1              // width--
> -        add             x3, x3, #(2*\element)   // Advance source pointer
> -        cbnz            w5, 2b                  // Process next pixel if any left
> +        sub             w5, w5, #1
> +        add             x3, x3, #(2*\element)
> +        cbnz            w5, 2b


Unrelated change.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".