From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id 322264CD14
	for <ffmpegdev@gitmailbox.com>; Thu,  5 Jun 2025 12:00:57 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 0904568CB23;
	Thu,  5 Jun 2025 15:00:54 +0300 (EEST)
Received: from mail-lf1-f52.google.com (mail-lf1-f52.google.com
 [209.85.167.52])
 by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 7BB7268C92E
 for <ffmpeg-devel@ffmpeg.org>; Thu,  5 Jun 2025 15:00:47 +0300 (EEST)
Received: by mail-lf1-f52.google.com with SMTP id
 2adb3069b0e04-5533a86a134so844838e87.3
 for <ffmpeg-devel@ffmpeg.org>; Thu, 05 Jun 2025 05:00:47 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1749124846; x=1749729646;
 darn=ffmpeg.org; 
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:from:to:cc:subject:date:message-id:reply-to;
 bh=C5b3kCm5LKKZI58kW5Lv6D1qZtlxOC1rrTAzvhPmvzQ=;
 b=qL7IiOk/9qlV/0PinSXvlE1rNfEtscZvfgtvmareiWrRVwRQiY+V537UfTVLQGJKQd
 b7f6cPMJS/cVHvPaT8GI8phHBaecASbpT0JElx/sLz91qhN3rQN/VM/Ba9nCoFB99Iyy
 RDugh8S2rLYSfbruwvt5GfgpkNElEDrUOI9xCrRhwJm45dh+LMvwDRx0V+ZL8DeeyznK
 +IEvWvhJM6j4tJ4t9GRuJ3tN1qUqoY2I+A4tlVf6xx0g5Ko3wWKBD0tU7wlPEGPnBnjx
 y3KtTMMyX551SsE0zpkCeLVELYzrMeDwRJFIOUkEil7RGJO0xlP197XPe6soxlMicqfp
 hmVA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1749124846; x=1749729646;
 h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
 :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=C5b3kCm5LKKZI58kW5Lv6D1qZtlxOC1rrTAzvhPmvzQ=;
 b=LVFtUVsfYDVZY8B19WpoxfQH5YNrShPgKaeXOLqA3NkTpL42xO5U+My/4++GjkXME+
 RDhSIgSUjRQ/g/pi4d1iHbKyLJPUbw5gIbtRtwSLMaeRJu4uQXmU29gRfih4bD8w4LxZ
 tQHmhxvIJ+zaGOR6gmtHM4CSFdusi5j2hwazFYR9FmRruWv7z4k393hb8REEjYp1a3g0
 uwJ8taVwKM4EEdpz/QZoZNgBUv3RaSlCjC40vnB2m43On7OO5/DU9HjXLEWiDIqG4rOn
 ghuiRKQfaY0e8dg901DmEkPexwsgIhCegFfgYLWrg8UideKn2p52LcoZvUbUtS4Y968P
 DDUg==
X-Gm-Message-State: AOJu0YwZ93dHk+7inaDLFDXvn8SccxwZeWnt2YqMnfi8z1pl0yrmUS0D
 lohakz7UOeHXibeOXiw05AyC7yISJSPs61rqKbxi5fBX1lG9cAkV65OTBsTKjM6gx5Fafg8JMbe
 /ENkASw==
X-Gm-Gg: ASbGncvql3qBqt8klFSM6M57lx2OAFYLFF2+jmPclTW77Ks+tvhhDpBCmJt3KxcYdAF
 UT0on230nwYwMKjx5I7/3RjxToQcVjg+yOCDZ394SpVpRka3R/1VB5Xq7QiyV4Wzw/CgCg3ETpe
 PJO91OU3LgfRDyDhWCdCC84aONJk9j5TW9zXDx4xkeYIfBgwbw3Pagm+vyNrvYLJ/7+0N1zdvjc
 Hbl5iSW228+YxHrINiiBpB7CjqlbuLgeoeXbQI0HU2M8gIqJ2d53we0T/9iP2/2Txk7WNoJp+3V
 HUkomSXf61Jk1uOZaBGxI+K7z0hCYu4EAsNmP+/hD+aFvs3AnV107+Nb1EuVmCdARp0Hdi4veMX
 VHkVWTU+V4UH845AdnrrRbrH8OeuGtzUvyLxafxaDR0ICezLYGBiyvCMPgA==
X-Google-Smtp-Source: AGHT+IEyC3cGqSu+7m2bwHo1cS4W+rEBNBt4gj0zYRWQzh2OCll8kyN6P4CD74uBlob+4ONkEN/eyg==
X-Received: by 2002:a05:6512:23a9:b0:553:2f78:d7f9 with SMTP id
 2adb3069b0e04-55356ae56eemr1888707e87.9.1749124846388; 
 Thu, 05 Jun 2025 05:00:46 -0700 (PDT)
Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net
 (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2])
 by smtp.gmail.com with ESMTPSA id
 2adb3069b0e04-5533790fef9sm2655389e87.109.2025.06.05.05.00.45
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Thu, 05 Jun 2025 05:00:46 -0700 (PDT)
Date: Thu, 5 Jun 2025 15:00:44 +0300 (EEST)
From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
In-Reply-To: <DBAP193MB0956F9E8C72D5F6E84578A7E8D60A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM>
Message-ID: <a9af9ffb-be49-4f35-79e5-edb589a1d19b@martin.st>
References: <20250531091631.45342-1-dmtr.kovalenko@outlook.com>
 <DBAP193MB0956F9E8C72D5F6E84578A7E8D60A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM>
MIME-Version: 1.0
Subject: Re: [FFmpeg-devel] [PATCH 1/2] swscale: rgb_to_yuv neon
 optimizations
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Dmitriy Kovalenko <dmtr.kovalenko@outlook.com>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/a9af9ffb-be49-4f35-79e5-edb589a1d19b@martin.st/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

On Sat, 31 May 2025, Dmitriy Kovalenko wrote:

> I've found quite a few ways to optimize existing ffmpeg's rgb to yuv
> subsampled conversion. In this patch stack I'll try to
> improve the perofrmance.
>
> This particular set of changes is a small improvement to all the
> existing functions and macro. The biggest performance gain is
> coming from post loading increment of the pointer and immediate
> ~~prefetching of the memory blocks~~(was moved to the next patch in the stack) and interleaving the multiplication shifting operations of
> different registers for better scheduling.

Why keep the mention of prefetching here, when it no longer is included in 
the patch at all? This is what you suggest is encoded as the final, 
immutable commit message describing this change.

I have further inline comments below, please read them all.

> Also changed a bunch of places where cmp + b.le was used instead
> of one instruction cbnz/tbnz and some other small cleanups.
>
> Here are checkasm results on the macbook pro with the latest M4 max
>
> <before>
>
> bgra_to_uv_1080_c:                                     257.5 ( 1.00x)
> bgra_to_uv_1080_neon:                                  211.9 ( 1.22x)
> bgra_to_uv_1920_c:                                     467.1 ( 1.00x)
> bgra_to_uv_1920_neon:                                  379.3 ( 1.23x)
> bgra_to_uv_half_1080_c:                                198.9 ( 1.00x)
> bgra_to_uv_half_1080_neon:                             125.7 ( 1.58x)
> bgra_to_uv_half_1920_c:                                346.3 ( 1.00x)
> bgra_to_uv_half_1920_neon:                             223.7 ( 1.55x)
>
> <after>
>
> bgra_to_uv_1080_c:                                     268.3 ( 1.00x)
> bgra_to_uv_1080_neon:                                  176.0 ( 1.53x)
> bgra_to_uv_1920_c:                                     456.6 ( 1.00x)
> bgra_to_uv_1920_neon:                                  307.7 ( 1.48x)
> bgra_to_uv_half_1080_c:                                193.2 ( 1.00x)
> bgra_to_uv_half_1080_neon:                              96.8 ( 2.00x)
> bgra_to_uv_half_1920_c:                                347.2 ( 1.00x)
> bgra_to_uv_half_1920_neon:                             182.6 ( 1.92x)
>
> With my proprietary test on IOS it gives around 70% of performance
> improvement converting bgra 1920x1920 image to yuv420p
>
> On my linux arm cortex-r processing the performance improvement not that
> visible but still consistently faster by 5-10% than the current
> implementation.
> ---
> libswscale/aarch64/input.S | 143 +++++++++++++++++++++++--------------
> 1 file changed, 91 insertions(+), 52 deletions(-)

> @@ -292,7 +330,7 @@ function ff_\fmt_rgb\()ToUV_neon, export=1
>         smaddl          x8, w16, w10, x9        // x8 = ru * r + const_offset
>         smaddl          x8, w17, w11, x8        // x8 += gu * g
>         smaddl          x8, w4, w12, x8         // x8 += bu * b
> -        asr             w8, w8, #9              // x8 >>= 9
> +        asr             x8, x8, #9              // x8 >>= 9
>         strh            w8, [x0], #2            // store to dst_u
>

Here you _still_ have one instance of these unrelated changes left in your 
patch.

>         smaddl          x8, w16, w13, x9        // x8 = rv * r + const_offset
> @@ -401,3 +439,4 @@ endfunc
>
> DISABLE_DOTPROD
> #endif
> +
> --

Here you are adding one unrelated empty line at the end of the file. Don't 
include any unrelated changes in your patches.

Before sending a patch, do review it yourself first, checking for any such 
unrelated stray changes.

Other than those details, the rest of the patch looks ok.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".