From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTP id 6120A4CB1E
	for <ffmpegdev@gitmailbox.com>; Wed, 14 Aug 2024 12:31:24 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id A383F68DB0D;
	Wed, 14 Aug 2024 15:31:21 +0300 (EEST)
Received: from mail-lf1-f50.google.com (mail-lf1-f50.google.com
 [209.85.167.50])
 by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 5AC6F68DAD6
 for <ffmpeg-devel@ffmpeg.org>; Wed, 14 Aug 2024 15:31:15 +0300 (EEST)
Received: by mail-lf1-f50.google.com with SMTP id
 2adb3069b0e04-52ed741fe46so7113479e87.0
 for <ffmpeg-devel@ffmpeg.org>; Wed, 14 Aug 2024 05:31:15 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=martin-st.20230601.gappssmtp.com; s=20230601; t=1723638674; x=1724243474;
 darn=ffmpeg.org; 
 h=mime-version:references:message-id:in-reply-to:subject:to:from:date
 :from:to:cc:subject:date:message-id:reply-to;
 bh=FBzZ4HbRPV4ds1NIq2Lyok87duUs3zt9TPDONprFdiQ=;
 b=yZ9qLN3SXT02ABdOgQmdQdipSr2wJ/ytWoNNvm9GXc+HUH03zltKZC7JmJrN/dRVDX
 Iso0eJSr88DnugHPbd4HEHDN32r1drMEdUeh/jMXnn9ez07xWm6LFtOdEL9BnHg8em+w
 EIeFGn1wy8Pcq0yAU9iDafeCshVDpHXjNi/TiMrWg/19nZM907o5xSw7xQy5xPWUwzkY
 o9a8a1fUMhzIKojdOkHdDY/GO5yrbIEFBazdTOPOfcZe4s0Hya5UKtuPKbsAv5hRN1sJ
 NzKo5UJ7LZl74D6HOImcqLSWCO7pJZ9EAzfOmBm8oQp0SnK+sLlLTY/Bc6w45wg+IOVg
 h2Aw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=1e100.net; s=20230601; t=1723638674; x=1724243474;
 h=mime-version:references:message-id:in-reply-to:subject:to:from:date
 :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
 bh=FBzZ4HbRPV4ds1NIq2Lyok87duUs3zt9TPDONprFdiQ=;
 b=ZD5039vkLKRtDmbQQUa7bqhZ36D22kmNEa7QcS9X56c9myO0ArqhdZ0ygvrcsTqcIl
 Fi20xQsSbKynfaJDn2HO3H4wL4O3/c8sJrDhjBV8FLQgYxgxIMwLD8o8lRUOBh+OT9jj
 RKvwfLrXwvJV+NcIIKRHxBxzMPyHtGOCkQbognjLFdHw9bgqmA39dX2JrRxXA8sKIs+S
 8QtasY+GqWee5cPREQJMI2o2OgnP0cL+DJv6CWIzcFZPY8BQuaEJdMAiOCuqikrBMfpE
 4p1GEqHue71zq7y1QNvmuH3uydkGKbzgBZCkHxVuAbvTCfq64vtFUZlvRRUiY5n8r5Ge
 mq1A==
X-Gm-Message-State: AOJu0YyejPYzry5N5dk9z36Cbdb+MKILEUlp9LIJTqYb2AWHaigW0wkj
 Vj/BTYXC4z7NY0e3iljPiLAaeebslGq7QtRK7kw0lav/B8J7co7rYEne+DwgY6awdqiWPk04aue
 S6w==
X-Google-Smtp-Source: AGHT+IFfiQEkunTCnanXarNMOpdFLtV8aioLAkPWMVzDUjZ4caAKubu7TU7Zm5HuvWGvaUu9nS2UKQ==
X-Received: by 2002:a05:6512:23a5:b0:52c:d834:4f2d with SMTP id
 2adb3069b0e04-532eda792c2mr1951386e87.18.1723638674234; 
 Wed, 14 Aug 2024 05:31:14 -0700 (PDT)
Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net
 (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2])
 by smtp.gmail.com with ESMTPSA id
 2adb3069b0e04-53200f4e037sm1296245e87.303.2024.08.14.05.31.13
 for <ffmpeg-devel@ffmpeg.org>
 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
 Wed, 14 Aug 2024 05:31:14 -0700 (PDT)
Date: Wed, 14 Aug 2024 15:31:12 +0300 (EEST)
From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
In-Reply-To: <20240809112612.107000-4-ramiro.polla@gmail.com>
Message-ID: <79af6ca-f4b2-6f51-1b4d-e35f8331991@martin.st>
References: <20240809112612.107000-1-ramiro.polla@gmail.com>
 <20240809112612.107000-4-ramiro.polla@gmail.com>
MIME-Version: 1.0
Subject: Re: [FFmpeg-devel] [PATCH 4/4] swscale/aarch64: add nv24/nv42 to
 yuv420p unscaled converter
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Transfer-Encoding: 7bit
Content-Type: text/plain; charset="us-ascii"; Format="flowed"
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/79af6ca-f4b2-6f51-1b4d-e35f8331991@martin.st/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>

On Fri, 9 Aug 2024, Ramiro Polla wrote:

> checkasm --bench for Raspberry Pi 5 Model B Rev 1.0:
> nv24_yuv420p_128_c: 423.0
> nv24_yuv420p_128_neon: 115.7
> nv24_yuv420p_1920_c: 5939.5
> nv24_yuv420p_1920_neon: 1339.7
> nv42_yuv420p_128_c: 423.2
> nv42_yuv420p_128_neon: 115.7
> nv42_yuv420p_1920_c: 5907.5
> nv42_yuv420p_1920_neon: 1342.5
> ---
> libswscale/aarch64/Makefile                |  1 +
> libswscale/aarch64/swscale_unscaled.c      | 30 +++++++++
> libswscale/aarch64/swscale_unscaled_neon.S | 75 ++++++++++++++++++++++
> 3 files changed, 106 insertions(+)
> create mode 100644 libswscale/aarch64/swscale_unscaled_neon.S

> diff --git a/libswscale/aarch64/swscale_unscaled_neon.S b/libswscale/aarch64/swscale_unscaled_neon.S
> new file mode 100644
> index 0000000000..a206fda41f
> --- /dev/null
> +++ b/libswscale/aarch64/swscale_unscaled_neon.S
> @@ -0,0 +1,75 @@
> +/*
> + * Copyright (c) 2024 Ramiro Polla
> + *
> + * This file is part of FFmpeg.
> + *
> + * FFmpeg is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * FFmpeg is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with FFmpeg; if not, write to the Free Software
> + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
> + */
> +
> +#include "libavutil/aarch64/asm.S"
> +
> +function ff_nv24_to_yuv420p_chroma_neon, export=1
> +// x0  uint8_t *dst1
> +// x1  int dstStride1
> +// x2  uint8_t *dst2
> +// x3  int dstStride2
> +// x4  const uint8_t *src
> +// x5  int srcStride
> +// w6  int w
> +// w7  int h
> +
> +        uxtw            x1, w1
> +        uxtw            x3, w3
> +        uxtw            x5, w5

You can often avoid the explicit uxtw instructions, if you can fold an 
uxtw attribute into the cases where the register is used. (If it's used 
often, it may be slightly more performant to do it upfront like this 
though, but often it can be omitted entirely.) And whenever you do an 
operation with a wN register as destination, the upper half of the 
register gets explicitly cleared, so these also may be avoided that way.

> +
> +        add             x9, x4, x5                  // x9 = src + srcStride
> +        lsl             w5, w5, #1                  // srcStride *= 2
> +
> +1:
> +        mov             w10, w6                     // w10 = w
> +        mov             x11, x4                     // x11 = src1 (line 1)
> +        mov             x12, x9                     // x12 = src2 (line 2)
> +        mov             x13, x0                     // x13 = dst1 (dstU)
> +        mov             x14, x2                     // x14 = dst2 (dstV)
> +
> +2:
> +        ld2             { v0.16b, v1.16b }, [x11], #32 // v0 = U1, v1 = V1
> +        ld2             { v2.16b, v3.16b }, [x12], #32 // v2 = U2, v3 = V2
> +
> +        uaddlp          v0.8h, v0.16b               // pairwise add U1 into v0
> +        uaddlp          v1.8h, v1.16b               // pairwise add V1 into v1
> +        uadalp          v0.8h, v2.16b               // pairwise add U2, accumulate into v0
> +        uadalp          v1.8h, v3.16b               // pairwise add V2, accumulate into v1
> +
> +        shrn            v0.8b, v0.8h, #2            // divide by 4
> +        shrn            v1.8b, v1.8h, #2            // divide by 4
> +
> +        st1             { v0.8b }, [x13], #8        // store U into dst1
> +        st1             { v1.8b }, [x14], #8        // store V into dst2
> +
> +        subs            w10, w10, #8
> +        b.gt            2b
> +
> +        // next row
> +        add             x4, x4, x5                  // src1 += srcStride * 2
> +        add             x9, x9, x5                  // src2 += srcStride * 2
> +        add             x0, x0, x1                  // dst1 += dstStride1
> +        add             x2, x2, x3                  // dst2 += dstStride2

It's often possible to avoid the extra step of moving the pointers back 
into the the x11/x12/x13/x14 registers, if you subtract the width from the 
stride at the start of the function. Then you don't need two separate 
registers for each pointer, and shortens dependency chain when moving on 
to the next line.

If the width can be any uneven value, but we in practice write in 
increments of 8 pixels, you may need to align the width up to 8 before 
using it to decrement the stride that way though.

// Martin

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".