From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <ffmpeg-devel-bounces@ffmpeg.org> Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id D63354CF87 for <ffmpegdev@gitmailbox.com>; Fri, 30 May 2025 07:08:24 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 4F1E868D4A1; Fri, 30 May 2025 10:08:22 +0300 (EEST) Received: from mail-lf1-f48.google.com (mail-lf1-f48.google.com [209.85.167.48]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 4CD7968CC8D for <ffmpeg-devel@ffmpeg.org>; Fri, 30 May 2025 10:08:20 +0300 (EEST) Received: by mail-lf1-f48.google.com with SMTP id 2adb3069b0e04-551fe46934eso2137711e87.1 for <ffmpeg-devel@ffmpeg.org>; Fri, 30 May 2025 00:08:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1748588899; x=1749193699; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=SWsTuySNUoEvHSt7TdpaFNqkxg786w81Rwr7I7bIsuE=; b=ywJU3gGd3czx11P7iUP4B5m0ZKwDQ1PzFg2NCg7ngaZCwcfCrn0doDim+FNdkOU53N 5ngOH5r4CB6YrS1eX1TsBaD/6gLI9Qc+gV5Pa9t5pWBg+cM7zCdbQQPEtASucjDmPO3K EqNHhYEdglSaZlC2q9xf9ufaPdCrxUUtH7oQTbk0CxpSsORlRmLEWCFo9ogFyShdtfiL E3nEzL3ybNRQeZ15vZkGBdpWaBgSxoGjV0w+jmlgg8Usqna3hz5YvVte+ojTVs/Addq6 +Cki6or/p7Cj2CCD9UdZNgNx4kovB/xv6eTRn25ENfO6n4igrYoG56v1tqIztb60bQcX GMfw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748588899; x=1749193699; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=SWsTuySNUoEvHSt7TdpaFNqkxg786w81Rwr7I7bIsuE=; b=lyzFaKYL6d/4AUG37SwzIG2V280C7t21SJae7Rz7M+uDtK+zPtu/4YDWdjLRR16UEi 07H9HwHpSsHiZ1c8fQa9j2jBaHHzpznKyvfSxJvcl9skkRi9+88z41sVLbKANTo3H8ia cVUb9aUoN7u3MVP7cFJ8b3rxcjPWeswANOsg23feRQAsfP7mkfl+bgNc8QOePM+nbNfV KtqmSbpcgzq6ZVnDELoGofwCuF83HuMKnTExtYvAUBGTjH9aNZikiqpnZ6IeWlB2Jm4B 5Nush21Re3TVQDmXJ/Kt7PdmAZnYFPLWiChBaAy4pF3LXmzhTcWGauolJIde0CSuh8J2 54TA== X-Gm-Message-State: AOJu0YzXzDZG6bnRTtHjHrfPr4e6iKaNEweCAUcBUknBakHVvnPkUfjg D6zuRfvpcNSjcXtV/2gC37LY2DgVsY96AuS578OXVA7t/ZxwZobeB9biAVWjTMTW74FCpREE/hg YtwZmKA== X-Gm-Gg: ASbGncv2Swd1IdmPnxECr0wm3LV2xA6YaDeywu2mNIT6YuEO4Eeujeh+iiHv+kZDOED iIpPExfi7g6gz+UZuoECzKPvrvEK9JshGG0VvCVPsVCKbO7NDouCmFnxZuKwUFFwGRqTRLdhb7V fONQRWIMwM9S40byHLNMNTAt7tKDnk5LJxkA5OZ/Y217C9Bq2nrigeooJbLzNab9utm0FzJacxD D/6VLOI75M959WP4lToq2oYp+dRxut4Adef09S9KsmK3erFFcOqjTeNUwqI88M90+t4HmQDlIia IGolTkQY1vMTGCr1M+SxyP4ZjX0MEGP3B0sgFPV3wfy8TA2fuLfiT69R5qW1nB2nJBWtA9zYyfg SnnIDbRG7AZ64Hzr6HNHIk+kZ/MhqyvwqVPmLczHvHehdY5g= X-Google-Smtp-Source: AGHT+IGe4qEP0HGcK5VjJwLdJDyJhTpp4BQ1Iue56T4w7xD2Dwyhv9RwBTCWvXTz9QH8orJNrRmuJQ== X-Received: by 2002:a05:6512:6c5:b0:553:2760:e82c with SMTP id 2adb3069b0e04-55335b468b8mr2418390e87.25.1748588899021; Fri, 30 May 2025 00:08:19 -0700 (PDT) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-55337910d86sm603577e87.107.2025.05.30.00.08.18 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 30 May 2025 00:08:18 -0700 (PDT) Date: Fri, 30 May 2025 10:08:18 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> In-Reply-To: <f82f1d48-8e54-e5be-714-9e963ec3188@martin.st> Message-ID: <d9663f3e-44be-266f-3d75-a72e31882697@martin.st> References: <20250527165800.17159-1-dmtr.kovalenko@outlook.com> <DBAP193MB095619210123E3F5CEC66E698D64A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM> <f82f1d48-8e54-e5be-714-9e963ec3188@martin.st> MIME-Version: 1.0 X-Content-Filtered-By: Mailman/MimeDel 2.1.29 Subject: Re: [FFmpeg-devel] [PATCH 2/2] swscale: Neon rgb_to_yuv_half process 16 pixels at a time X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org> List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>, <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe> List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel> List-Post: <mailto:ffmpeg-devel@ffmpeg.org> List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help> List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>, <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe> Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Cc: Dmitriy Kovalenko <dmtr.kovalenko@outlook.com> Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="iso-8859-15"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org> Archived-At: <https://master.gitmailbox.com/ffmpegdev/d9663f3e-44be-266f-3d75-a72e31882697@martin.st/> List-Archive: <https://master.gitmailbox.com/ffmpegdev/> List-Post: <mailto:ffmpegdev@gitmailbox.com> On Thu, 29 May 2025, Martin Storsj=F6 wrote: > On Tue, 27 May 2025, Dmitriy Kovalenko wrote: > >> This patches integrates so called double bufferring when we are loading >> 2 batch elements at a time and then processing them in parallel. On the >> moden arm processors especially Apple Silicon it gives a visible >> benefit, for subsampled pixel processing it is especially nice because >> it allows to read elements w/ 2 instructions and write with a single one >> (which is usually the slowest part). >> = >> Including the previous patch in a stack on macbook pro m4 max = >> rgb_to_yuv_half >> in checkasm goes up 2x of the c version >> --- >> libswscale/aarch64/input.S | 332 ++++++++++++++++++++++++++++++++++--- >> 1 file changed, 309 insertions(+), 23 deletions(-) >> = >> diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S >> index ee8eb24c14..59d66d0022 100644 >> --- a/libswscale/aarch64/input.S >> +++ b/libswscale/aarch64/input.S >> @@ -194,40 +194,94 @@ function ff_\fmt_rgb\()ToUV_half_neon, export=3D1 >> ldp w12, w13, [x6, #20] // w12: bu, w13: rv >> ldp w14, w15, [x6, #28] // w14: gv, w15: bv >> 4: >> - cmp w5, #8 >> rgb_set_uv_coeff half=3D1 >> - b.lt 2f >> -1: // load 16 pixels and prefetch memory for the next block >> + >> + cmp w5, #16 >> + b.lt 2f // Go directly to scala= r = >> if < 16 >> + >> +1: >> .if \element =3D=3D 3 >> - ld3 { v16.16b, v17.16b, v18.16b }, [x3], #48 >> - prfm pldl1strm, [x3, #48] >> + ld3 { v16.16b, v17.16b, v18.16b }, [x3], #48 // Fi= rst = >> 16 pixels >> + ld3 { v26.16b, v27.16b, v28.16b }, [x3], #48 // = >> Second 16 pixels >> + prfm pldl1keep, [x3, #96] >> .else >> - ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #= 64 >> - prfm pldl1strm, [x3, #64] >> + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #= 64 = >> // First 16 pixels >> + ld4 { v26.16b, v27.16b, v28.16b, v29.16b }, [x3], #= 64 = >> // Second 16 pixels >> + prfm pldl1keep, [x3, #128] >> .endif >> = >> + // **Sum adjacent pixel pairs** >> .if \alpha_first >> - uaddlp v21.8h, v19.16b // v21: summed b pairs >> - uaddlp v20.8h, v18.16b // v20: summed g pairs >> - uaddlp v19.8h, v17.16b // v19: summed r pairs >> + uaddlp v21.8h, v19.16b // Block 1: B sums >> + uaddlp v20.8h, v18.16b // Block 1: G sums >> + uaddlp v19.8h, v17.16b // Block 1: R sums >> + uaddlp v31.8h, v29.16b // Block 2: B sums >> + uaddlp v30.8h, v28.16b // Block 2: G sums >> + uaddlp v29.8h, v27.16b // Block 2: R sums >> .else >> - uaddlp v19.8h, v16.16b // v19: summed r pairs >> - uaddlp v20.8h, v17.16b // v20: summed g pairs >> - uaddlp v21.8h, v18.16b // v21: summed b pairs >> + uaddlp v19.8h, v16.16b // Block 1: R sums >> + uaddlp v20.8h, v17.16b // Block 1: G sums >> + uaddlp v21.8h, v18.16b // Block 1: B sums >> + uaddlp v29.8h, v26.16b // Block 2: R sums >> + uaddlp v30.8h, v27.16b // Block 2: G sums >> + uaddlp v31.8h, v28.16b // Block 2: B sums >> .endif >> = >> - mov v22.16b, v6.16b // U first half >> - mov v23.16b, v6.16b // U second half >> - mov v24.16b, v6.16b // V first half >> - mov v25.16b, v6.16b // V second half >> - >> - rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4= , = >> v5, v22, v23, v24, v25, v16, v17, #10 >> + // init accumulatos for both blocks >> + mov v7.16b, v6.16b // U_low >> + mov v8.16b, v6.16b // U_high >> + mov v9.16b, v6.16b // V_low >> + mov v10.16b, v6.16b // V_high >> + mov v11.16b, v6.16b // U_low >> + mov v12.16b, v6.16b // U_high >> + mov v13.16b, v6.16b // V_low >> + mov v14.16b, v6.16b // V_high >> + >> + smlal v7.4s, v0.4h, v19.4h // U +=3D ru * r (0-3) >> + smlal v9.4s, v3.4h, v19.4h // V +=3D rv * r (0-3) >> + smlal v11.4s, v0.4h, v29.4h // U +=3D ru * r (0-3) >> + smlal v13.4s, v3.4h, v29.4h // V +=3D rv * r (0-3) >> + >> + smlal2 v8.4s, v0.8h, v19.8h // U +=3D ru * r (4-7) >> + smlal2 v10.4s, v3.8h, v19.8h // V +=3D rv * r (4-7) >> + smlal2 v12.4s, v0.8h, v29.8h // U +=3D ru * r (4-7) >> + smlal2 v14.4s, v3.8h, v29.8h // V +=3D rv * r (4-7) >> + >> + smlal v7.4s, v1.4h, v20.4h // U +=3D gu * g (0-3) >> + smlal v9.4s, v4.4h, v20.4h // V +=3D gv * g (0-3) >> + smlal v11.4s, v1.4h, v30.4h // U +=3D gu * g (0-3) >> + smlal v13.4s, v4.4h, v30.4h // V +=3D gv * g (0-3) >> + >> + smlal2 v8.4s, v1.8h, v20.8h // U +=3D gu * g (4-7) >> + smlal2 v10.4s, v4.8h, v20.8h // V +=3D gv * g (4-7) >> + smlal2 v12.4s, v1.8h, v30.8h // U +=3D gu * g (4-7) >> + smlal2 v14.4s, v4.8h, v30.8h // V +=3D gv * g (4-7) >> + >> + smlal v7.4s, v2.4h, v21.4h // U +=3D bu * b (0-3) >> + smlal v9.4s, v5.4h, v21.4h // V +=3D bv * b (0-3) >> + smlal v11.4s, v2.4h, v31.4h // U +=3D bu * b (0-3) >> + smlal v13.4s, v5.4h, v31.4h // V +=3D bv * b (0-3) >> + >> + smlal2 v8.4s, v2.8h, v21.8h // U +=3D bu * b (4-7) >> + smlal2 v10.4s, v5.8h, v21.8h // V +=3D bv * b (4-7) >> + smlal2 v12.4s, v2.8h, v31.8h // U +=3D bu * b (4-7) >> + smlal2 v14.4s, v5.8h, v31.8h // V +=3D bv * b (4-7) >> + >> + sqshrn v16.4h, v7.4s, #10 // U (0-3) >> + sqshrn v17.4h, v9.4s, #10 // V (0-3) >> + sqshrn v22.4h, v11.4s, #10 // U (0-3) >> + sqshrn v23.4h, v13.4s, #10 // V (0-3) >> + >> + sqshrn2 v16.8h, v8.4s, #10 // U (0-7) >> + sqshrn2 v17.8h, v10.4s, #10 // V (0-7) >> + sqshrn2 v22.8h, v12.4s, #10 // U (0-7) >> + sqshrn2 v23.8h, v14.4s, #10 // V (0-7) >> = >> - str q16, [x0], #16 // store dst_u >> - str q17, [x1], #16 // store dst_v >> + stp q16, q22, [x0], #32 // Store all 16 U values >> + stp q17, q23, [x1], #32 // Store all 16 V values >> = >> - sub w5, w5, #8 // width -=3D 8 >> - cmp w5, #8 // width >=3D 8 ? >> + sub w5, w5, #16 // width -=3D 16 >> + cmp w5, #16 // width >=3D 16 ? >> b.ge 1b >> cbz w5, 3f // No pixels left? Exit >> = >> @@ -459,3 +513,235 @@ endfunc >> = >> DISABLE_DOTPROD >> #endif >> + >> +.macro rgbToUV_half_neon_double fmt_bgr, fmt_rgb, element, alpha_first= =3D0 >> +function ff_\fmt_bgr\()ToUV_half_neon_double, export=3D1 >> + cbz w5, 9f // exit immediately if = >> width is 0 >> + cmp w5, #16 // check if we have at = >> least 16 pixels >> + b.lt _ff_\fmt_bgr\()ToUV_half_neon > > Also, with that fixed, this fails to properly back up and restore registe= rs = > v8-v15; checkasm doesn't notice this on macOS, but on Linux and windows, = > checkasm has a call wrapper which does detect such issues. This comment is still unaddressed, checkasm still fails on Linux and = Windows. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".