From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: <ffmpeg-devel-bounces@ffmpeg.org> Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id CE5BE4CEF2 for <ffmpegdev@gitmailbox.com>; Thu, 29 May 2025 19:09:52 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id 3272168DC94; Thu, 29 May 2025 22:09:48 +0300 (EEST) Received: from mail-lf1-f49.google.com (mail-lf1-f49.google.com [209.85.167.49]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 769F668D7C9 for <ffmpeg-devel@ffmpeg.org>; Thu, 29 May 2025 22:09:41 +0300 (EEST) Received: by mail-lf1-f49.google.com with SMTP id 2adb3069b0e04-5532f6d184eso1711974e87.0 for <ffmpeg-devel@ffmpeg.org>; Thu, 29 May 2025 12:09:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1748545780; x=1749150580; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=CaICkAYNqr305Dm9p2ZbEg3rnVUelmZjnxPY4eKY+ng=; b=utFw+xKV1IMf5UHI1w14NJPGNTfCu8aZP6fxZOa57PquMVY6g6M5s+fvZMjK6FUW98 mH6VSSYVYjdSjxPqkIpDXtArx6wgMGn/qdGjSDI9tf9JhkeE38wGIp3nRjJaAes40nPz Bm7xwN/td97r551SWi8r6MAa2lR9LWPKhW3RS8vYoB+2WtFNVi+VYk7IFrHQN0QqNqkz kQ45JzgZKEWBcyN0dNYm8ehTxCURWjk5fGaFNTKmRmpZNAINTt35K9g7AFLTtSK6dJzE A5vH1v9XvEfdsOZvOJCS+AErxk8799MXnCbn6/cwmXT6aupTK6YwqyHjCJ4sXbeSoPvS numg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1748545780; x=1749150580; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=CaICkAYNqr305Dm9p2ZbEg3rnVUelmZjnxPY4eKY+ng=; b=WNh2CLQXeaxWAwmF/N6U/7OwDHCG0ReraDY+SSjI+5kAE84uYC8iL8nuWB2n4JIG01 /rFmCd3QVpeNJ90/yWfBPorH7n206vJTQk0qWDaNAU1uOn8dTSG7WVQHhotVGXPhc/jK Jou9HI7GgL4O4USiYA0jem+TH7BlnIyJE1zBHXogO0A5IpcsICI7VTrRJS1p/fgCeUac 77shtBqpE5beb9goB6iZ53n13Gk25jIKn6C+zv2omHBdLMRUusJsoxXKOLYh5giHaXnv 9vyjUtXPmcoMq2Ba/B1QqWpq8JXgu/397aqFBNf3z048QSWTi46V0Ka1Gb4ZsevDj2aL +gww== X-Gm-Message-State: AOJu0Yz6DjO0JTxKz3Cff9N70LklVMxIyZiJTH/LOvamOZ79jytYxF7Y 2QSLq/RVL4Popunwf6aK/hIgGTgXX1m4FpjbKGaDUS8vsJcnJmsxCJsg6bswcgnb1c9Qhc+yo8k rAGa1sA== X-Gm-Gg: ASbGncukTJSesFdziuyrxi9LWvO1GYLg1xa+UJ7hyFxmFhnIGvF0JC9ILe3k8CQffXL pjtO4BWwwAaNZUzK68kZUN3gOgIheTLzAG/I7r2IvcMELvUGorZXuVnQvOGBNdwUuEDdRjfYdit 0EtgXtSxZs+gnIzITo7LQMIgabonda7I4XUz5IwuZdFIUC2B2oHHwGLjKY04M/PDIQYnuuK4X0R r+apZ5jcB+cHmYXV1JZSovz0OhKaR19gL0VYJtg72EGJDOBFtBxEGz0HEpCDDC2jVukrxSj+mlO KQNsTlPdH6GVyOy0sF0j6ka8NbK+Li7lNs1bnDjuSE61lN2N1XBr1cfMr9ylbSNGL5pGGDbtj68 4SM8d19tlexmmnKc9kSEJwyWaVO8Jj3RG2J/3k9lKHdL4TEVkfPFQMkFtPg== X-Google-Smtp-Source: AGHT+IFCfgNBkVrO8yHj2f58+EdzeHbJArW/qWQ06TGuRPtHq7alEHNgZjU8sjypCph6iY4SFm6AqA== X-Received: by 2002:a05:6512:1381:b0:553:3532:5b30 with SMTP id 2adb3069b0e04-5533b907bb2mr204904e87.27.1748545780154; Thu, 29 May 2025 12:09:40 -0700 (PDT) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-5533787d222sm429259e87.40.2025.05.29.12.09.39 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 29 May 2025 12:09:39 -0700 (PDT) Date: Thu, 29 May 2025 22:09:39 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= <martin@martin.st> To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> In-Reply-To: <DBAP193MB095619210123E3F5CEC66E698D64A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM> Message-ID: <f82f1d48-8e54-e5be-714-9e963ec3188@martin.st> References: <20250527165800.17159-1-dmtr.kovalenko@outlook.com> <DBAP193MB095619210123E3F5CEC66E698D64A@DBAP193MB0956.EURP193.PROD.OUTLOOK.COM> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH 2/2] swscale: Neon rgb_to_yuv_half process 16 pixels at a time X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org> List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>, <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe> List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel> List-Post: <mailto:ffmpeg-devel@ffmpeg.org> List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help> List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>, <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe> Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Cc: Dmitriy Kovalenko <dmtr.kovalenko@outlook.com> Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org> Archived-At: <https://master.gitmailbox.com/ffmpegdev/f82f1d48-8e54-e5be-714-9e963ec3188@martin.st/> List-Archive: <https://master.gitmailbox.com/ffmpegdev/> List-Post: <mailto:ffmpegdev@gitmailbox.com> On Tue, 27 May 2025, Dmitriy Kovalenko wrote: > This patches integrates so called double bufferring when we are loading > 2 batch elements at a time and then processing them in parallel. On the > moden arm processors especially Apple Silicon it gives a visible > benefit, for subsampled pixel processing it is especially nice because > it allows to read elements w/ 2 instructions and write with a single one > (which is usually the slowest part). > > Including the previous patch in a stack on macbook pro m4 max rgb_to_yuv_half > in checkasm goes up 2x of the c version > --- > libswscale/aarch64/input.S | 332 ++++++++++++++++++++++++++++++++++--- > 1 file changed, 309 insertions(+), 23 deletions(-) > > diff --git a/libswscale/aarch64/input.S b/libswscale/aarch64/input.S > index ee8eb24c14..59d66d0022 100644 > --- a/libswscale/aarch64/input.S > +++ b/libswscale/aarch64/input.S > @@ -194,40 +194,94 @@ function ff_\fmt_rgb\()ToUV_half_neon, export=1 > ldp w12, w13, [x6, #20] // w12: bu, w13: rv > ldp w14, w15, [x6, #28] // w14: gv, w15: bv > 4: > - cmp w5, #8 > rgb_set_uv_coeff half=1 > - b.lt 2f > -1: // load 16 pixels and prefetch memory for the next block > + > + cmp w5, #16 > + b.lt 2f // Go directly to scalar if < 16 > + > +1: > .if \element == 3 > - ld3 { v16.16b, v17.16b, v18.16b }, [x3], #48 > - prfm pldl1strm, [x3, #48] > + ld3 { v16.16b, v17.16b, v18.16b }, [x3], #48 // First 16 pixels > + ld3 { v26.16b, v27.16b, v28.16b }, [x3], #48 // Second 16 pixels > + prfm pldl1keep, [x3, #96] > .else > - ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64 > - prfm pldl1strm, [x3, #64] > + ld4 { v16.16b, v17.16b, v18.16b, v19.16b }, [x3], #64 // First 16 pixels > + ld4 { v26.16b, v27.16b, v28.16b, v29.16b }, [x3], #64 // Second 16 pixels > + prfm pldl1keep, [x3, #128] > .endif > > + // **Sum adjacent pixel pairs** > .if \alpha_first > - uaddlp v21.8h, v19.16b // v21: summed b pairs > - uaddlp v20.8h, v18.16b // v20: summed g pairs > - uaddlp v19.8h, v17.16b // v19: summed r pairs > + uaddlp v21.8h, v19.16b // Block 1: B sums > + uaddlp v20.8h, v18.16b // Block 1: G sums > + uaddlp v19.8h, v17.16b // Block 1: R sums > + uaddlp v31.8h, v29.16b // Block 2: B sums > + uaddlp v30.8h, v28.16b // Block 2: G sums > + uaddlp v29.8h, v27.16b // Block 2: R sums > .else > - uaddlp v19.8h, v16.16b // v19: summed r pairs > - uaddlp v20.8h, v17.16b // v20: summed g pairs > - uaddlp v21.8h, v18.16b // v21: summed b pairs > + uaddlp v19.8h, v16.16b // Block 1: R sums > + uaddlp v20.8h, v17.16b // Block 1: G sums > + uaddlp v21.8h, v18.16b // Block 1: B sums > + uaddlp v29.8h, v26.16b // Block 2: R sums > + uaddlp v30.8h, v27.16b // Block 2: G sums > + uaddlp v31.8h, v28.16b // Block 2: B sums > .endif > > - mov v22.16b, v6.16b // U first half > - mov v23.16b, v6.16b // U second half > - mov v24.16b, v6.16b // V first half > - mov v25.16b, v6.16b // V second half > - > - rgb_to_uv_interleaved_product v19, v20, v21, v0, v1, v2, v3, v4, v5, v22, v23, v24, v25, v16, v17, #10 > + // init accumulatos for both blocks > + mov v7.16b, v6.16b // U_low > + mov v8.16b, v6.16b // U_high > + mov v9.16b, v6.16b // V_low > + mov v10.16b, v6.16b // V_high > + mov v11.16b, v6.16b // U_low > + mov v12.16b, v6.16b // U_high > + mov v13.16b, v6.16b // V_low > + mov v14.16b, v6.16b // V_high > + > + smlal v7.4s, v0.4h, v19.4h // U += ru * r (0-3) > + smlal v9.4s, v3.4h, v19.4h // V += rv * r (0-3) > + smlal v11.4s, v0.4h, v29.4h // U += ru * r (0-3) > + smlal v13.4s, v3.4h, v29.4h // V += rv * r (0-3) > + > + smlal2 v8.4s, v0.8h, v19.8h // U += ru * r (4-7) > + smlal2 v10.4s, v3.8h, v19.8h // V += rv * r (4-7) > + smlal2 v12.4s, v0.8h, v29.8h // U += ru * r (4-7) > + smlal2 v14.4s, v3.8h, v29.8h // V += rv * r (4-7) > + > + smlal v7.4s, v1.4h, v20.4h // U += gu * g (0-3) > + smlal v9.4s, v4.4h, v20.4h // V += gv * g (0-3) > + smlal v11.4s, v1.4h, v30.4h // U += gu * g (0-3) > + smlal v13.4s, v4.4h, v30.4h // V += gv * g (0-3) > + > + smlal2 v8.4s, v1.8h, v20.8h // U += gu * g (4-7) > + smlal2 v10.4s, v4.8h, v20.8h // V += gv * g (4-7) > + smlal2 v12.4s, v1.8h, v30.8h // U += gu * g (4-7) > + smlal2 v14.4s, v4.8h, v30.8h // V += gv * g (4-7) > + > + smlal v7.4s, v2.4h, v21.4h // U += bu * b (0-3) > + smlal v9.4s, v5.4h, v21.4h // V += bv * b (0-3) > + smlal v11.4s, v2.4h, v31.4h // U += bu * b (0-3) > + smlal v13.4s, v5.4h, v31.4h // V += bv * b (0-3) > + > + smlal2 v8.4s, v2.8h, v21.8h // U += bu * b (4-7) > + smlal2 v10.4s, v5.8h, v21.8h // V += bv * b (4-7) > + smlal2 v12.4s, v2.8h, v31.8h // U += bu * b (4-7) > + smlal2 v14.4s, v5.8h, v31.8h // V += bv * b (4-7) > + > + sqshrn v16.4h, v7.4s, #10 // U (0-3) > + sqshrn v17.4h, v9.4s, #10 // V (0-3) > + sqshrn v22.4h, v11.4s, #10 // U (0-3) > + sqshrn v23.4h, v13.4s, #10 // V (0-3) > + > + sqshrn2 v16.8h, v8.4s, #10 // U (0-7) > + sqshrn2 v17.8h, v10.4s, #10 // V (0-7) > + sqshrn2 v22.8h, v12.4s, #10 // U (0-7) > + sqshrn2 v23.8h, v14.4s, #10 // V (0-7) > > - str q16, [x0], #16 // store dst_u > - str q17, [x1], #16 // store dst_v > + stp q16, q22, [x0], #32 // Store all 16 U values > + stp q17, q23, [x1], #32 // Store all 16 V values > > - sub w5, w5, #8 // width -= 8 > - cmp w5, #8 // width >= 8 ? > + sub w5, w5, #16 // width -= 16 > + cmp w5, #16 // width >= 16 ? > b.ge 1b > cbz w5, 3f // No pixels left? Exit > > @@ -459,3 +513,235 @@ endfunc > > DISABLE_DOTPROD > #endif > + > +.macro rgbToUV_half_neon_double fmt_bgr, fmt_rgb, element, alpha_first=0 > +function ff_\fmt_bgr\()ToUV_half_neon_double, export=1 > + cbz w5, 9f // exit immediately if width is 0 > + cmp w5, #16 // check if we have at least 16 pixels > + b.lt _ff_\fmt_bgr\()ToUV_half_neon This fails to link on anything other than Darwin targets; other platforms don't have an underscore prefix on symbols. Use the X() macro around symbol names to get the right external symbol name for the function. Also, with that fixed, this fails to properly back up and restore registers v8-v15; checkasm doesn't notice this on macOS, but on Linux and windows, checkasm has a call wrapper which does detect such issues. I have set up a set of test configurations for aarch64 assembly on github; if you fetch the branch https://github.com/mstorsjo/ffmpeg/commits/gha-aarch64, append your own commits on top, and push this to your own fork on github, it'll test building it in all the relevant configurations (most relevant platforms/toolchains, including rare ones that not everybody may have availalbe). (You may need to activate the actions by visiting http://github.com/<yourusername>/ffmpeg/actions.) It also does check that the indentation of the assembly matches the common style. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".