From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 4988746946 for ; Wed, 18 Jun 2025 21:16:11 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id A7E3068CBF5; Thu, 19 Jun 2025 00:16:04 +0300 (EEST) Received: from mail-lf1-f44.google.com (mail-lf1-f44.google.com [209.85.167.44]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 22B7D68C69B for ; Thu, 19 Jun 2025 00:15:58 +0300 (EEST) Received: by mail-lf1-f44.google.com with SMTP id 2adb3069b0e04-553241d30b3so36138e87.3 for ; Wed, 18 Jun 2025 14:15:58 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1750281357; x=1750886157; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=ylBRLhe+PnVtWJURK7i7ZAPureoM5huU7bMx/a1mwv0=; b=U94FO0XL3BjBajkh9KvNfj2qCjKoc0SAOLrKhYzpiaRFnOGb8HPzXPY3GOdeSg4ILU ruBlVggxX5VwAnVNSfBRk5FwgKF6lQWbR5Q6NYQh5U84iwp0UEfA2KDTAoMWuqjHxFSD fLW51iGXeKjMxRD+tIkMowFQbeY2VJWGJ+TX/Oe6X8Oj8T8xWbh9RDapRj1uo7deH8Ct f08kdd4OqTG542teThNTLOwe6EPO1LduWblU//EUbTXasumLB/++gXnDO+SJ3eO7ZCuF B/6kl+iGj0KAcsD96celdrYXIwtMCX7h57PHc3O+1NibMG1jfFUUHOgoRJhuJGxCe5xg CTlQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1750281357; x=1750886157; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=ylBRLhe+PnVtWJURK7i7ZAPureoM5huU7bMx/a1mwv0=; b=DX+Gxvpfr3kDWU4XHxq+5MryLzMwXhoB7caB1y1yyOxlMscZOaHJgI6hn95f9o9821 kigNcejXZbmEAUhH+LNcT+h19x8EtiWUigE1lGnlTD25Mh4UrKap4fJ5yCzd46O2lX49 Hi2SZw0OXEpvnqtsQk55+CzTFfJskUP2hW4TwVz6Bkt/egwK30ZUKARpam1d5teUEMFw 528i6+QifkLuTTNy+6KoNu/kZXsGw4gMLWJZfj+E+BMBqAsGRWodbZtiml6pb/Ty9lXa uwaiJ/S7hqyB8GeGZ1kDDDrb3Npfwk2gbs3mkWGSNfgkj0bm1s36FnGRidW9GE5DT5Nf VGbw== X-Gm-Message-State: AOJu0YwJpDjdLMLVRZytv8M99hEZKTropGIQKEMqcj5JdPBSHbp0YfjS pj1SjxYclax/3bdY21b3Iyneb81Pk/xC865JNm7OEYvaODC2ZBP7GbnkK2/MwaZ5nT/2cjakFlk jhJtilA== X-Gm-Gg: ASbGncuRX8tHtjl60ns6ThMkWa3Y4f/U9kC8fg+XP3JSzfEYFTe2+bUh3AvRcb3EXtT fgqn9fgU5wjI3zglSdUVxmsAlb4aL4a51VcqmJWDV1SEjB3S2ag+ZpDwlz44osKYmvn19GvCLBu WWtb4clCCYhg8z8xZMwjCXTgq5bNFj9cJ3tOFu6CAeCTgbHOsYFd4fmxPhiUHjx3YVzz1qIgN2x YhKQx8rbh83ALw3dkY+WwMsArPPuN1hMKJdcHZTlrxgvzXYhhLDqW//uyBst8Y0zjyFod+gRfRh wJkX6iiDTL8BciTQHjLW8Fl4xdv2W/ibJvy9N3DDT/aHBCx8dvLnfounptksHFJZa5O++a6NjxC jxIU8N8befaFh+U9dNYNsNlmT6HoYtv/bxYxcHdlSsEQJFUKvkc5JNY1YaQ== X-Google-Smtp-Source: AGHT+IGX9a/WC4+9ylFWps4hDUZg/diF3+WJn+t/Mn3Zk/dMW3qc1a9n3Kk9uVwA9k0D+gqLl/pacA== X-Received: by 2002:a05:6512:1195:b0:553:2e99:c20 with SMTP id 2adb3069b0e04-553b6f17035mr5156918e87.34.1750281356868; Wed, 18 Jun 2025 14:15:56 -0700 (PDT) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-553ac116804sm2391028e87.41.2025.06.18.14.15.56 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 18 Jun 2025 14:15:56 -0700 (PDT) Date: Thu, 19 Jun 2025 00:15:55 +0300 (EEST) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: FFmpeg development discussions and patches In-Reply-To: Message-ID: <1fb7d51-33b0-0a8-2ed6-c8553638c3df@martin.st> References: MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] swscale/aarch64/output: Implement neon assembly for yuv2planeX_10_c_template() X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Dash Santosh Sathyanarayanan , Harshitha Sarangu Suresh Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Fri, 6 Jun 2025, Logaprakash Ramajayam wrote: > Checked FATE tests and gha-aarch64 git workflow. > > From 34cdef26eaebcf98916e9881b3a04f4f698f09c6 Mon Sep 17 00:00:00 2001 > From: Logaprakash Ramajayam > Date: Thu, 5 Jun 2025 01:33:39 -0700 > Subject: [PATCH] swscale/aarch64/output: Implement neon assembly for > yuv2planeX_10_c_template() > --- > libswscale/aarch64/output.S | 167 +++++++++++++++++++++++++++++++++++ > libswscale/aarch64/swscale.c | 38 ++++++++ > 2 files changed, 205 insertions(+) This is missing checkasm benchmarks for the function. That is presuming that there is a checkasm test for it. If not, such a test needs to be written. > diff --git a/libswscale/aarch64/output.S b/libswscale/aarch64/output.S > index 190c438870..e039e820ae 100644 > --- a/libswscale/aarch64/output.S > +++ b/libswscale/aarch64/output.S > @@ -20,6 +20,173 @@ > > #include "libavutil/aarch64/asm.S" > > +function ff_yuv2planeX_10_neon, export=1 > +// x0 = filter (int16_t*) > +// w1 = filterSize > +// x2 = src (int16_t**) > +// x3 = dest (uint16_t*) > +// w4 = dstW > +// w5 = big_endian > +// w6 = output_bits > + > + mov w8, #27 > + sub w8, w8, w6 // shift = 11 + 16 - output_bits > + > + sub w9, w8, #1 > + mov w10, #1 > + lsl w9, w10, w9 // val = 1 << (shift - 1) > + > + dup v1.4s, w9 > + dup v2.4s, w9 // Create vectors with val > + > + mov w17, #0 > + sub w16, w17, w8 You don't need to assign zero to a register and do subtraction in order to negate a value, you can also just do "neg w16, w8". > + dup v8.4s, w16 // Create (-shift) vector for right shift > + > + movi v11.4s, #0 > + > + mov w10, #1 > + lsl w10, w10, w6 > + sub w10, w10, #1 // (1U << output_bits) - 1 > + dup v12.4s, w10 // Create Clip vector for uppr bound > + > + tst w4, #15 // if dstW divisible by 16, process 16 elements > + b.ne 4f // else process 8 elements Same question as for the other patch; can we assume that it is ok to always write in increments of 8? If not, we'd need a scalar loop to handle the tail. And in any case, it's more efficient to use the most unrolled version of the function for the majority of a line, instead of running the whole line with a less efficient loop just because the tail doesn't line up entirely. > + > + mov x7, #0 // i = 0 > +1: // Loop > + > + mov v3.16b, v1.16b > + mov v4.16b, v2.16b > + mov v5.16b, v1.16b > + mov v6.16b, v2.16b > + > + mov w11, w1 // tmpfilterSize = filterSize > + mov x12, x2 // srcp = src > + mov x13, x0 // filterp = filter > + > +2: // Filter loop > + > + ldp x14, x15, [x12], #16 // get 2 pointers: src[j] and src[j+1] > + ldr s7, [x13], #4 // load filter coefficients > + add x14, x14, x7, lsl #1 > + add x15, x15, x7, lsl #1 > + ld1 {v16.8h, v17.8h}, [x14] > + ld1 {v18.8h, v19.8h}, [x15] > + > + // Multiply-accumulate > + smlal v3.4s, v16.4h, v7.h[0] > + smlal2 v4.4s, v16.8h, v7.h[0] > + smlal v5.4s, v17.4h, v7.h[0] > + smlal2 v6.4s, v17.8h, v7.h[0] > + > + smlal v3.4s, v18.4h, v7.h[1] > + smlal2 v4.4s, v18.8h, v7.h[1] > + smlal v5.4s, v19.4h, v7.h[1] > + smlal2 v6.4s, v19.8h, v7.h[1] > + > + subs w11, w11, #2 // tmpfilterSize -= 2 > + b.gt 2b // continue filter loop > + > + // Shift results > + sshl v3.4s, v3.4s, v8.4s > + sshl v4.4s, v4.4s, v8.4s > + sshl v5.4s, v5.4s, v8.4s > + sshl v6.4s, v6.4s, v8.4s > + > + // Clamp to 0 > + smax v3.4s, v3.4s, v11.4s > + smax v4.4s, v4.4s, v11.4s > + smax v5.4s, v5.4s, v11.4s > + smax v6.4s, v6.4s, v11.4s > + > + // Clip upper bound > + smin v3.4s, v3.4s, v12.4s > + smin v4.4s, v4.4s, v12.4s > + smin v5.4s, v5.4s, v12.4s > + smin v6.4s, v6.4s, v12.4s > + > + // Narrow to 16-bit > + xtn v13.4h, v3.4s > + xtn2 v13.8h, v4.4s > + xtn v14.4h, v5.4s > + xtn2 v14.8h, v6.4s If we are going to narrow things to 16 bit here, I think it would be more efficient to first narrow to 16 bit. You can do that with sqxtun, then you also get the clamp to 0 part for free, so you only need to clamp the upper bound, with half the number of instructions/registers. > + > + cbz w5, 3f // Check if big endian > + rev16 v13.16b, v13.16b > + rev16 v14.16b, v14.16b // Swap bits for big endian > +3: > + // Store 16 pixels > + st1 {v13.8h}, [x3], #16 > + st1 {v14.8h}, [x3], #16 Write both registers with one store - st1 {v13.8h, v14.8h}, [x3], #32. > + > + add x7, x7, #16 // i = i + 16 > + subs w4, w4, #16 // dstW = dstW - 16 > + b.gt 1b // Continue loop If possible, don't do the calculation that sets the condition codes directly before the branch; that forces the branch to wait for the previous instruction to finish. Instead you can move the "subs" instruction a bit earlier; at least above the "add" above, but it could also go e.g. after the 3: label. > diff --git a/libswscale/aarch64/swscale.c b/libswscale/aarch64/swscale.c > index 6e5a721c1f..23cdb7d26e 100644 > --- a/libswscale/aarch64/swscale.c > +++ b/libswscale/aarch64/swscale.c > @@ -158,6 +158,29 @@ void ff_hscale ## from_bpc ## to ## to_bpc ## _ ## filter_n ## _ ## opt( \ > > ALL_SCALE_FUNCS(neon); > > +void ff_yuv2planeX_10_neon(const int16_t *filter, int filterSize, > + const int16_t **src, uint16_t *dest, int dstW, > + int big_endian, int output_bits); Align the later lines with parameters with the parameters on the first line. See ff_yuv2planeX_8_neon right below. > + > +#define yuv2NBPS(bits, BE_LE, is_be, template_size, typeX_t) \ > +static void yuv2planeX_ ## bits ## BE_LE ## _neon(const int16_t *filter, int filterSize, \ > + const int16_t **src, uint8_t *dest, int dstW, \ > + const uint8_t *dither, int offset)\ Same thing here > +{ \ > + ff_yuv2planeX_## template_size ## _neon(filter, \ > + filterSize, (const typeX_t **) src, \ > + (uint16_t *) dest, dstW, is_be, bits); \ Same thing here > +} > + > +yuv2NBPS( 9, BE, 1, 10, int16_t) > +yuv2NBPS( 9, LE, 0, 10, int16_t) > +yuv2NBPS(10, BE, 1, 10, int16_t) > +yuv2NBPS(10, LE, 0, 10, int16_t) > +yuv2NBPS(12, BE, 1, 10, int16_t) > +yuv2NBPS(12, LE, 0, 10, int16_t) > +yuv2NBPS(14, BE, 1, 10, int16_t) > +yuv2NBPS(14, LE, 0, 10, int16_t) FWIW, I appreciate the effort to save code size here by not templating 8 different copies of the same functions, but making it use one single implementation for all the variants. > + > void ff_yuv2planeX_8_neon(const int16_t *filter, int filterSize, > const int16_t **src, uint8_t *dest, int dstW, > const uint8_t *dither, int offset); > @@ -268,6 +291,8 @@ av_cold void ff_sws_init_range_convert_aarch64(SwsInternal *c) > av_cold void ff_sws_init_swscale_aarch64(SwsInternal *c) > { > int cpu_flags = av_get_cpu_flags(); > + enum AVPixelFormat dstFormat = c->opts.dst_format; > + const AVPixFmtDescriptor *desc = av_pix_fmt_desc_get(dstFormat); > > if (have_neon(cpu_flags)) { > ASSIGN_SCALE_FUNC(c->hyScale, c->hLumFilterSize, neon); > @@ -276,6 +301,19 @@ av_cold void ff_sws_init_swscale_aarch64(SwsInternal *c) > if (c->dstBpc == 8) { > c->yuv2planeX = ff_yuv2planeX_8_neon; > } > + > + if (isNBPS(dstFormat) && !isSemiPlanarYUV(dstFormat)) { > + if (desc->comp[0].depth == 9) { > + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_9BE_neon : yuv2planeX_9LE_neon; > + } else if (desc->comp[0].depth == 10) { > + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_10BE_neon : yuv2planeX_10LE_neon; > + } else if (desc->comp[0].depth == 12) { > + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_12BE_neon : yuv2planeX_12LE_neon; > + } else if (desc->comp[0].depth == 14) { > + c->yuv2planeX = isBE(dstFormat) ? yuv2planeX_14BE_neon : yuv2planeX_14LE_neon; > + } else > + av_assert0(0); The av_assert is misindented. // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".