From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.mplayerhq.hu (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id C7B3E4C9C8 for ; Mon, 10 Feb 2025 13:15:50 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTP id 43B7568BDF9; Mon, 10 Feb 2025 15:15:47 +0200 (EET) Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com [209.85.208.179]) by ffbox0-bg.mplayerhq.hu (Postfix) with ESMTPS id 195FB68B882 for ; Mon, 10 Feb 2025 15:15:40 +0200 (EET) Received: by mail-lj1-f179.google.com with SMTP id 38308e7fff4ca-308f32984bdso5184631fa.1 for ; Mon, 10 Feb 2025 05:15:40 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=martin-st.20230601.gappssmtp.com; s=20230601; t=1739193339; x=1739798139; darn=ffmpeg.org; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:from:to:cc:subject:date:message-id:reply-to; bh=8PL7ZIM3vyyk0LEDmWe/3/uL3PRPZiUSjs1TrDOBryI=; b=MOaokO4RUu1loBDYbj3o8EDwoZE+OSZivLE0XDRBTid2ZnYWAw7gbp8DK8S136sqhM YLqK8d1iK0FBnVcFcnQG3oJnPDX61mgHbFiRxKb7GJQQJiYY9pEPkbTOb+9fP4vKbh1x bPRZnJGZHwQ98tm6h+KjFNBePOTgkqgUAMzaVaSVcfe686Qy2ts7Iv/1R+4Xgf0XJaMP g0J/TyhYh6T097HLZE5raIrljPENBlOpbwS5eHbfPoe/0vRpEdbXW0S5I5btCawwvDmF cPXehS9dOVGIr4yuDqGyu9dRbvmdtP8MHuzf9vqX8+cOZy2CmCvlBs3E7alYMxKsG3wy RTYA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739193339; x=1739798139; h=mime-version:references:message-id:in-reply-to:subject:cc:to:from :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8PL7ZIM3vyyk0LEDmWe/3/uL3PRPZiUSjs1TrDOBryI=; b=Xh+IDBQU7U54AMeLsORk9uU9pFra4vGDN2bQFH8m/Ldou8ESEJaQ30zSdzPCkWYDn+ /h8QO3U4KyBWHG0Xk18+uikndJhKhrLdvwBsbD039o2Vbrc4A1ZvlA/h+oWoeL+gYOty mSzwYZYzTC7+JEhiHb71wLQKCajtuxcvRr4Qy0hUeHpkBOYwrZduc8NmJLfKzxwxgE9t 2g1zFdw+kl5936loS9LBd+6zkyXrvu0skgP4PCu+BIfxpHGWNQymd2KGnrwopaC70wHp gTKxXYp6CPXJRRJS4obrksEyY8+5TGLEBBW1/7yc8XSyHzX+yE4WHorrvF6nGejA1j1L eP/A== X-Gm-Message-State: AOJu0YxBhEt34jQQhSkv/j9BeO5oOKzMn+yu824I49l4wYlE6PDqN0E7 3CnaEWOr1VCx6Izo4JPe0dB/jsyjeLN0wkz7TSfvFMtxI8BRUu0HMWc0pk4Fao8VNRyn9iJDk2L 4kQ== X-Gm-Gg: ASbGncu9ObKfIpUxegOwhzEleZ36XuiK3ZAiTaPqvu/nSEmRj64DS3jAVOznbedBS4o K1ChPXq73f6R9uGQZCu+Cuj0Xy5edc2ZSGnT6gySQ+k5nA0vRVLcGtgHFkmVq7IW0qvx1qSUivC 7SNZ2PVj1q1APhnsU4URaZKONDdDViOkky37UPcU3XdiCVrNvHZyNglIho+LPV1i+kQKdZF1EjZ mpOkcrre0erWnbcD6Heu0A1VQB4AdiGDHv9OxOuktYbCxHzEIL6UzswDj7LxejGXYBaQeBUHLK1 gV8asxy9h6ZUphYQ+n7YI6YDEx/LImPQkGsr6rij50w5umOb7e9hAsnS2BlafF5MpoDp5+ir2l5 C8Pb+fqpnCu0= X-Google-Smtp-Source: AGHT+IH5MmTWPmiFNZ+8dGmxuyyYg+8myhp9Y3BBwE/8z3msMgQpA4BBBlyXHP5GLWfpxhpwdHB9Gg== X-Received: by 2002:a2e:a98e:0:b0:302:17e7:e176 with SMTP id 38308e7fff4ca-307e57be353mr39885391fa.5.1739193338925; Mon, 10 Feb 2025 05:15:38 -0800 (PST) Received: from tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net (tunnel335574-pt.tunnel.tserv24.sto1.ipv6.he.net. [2001:470:27:11::2]) by smtp.gmail.com with ESMTPSA id 38308e7fff4ca-308f4e60ebfsm1088691fa.60.2025.02.10.05.15.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Feb 2025 05:15:38 -0800 (PST) Date: Mon, 10 Feb 2025 15:15:35 +0200 (EET) From: =?ISO-8859-15?Q?Martin_Storsj=F6?= To: Krzysztof Pyrkosz via ffmpeg-devel In-Reply-To: <20250207190647.4729-2-ffmpeg@szaka.eu> Message-ID: <6f582baf-e5db-8259-87ba-aed38bfe26ce@martin.st> References: <20250207190647.4729-2-ffmpeg@szaka.eu> MIME-Version: 1.0 Subject: Re: [FFmpeg-devel] [PATCH] swscale/aarch64/rgb2rgb_neon: Implemented uyvytoyuv422 X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Cc: Krzysztof Pyrkosz Content-Transfer-Encoding: 7bit Content-Type: text/plain; charset="us-ascii"; Format="flowed" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: On Fri, 7 Feb 2025, Krzysztof Pyrkosz via ffmpeg-devel wrote: > The patch contains NEON code that splits the uyvy input array into 3 > separate buffers. > > The existing test cases are covering scenarios with odd height and odd > stride, but width is even in every instance. Is it safe to make that > assumption about the width? For something like uyvy, I'm kinda ok with assuming that, especially if we don't have test coverage for it. But ideally, it would of course be clear how those aspects are handled wrt assembly functions, indeed! > Just as I'm about to send this patch, I'm thinking if non-interleaved > read followed by 4 invocations of TBL wouldn't be more performant. One > call to generate a contiguous vector of u, second for v and two for y. > I'm curious to find out. My guess is that it may be more performant on more modern cores, but probably not on older ones. > > The speed: > > A78: > uyvytoyuv422_c: 42213.5 ( 1.00x) > uyvytoyuv422_neon: 5298.8 ( 7.97x) > > A72: > uyvytoyuv422_c: 61797.6 ( 1.00x) > uyvytoyuv422_neon: 12141.9 ( 5.09x) > > x13s: > uyvytoyuv422_c: 28581.7 ( 1.00x) > uyvytoyuv422_neon: 4882.4 ( 5.85x) > > Krzysztof > > --- > libswscale/aarch64/rgb2rgb.c | 5 +++ > libswscale/aarch64/rgb2rgb_neon.S | 51 +++++++++++++++++++++++++++++++ > 2 files changed, 56 insertions(+) > > diff --git a/libswscale/aarch64/rgb2rgb.c b/libswscale/aarch64/rgb2rgb.c > index 7e1dba572d..096ed9f363 100644 > --- a/libswscale/aarch64/rgb2rgb.c > +++ b/libswscale/aarch64/rgb2rgb.c > @@ -67,6 +67,10 @@ void ff_shuffle_bytes_2013_neon(const uint8_t *src, uint8_t *dst, int src_size); > void ff_shuffle_bytes_2130_neon(const uint8_t *src, uint8_t *dst, int src_size); > void ff_shuffle_bytes_1203_neon(const uint8_t *src, uint8_t *dst, int src_size); > > +void ff_uyvytoyuv422_neon(uint8_t *ydst, uint8_t *udst, uint8_t *vdst, > + const uint8_t *src, int width, int height, > + int lumStride, int chromStride, int srcStride); > + > av_cold void rgb2rgb_init_aarch64(void) > { > int cpu_flags = av_get_cpu_flags(); > @@ -84,5 +88,6 @@ av_cold void rgb2rgb_init_aarch64(void) > shuffle_bytes_2013 = ff_shuffle_bytes_2013_neon; > shuffle_bytes_2130 = ff_shuffle_bytes_2130_neon; > shuffle_bytes_1203 = ff_shuffle_bytes_1203_neon; > + uyvytoyuv422 = ff_uyvytoyuv422_neon; > } > } > diff --git a/libswscale/aarch64/rgb2rgb_neon.S b/libswscale/aarch64/rgb2rgb_neon.S > index 22ecdf7ac8..bdbee7df2e 100644 > --- a/libswscale/aarch64/rgb2rgb_neon.S > +++ b/libswscale/aarch64/rgb2rgb_neon.S > @@ -427,3 +427,54 @@ neon_shuf 2013 > neon_shuf 1203 > neon_shuf 2130 > neon_shuf 3210 > + > +function ff_uyvytoyuv422_neon, export=1 > + sxtw x6, w6 The indentation of the arguments column is off by one char within this whole function. For testing various aarch64 assembly details (building with unusual toolchains etc), I've set up a little extra CI for that on github; have a look at the branch at the topmost 4 commits at https://github.com/mstorsjo/ffmpeg/commits/gha-aarch64. If you include those 4 commits in a branch and push to a personal fork at github, you'll get test results for that push like this: https://github.com/mstorsjo/FFmpeg/actions/runs/13240513523 For these two patches, I got the following results: https://github.com/mstorsjo/FFmpeg/actions/runs/13240526767 This points out the unexpected indentation here. > + sxtw x7, w7 > + ldrsw x8, [sp] > + ubfx x10, x4, #1, #31 The ubfx instruction is kinda esoteric; I presume what you're doing here is essentially the same as "lsr #1"? That'd be much more idiomatic and readable. > + sub x8, x8, w4, sxtw #1 // src offset > + sub x6, x6, w4, sxtw // lum offset > + sub x7, x7, x10 // chr offset > +1: > + sub w5, w5, #1 It feels a bit unusual to do the decrement of the outer loop counter here at this point; I feel that it would be more readable in context if it was done at the end after the 3: label. (There can of course be good reasons for doing it early due to scheduling etc, but I don't see such a case here.) > + ands w10, w4, #~31 > + and w9, w4, #15 > + and w11, w4, #16 > + b.eq 7f > +4: > + ld4 {v0.16b - v3.16b}, [x3], #64 // handle 16 uyvy tuples per iteration > + subs w10, w10, #32 > + st1 {v2.16b}, [x2], #16 > + st1 {v0.16b}, [x1], #16 > + mov v2.16b, v1.16b > + st2 {v2.16b,v3.16b}, [x0], #32 > + b.ne 4b > +7: > + cbz w11, 5f // 8 - 15 remaining > + ld4 {v0.8b - v3.8b}, [x3], #32 > + st1 {v2.8b}, [x2], #8 > + st1 {v0.8b}, [x1], #8 > + mov v2.8b, v1.8b > + st2 {v2.8b,v3.8b}, [x0], #16 > +5: > + cbz w9, 3f > +2: > + subs w9, w9, #2 // 0 - 7 left > + ldrb w12, [x3], #1 > + strb w12, [x1], #1 > + ldrb w12, [x3], #1 > + strb w12, [x0], #1 > + ldrb w12, [x3], #1 > + strb w12, [x2], #1 > + ldrb w12, [x3], #1 > + strb w12, [x0], #1 > + b.ne 2b > +3: > + add x3, x3, x8 > + add x0, x0, x6 > + add x1, x1, x7 > + add x2, x2, x7 > + cbnz w5, 1b > + ret > +endfunc If the height decrement is moved into the end here, it can be a subs with a regular b.gt branch. Other than that, this looks quite reasonable, thanks! // Martin _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".