From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id 25A3D4CF4D for ; Thu, 15 May 2025 20:20:09 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id D14FE68C37D; Thu, 15 May 2025 23:20:05 +0300 (EEST) Received: from relay7-d.mail.gandi.net (relay7-d.mail.gandi.net [217.70.183.200]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id E0DAE68C18E for ; Thu, 15 May 2025 23:19:58 +0300 (EEST) Received: by mail.gandi.net (Postfix) with ESMTPSA id 413364321C for ; Thu, 15 May 2025 20:19:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=niedermayer.cc; s=gm1; t=1747340398; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=fhv50v9xBnJRuVrJH2Ez0dtpk/vgZEqzc4K2vGkBuI0=; b=FLAu+s2hp6wvt79fZrRPY+FGszqnr7gSbEi28bOFcun6babIVn3ahQDRNaLTi9rGawWkKi iWPfY3j/yiSRP13f1cc+vPoW1RnNyMP7MNLtAlIgWflM1yjn7DI4IKCY0BWG8RjQGqWqMZ 9p+KvPRNB/WFxPcgNwqbBiiYKBnkZLYO9mpd+HgCWbSVXiqvhLlwKD8V+TUp/unWBcTZeO hckc/Ib8edDNBaPqmzSY12MTjTMipXucHDLri/W9QrZHG0Hcfj4MPRN1K96uwsDGh0lfFa 7sTKv3lyq8A1uhqZ2atgW4xBaQZRf9r9wsWDnRNJ9yhJuckdww/QjrM+28KSfQ== Date: Thu, 15 May 2025 22:19:57 +0200 From: Michael Niedermayer To: FFmpeg development discussions and patches Message-ID: <20250515201957.GW29660@pb2> References: <20250514164003.GL29660@pb2> MIME-Version: 1.0 In-Reply-To: <20250514164003.GL29660@pb2> X-GND-State: clean X-GND-Score: -70 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefvddrtddtgdefuddtkedvucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenfghrlhcuvffnffculdeftddmnecujfgurhepfffhvffukfhfgggtuggjsehgtderredttddvnecuhfhrohhmpefoihgthhgrvghlucfpihgvuggvrhhmrgihvghruceomhhitghhrggvlhesnhhivgguvghrmhgrhigvrhdrtggtqeenucggtffrrghtthgvrhhnpeeigeektdejudffjefhteegjedtgeettefggedthfejgfevhfetgeekjedtvdfhveenucfkphepgedurdeiiedrieejrdduudefnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepgedurdeiiedrieejrdduudefpdhhvghloheplhhotggrlhhhohhsthdpmhgrihhlfhhrohhmpehmihgthhgrvghlsehnihgvuggvrhhmrgihvghrrdgttgdpnhgspghrtghpthhtohepuddprhgtphhtthhopehffhhmphgvghdquggvvhgvlhesfhhfmhhpvghgrdhorhhg X-GND-Sasl: michael@niedermayer.cc Subject: Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float] X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Type: multipart/mixed; boundary="===============3855613678243751320==" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: --===============3855613678243751320== Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="TdDkFkGUMNACHLaO" Content-Disposition: inline --TdDkFkGUMNACHLaO Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable Hi On Wed, May 14, 2025 at 06:40:03PM +0200, Michael Niedermayer wrote: > Hi Chitra >=20 > On Wed, May 14, 2025 at 03:55:59AM +0000, Chitra Dey Sarkar via ffmpeg-de= vel wrote: > > Original Implementation: > > --------------------------------- > > In the original implementation, the "VER_SD" section processes image da= ta stored in *data using strided memory access in a vertical fashion This l= eads to inefficient memory access patterns and cache thrashing due to non-s= equential data access across multiple inner loops. > >=20 > > Proposed Refactor: > > --------------------------------- > > The proposed refactor replaces this by allocating a cache-friendly 2D = array buffer. This change eliminates strided memory access across the three= inner loops, significantly improving cache locality and reducing cache thr= ashing. > >=20 > > Additionally, the data is transposed outside the lp loop, which allows = for efficient per-line access and write-back to the l buffer, further optim= izing performance. > >=20 > > Performance improvements > > ------------------------------------------------------- > > This change results in a substantial performance improvement Sharing t= he FPS data benchmarked on our end for the file 'Tears of Steel' using Hand= Brake > >=20 > > Device / CPU Model Official FPS = Optimized FPS % Improvement > > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 3.18 = 6.15 +93% > > Surface Laptop 11(10-core X1P64100, L2: 36MB) 5.16 7.31 = +41% > > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 5.57 = 9.21 +65% > > AMD Ryzen + NVIDIA RTX 4060 Laptop (12C/24T) 9.97 = 11.22 +12% > > Mac Mini Apple M4 Chip 9.00 = 12.00 +30% > >=20 > > -----------------------------------------------------------------------= -------------------------------- > > --- > > libavcodec/jpeg2000dwt.c | 72 +++++++++++++++++++++++++++++++--------- > > 1 file changed, 57 insertions(+), 15 deletions(-) > >=20 > > diff --git a/libavcodec/jpeg2000dwt.c b/libavcodec/jpeg2000dwt.c index = 9ee8122658..45d7897893 100644 > > --- a/libavcodec/jpeg2000dwt.c > > +++ b/libavcodec/jpeg2000dwt.c > > @@ -409,6 +409,15 @@ static void dwt_decode97_float(DWTContext *s, floa= t *t) > > /* position at index O of line range [0-5,w+5] cf. extend function= */ > > line +=3D 5; > >=20 >=20 > > + /* Find the largest lv and lv to allocate a 2D Array*/ >=20 > lv and lv ? > you mean lv anf lh ? >=20 >=20 > > + int max_dim =3D 0; > > + for (lev =3D 0; lev < s->ndeclevels; lev++) { > > + if (s->linelen[lev][0] > max_dim) max_dim =3D s->linelen[lev]= [0]; > > + if (s->linelen[lev][1] > max_dim) max_dim =3D s->linelen[lev][= 1]; >=20 > FFMAX() >=20 >=20 > > + } > > + float *array2DBlock =3D av_malloc(max_dim * max_dim * sizeof(float= )); > > + int useFallback =3D !array2DBlock; >=20 > also is this supposed to be max_dim_h * max_dim_v ? >=20 >=20 >=20 > > + > > for (lev =3D 0; lev < s->ndeclevels; lev++) { > > int lh =3D s->linelen[lev][0], > > lv =3D s->linelen[lev][1], > > @@ -431,23 +440,56 @@ static void dwt_decode97_float(DWTContext *s, flo= at *t) > > for (i =3D 0; i < lh; i++) > > data[w * lp + i] =3D l[i]; > > } > > - > > - // VER_SD > > - l =3D line + mv; > > - for (lp =3D 0; lp < lh; lp++) { > > - int i, j =3D 0; > > - // copy with interleaving > > - for (i =3D mv; i < lv; i +=3D 2, j++) > > - l[i] =3D data[w * j + lp]; > > - for (i =3D 1 - mv; i < lv; i +=3D 2, j++) > > - l[i] =3D data[w * j + lp]; > > - >=20 > > - sr_1d97_float(line, mv, mv + lv); >=20 > this should be run linewise not columnwise > if you dont understand what i mean here, please say so and ill elaborate For the record: (may be interresting for others, or others may have comment= s too) The 1D transform is made of 4 passes of lifting transforms, cur= rently all pixels of a column are run through the first lifting step before= anything runs through the 2nd. One could try to run the first through the = 4th lifting step after the 2nd finishes the 3rd lifting step and the 3rd pi= xel of the column finishes the 2nd lifting step and then do this for all pixels in a row so that the whole tran= sform is finished for the whole first row before more than 5 rows or so hav= e been touched this _COULD_ be faster, but it needs to be tried to be sure basically, there would be a area covering s small number of whole rows and this area would move down by 1 or 2 rows in each iteration above it both horizontal and vertical transforms are finished below it its the untouched input. as long as this sliding window fits in the cache this should outperform anything that copyies the data around I think its important to look into this before optimizing the code for CPU or GPU because the structure of this is different than the transpose based code. In fact the horizontal transform is quite unfriendly for SIMD so the code as is in C might be to worst possible starting point for SIMD it transforms the easy vertical transform with a transpose into a horizontal one ... thx [...] --=20 Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB The greatest way to live with honor in this world is to be what we pretend to be. -- Socrates --TdDkFkGUMNACHLaO Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iF0EABEKAB0WIQSf8hKLFH72cwut8TNhHseHBAsPqwUCaCZMaQAKCRBhHseHBAsP q5cIAJoDrYzVCYUNDdt1iTIFQ7qGZ7k2sgCeN+rKkpZRH3Ho+7dEs87CVfDHClE= =5/7v -----END PGP SIGNATURE----- --TdDkFkGUMNACHLaO-- --===============3855613678243751320== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". --===============3855613678243751320==--