From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100]) by master.gitmailbox.com (Postfix) with ESMTPS id AC9874D19F for ; Fri, 16 May 2025 09:43:47 +0000 (UTC) Received: from [127.0.1.1] (localhost [127.0.0.1]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id A71D868C9CA; Fri, 16 May 2025 12:43:43 +0300 (EEST) Received: from relay6-d.mail.gandi.net (relay6-d.mail.gandi.net [217.70.183.198]) by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 2564A68C8D5 for ; Fri, 16 May 2025 12:43:37 +0300 (EEST) Received: by mail.gandi.net (Postfix) with ESMTPSA id 7BFE843292 for ; Fri, 16 May 2025 09:43:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=niedermayer.cc; s=gm1; t=1747388616; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=rmlkZ4E/jYHO9Qz0zliTvq/e4DoKXlSIpBsNnFW2IAg=; b=ZrYZb2pqWGjZPCXIlHiucjB/LP3GXbxSD/dFLmmFwp00DePZsd+jy4kvebQh7DmxHOsHxH 478ebTtWksBaMkbDpnRE4LmzZKtGcrKBYdzifFsAMwZb7TXuZ9J8iunaUXBiS1oyjRWQnq 7oQFeuw4Vlte4gptrrfic2G78VrjVyN0vfQaCh7HhWtvswN8pyqCNWZ21ReqFaRu5NOqC8 qUu4uGVUbEWVIO/NxjcopzAW6BhpajIHXQHq3iS4pTiNpB2mkWE3ub+rByTReRlK48Z68D pCQVwe/hHceOtaVf+g91hQ5k3ZtvZRHX14g8Yqgm9xdYzBI0V7kHtRjaKcAEJQ== Date: Fri, 16 May 2025 11:43:35 +0200 From: Michael Niedermayer To: FFmpeg development discussions and patches Message-ID: <20250516094335.GY29660@pb2> References: <20250514164003.GL29660@pb2> <20250515201957.GW29660@pb2> MIME-Version: 1.0 In-Reply-To: <20250515201957.GW29660@pb2> X-GND-State: clean X-GND-Score: -70 X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefvddrtddtgdefuddvgedvucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenfghrlhcuvffnffculdeftddmnecujfgurhepfffhvffukfhfgggtuggjsehgtderredttddvnecuhfhrohhmpefoihgthhgrvghlucfpihgvuggvrhhmrgihvghruceomhhitghhrggvlhesnhhivgguvghrmhgrhigvrhdrtggtqeenucggtffrrghtthgvrhhnpeeigeektdejudffjefhteegjedtgeettefggedthfejgfevhfetgeekjedtvdfhveenucfkphepgedurdeiiedrieejrdduudefnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepgedurdeiiedrieejrdduudefpdhhvghloheplhhotggrlhhhohhsthdpmhgrihhlfhhrohhmpehmihgthhgrvghlsehnihgvuggvrhhmrgihvghrrdgttgdpnhgspghrtghpthhtohepuddprhgtphhtthhopehffhhmphgvghdquggvvhgvlhesfhhfmhhpvghgrdhorhhg X-GND-Sasl: michael@niedermayer.cc Subject: Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float] X-BeenThere: ffmpeg-devel@ffmpeg.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: FFmpeg development discussions and patches List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: FFmpeg development discussions and patches Content-Type: multipart/mixed; boundary="===============0286181493410625712==" Errors-To: ffmpeg-devel-bounces@ffmpeg.org Sender: "ffmpeg-devel" Archived-At: List-Archive: List-Post: --===============0286181493410625712== Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="HQLHrCACpVdwMl/9" Content-Disposition: inline --HQLHrCACpVdwMl/9 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline Content-Transfer-Encoding: quoted-printable On Thu, May 15, 2025 at 10:19:57PM +0200, Michael Niedermayer wrote: > Hi >=20 > On Wed, May 14, 2025 at 06:40:03PM +0200, Michael Niedermayer wrote: > > Hi Chitra > >=20 > > On Wed, May 14, 2025 at 03:55:59AM +0000, Chitra Dey Sarkar via ffmpeg-= devel wrote: > > > Original Implementation: > > > --------------------------------- > > > In the original implementation, the "VER_SD" section processes image = data stored in *data using strided memory access in a vertical fashion This= leads to inefficient memory access patterns and cache thrashing due to non= -sequential data access across multiple inner loops. > > >=20 > > > Proposed Refactor: > > > --------------------------------- > > > The proposed refactor replaces this by allocating a cache-friendly 2= D array buffer. This change eliminates strided memory access across the thr= ee inner loops, significantly improving cache locality and reducing cache t= hrashing. > > >=20 > > > Additionally, the data is transposed outside the lp loop, which allow= s for efficient per-line access and write-back to the l buffer, further opt= imizing performance. > > >=20 > > > Performance improvements > > > ------------------------------------------------------- > > > This change results in a substantial performance improvement Sharing= the FPS data benchmarked on our end for the file 'Tears of Steel' using Ha= ndBrake > > >=20 > > > Device / CPU Model Official FPS = Optimized FPS % Improvement > > > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 3.18 = 6.15 +93% > > > Surface Laptop 11(10-core X1P64100, L2: 36MB) 5.16 7.31= +41% > > > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 5.57 = 9.21 +65% > > > AMD Ryzen + NVIDIA RTX 4060 Laptop (12C/24T) 9.97 = 11.22 +12% > > > Mac Mini Apple M4 Chip 9.00 = 12.00 +30% > > >=20 > > > ---------------------------------------------------------------------= ---------------------------------- > > > --- > > > libavcodec/jpeg2000dwt.c | 72 +++++++++++++++++++++++++++++++-------= -- > > > 1 file changed, 57 insertions(+), 15 deletions(-) > > >=20 > > > diff --git a/libavcodec/jpeg2000dwt.c b/libavcodec/jpeg2000dwt.c inde= x 9ee8122658..45d7897893 100644 > > > --- a/libavcodec/jpeg2000dwt.c > > > +++ b/libavcodec/jpeg2000dwt.c > > > @@ -409,6 +409,15 @@ static void dwt_decode97_float(DWTContext *s, fl= oat *t) > > > /* position at index O of line range [0-5,w+5] cf. extend functi= on */ > > > line +=3D 5; > > >=20 > >=20 > > > + /* Find the largest lv and lv to allocate a 2D Array*/ > >=20 > > lv and lv ? > > you mean lv anf lh ? > >=20 > >=20 > > > + int max_dim =3D 0; > > > + for (lev =3D 0; lev < s->ndeclevels; lev++) { > > > + if (s->linelen[lev][0] > max_dim) max_dim =3D s->linelen[le= v][0]; > > > + if (s->linelen[lev][1] > max_dim) max_dim =3D s->linelen[lev= ][1]; > >=20 > > FFMAX() > >=20 > >=20 > > > + } > > > + float *array2DBlock =3D av_malloc(max_dim * max_dim * sizeof(flo= at)); > > > + int useFallback =3D !array2DBlock; > >=20 > > also is this supposed to be max_dim_h * max_dim_v ? > >=20 > >=20 > >=20 > > > + > > > for (lev =3D 0; lev < s->ndeclevels; lev++) { > > > int lh =3D s->linelen[lev][0], > > > lv =3D s->linelen[lev][1], > > > @@ -431,23 +440,56 @@ static void dwt_decode97_float(DWTContext *s, f= loat *t) > > > for (i =3D 0; i < lh; i++) > > > data[w * lp + i] =3D l[i]; > > > } > > > - > > > - // VER_SD > > > - l =3D line + mv; > > > - for (lp =3D 0; lp < lh; lp++) { > > > - int i, j =3D 0; > > > - // copy with interleaving > > > - for (i =3D mv; i < lv; i +=3D 2, j++) > > > - l[i] =3D data[w * j + lp]; > > > - for (i =3D 1 - mv; i < lv; i +=3D 2, j++) > > > - l[i] =3D data[w * j + lp]; > > > - > >=20 > > > - sr_1d97_float(line, mv, mv + lv); > >=20 > > this should be run linewise not columnwise > > if you dont understand what i mean here, please say so and ill elaborate >=20 > For the record: (may be interresting for others, or others may have comme= nts too) >=20 > The 1D transform is made of 4 passes of lifting transforms, c= urrently all pixels of a column are run through the first lifting step befo= re anything runs through the 2nd. One could try to run the first through th= e 4th lifting step after the 2nd finishes the 3rd lifting step and the 3rd = pixel of the column finishes the 2nd lifting step > and then do this for all pixels in a row so that the whole tr= ansform is finished for the whole first row before more than 5 rows or so h= ave been touched > this _COULD_ be faster, but it needs to be tried to be sure >=20 > basically, there would be a area covering s small number of whole rows and > this area would move down by 1 or 2 rows in each iteration > above it both horizontal and vertical transforms are finished below it > its the untouched input. > as long as this sliding window fits in the cache this should outperform > anything that copyies the data around >=20 > I think its important to look into this before optimizing the code for > CPU or GPU because the structure of this is different than the transpose > based code. In fact the horizontal transform is quite unfriendly for SIMD >=20 > so the code as is in C might be to worst possible starting point for SIMD > it transforms the easy vertical transform with a transpose into a horizon= tal > one ... More elaboration, i think what i wrote is still unclear If one looks at spatial_decompose97i() in libavcodec/snow_dwt.c thats approximately what i had in mind. Its just one page of code [...] --=20 Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB The difference between a dictatorship and a democracy is that every 4 years the population together is allowed to provide 1 bit of input to the governm= ent. --HQLHrCACpVdwMl/9 Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iF0EABEKAB0WIQSf8hKLFH72cwut8TNhHseHBAsPqwUCaCcIwwAKCRBhHseHBAsP qyeoAKCbmvNozfVHuRaDoWLLpa8Ck78uOACggGSJDdDW/LUCRIWPfd83RootwbY= =wuCL -----END PGP SIGNATURE----- --HQLHrCACpVdwMl/9-- --===============0286181493410625712== Content-Type: text/plain; charset="us-ascii" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Content-Disposition: inline _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe". --===============0286181493410625712==--