From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <ffmpeg-devel-bounces@ffmpeg.org>
Received: from ffbox0-bg.ffmpeg.org (ffbox0-bg.ffmpeg.org [79.124.17.100])
	by master.gitmailbox.com (Postfix) with ESMTPS id AC9874D19F
	for <ffmpegdev@gitmailbox.com>; Fri, 16 May 2025 09:43:47 +0000 (UTC)
Received: from [127.0.1.1] (localhost [127.0.0.1])
	by ffbox0-bg.ffmpeg.org (Postfix) with ESMTP id A71D868C9CA;
	Fri, 16 May 2025 12:43:43 +0300 (EEST)
Received: from relay6-d.mail.gandi.net (relay6-d.mail.gandi.net
 [217.70.183.198])
 by ffbox0-bg.ffmpeg.org (Postfix) with ESMTPS id 2564A68C8D5
 for <ffmpeg-devel@ffmpeg.org>; Fri, 16 May 2025 12:43:37 +0300 (EEST)
Received: by mail.gandi.net (Postfix) with ESMTPSA id 7BFE843292
 for <ffmpeg-devel@ffmpeg.org>; Fri, 16 May 2025 09:43:36 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=niedermayer.cc;
 s=gm1; t=1747388616;
 h=from:from:reply-to:subject:subject:date:date:message-id:message-id:
 to:to:cc:mime-version:mime-version:content-type:content-type:
 in-reply-to:in-reply-to:references:references;
 bh=rmlkZ4E/jYHO9Qz0zliTvq/e4DoKXlSIpBsNnFW2IAg=;
 b=ZrYZb2pqWGjZPCXIlHiucjB/LP3GXbxSD/dFLmmFwp00DePZsd+jy4kvebQh7DmxHOsHxH
 478ebTtWksBaMkbDpnRE4LmzZKtGcrKBYdzifFsAMwZb7TXuZ9J8iunaUXBiS1oyjRWQnq
 7oQFeuw4Vlte4gptrrfic2G78VrjVyN0vfQaCh7HhWtvswN8pyqCNWZ21ReqFaRu5NOqC8
 qUu4uGVUbEWVIO/NxjcopzAW6BhpajIHXQHq3iS4pTiNpB2mkWE3ub+rByTReRlK48Z68D
 pCQVwe/hHceOtaVf+g91hQ5k3ZtvZRHX14g8Yqgm9xdYzBI0V7kHtRjaKcAEJQ==
Date: Fri, 16 May 2025 11:43:35 +0200
From: Michael Niedermayer <michael@niedermayer.cc>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Message-ID: <20250516094335.GY29660@pb2>
References: <DM4PR21MB4619D5DBB0A33C2A32699BA79291A@DM4PR21MB4619.namprd21.prod.outlook.com>
 <DM4PR21MB4619014F8408D7881D8EA42D9291A@DM4PR21MB4619.namprd21.prod.outlook.com>
 <20250514164003.GL29660@pb2> <20250515201957.GW29660@pb2>
MIME-Version: 1.0
In-Reply-To: <20250515201957.GW29660@pb2>
X-GND-State: clean
X-GND-Score: -70
X-GND-Cause: gggruggvucftvghtrhhoucdtuddrgeefvddrtddtgdefuddvgedvucetufdoteggodetrfdotffvucfrrhhofhhilhgvmecuifetpfffkfdpucggtfgfnhhsuhgsshgtrhhisggvnecuuegrihhlohhuthemuceftddunecusecvtfgvtghiphhivghnthhsucdlqddutddtmdenfghrlhcuvffnffculdeftddmnecujfgurhepfffhvffukfhfgggtuggjsehgtderredttddvnecuhfhrohhmpefoihgthhgrvghlucfpihgvuggvrhhmrgihvghruceomhhitghhrggvlhesnhhivgguvghrmhgrhigvrhdrtggtqeenucggtffrrghtthgvrhhnpeeigeektdejudffjefhteegjedtgeettefggedthfejgfevhfetgeekjedtvdfhveenucfkphepgedurdeiiedrieejrdduudefnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghmpehinhgvthepgedurdeiiedrieejrdduudefpdhhvghloheplhhotggrlhhhohhsthdpmhgrihhlfhhrohhmpehmihgthhgrvghlsehnihgvuggvrhhmrgihvghrrdgttgdpnhgspghrtghpthhtohepuddprhgtphhtthhopehffhhmphgvghdquggvvhgvlhesfhhfmhhpvghgrdhorhhg
X-GND-Sasl: michael@niedermayer.cc
Subject: Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize
 vertical loop for cache-friendly access
 [libavcodec/jpeg2000dwt.c:dwt_decode97_float]
X-BeenThere: ffmpeg-devel@ffmpeg.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: FFmpeg development discussions and patches <ffmpeg-devel.ffmpeg.org>
List-Unsubscribe: <https://ffmpeg.org/mailman/options/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=unsubscribe>
List-Archive: <https://ffmpeg.org/pipermail/ffmpeg-devel>
List-Post: <mailto:ffmpeg-devel@ffmpeg.org>
List-Help: <mailto:ffmpeg-devel-request@ffmpeg.org?subject=help>
List-Subscribe: <https://ffmpeg.org/mailman/listinfo/ffmpeg-devel>,
 <mailto:ffmpeg-devel-request@ffmpeg.org?subject=subscribe>
Reply-To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Content-Type: multipart/mixed; boundary="===============0286181493410625712=="
Errors-To: ffmpeg-devel-bounces@ffmpeg.org
Sender: "ffmpeg-devel" <ffmpeg-devel-bounces@ffmpeg.org>
Archived-At: <https://master.gitmailbox.com/ffmpegdev/20250516094335.GY29660@pb2/>
List-Archive: <https://master.gitmailbox.com/ffmpegdev/>
List-Post: <mailto:ffmpegdev@gitmailbox.com>


--===============0286181493410625712==
Content-Type: multipart/signed; micalg=pgp-sha512;
	protocol="application/pgp-signature"; boundary="HQLHrCACpVdwMl/9"
Content-Disposition: inline


--HQLHrCACpVdwMl/9
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

On Thu, May 15, 2025 at 10:19:57PM +0200, Michael Niedermayer wrote:
> Hi
>=20
> On Wed, May 14, 2025 at 06:40:03PM +0200, Michael Niedermayer wrote:
> > Hi Chitra
> >=20
> > On Wed, May 14, 2025 at 03:55:59AM +0000, Chitra Dey Sarkar via ffmpeg-=
devel wrote:
> > > Original Implementation:
> > > ---------------------------------
> > > In the original implementation, the "VER_SD" section processes image =
data stored in *data using strided memory access in a vertical fashion This=
 leads to inefficient memory access patterns and cache thrashing due to non=
-sequential data access across multiple inner loops.
> > >=20
> > > Proposed Refactor:
> > > ---------------------------------
> > > The proposed refactor replaces this  by allocating a cache-friendly 2=
D array buffer. This change eliminates strided memory access across the thr=
ee inner loops, significantly improving cache locality and reducing cache t=
hrashing.
> > >=20
> > > Additionally, the data is transposed outside the lp loop, which allow=
s for efficient per-line access and write-back to the l buffer, further opt=
imizing performance.
> > >=20
> > > Performance improvements
> > > -------------------------------------------------------
> > > This change results in a substantial performance improvement  Sharing=
 the FPS data benchmarked on our end for the file 'Tears of Steel' using Ha=
ndBrake
> > >=20
> > > Device / CPU Model                                    Official FPS   =
        Optimized FPS   % Improvement
> > > Surface Laptop 11 (10-core X1P64100, L2: 36MB)                  3.18 =
          6.15                      +93%
> > > Surface Laptop 11(10-core X1P64100, L2: 36MB)     5.16           7.31=
                      +41%
> > > Surface Laptop 11 (10-core X1P64100, L2: 36MB)                  5.57 =
          9.21                      +65%
> > > AMD Ryzen + NVIDIA RTX 4060 Laptop (12C/24T)                9.97     =
        11.22                   +12%
> > > Mac Mini Apple M4         Chip                           9.00        =
  12.00                   +30%
> > >=20
> > > ---------------------------------------------------------------------=
----------------------------------
> > > ---
> > >  libavcodec/jpeg2000dwt.c | 72 +++++++++++++++++++++++++++++++-------=
--
> > >  1 file changed, 57 insertions(+), 15 deletions(-)
> > >=20
> > > diff --git a/libavcodec/jpeg2000dwt.c b/libavcodec/jpeg2000dwt.c inde=
x 9ee8122658..45d7897893 100644
> > > --- a/libavcodec/jpeg2000dwt.c
> > > +++ b/libavcodec/jpeg2000dwt.c
> > > @@ -409,6 +409,15 @@ static void dwt_decode97_float(DWTContext *s, fl=
oat *t)
> > >      /* position at index O of line range [0-5,w+5] cf. extend functi=
on */
> > >      line +=3D 5;
> > >=20
> >=20
> > > +    /* Find the largest lv and lv to allocate a 2D Array*/
> >=20
> > lv and lv ?
> > you mean lv anf lh ?
> >=20
> >=20
> > > +    int max_dim =3D 0;
> > > +    for (lev =3D 0; lev < s->ndeclevels; lev++) {
> > > +        if (s->linelen[lev][0]  > max_dim) max_dim =3D s->linelen[le=
v][0];
> > > +        if (s->linelen[lev][1] > max_dim) max_dim =3D s->linelen[lev=
][1];
> >=20
> > FFMAX()
> >=20
> >=20
> > > +    }
> > > +    float *array2DBlock =3D av_malloc(max_dim * max_dim * sizeof(flo=
at));
> > > +    int useFallback =3D !array2DBlock;
> >=20
> > also is this supposed to be max_dim_h * max_dim_v ?
> >=20
> >=20
> >=20
> > > +
> > >      for (lev =3D 0; lev < s->ndeclevels; lev++) {
> > >          int lh =3D s->linelen[lev][0],
> > >              lv =3D s->linelen[lev][1],
> > > @@ -431,23 +440,56 @@ static void dwt_decode97_float(DWTContext *s, f=
loat *t)
> > >              for (i =3D 0; i < lh; i++)
> > >                  data[w * lp + i] =3D l[i];
> > >          }
> > > -
> > > -        // VER_SD
> > > -        l =3D line + mv;
> > > -        for (lp =3D 0; lp < lh; lp++) {
> > > -            int i, j =3D 0;
> > > -            // copy with interleaving
> > > -            for (i =3D mv; i < lv; i +=3D 2, j++)
> > > -                l[i] =3D data[w * j + lp];
> > > -            for (i =3D 1 - mv; i < lv; i +=3D 2, j++)
> > > -                l[i] =3D data[w * j + lp];
> > > -
> >=20
> > > -            sr_1d97_float(line, mv, mv + lv);
> >=20
> > this should be run linewise not columnwise
> > if you dont understand what i mean here, please say so and ill elaborate
>=20
> For the record: (may be interresting for others, or others may have comme=
nts too)
>=20
> <michaelni> The 1D transform is made of 4 passes of lifting transforms, c=
urrently all pixels of a column are run through the first lifting step befo=
re anything runs through the 2nd. One could try to run the first through th=
e 4th lifting step after the 2nd finishes the 3rd lifting step and the 3rd =
pixel of the column finishes the 2nd lifting step
> <michaelni> and then do this for all pixels in a row so that the whole tr=
ansform is finished for the whole first row before more than 5 rows or so h=
ave been touched
> <michaelni> this _COULD_ be faster, but it needs to be tried to be sure
>=20
> basically, there would be a area covering s small number of whole rows and
> this area would move down by 1 or 2 rows in each iteration
> above it both horizontal and vertical transforms are finished below it
> its the untouched input.
> as long as this sliding window fits in the cache this should outperform
> anything that copyies the data around
>=20
> I think its important to look into this before optimizing the code for
> CPU or GPU because the structure of this is different than the transpose
> based code. In fact the horizontal transform is quite unfriendly for SIMD
>=20
> so the code as is in C might be to worst possible starting point for SIMD
> it transforms the easy vertical transform with a transpose into a horizon=
tal
> one ...

More elaboration, i think what i wrote is still unclear

If one looks at spatial_decompose97i() in libavcodec/snow_dwt.c
thats approximately what i had in mind.
Its just one page of code

[...]

--=20
Michael     GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB

The difference between a dictatorship and a democracy is that every 4 years
the population together is allowed to provide 1 bit of input to the governm=
ent.

--HQLHrCACpVdwMl/9
Content-Type: application/pgp-signature; name="signature.asc"

-----BEGIN PGP SIGNATURE-----

iF0EABEKAB0WIQSf8hKLFH72cwut8TNhHseHBAsPqwUCaCcIwwAKCRBhHseHBAsP
qyeoAKCbmvNozfVHuRaDoWLLpa8Ck78uOACggGSJDdDW/LUCRIWPfd83RootwbY=
=wuCL
-----END PGP SIGNATURE-----

--HQLHrCACpVdwMl/9--

--===============0286181493410625712==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

--===============0286181493410625712==--