From: Michael Niedermayer <michael@niedermayer.cc>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float]
Date: Thu, 15 May 2025 22:19:57 +0200
Message-ID: <20250515201957.GW29660@pb2> (raw)
In-Reply-To: <20250514164003.GL29660@pb2>
[-- Attachment #1.1: Type: text/plain, Size: 5650 bytes --]
Hi
On Wed, May 14, 2025 at 06:40:03PM +0200, Michael Niedermayer wrote:
> Hi Chitra
>
> On Wed, May 14, 2025 at 03:55:59AM +0000, Chitra Dey Sarkar via ffmpeg-devel wrote:
> > Original Implementation:
> > ---------------------------------
> > In the original implementation, the "VER_SD" section processes image data stored in *data using strided memory access in a vertical fashion This leads to inefficient memory access patterns and cache thrashing due to non-sequential data access across multiple inner loops.
> >
> > Proposed Refactor:
> > ---------------------------------
> > The proposed refactor replaces this by allocating a cache-friendly 2D array buffer. This change eliminates strided memory access across the three inner loops, significantly improving cache locality and reducing cache thrashing.
> >
> > Additionally, the data is transposed outside the lp loop, which allows for efficient per-line access and write-back to the l buffer, further optimizing performance.
> >
> > Performance improvements
> > -------------------------------------------------------
> > This change results in a substantial performance improvement Sharing the FPS data benchmarked on our end for the file 'Tears of Steel' using HandBrake
> >
> > Device / CPU Model Official FPS Optimized FPS % Improvement
> > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 3.18 6.15 +93%
> > Surface Laptop 11(10-core X1P64100, L2: 36MB) 5.16 7.31 +41%
> > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 5.57 9.21 +65%
> > AMD Ryzen + NVIDIA RTX 4060 Laptop (12C/24T) 9.97 11.22 +12%
> > Mac Mini Apple M4 Chip 9.00 12.00 +30%
> >
> > -------------------------------------------------------------------------------------------------------
> > ---
> > libavcodec/jpeg2000dwt.c | 72 +++++++++++++++++++++++++++++++---------
> > 1 file changed, 57 insertions(+), 15 deletions(-)
> >
> > diff --git a/libavcodec/jpeg2000dwt.c b/libavcodec/jpeg2000dwt.c index 9ee8122658..45d7897893 100644
> > --- a/libavcodec/jpeg2000dwt.c
> > +++ b/libavcodec/jpeg2000dwt.c
> > @@ -409,6 +409,15 @@ static void dwt_decode97_float(DWTContext *s, float *t)
> > /* position at index O of line range [0-5,w+5] cf. extend function */
> > line += 5;
> >
>
> > + /* Find the largest lv and lv to allocate a 2D Array*/
>
> lv and lv ?
> you mean lv anf lh ?
>
>
> > + int max_dim = 0;
> > + for (lev = 0; lev < s->ndeclevels; lev++) {
> > + if (s->linelen[lev][0] > max_dim) max_dim = s->linelen[lev][0];
> > + if (s->linelen[lev][1] > max_dim) max_dim = s->linelen[lev][1];
>
> FFMAX()
>
>
> > + }
> > + float *array2DBlock = av_malloc(max_dim * max_dim * sizeof(float));
> > + int useFallback = !array2DBlock;
>
> also is this supposed to be max_dim_h * max_dim_v ?
>
>
>
> > +
> > for (lev = 0; lev < s->ndeclevels; lev++) {
> > int lh = s->linelen[lev][0],
> > lv = s->linelen[lev][1],
> > @@ -431,23 +440,56 @@ static void dwt_decode97_float(DWTContext *s, float *t)
> > for (i = 0; i < lh; i++)
> > data[w * lp + i] = l[i];
> > }
> > -
> > - // VER_SD
> > - l = line + mv;
> > - for (lp = 0; lp < lh; lp++) {
> > - int i, j = 0;
> > - // copy with interleaving
> > - for (i = mv; i < lv; i += 2, j++)
> > - l[i] = data[w * j + lp];
> > - for (i = 1 - mv; i < lv; i += 2, j++)
> > - l[i] = data[w * j + lp];
> > -
>
> > - sr_1d97_float(line, mv, mv + lv);
>
> this should be run linewise not columnwise
> if you dont understand what i mean here, please say so and ill elaborate
For the record: (may be interresting for others, or others may have comments too)
<michaelni> The 1D transform is made of 4 passes of lifting transforms, currently all pixels of a column are run through the first lifting step before anything runs through the 2nd. One could try to run the first through the 4th lifting step after the 2nd finishes the 3rd lifting step and the 3rd pixel of the column finishes the 2nd lifting step
<michaelni> and then do this for all pixels in a row so that the whole transform is finished for the whole first row before more than 5 rows or so have been touched
<michaelni> this _COULD_ be faster, but it needs to be tried to be sure
basically, there would be a area covering s small number of whole rows and
this area would move down by 1 or 2 rows in each iteration
above it both horizontal and vertical transforms are finished below it
its the untouched input.
as long as this sliding window fits in the cache this should outperform
anything that copyies the data around
I think its important to look into this before optimizing the code for
CPU or GPU because the structure of this is different than the transpose
based code. In fact the horizontal transform is quite unfriendly for SIMD
so the code as is in C might be to worst possible starting point for SIMD
it transforms the easy vertical transform with a transpose into a horizontal
one ...
thx
[...]
--
Michael GnuPG fingerprint: 9FF2128B147EF6730BADF133611EC787040B0FAB
The greatest way to live with honor in this world is to be what we pretend
to be. -- Socrates
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 195 bytes --]
[-- Attachment #2: Type: text/plain, Size: 251 bytes --]
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".
next prev parent reply other threads:[~2025-05-15 20:20 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <DM4PR21MB4619D5DBB0A33C2A32699BA79291A@DM4PR21MB4619.namprd21.prod.outlook.com>
2025-05-14 3:55 ` Chitra Dey Sarkar via ffmpeg-devel
2025-05-14 16:40 ` Michael Niedermayer
2025-05-14 17:42 ` [FFmpeg-devel] [EXTERNAL] " Chitra Dey Sarkar via ffmpeg-devel
2025-05-15 20:19 ` Michael Niedermayer [this message]
2025-05-16 9:43 ` [FFmpeg-devel] " Michael Niedermayer
2025-05-16 20:06 ` [FFmpeg-devel] [EXTERNAL] " Chitra Dey Sarkar via ffmpeg-devel
2025-05-14 19:12 [FFmpeg-devel] " Chitra Dey Sarkar via ffmpeg-devel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250515201957.GW29660@pb2 \
--to=michael@niedermayer.cc \
--cc=ffmpeg-devel@ffmpeg.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
This inbox may be cloned and mirrored by anyone:
git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git
# If you have public-inbox 1.1+ installed, you may
# initialize and index your mirror using the following commands:
public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
ffmpegdev@gitmailbox.com
public-inbox-index ffmpegdev
Example config snippet for mirrors.
AGPL code for this site: git clone https://public-inbox.org/public-inbox.git