Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel
 help / color / mirror / Atom feed
From: Chitra Dey Sarkar via ffmpeg-devel <ffmpeg-devel@ffmpeg.org>
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Cc: Chitra Dey Sarkar <Chitra.Dey@microsoft.com>,
	Michael Niedermayer <michael@niedermayer.cc>,
	Linda Zhang <Zhang.Linda@microsoft.com>,
	Vittal Pai <vipai@microsoft.com>
Subject: Re: [FFmpeg-devel] [EXTERNAL] Re: [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float]
Date: Fri, 16 May 2025 20:06:05 +0000
Message-ID: <DM4PR21MB46196DFF7B0128C31619F6569293A@DM4PR21MB4619.namprd21.prod.outlook.com> (raw)
In-Reply-To: <20250516094335.GY29660@pb2>



-----Original Message-----
From: ffmpeg-devel <ffmpeg-devel-bounces@ffmpeg.org> On Behalf Of Michael Niedermayer
Sent: Friday, May 16, 2025 2:44 AM
To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org>
Subject: [EXTERNAL] Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float]

On Thu, May 15, 2025 at 10:19:57PM +0200, Michael Niedermayer wrote:
> Hi
> 
> On Wed, May 14, 2025 at 06:40:03PM +0200, Michael Niedermayer wrote:
> > Hi Chitra
> > 
> > On Wed, May 14, 2025 at 03:55:59AM +0000, Chitra Dey Sarkar via ffmpeg-devel wrote:
> > > Original Implementation:
> > > ---------------------------------
> > > In the original implementation, the "VER_SD" section processes image data stored in *data using strided memory access in a vertical fashion This leads to inefficient memory access patterns and cache thrashing due to non-sequential data access across multiple inner loops.
> > > 
> > > Proposed Refactor:
> > > ---------------------------------
> > > The proposed refactor replaces this  by allocating a cache-friendly 2D array buffer. This change eliminates strided memory access across the three inner loops, significantly improving cache locality and reducing cache thrashing.
> > > 
> > > Additionally, the data is transposed outside the lp loop, which allows for efficient per-line access and write-back to the l buffer, further optimizing performance.
> > > 
> > > Performance improvements
> > > -------------------------------------------------------
> > > This change results in a substantial performance improvement  
> > > Sharing the FPS data benchmarked on our end for the file 'Tears of 
> > > Steel' using HandBrake
> > > 
> > > Device / CPU Model                                    Official FPS           Optimized FPS   % Improvement
> > > Surface Laptop 11 (10-core X1P64100, L2: 36MB)                  3.18           6.15                      +93%
> > > Surface Laptop 11(10-core X1P64100, L2: 36MB)     5.16           7.31                      +41%
> > > Surface Laptop 11 (10-core X1P64100, L2: 36MB)                  5.57           9.21                      +65%
> > > AMD Ryzen + NVIDIA RTX 4060 Laptop (12C/24T)                9.97             11.22                   +12%
> > > Mac Mini Apple M4         Chip                           9.00          12.00                   +30%
> > > 
> > > ------------------------------------------------------------------
> > > -------------------------------------
> > > ---
> > >  libavcodec/jpeg2000dwt.c | 72 
> > > +++++++++++++++++++++++++++++++---------
> > >  1 file changed, 57 insertions(+), 15 deletions(-)
> > > 
> > > diff --git a/libavcodec/jpeg2000dwt.c b/libavcodec/jpeg2000dwt.c 
> > > index 9ee8122658..45d7897893 100644
> > > --- a/libavcodec/jpeg2000dwt.c
> > > +++ b/libavcodec/jpeg2000dwt.c
> > > @@ -409,6 +409,15 @@ static void dwt_decode97_float(DWTContext *s, float *t)
> > > +
> > >      for (lev = 0; lev < s->ndeclevels; lev++) {
> > >          int lh = s->linelen[lev][0],
> > >              lv = s->linelen[lev][1], @@ -431,23 +440,56 @@ static 
> > > void dwt_decode97_float(DWTContext *s, float *t)
> > >              for (i = 0; i < lh; i++)
> > >                  data[w * lp + i] = l[i];
> > >          }
> > > -
> > > -        // VER_SD
> > > -        l = line + mv;
> > > -        for (lp = 0; lp < lh; lp++) {
> > > -            int i, j = 0;
> > > -            // copy with interleaving
> > > -            for (i = mv; i < lv; i += 2, j++)
> > > -                l[i] = data[w * j + lp];
> > > -            for (i = 1 - mv; i < lv; i += 2, j++)
> > > -                l[i] = data[w * j + lp];
> > > -
> > 
> > > -            sr_1d97_float(line, mv, mv + lv);
> > 
> > this should be run linewise not columnwise if you dont understand 
> > what i mean here, please say so and ill elaborate
> 
> For the record: (may be interresting for others, or others may have 
> comments too)
> 
> <michaelni> The 1D transform is made of 4 passes of lifting 
> transforms, currently all pixels of a column are run through the first 
> lifting step before anything runs through the 2nd. One could try to 
> run the first through the 4th lifting step after the 2nd finishes the 
> 3rd lifting step and the 3rd pixel of the column finishes the 2nd 
> lifting step <michaelni> and then do this for all pixels in a row so 
> that the whole transform is finished for the whole first row before 
> more than 5 rows or so have been touched <michaelni> this _COULD_ be 
> faster, but it needs to be tried to be sure
> 
> basically, there would be a area covering s small number of whole rows 
> and this area would move down by 1 or 2 rows in each iteration above 
> it both horizontal and vertical transforms are finished below it its 
> the untouched input.
> as long as this sliding window fits in the cache this should 
> outperform anything that copyies the data around
> 
> I think its important to look into this before optimizing the code for 
> CPU or GPU because the structure of this is different than the 
> transpose based code. In fact the horizontal transform is quite 
> unfriendly for SIMD
> 
> so the code as is in C might be to worst possible starting point for 
> SIMD it transforms the easy vertical transform with a transpose into a 
> horizontal one ...

More elaboration, i think what i wrote is still unclear

If one looks at spatial_decompose97i() in libavcodec/snow_dwt.c thats approximately what i had in mind.
Its just one page of code

==========> Acknowledged . I will look at this part of the code to see if I am able to do something similar
_______________________________________________
ffmpeg-devel mailing list
ffmpeg-devel@ffmpeg.org
https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

To unsubscribe, visit link above, or email
ffmpeg-devel-request@ffmpeg.org with subject "unsubscribe".

      reply	other threads:[~2025-05-16 20:06 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <DM4PR21MB4619D5DBB0A33C2A32699BA79291A@DM4PR21MB4619.namprd21.prod.outlook.com>
2025-05-14  3:55 ` [FFmpeg-devel] " Chitra Dey Sarkar via ffmpeg-devel
2025-05-14 16:40   ` Michael Niedermayer
2025-05-14 17:42     ` [FFmpeg-devel] [EXTERNAL] " Chitra Dey Sarkar via ffmpeg-devel
2025-05-15 20:19     ` [FFmpeg-devel] " Michael Niedermayer
2025-05-16  9:43       ` Michael Niedermayer
2025-05-16 20:06         ` Chitra Dey Sarkar via ffmpeg-devel [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=DM4PR21MB46196DFF7B0128C31619F6569293A@DM4PR21MB4619.namprd21.prod.outlook.com \
    --to=ffmpeg-devel@ffmpeg.org \
    --cc=Chitra.Dey@microsoft.com \
    --cc=Zhang.Linda@microsoft.com \
    --cc=michael@niedermayer.cc \
    --cc=vipai@microsoft.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Git Inbox Mirror of the ffmpeg-devel mailing list - see https://ffmpeg.org/mailman/listinfo/ffmpeg-devel

This inbox may be cloned and mirrored by anyone:

	git clone --mirror https://master.gitmailbox.com/ffmpegdev/0 ffmpegdev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 ffmpegdev ffmpegdev/ https://master.gitmailbox.com/ffmpegdev \
		ffmpegdev@gitmailbox.com
	public-inbox-index ffmpegdev

Example config snippet for mirrors.


AGPL code for this site: git clone https://public-inbox.org/public-inbox.git