-----Original Message----- From: ffmpeg-devel <ffmpeg-devel-boun...@ffmpeg.org> On Behalf Of Michael Niedermayer Sent: Friday, May 16, 2025 2:44 AM To: FFmpeg development discussions and patches <ffmpeg-devel@ffmpeg.org> Subject: [EXTERNAL] Re: [FFmpeg-devel] [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float] On Thu, May 15, 2025 at 10:19:57PM +0200, Michael Niedermayer wrote: > Hi > > On Wed, May 14, 2025 at 06:40:03PM +0200, Michael Niedermayer wrote: > > Hi Chitra > > > > On Wed, May 14, 2025 at 03:55:59AM +0000, Chitra Dey Sarkar via > > ffmpeg-devel wrote: > > > Original Implementation: > > > --------------------------------- > > > In the original implementation, the "VER_SD" section processes image data > > > stored in *data using strided memory access in a vertical fashion This > > > leads to inefficient memory access patterns and cache thrashing due to > > > non-sequential data access across multiple inner loops. > > > > > > Proposed Refactor: > > > --------------------------------- > > > The proposed refactor replaces this by allocating a cache-friendly 2D > > > array buffer. This change eliminates strided memory access across the > > > three inner loops, significantly improving cache locality and reducing > > > cache thrashing. > > > > > > Additionally, the data is transposed outside the lp loop, which allows > > > for efficient per-line access and write-back to the l buffer, further > > > optimizing performance. > > > > > > Performance improvements > > > ------------------------------------------------------- > > > This change results in a substantial performance improvement > > > Sharing the FPS data benchmarked on our end for the file 'Tears of > > > Steel' using HandBrake > > > > > > Device / CPU Model Official FPS > > > Optimized FPS % Improvement > > > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 3.18 > > > 6.15 +93% > > > Surface Laptop 11(10-core X1P64100, L2: 36MB) 5.16 7.31 > > > +41% > > > Surface Laptop 11 (10-core X1P64100, L2: 36MB) 5.57 > > > 9.21 +65% > > > AMD Ryzen + NVIDIA RTX 4060 Laptop (12C/24T) 9.97 > > > 11.22 +12% > > > Mac Mini Apple M4 Chip 9.00 > > > 12.00 +30% > > > > > > ------------------------------------------------------------------ > > > ------------------------------------- > > > --- > > > libavcodec/jpeg2000dwt.c | 72 > > > +++++++++++++++++++++++++++++++--------- > > > 1 file changed, 57 insertions(+), 15 deletions(-) > > > > > > diff --git a/libavcodec/jpeg2000dwt.c b/libavcodec/jpeg2000dwt.c > > > index 9ee8122658..45d7897893 100644 > > > --- a/libavcodec/jpeg2000dwt.c > > > +++ b/libavcodec/jpeg2000dwt.c > > > @@ -409,6 +409,15 @@ static void dwt_decode97_float(DWTContext *s, float > > > *t) > > > + > > > for (lev = 0; lev < s->ndeclevels; lev++) { > > > int lh = s->linelen[lev][0], > > > lv = s->linelen[lev][1], @@ -431,23 +440,56 @@ static > > > void dwt_decode97_float(DWTContext *s, float *t) > > > for (i = 0; i < lh; i++) > > > data[w * lp + i] = l[i]; > > > } > > > - > > > - // VER_SD > > > - l = line + mv; > > > - for (lp = 0; lp < lh; lp++) { > > > - int i, j = 0; > > > - // copy with interleaving > > > - for (i = mv; i < lv; i += 2, j++) > > > - l[i] = data[w * j + lp]; > > > - for (i = 1 - mv; i < lv; i += 2, j++) > > > - l[i] = data[w * j + lp]; > > > - > > > > > - sr_1d97_float(line, mv, mv + lv); > > > > this should be run linewise not columnwise if you dont understand > > what i mean here, please say so and ill elaborate > > For the record: (may be interresting for others, or others may have > comments too) > > <michaelni> The 1D transform is made of 4 passes of lifting > transforms, currently all pixels of a column are run through the first > lifting step before anything runs through the 2nd. One could try to > run the first through the 4th lifting step after the 2nd finishes the > 3rd lifting step and the 3rd pixel of the column finishes the 2nd > lifting step <michaelni> and then do this for all pixels in a row so > that the whole transform is finished for the whole first row before > more than 5 rows or so have been touched <michaelni> this _COULD_ be > faster, but it needs to be tried to be sure > > basically, there would be a area covering s small number of whole rows > and this area would move down by 1 or 2 rows in each iteration above > it both horizontal and vertical transforms are finished below it its > the untouched input. > as long as this sliding window fits in the cache this should > outperform anything that copyies the data around > > I think its important to look into this before optimizing the code for > CPU or GPU because the structure of this is different than the > transpose based code. In fact the horizontal transform is quite > unfriendly for SIMD > > so the code as is in C might be to worst possible starting point for > SIMD it transforms the easy vertical transform with a transpose into a > horizontal one ... More elaboration, i think what i wrote is still unclear If one looks at spatial_decompose97i() in libavcodec/snow_dwt.c thats approximately what i had in mind. Its just one page of code ==========> Acknowledged . I will look at this part of the code to see if I am able to do something similar _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".
Re: [FFmpeg-devel] [EXTERNAL] Re: [PATCH] Boost FPS and performance: Optimize vertical loop for cache-friendly access [libavcodec/jpeg2000dwt.c:dwt_decode97_float]
Chitra Dey Sarkar via ffmpeg-devel Fri, 16 May 2025 13:06:50 -0700
- [FFmpeg-devel] [PATCH] Boost FPS and pe... Chitra Dey Sarkar via ffmpeg-devel
- Re: [FFmpeg-devel] [PATCH] Boost F... Michael Niedermayer
- Re: [FFmpeg-devel] [EXTERNAL] ... Chitra Dey Sarkar via ffmpeg-devel
- Re: [FFmpeg-devel] [PATCH] Boo... Michael Niedermayer
- Re: [FFmpeg-devel] [PATCH]... Michael Niedermayer
- Re: [FFmpeg-devel] [EX... Chitra Dey Sarkar via ffmpeg-devel
- [FFmpeg-devel] [PATCH] Boost FPS a... Chitra Dey Sarkar via ffmpeg-devel