On Thu, Jul 18, 2024 at 5:31 PM Anton Khirnov <an...@khirnov.net> wrote:
> Quoting Michael Niedermayer (2024-07-18 16:48:06) > > On Thu, Jul 18, 2024 at 11:08:59AM +0200, Anton Khirnov wrote: > > > Quoting Michael Niedermayer (2024-07-18 00:42:05) > > > > all the stuff should be put together close so its efficiently > > > > using CPU caches > > > > > > Which is why it shares its cacheline with PutBitContext, because the > > > code benefits from having the both in the cache, right? And the 4-byte > > > hole in PutBitContext is there presumably to aerate the cache for > > > smoother data streaming. > > > > thanks for spoting these, can you fix these ? > > I have no interest in optimizing the performance of this code. My > primary goal here is to remove FFV1-specific hacks from the frame > threading code for patch 33/39, which is in turn needed for 38/39. > > As a public service, I also spent some effort on making the ffv1 code > easier to understand, but if you insist on keeping the code as it is I > can also just drop its non-compliant frame threading implementation. > > > > > > > More seriously, this is not how caches work. Being close together > > > matters mainly so long as your data fits in a cacheline, beyond that > > > physical proximity matters little. On stack, the bitreader is likely to > > > share the cacheline with other data that is currently needed, thus > > > improving cache utilization. > > > > caches are complex, and being close does matter. > > having things in seperate allocations risks hitting aliassing cases > > (that is things that cannot be in the cache at the same time) > > so when you have the bitstream, the frame buffer, the context already > > in 3 independant locations adding a few more increases the risk for > hitting > > these. > > Also sequential memory access is faster than non sequential, it does > > make sense to put things together in few places than to scatter them > > > > Its years since ive done hardcore optimization stuff but i dont think > > the principles have changed that much that random access is faster than > > sequential and that caches work fundamentally differently > > I don't see how any of these arguments are relevant - I am not moving > the bitreader to a new allocation, but to stack, which is already highly > likely to be in cache. > > > > > > > Another factor that matters in efficient cache use is e.g. not having > > > multiple copies of the same constant data scattered around, which > you're > > > objecting to in my other patches. > > > > copying the actually used small data together per slice > > where its accessed per pixel should improve teh speed per pixel while > > making the per slice code a little slower. now we have 4 slices maybe > > and millions of pixels. Thats why this can give an overall gain > > This all sounds like premature optimization, AKA the root of all evil. > As I said above, I intended to make this code more readable, not faster. > Yet somehow it became faster anyway, which suggests this code is not > very optimized. So then arguing whether this or that specific change > adds or removes a few cycles per frame seems like a waste time to me. > Will merge this into Librempeg if it does not get into FFmpeg. > > -- > Anton Khirnov > _______________________________________________ > ffmpeg-devel mailing list > ffmpeg-devel@ffmpeg.org > https://ffmpeg.org/mailman/listinfo/ffmpeg-devel > > To unsubscribe, visit link above, or email > ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe". > _______________________________________________ ffmpeg-devel mailing list ffmpeg-devel@ffmpeg.org https://ffmpeg.org/mailman/listinfo/ffmpeg-devel To unsubscribe, visit link above, or email ffmpeg-devel-requ...@ffmpeg.org with subject "unsubscribe".