On Sun, Mar 17, 2024 at 3:21 PM Tomas Vondra <tomas.von...@enterprisedb.com> wrote: > > On 3/14/24 22:39, Melanie Plageman wrote: > > On Thu, Mar 14, 2024 at 5:26 PM Tomas Vondra > > <tomas.von...@enterprisedb.com> wrote: > >> > >> On 3/14/24 19:16, Melanie Plageman wrote: > >>> On Thu, Mar 14, 2024 at 03:32:04PM +0200, Heikki Linnakangas wrote: > >>>> ... > >>>> > >>>> Ok, committed that for now. Thanks for looking! > >>> > >>> Attached v6 is rebased over your new commit. It also has the "fix" in > >>> 0010 which moves BitmapAdjustPrefetchIterator() back above > >>> table_scan_bitmap_next_block(). I've also updated the Streaming Read API > >>> commit (0013) to Thomas' v7 version from [1]. This has the update that > >>> we theorize should address some of the regressions in the bitmapheapscan > >>> streaming read user in 0014. > >>> > >> > >> Should I rerun the benchmarks with these new patches, to see if it > >> really helps with the regressions? > > > > That would be awesome! > > > > OK, here's a couple charts comparing the effect of v6 patches to master. > These are from 1M and 10M data sets, same as the runs presented earlier > in this thread (the 10M is still running, but should be good enough for > this kind of visual comparison).
Thanks for doing this! > What is even more obvious is that 0014 behaves *VERY* differently. I'm > not sure if this is a good thing or a problem is debatable/unclear. I'm > sure we don't want to cause regressions, but perhaps those are due to > the prefetch issue discussed elsewhere in this thread (identified by > Andres and Melanie). There are also many cases that got much faster, but > the question is whether this is due to better efficiency or maybe the > new code being more aggressive in some way (not sure). Are these with the default effective_io_concurrency (1)? If so, the "effective" prefetch distance in many cases will be higher with the streaming read code applied. With effective_io_concurrency 1, "max_ios" will always be 1, but the number of blocks prefetched may exceed this (up to MAX_BUFFERS_PER_TRANSFER) because the streaming read code is always trying to build bigger IOs. And, if prefetching, it will prefetch IOs not yet in shared buffers before reading them. It's hard to tell without going into a specific repro why this would cause some queries to be much slower. In the forced bitmapheapscan, it would make sense that more prefetching is worse -- which is why a bitmapheapscan plan wouldn't have been chosen. But in the optimal cases, it is unclear why it would be worse. I don't think there is any way it could be the issue Andres identified, because there is only one iterator. Nothing to get out of sync. It could be that the fadvises are being issued too close to the reads and aren't effective enough at covering up read latency on slower, older hardware. But that doesn't explain why master would sometimes be faster. Probably the only thing we can do is get into a repro. It would, of course, be easiest to do this with a serial query. I can dig into the scripts you shared earlier and try to find a good repro. Because the regressions may have shifted with Thomas' new version, it would help if you shared a category (cyclic/uniform/etc, parallel or serial, eic value, work mem, etc) where you now see the most regressions. - Melanie