On Mon, Mar 11, 2024 at 5:31 AM Melanie Plageman <melanieplage...@gmail.com> wrote: > On Wed, Mar 6, 2024 at 6:47 PM Melanie Plageman > <melanieplage...@gmail.com> wrote: > > Performance results: > > > > The TL;DR of my performance results is that streaming read vacuum is > > faster. However there is an issue with the interaction of the streaming > > read code and the vacuum buffer access strategy which must be addressed.
Woo. > I have investigated the interaction between > maintenance_io_concurrency, streaming reads, and the vacuum buffer > access strategy (BAS_VACUUM). > > The streaming read API limits max_pinned_buffers to a pinned buffer > multiplier (currently 4) * maintenance_io_concurrency buffers with the > goal of constructing reads of at least MAX_BUFFERS_PER_TRANSFER size. > > Since the BAS_VACUUM ring buffer is size 256 kB or 32 buffers with > default block size, that means that for a fully uncached vacuum in > which all blocks must be vacuumed and will be dirtied, you'd have to > set maintenance_io_concurrency at 8 or lower to see the same number of > reuses (and shared buffer consumption) as master. > > Given that we allow users to specify BUFFER_USAGE_LIMIT to vacuum, it > seems like we should force max_pinned_buffers to a value that > guarantees the expected shared buffer usage by vacuum. But that means > that maintenance_io_concurrency does not have a predictable impact on > streaming read vacuum. > > What is the right thing to do here? > > At the least, the default size of the BAS_VACUUM ring buffer should be > BLCKSZ * pinned_buffer_multiplier * default maintenance_io_concurrency > (probably rounded up to the next power of two) bytes. Hmm, does the v6 look-ahead distance control algorithm mitigate that problem? Using the ABC classifications from the streaming read thread, I think for A it should now pin only 1, for B 16 and for C, it depends on the size of the random 'chunks': if you have a lot of size 1 random reads then it shouldn't go above 10 because of (default) maintenance_io_concurrency. The only way to get up to very high numbers would be to have a lot of random chunks triggering behaviour C, but each made up of long runs of misses. For example one can contrive a BHS query that happens to read pages 0-15 then 20-35 then 40-55 etc etc so that we want to get lots of wide I/Os running concurrently. Unless vacuum manages to do something like that, it shouldn't be able to exceed 32 buffers very easily. I suspect that if we taught streaming_read.c to ask the BufferAccessStrategy (if one is passed in) what its recommended pin limit is (strategy->nbuffers?), we could just clamp max_pinned_buffers, and it would be hard to find a workload where that makes a difference, and we could think about more complicated logic later. In other words, I think/hope your complaints about excessive pinning from v5 WRT all-cached heap scans might have also already improved this case by happy coincidence? I haven't tried it out though, I just read your description of the problem...