On Tue, Mar 12, 2024 at 10:03 AM Melanie Plageman <melanieplage...@gmail.com> wrote: > I've rebased the attached v10 over top of the changes to > lazy_scan_heap() Heikki just committed and over the v6 streaming read > patch set. I started testing them and see that you are right, we no > longer pin too many buffers. However, the uncached example below is > now slower with streaming read than on master -- it looks to be > because it is doing twice as many WAL writes and syncs. I'm still > investigating why that is.
That makes sense to me. We have 256kB of buffers in our ring, but now we're trying to read ahead 128kB at a time, so it works out that we can only flush the WAL accumulated while dirtying half the blocks at a time, so we flush twice as often. If I change the ring size to 384kB, allowing for that read-ahead window, I see approximately the same WAL flushes. Surely we'd never be able to get the behaviour to match *and* keep the same ring size? We simply need those 16 extra buffers to have a chance of accumulating 32 dirty buffers, and the associated WAL. Do you see the same result, or do you think something more than that is wrong here? Here are some system call traces using your test that helped me see the behaviour: 1. Unpatched, ie no streaming read, we flush 90kB of WAL generated by 32 pages before we write them out one at a time just before we read in their replacements. One flush covers the LSNs of all the pages that will be written, even though it's only called for the first page to be written. That's because XLogFlush(lsn), if it decides to do anything, flushes as far as it can... IOW when we hit the *oldest* dirty block, that's when we write out the WAL up to where we dirtied the *newest* block, which covers the 32 pwrite() calls here: pwrite(30,...,90112,0xf90000) = 90112 (0x16000) fdatasync(30) = 0 (0x0) pwrite(27,...,8192,0x0) = 8192 (0x2000) pread(27,...,8192,0x40000) = 8192 (0x2000) pwrite(27,...,8192,0x2000) = 8192 (0x2000) pread(27,...,8192,0x42000) = 8192 (0x2000) pwrite(27,...,8192,0x4000) = 8192 (0x2000) pread(27,...,8192,0x44000) = 8192 (0x2000) pwrite(27,...,8192,0x6000) = 8192 (0x2000) pread(27,...,8192,0x46000) = 8192 (0x2000) pwrite(27,...,8192,0x8000) = 8192 (0x2000) pread(27,...,8192,0x48000) = 8192 (0x2000) pwrite(27,...,8192,0xa000) = 8192 (0x2000) pread(27,...,8192,0x4a000) = 8192 (0x2000) pwrite(27,...,8192,0xc000) = 8192 (0x2000) pread(27,...,8192,0x4c000) = 8192 (0x2000) pwrite(27,...,8192,0xe000) = 8192 (0x2000) pread(27,...,8192,0x4e000) = 8192 (0x2000) pwrite(27,...,8192,0x10000) = 8192 (0x2000) pread(27,...,8192,0x50000) = 8192 (0x2000) pwrite(27,...,8192,0x12000) = 8192 (0x2000) pread(27,...,8192,0x52000) = 8192 (0x2000) pwrite(27,...,8192,0x14000) = 8192 (0x2000) pread(27,...,8192,0x54000) = 8192 (0x2000) pwrite(27,...,8192,0x16000) = 8192 (0x2000) pread(27,...,8192,0x56000) = 8192 (0x2000) pwrite(27,...,8192,0x18000) = 8192 (0x2000) pread(27,...,8192,0x58000) = 8192 (0x2000) pwrite(27,...,8192,0x1a000) = 8192 (0x2000) pread(27,...,8192,0x5a000) = 8192 (0x2000) pwrite(27,...,8192,0x1c000) = 8192 (0x2000) pread(27,...,8192,0x5c000) = 8192 (0x2000) pwrite(27,...,8192,0x1e000) = 8192 (0x2000) pread(27,...,8192,0x5e000) = 8192 (0x2000) pwrite(27,...,8192,0x20000) = 8192 (0x2000) pread(27,...,8192,0x60000) = 8192 (0x2000) pwrite(27,...,8192,0x22000) = 8192 (0x2000) pread(27,...,8192,0x62000) = 8192 (0x2000) pwrite(27,...,8192,0x24000) = 8192 (0x2000) pread(27,...,8192,0x64000) = 8192 (0x2000) pwrite(27,...,8192,0x26000) = 8192 (0x2000) pread(27,...,8192,0x66000) = 8192 (0x2000) pwrite(27,...,8192,0x28000) = 8192 (0x2000) pread(27,...,8192,0x68000) = 8192 (0x2000) pwrite(27,...,8192,0x2a000) = 8192 (0x2000) pread(27,...,8192,0x6a000) = 8192 (0x2000) pwrite(27,...,8192,0x2c000) = 8192 (0x2000) pread(27,...,8192,0x6c000) = 8192 (0x2000) pwrite(27,...,8192,0x2e000) = 8192 (0x2000) pread(27,...,8192,0x6e000) = 8192 (0x2000) pwrite(27,...,8192,0x30000) = 8192 (0x2000) pread(27,...,8192,0x70000) = 8192 (0x2000) pwrite(27,...,8192,0x32000) = 8192 (0x2000) pread(27,...,8192,0x72000) = 8192 (0x2000) pwrite(27,...,8192,0x34000) = 8192 (0x2000) pread(27,...,8192,0x74000) = 8192 (0x2000) pwrite(27,...,8192,0x36000) = 8192 (0x2000) pread(27,...,8192,0x76000) = 8192 (0x2000) pwrite(27,...,8192,0x38000) = 8192 (0x2000) pread(27,...,8192,0x78000) = 8192 (0x2000) pwrite(27,...,8192,0x3a000) = 8192 (0x2000) pread(27,...,8192,0x7a000) = 8192 (0x2000) pwrite(27,...,8192,0x3c000) = 8192 (0x2000) pread(27,...,8192,0x7c000) = 8192 (0x2000) pwrite(27,...,8192,0x3e000) = 8192 (0x2000) pread(27,...,8192,0x7e000) = 8192 (0x2000) (Digression: this alternative tail-write-head-read pattern defeats the read-ahead and write-behind on a bunch of OSes, but not Linux because it only seems to worry about the reads, while other Unixes have write-behind detection too, and I believe at least some are confused by this pattern of tiny writes following along some distance behind tiny reads; Andrew Gierth figured that out after noticing poor ring buffer performance, and we eventually got that fixed for one such system[1], separating the sequence detection for reads and writes.) 2. With your patches, we replace all those little pread calls with nice wide calls, yay!, but now we only manage to write out about half the amount of WAL at a time as you discovered. The repeating blocks of system calls now look like this, but there are twice as many of them: pwrite(32,...,40960,0x224000) = 40960 (0xa000) fdatasync(32) = 0 (0x0) pwrite(27,...,8192,0x5c000) = 8192 (0x2000) preadv(27,[...],3,0x7e000) = 131072 (0x20000) pwrite(27,...,8192,0x5e000) = 8192 (0x2000) pwrite(27,...,8192,0x60000) = 8192 (0x2000) pwrite(27,...,8192,0x62000) = 8192 (0x2000) pwrite(27,...,8192,0x64000) = 8192 (0x2000) pwrite(27,...,8192,0x66000) = 8192 (0x2000) pwrite(27,...,8192,0x68000) = 8192 (0x2000) pwrite(27,...,8192,0x6a000) = 8192 (0x2000) pwrite(27,...,8192,0x6c000) = 8192 (0x2000) pwrite(27,...,8192,0x6e000) = 8192 (0x2000) pwrite(27,...,8192,0x70000) = 8192 (0x2000) pwrite(27,...,8192,0x72000) = 8192 (0x2000) pwrite(27,...,8192,0x74000) = 8192 (0x2000) pwrite(27,...,8192,0x76000) = 8192 (0x2000) pwrite(27,...,8192,0x78000) = 8192 (0x2000) pwrite(27,...,8192,0x7a000) = 8192 (0x2000) 3. With your patches and test but this time using VACUUM (BUFFER_USAGE_LIMIT = '384kB'), the repeating block grows bigger and we get the larger WAL flushes back again, because now we're able to collect 32 blocks' worth of WAL up front again: pwrite(32,...,90112,0x50c000) = 90112 (0x16000) fdatasync(32) = 0 (0x0) pwrite(27,...,8192,0x1dc000) = 8192 (0x2000) pread(27,...,131072,0x21e000) = 131072 (0x20000) pwrite(27,...,8192,0x1de000) = 8192 (0x2000) pwrite(27,...,8192,0x1e0000) = 8192 (0x2000) pwrite(27,...,8192,0x1e2000) = 8192 (0x2000) pwrite(27,...,8192,0x1e4000) = 8192 (0x2000) pwrite(27,...,8192,0x1e6000) = 8192 (0x2000) pwrite(27,...,8192,0x1e8000) = 8192 (0x2000) pwrite(27,...,8192,0x1ea000) = 8192 (0x2000) pwrite(27,...,8192,0x1ec000) = 8192 (0x2000) pwrite(27,...,8192,0x1ee000) = 8192 (0x2000) pwrite(27,...,8192,0x1f0000) = 8192 (0x2000) pwrite(27,...,8192,0x1f2000) = 8192 (0x2000) pwrite(27,...,8192,0x1f4000) = 8192 (0x2000) pwrite(27,...,8192,0x1f6000) = 8192 (0x2000) pwrite(27,...,8192,0x1f8000) = 8192 (0x2000) pwrite(27,...,8192,0x1fa000) = 8192 (0x2000) pwrite(27,...,8192,0x1fc000) = 8192 (0x2000) preadv(27,[...],3,0x23e000) = 131072 (0x20000) pwrite(27,...,8192,0x1fe000) = 8192 (0x2000) pwrite(27,...,8192,0x200000) = 8192 (0x2000) pwrite(27,...,8192,0x202000) = 8192 (0x2000) pwrite(27,...,8192,0x204000) = 8192 (0x2000) pwrite(27,...,8192,0x206000) = 8192 (0x2000) pwrite(27,...,8192,0x208000) = 8192 (0x2000) pwrite(27,...,8192,0x20a000) = 8192 (0x2000) pwrite(27,...,8192,0x20c000) = 8192 (0x2000) pwrite(27,...,8192,0x20e000) = 8192 (0x2000) pwrite(27,...,8192,0x210000) = 8192 (0x2000) pwrite(27,...,8192,0x212000) = 8192 (0x2000) pwrite(27,...,8192,0x214000) = 8192 (0x2000) pwrite(27,...,8192,0x216000) = 8192 (0x2000) pwrite(27,...,8192,0x218000) = 8192 (0x2000) pwrite(27,...,8192,0x21a000) = 8192 (0x2000) 4. For learning/exploration only, I rebased my experimental vectored FlushBuffers() patch, which teaches the checkpointer to write relation data out using smgrwritev(). The checkpointer explicitly sorts blocks, but I think ring buffers should naturally often contain consecutive blocks in ring order. Highly experimental POC code pushed to a public branch[2], but I am not proposing anything here, just trying to understand things. The nicest looking system call trace was with BUFFER_USAGE_LIMIT set to 512kB, so it could do its writes, reads and WAL writes 128kB at a time: pwrite(32,...,131072,0xfc6000) = 131072 (0x20000) fdatasync(32) = 0 (0x0) pwrite(27,...,131072,0x6c0000) = 131072 (0x20000) pread(27,...,131072,0x73e000) = 131072 (0x20000) pwrite(27,...,131072,0x6e0000) = 131072 (0x20000) pread(27,...,131072,0x75e000) = 131072 (0x20000) pwritev(27,[...],3,0x77e000) = 131072 (0x20000) preadv(27,[...],3,0x77e000) = 131072 (0x20000) That was a fun experiment, but... I recognise that efficient cleaning of ring buffers is a Hard Problem requiring more concurrency: it's just too late to be flushing that WAL. But we also don't want to start writing back data immediately after dirtying pages (cf. OS write-behind for big sequential writes in traditional Unixes), because we're not allowed to write data out without writing the WAL first and we currently need to build up bigger WAL writes to do so efficiently (cf. some other systems that can write out fragments of WAL concurrently so the latency-vs-throughput trade-off doesn't have to be so extreme). So we want to defer writing it, but not too long. We need something cleaning our buffers (or at least flushing the associated WAL, but preferably also writing the data) not too late and not too early, and more in sync with our scan than the WAL writer is. What that machinery should look like I don't know (but I believe Andres has ideas). [1] https://github.com/freebsd/freebsd-src/commit/f2706588730a5d3b9a687ba8d4269e386650cc4f [2] https://github.com/macdice/postgres/tree/vectored-ring-buffer