Hello hackers, Based on ideas from earlier discussions[1][2], here is an experimental WIP patch to improve recovery speed by prefetching blocks. If you set wal_prefetch_distance to a positive distance, measured in bytes, then the recovery loop will look ahead in the WAL and call PrefetchBuffer() for referenced blocks. This can speed things up with cold caches (example: after a server reboot) and working sets that don't fit in memory (example: large scale pgbench).
Results vary, but in contrived larger-than-memory pgbench crash recovery experiments on a Linux development system, I've seen recovery running as much as 20x faster with full_page_writes=off and wal_prefetch_distance=8kB. FPWs reduce the potential speed-up as discussed in the other thread. Some notes: * PrefetchBuffer() is only beneficial if your kernel and filesystem have a working POSIX_FADV_WILLNEED implementation. That includes Linux ext4 and xfs, but excludes macOS and Windows. In future we might use asynchronous I/O to bring data all the way into our own buffer pool; hopefully the PrefetchBuffer() interface wouldn't change much and this code would automatically benefit. * For now, for proof-of-concept purposes, the patch uses a second XLogReader to read ahead in the WAL. I am thinking about how to write a two-cursor XLogReader that reads and decodes each record just once. * It can handle simple crash recovery and streaming replication scenarios, but doesn't yet deal with complications like timeline changes (the way to do that might depend on how the previous point works out). The integration with WAL receiver probably needs some work, I've been testing pretty narrow cases so far, and the way I hijacked read_local_xlog_page() probably isn't right. * On filesystems with block size <= BLCKSZ, it's a waste of a syscall to try to prefetch a block that we have a FPW for, but otherwise it can avoid a later stall due to a read-before-write at pwrite() time, so I added a second GUC wal_prefetch_fpw to make that optional. Earlier work, and how this patch compares: * Sean Chittenden wrote pg_prefaulter[1], an external process that uses worker threads to pread() referenced pages some time before recovery does, and demonstrated very good speed-up, triggering a lot of discussion of this topic. My WIP patch differs mainly in that it's integrated with PostgreSQL, and it uses POSIX_FADV_WILLNEED rather than synchronous I/O from worker threads/processes. Sean wouldn't have liked my patch much because he was working on ZFS and that doesn't support POSIX_FADV_WILLNEED, but with a small patch to ZFS it works pretty well, and I'll try to get that upstreamed. * Konstantin Knizhnik proposed a dedicated PostgreSQL process that would do approximately the same thing[2]. My WIP patch differs mainly in that it does the prefetching work in the recovery loop itself, and uses PrefetchBuffer() rather than FilePrefetch() directly. This avoids a bunch of communication and complications, but admittedly does introduce new system calls into a hot loop (for now); perhaps I could pay for that by removing more lseek(SEEK_END) noise. It also deals with various edge cases relating to created, dropped and truncated relations a bit differently. It also tries to avoid generating sequential WILLNEED advice, based on experimental evidence[3] that that affects Linux's readahead heuristics negatively, though I don't understand the exact mechanism there. Here are some cases where I expect this patch to perform badly: * Your WAL has multiple intermixed sequential access streams (ie sequential access to N different relations), so that sequential access is not detected, and then all the WILLNEED advice prevents Linux's automagic readahead from working well. Perhaps that could be mitigated by having a system that can detect up to N concurrent streams, where N is more than the current 1, or by flagging buffers in the WAL as part of a sequential stream. I haven't looked into this. * The data is always found in our buffer pool, so PrefetchBuffer() is doing nothing useful and you might as well not be calling it or doing the extra work that leads up to that. Perhaps that could be mitigated with an adaptive approach: too many PrefetchBuffer() hits and we stop trying to prefetch, too many XLogReadBufferForRedo() misses and we start trying to prefetch. That might work nicely for systems that start out with cold caches but eventually warm up. I haven't looked into this. * The data is actually always in the kernel's cache, so the advice is a waste of a syscall. That might imply that you should probably be running with a larger shared_buffers (?). It's technically possible to ask the operating system if a region is cached on many systems, which could in theory be used for some kind of adaptive heuristic that would disable pointless prefetching, but I'm not proposing that. Ultimately this problem would be avoided by moving to true async I/O, where we'd be initiating the read all the way into our buffers (ie it replaces the later pread() so it's a wash, at worst). * The prefetch distance is set too low so that pread() waits are not avoided, or your storage subsystem can't actually perform enough concurrent I/O to get ahead of the random access pattern you're generating, so no distance would be far enough ahead. To help with the former case, perhaps we could invent something smarter than a user-supplied distance (something like "N cold block references ahead", possibly using effective_io_concurrency, rather than "N bytes ahead"). [1] https://www.pgcon.org/2018/schedule/track/Case%20Studies/1204.en.html [2] https://www.postgresql.org/message-id/flat/49df9cd2-7086-02d0-3f8d-535a32d44c82%40postgrespro.ru [3] https://github.com/macdice/some-io-tests
wal-prefetch-another-approach-v1.tgz
Description: application/compressed