Hi, I was re-reviewing the proposed batch of GUCs for controlling the SLRU cache sizes[1], and I couldn't resist sketching out $SUBJECT as an obvious alternative. This patch is highly experimental and full of unresolved bits and pieces (see below for some), but it passes basic tests and is enough to start trying the idea out and figuring out where the real problems lie. The hypothesis here is that CLOG, multixact, etc data should compete for space with relation data in one unified buffer pool so you don't have to tune them, and they can benefit from the better common implementation (mapping, locking, replacement, bgwriter, checksums, etc and eventually new things like AIO, TDE, ...).
I know that many people have talked about doing this and maybe they already have patches along these lines too; I'd love to know what others imagined differently/better. In the attached sketch, the SLRU caches are psuedo-relations in pseudo-database 9. Yeah. That's a straw-man idea stolen from the Zheap/undo project[2] (I also stole DiscardBuffer() from there); better ideas for identifying these buffers without making BufferTag bigger are very welcome. You can list SLRU buffers with: WITH slru(relfilenode, path) AS (VALUES (0, 'pg_xact'), (1, 'pg_multixact/offsets'), (2, 'pg_multixact/members'), (3, 'pg_subtrans'), (4, 'pg_serial'), (5, 'pg_commit_ts'), (6, 'pg_notify')) SELECT bufferid, path, relblocknumber, isdirty, usagecount, pinning_backends FROM pg_buffercache NATURAL JOIN slru WHERE reldatabase = 9 ORDER BY path, relblocknumber; Here are some per-cache starter hypotheses about locking that might be completely wrong and obviously need real analysis and testing. pg_xact: I couldn't easily get rid of XactSLRULock, because it didn't just protect buffers, it's also used to negotiate "group CLOG updates". (I think it'd be nice to replace that system with an atomic page update scheme so that concurrent committers stay on CPU, something like [3], but that's another topic.) I decided to try a model where readers only have to pin the page (the reads are sub-byte values that we can read atomically, and you'll see a value as least as fresh as the time you took the pin, right?), but writers have to take an exclusive content lock because otherwise they'd clobber each other at byte level, and because they need to maintain the page LSN consistently. Writing back is done with a share lock as usual and log flushing can be done consistently. I also wanted to try avoiding the extra cost of locking and accessing the buffer mapping table in common cases, so I use ReadRecentBuffer() for repeat access to the same page (this applies to the other SLRUs too). pg_subtrans: I got rid of SubtransSLRULock because it only protected page contents. Can be read with only a pin. Exclusive page content lock to write. pg_multixact: I got rid of the MultiXact{Offset,Members}SLRULock locks. Can be read with only a pin. Writers take exclusive page content lock. The multixact.c module still has its own MultiXactGenLock. pg_commit_ts: I got rid of CommitTsSLRULock since it only protected buffers, but here I had to take shared content locks to read pages, since the values can't be read atomically. Exclusive content lock to write. pg_serial: I could not easily get rid of SerialSLRULock, because it protects the SLRU + also some variables in serialControl. Shared and exclusive page content locks. pg_notify: I got rid of NotifySLRULock. Shared and exclusive page content locks are used for reading and writing. The module still has a separate lock NotifyQueueLock to coordinate queue positions. Some problems tackled incompletely: * I needed to disable checksums and in-page LSNs, since SLRU pages hold raw data with no header. We'd probably eventually want regular (standard? formatted?) pages (the real work here may be implementing FPI for SLRUs so that checksums don't break your database on torn writes). In the meantime, suppressing those things is done by the kludge of recognising database 9 as raw data, but there should be something better than this. A separate array of size NBuffer holds "external" page LSNs, to drive WAL flushing. * The CLOG SLRU also tracks groups of async commit LSNs in a fixed sized array. The obvious translation would be very wasteful (an array big enough for NBuffers * groups per page), but I hope that there is a better way to do this... in the sketch patch I changed it to use the single per-page LSN for simplicity (basically group size is 32k instead of 32...), which is certainly not good enough. Some stupid problems not tackled yet: * It holds onto the virtual file descriptor for the last segment accessed, but there is no invalidation for when segment files are recycled; that could be fixed with a cycle counter or something like that. * It needs to pin buffers during the critical section in commit processing, but that crashes into the ban on allocating memory while dealing with resowner.c book-keeping. It's also hard to know how many buffers you'll need to pin in advance. For now, I just commented out the assertions... * While hacking on the pg_stat_slru view I realised that there is support for "other" SLRUs, presumably for extensions to define their own. Does anyone actually do that? I, erm, didn't support that in this sketch (not too hard though, I guess). * For some reason this is failing on Windows CI, but I haven't looked into that yet. Thoughts on the general concept, technical details? Existing patches for this that are further ahead/better? [1] https://commitfest.postgresql.org/36/2627/ [2] https://commitfest.postgresql.org/36/3228/ [3] http://www.vldb.org/pvldb/vol13/p3195-kodandaramaih.pdf
0001-Move-SLRU-data-into-the-regular-buffer-pool.patch.gz
Description: application/gzip