On Wed, Apr 21, 2021 at 8:21 AM Robert Haas <robertmh...@gmail.com> wrote: > Now, the reason for this is that when we discover dead TIDs, we only > record them in memory, not on disk. So, as soon as VACUUM ends, we > lose all knowledge of whether those TIDs were and must rediscover > them. Suppose we didn't do this, and instead had a "dead TID" fork for > each table. Suppose further that this worked like a conveyor belt, > similar to WAL, where every dead TID we store into the fork is > assigned an identifying 64-bit number that is never reused.
Have you started any work on this project? I think that it's a very good idea. Enabling index-only scans is a good enough reason to pursue this project, even on its own. The flexibility that this design offers allows VACUUM to run far more aggressively, with little possible downside. It makes it possible for VACUUM to run so frequently that it rarely dirties pages most of the time -- at least in many important cases. Imagine if VACUUM almost kept in lockstep with inserters into an append-mostly table -- that would be great. The main blocker to making VACUUM behave like that is of course indexes. Setting visibility map bits during VACUUM can make future vacuuming cheaper (for the obvious reason), which *also* makes it cheaper to set *most* visibility map bits as the table is further extended, which in turn makes future vacuuming cheaper...and so on. This virtuous circle seems like it might be really important. Especially once you factor in the cost of dirtying pages a second or a third time. I think that we can really keep the number of times VACUUM dirties pages under control, simply by decoupling. Decoupling is key to keeping the costs to a minimum. I attached a POC autovacuum logging instrumentation patch that shows how VACUUM uses *and* sets VM bits. I wrote this for my TPC-C + FSM work. Seeing both things together, and seeing how both things *change* over time was a real eye opener for me: it turns out that the master branch keeps setting and resetting VM bit pages in the two big append-mostly tables that are causing so much trouble for Postgres today. What we see right now is pretty disorderly -- the numbers don't trend in the right direction when they should. But it could be a lot more orderly, with a little work. This instrumentation helped me to discover a better approach to indexing within TPC-C, based on index-only scans [1]. It also made me realize that it's possible for a table to have real problems with dead tuple cleanup in indexes, while nevertheless being an effective target for index-only scans. There is actually no good reason to think that one condition should preclude the other -- they may very well go together. You did say this yourself when talking about global indexes, but there is no reason to think that it's limited to partitioning cases. The current "ANALYZE dead_tuples statistics" paradigm cannot recognize when both conditions go together, even though I now think that it's fairly common. I also like your idea here because it enables a more qualitative approach, based on recent information for recently modified blocks -- not whole-table statistics. Averages are notoriously misleading. [1] https://github.com/pgsql-io/benchmarksql/pull/16 -- Peter Geoghegan
0001-Instrument-pages-skipped-by-VACUUM.patch
Description: Binary data