My recent commit 44fa8488 made vacuumlazy.c always scan the last page of every heap relation, even when it's all-visible, regardless of any other factor. The general idea is to always examine the last page in the relation when that might prevent us from unsuccessfully attempting to truncate the relation -- since even the attempt is costly (in particular, we have to get an AccessExclusiveLock, which is painful with a Hot Standby replica). This created a new problem: if we manually VACUUM a smaller, static table many times, vac_estimate_reltuples() will consider the same scanned_page to be a random sample each time. Of course this page is the further possible thing from a random sample, because it's the same heap page each time, and because it's generally more likely that the last page will have fewer tuples.
Every time you run VACUUM (without changing the table), pg_class.reltuples shrinks, by just a small amount. Run VACUUM like that enough times, and pg_class.reltuples will eventually become quite distorted. It's easy to spot this effect, just by running "VACUUM VERBOSE" in a loop against a static table, and noting how "tuples: .... ??? remain" tends to shrink over time, even though the contents of the table won't have changed. The "always scan last page" behavior isn't really new, actually. It's closely related to an earlier, slightly different behavior with the same goal -- though one that conditioned scanning the last page on certain other factors that were protective (which was probably never considered by the design). This closely related behavior (essentially the same behavior) was added by commit e8429082 from 2015. I judged that it wasn't worth worrying about scanning an extra page unnecessarily (better to keep things simple), since it's not at all uncommon for VACUUM to scan a few all-visible pages anyway (due to the SKIP_PAGES_THRESHOLD stuff). And I haven't changed my mind about that -- I still think that we should just scan the last page in all cases, to keep things simple. But I definitely don't think that this "pg_class.reltuples gets distorted over time" business is okay -- that has to be fixed. There has been at least one very similar bug in the past: the commit message from commit b4b6923e (from 2011) talks about a similar issue with over-sampling from the beginning of the table (not the end), due to a similar bias involving visibility map implementation details that lead to an extremely nonrandom scanned_pages sample. To me that suggests that the true problem here isn't really in vacuumlazy.c -- going back to the old approach (i.e. scanning the last page conditionally) seems like a case of the tail wagging the dog. IMV the real problem is that vac_estimate_reltuples() is totally unconcerned about these kinds of systematic sampling problems. Attached draft patch fixes the issue by making vac_estimate_reltuples() return the old/original reltuples from pg_class (and not care at all about the scanned_tuples value from its vacuumlazy.c caller) when: 1. The total_pages for the table hasn't changed (by even one page) since the last VACUUM (or last ANALYZE). and, 2. The total number of scanned_pages is less than 2% of total_pages. This fixes the observed problem directly. It also makes the code robust against other similar problems that might arise in the future. The risk that comes from trusting that scanned_pages is a truly random sample (given these conditions) is generally very large, while the risk that comes from disbelieving it (given these same conditions) is vanishingly small. -- Peter Geoghegan
v1-0001-Avoid-vac_estimate_reltuples-distortion.patch
Description: Binary data