The accounting used by ANALYZE to count dead tuples in acquire_sample_rows() (actually in heapam_scan_analyze_next_tuple() these days) makes some dubious assumptions about how it should count dead tuples. This is something that I discussed with Masahiko in the context of our Postgres 14 work on VACUUM, which ultimately led to better documentation of the issues (see commit 7136bf34). But I want to talk about it again now. This is not a new issue.
The ANALYZE dead tuple accounting takes a 100% quantitative approach -- it is entirely unconcerned about qualitative distinctions about the number of dead tuples per logical row. Sometimes that doesn't matter, but there are many important cases where it clearly is important. I'll show one such case now. This is a case where the system frequently launches autovacuum workers that really never manage to do truly useful work: $ pgbench -i -s 50 -F 80 ... $ pgbench -s 50 -j 4 -c 32 -M prepared -T 300 --rate=15000 ... I've made the heap fill factor 80 (with -F). I've also set both autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor to 0.02 here, which is aggressive but still basically reasonable. I've enabled autovacuum logging so we can see exactly what's going on with autovacuum when pgbench runs -- that's the interesting part. The log output shows that an autovacuum worker was launched and ran VACUUM against pgbench_accounts on 11 separate occasions during the 5 minute pgbench benchmark. All 11 autovacuum log reports show the details are virtually the same in each case. Here is the 11th and final output concerning the accounts table (I could have used any of the other 10 just as easily): p 593300/2021-05-28 16:16:47 PDT LOG: automatic vacuum of table "regression.public.pgbench_accounts": index scans: 0 pages: 0 removed, 102041 remain, 0 skipped due to pins, 0 skipped frozen tuples: 100300 removed, 5000000 remain, 0 are dead but not yet removable, oldest xmin: 7269905 buffer usage: 204166 hits, 0 misses, 3586 dirtied index scan not needed: 0 pages from table (0.00% of total) had 0 dead item identifiers removed avg read rate: 0.000 MB/s, avg write rate: 11.250 MB/s I/O Timings: system usage: CPU: user: 2.31 s, system: 0.02 s, elapsed: 2.49 s WAL usage: 200471 records, 31163 full page images, 44115415 bytes Notice that we have 0 LP_DEAD items left behind by pruning -- either opportunistic pruning or pruning by VACUUM. Pruning by VACUUM inside lazy_scan_prune() does "remove 100300 dead tuples", so arguably VACUUM does some useful work. Though I would argue that we don't -- I think that this is a total waste of cycles. This particular quantitative measure has little to do with anything that matters to the workload. This workload shouldn't ever need to VACUUM the accounts table (except when the time comes to freeze its tuples) -- the backends can clean up after themselves opportunistically, without ever faltering (i.e. without ever failing to keep a HOT chain on the same page). The picture we see here seems contradictory, even if you think about the problem in exactly the same way as vacuumlazy.c thinks about the problem. On the one hand autovacuum workers are launched because opportunistic cleanup techniques (mainly opportunistic heap page pruning) don't seem to be able to keep up with the workload. On the other hand, when VACUUM actually runs we consistently see 0 LP_DEAD stub items in heap pages, which is generally an excellent indicator that opportunistic HOT pruning is in fact working perfectly. Only one of those statements can be correct. The absurdity of autovacuum's behavior with this workload becomes undeniable once you tweak just one detail and see what changes. For example, I find that if I repeat the same process but increase autovacuum_vacuum_scale_factor from 0.02 to 0.05, everything changes. Instead of getting 11 autovacuum runs against pgbench_accounts I get 0 autovacuum runs! This effect is stable, and won't change if the workload runs for more than 5 minutes. Apparently vacuuming less aggressively results in less need for vacuuming! I believe that there is a sharp discontinuity here -- a crossover point for autovacuum_vacuum_scale_factor at which the behavior of the system *flips*, from very *frequent* autovacuum runs against the accounts table, to *zero* runs. This seems like a real problem to me. I bet it has real consequences that are hard to model. In any case this simple model seems convincing enough. The dead tuple accounting makes it much harder to set autovacuum_vacuum_scale_factor very aggressively (say 0.02 or so) -- nobody is going to want to do that as long as it makes the system launch useless autovacuum workers that never end up doing useful work in a subset of tables. Users are currently missing out on the benefit of very aggressive autovacuums against tables where it truly makes sense. The code in acquire_sample_rows()/heapam_scan_analyze_next_tuple() counts tuples/line pointers on a physical heap page. Perhaps it should "operate against an imaginary version of the page" instead -- the page as it would be just *after* lazy_scan_prune() is called for the page during a future VACUUM. More concretely, if there is a HOT chain then acquire_sample_rows() could perhaps either count 0 or 1 or the chain's tuples as dead tuples. The code might be taught to recognize that a total absence of LP_DEAD stubs items on the heap page strongly indicates that the workload can manage HOT chains via opportunistic pruning. I'm just speculating about what alternative design might fix the issue at this point. In any case I contend that the current behavior gets too much wrong, and should be fixed in Postgres 15. -- Peter Geoghegan