ANALYZE's dead tuple accounting can get confused

Peter Geoghegan Fri, 28 May 2021 17:27:45 -0700

The accounting used by ANALYZE to count dead tuples in
acquire_sample_rows() (actually in heapam_scan_analyze_next_tuple()
these days) makes some dubious assumptions about how it should count
dead tuples. This is something that I discussed with Masahiko in the
context of our Postgres 14 work on VACUUM, which ultimately led to
better documentation of the issues (see commit 7136bf34). But I want
to talk about it again now. This is not a new issue.


The ANALYZE dead tuple accounting takes a 100% quantitative approach
-- it is entirely unconcerned about qualitative distinctions about the
number of dead tuples per logical row. Sometimes that doesn't matter,
but there are many important cases where it clearly is important. I'll
show one such case now. This is a case where the system frequently
launches autovacuum workers that really never manage to do truly
useful work:

$ pgbench -i -s 50 -F 80
  ...
$ pgbench -s 50 -j 4 -c 32 -M prepared -T 300 --rate=15000
 ...

I've made the heap fill factor 80 (with -F). I've also set both
autovacuum_vacuum_scale_factor and autovacuum_analyze_scale_factor to
0.02 here, which is aggressive but still basically reasonable. I've
enabled autovacuum logging so we can see exactly what's going on with
autovacuum when pgbench runs -- that's the interesting part.

The log output shows that an autovacuum worker was launched and ran
VACUUM against pgbench_accounts on 11 separate occasions during the 5
minute pgbench benchmark. All 11 autovacuum log reports show the
details are virtually the same in each case. Here is the 11th and
final output concerning the accounts table (I could have used any of
the other 10 just as easily):

p 593300/2021-05-28 16:16:47 PDT LOG:  automatic vacuum of table
"regression.public.pgbench_accounts": index scans: 0
    pages: 0 removed, 102041 remain, 0 skipped due to pins, 0 skipped frozen
    tuples: 100300 removed, 5000000 remain, 0 are dead but not yet
removable, oldest xmin: 7269905
    buffer usage: 204166 hits, 0 misses, 3586 dirtied
    index scan not needed: 0 pages from table (0.00% of total) had 0
dead item identifiers removed
    avg read rate: 0.000 MB/s, avg write rate: 11.250 MB/s
    I/O Timings:
    system usage: CPU: user: 2.31 s, system: 0.02 s, elapsed: 2.49 s
    WAL usage: 200471 records, 31163 full page images, 44115415 bytes

Notice that we have 0 LP_DEAD items left behind by pruning -- either
opportunistic pruning or pruning by VACUUM. Pruning by VACUUM inside
lazy_scan_prune() does "remove 100300 dead tuples", so arguably VACUUM
does some useful work. Though I would argue that we don't -- I think
that this is a total waste of cycles. This particular quantitative
measure has little to do with anything that matters to the workload.
This workload shouldn't ever need to VACUUM the accounts table (except
when the time comes to freeze its tuples) -- the backends can clean up
after themselves opportunistically, without ever faltering (i.e.
without ever failing to keep a HOT chain on the same page).

The picture we see here seems contradictory, even if you think about
the problem in exactly the same way as vacuumlazy.c thinks about the
problem. On the one hand autovacuum workers are launched because
opportunistic cleanup techniques (mainly opportunistic heap page
pruning) don't seem to be able to keep up with the workload. On the
other hand, when VACUUM actually runs we consistently see 0 LP_DEAD
stub items in heap pages, which is generally an excellent indicator
that opportunistic HOT pruning is in fact working perfectly. Only one
of those statements can be correct.

The absurdity of autovacuum's behavior with this workload becomes
undeniable once you tweak just one detail and see what changes. For
example, I find that if I repeat the same process but increase
autovacuum_vacuum_scale_factor from 0.02 to 0.05, everything changes.
Instead of getting 11 autovacuum runs against pgbench_accounts I get 0
autovacuum runs! This effect is stable, and won't change if the
workload runs for more than 5 minutes. Apparently vacuuming less
aggressively results in less need for vacuuming!

I believe that there is a sharp discontinuity here -- a crossover
point for autovacuum_vacuum_scale_factor at which the behavior of the
system *flips*, from very *frequent* autovacuum runs against the
accounts table, to *zero* runs. This seems like a real problem to me.
I bet it has real consequences that are hard to model. In any case
this simple model seems convincing enough. The dead tuple accounting
makes it much harder to set autovacuum_vacuum_scale_factor very
aggressively (say 0.02 or so) -- nobody is going to want to do that as
long as it makes the system launch useless autovacuum workers that
never end up doing useful work in a subset of tables. Users are
currently missing out on the benefit of very aggressive autovacuums
against tables where it truly makes sense.

The code in acquire_sample_rows()/heapam_scan_analyze_next_tuple()
counts tuples/line pointers on a physical heap page. Perhaps it should
"operate against an imaginary version of the page" instead -- the page
as it would be just *after* lazy_scan_prune() is called for the page
during a future VACUUM. More concretely, if there is a HOT chain then
acquire_sample_rows() could perhaps either count 0 or 1 or the chain's
tuples as dead tuples. The code might be taught to recognize that a
total absence of LP_DEAD stubs items on the heap page strongly
indicates that the workload can manage HOT chains via opportunistic
pruning.

I'm just speculating about what alternative design might fix the issue
at this point. In any case I contend that the current behavior gets
too much wrong, and should be fixed in Postgres 15.

-- 
Peter Geoghegan

ANALYZE's dead tuple accounting can get confused

Reply via email to