Defining (and possibly skipping) useless VACUUM operations

Peter Geoghegan Sun, 12 Dec 2021 17:48:02 -0800

Robert Haas has written on the subject of useless vacuuming, here:

http://rhaas.blogspot.com/2020/02/useless-vacuuming.html


I'm sure at least a few of us have thought about the problem at some
point. I would like to discuss how we can actually avoid useless
vacuuming, and what our goals should be.

I am currently working on decoupling advancing relfrozenxid from tuple
freezing [1]. That is, I'm teaching VACUUM to keep track of
information that it uses to generate an "optimal value" for the
table's final relfrozenxid: the most recent XID value that might still
be in the table. This patch is based on the observation that we don't
actually have to use the FreezeLimit cutoff for our new
pg_class.relfrozenxid. We need only obey the basic relfrozenxid
invariant, which is that the final value must be <= any extant XID in
the table.  Using FreezeLimit is needlessly conservative.

My draft patch to implement the optimization (which builds on the
patches already posted to [1]) will reliably set pg_class.relfrozenxid
to the same VACUUM's precise original OldestXmin once certain
conditions are met -- reasonably common conditions. For example, the
same precise OldestXmin XID is used for relfrozenxid in the event of a
manual VACUUM (without FREEZE) on a table that was just bulk-loaded,
assuming the system is otherwise idle. Setting relfrozenxid to the
precise lowest safe value happens on a best-effort basis, without
needlessly tying that to things like when or how we freeze tuples.

It now occurs to me to push this patch in another direction, on top of
all that: the OldestXmin behavior hints at a precise, robust way of
defining "useless vacuuming". We can condition skipping a VACUUM (i.e.
whether a VACUUM is considered "definitely won't be useful if allowed
to execute") on whether or not our preexisting pg_class.relfrozenxid
precisely equals our newly-acquired OldestXmin for an about-to-begin
VACUUM operation.  (We'd also want to add an "unchangeable
pg_class.relminmxid" test, I think.)

This definition does seem to be close to ideal: We're virtually
assured that there will be no more useful work for us, in a way that
is grounded in theory but still quite practical. But it's not a slam
dunk. A person could still argue that we shouldn't cancel the VACUUM
before it has begun, even when all these conditions have been met.
This would not be a particularly strong argument, mind you, but it's
still worth taking seriously. We need an exact problem statement that
justifies whatever definition of "useless VACUUM" we settle on.

Here are arguments *against* the skipping behavior I sketched out:

* An aborted transaction might need to be cleaned up, which should be
able to go ahead despite the unchanged OldestXmin. (I think that this
is the argument with the most merit, by quite a bit.)

* In general index AMs may want to do deferred cleanup, say to place
previously deleted pages in the FSM. Although in practice the criteria
for recycling safety used by nbtree and GiST will make that
impossible, there is no fundamental reason why they need to work that
way (XIDs are used, but only because they provide a conveniently
available notion of "logical time" that is sufficient to implement
what Lanin & Shasha call "the drain technique"). Plus GIN really could
do real work in amvacuumcleanup, for the pending list. There are bound
to be a handful of marginal things like this.

* Who are we to intervene like this, anyway? (Makes much more sense if
we don't limit ourselves to autovacuum worker operations.)

Offhand, I suspect that we should only consider skipping "useless"
anti-wraparound autovacuums (not other kinds of autovacuums, not
manual VACUUMs). The arguments against skipping are weakest for the
anti-wraparound case. And the arguments in favor are particularly
strong: we should specifically avoid starting a useless (and possibly
time-consuming) anti-wraparound autovacuum, because that could easily
block an actually-useful autovacuum launched some time later. We
should aim to be in a position to launch an anti-wraparound autovacuum
that can actually advance relfrozenxid as soon as that becomes
possible (e.g. when the DBA drops an old replication slot that was
holding back each VACUUM's OldestXmin). And so "skipping" makes us
much more responsive, which seems like it might matter a lot in
practice. It minimizes the risk of wraparound failure.

There is also a strong argument for logging our failure to clean up
anything in any autovacuum -- we don't do nearly enough alerting when
stuff like this happens (possibly because "useless" is such a squishy
concept right now?). Just logging something still requires defining
"useless VACUUM operation" in a way that is both reliable and
proportionate. So just logging something necessitates solving that
hard problem.

[1] https://commitfest.postgresql.org/36/3433/
-- 
Peter Geoghegan

Defining (and possibly skipping) useless VACUUM operations

Reply via email to