Re: vacuum_cost_page_miss default value and modern hardware

Peter Geoghegan Thu, 14 Jan 2021 17:09:33 -0800

On Thu, Jan 14, 2021 at 10:42 AM Robert Haas <robertmh...@gmail.com> wrote:
> There are also users I've seen get hosed by vacuuming too
> aggressively. I have seen this happen in two ways. One is too much
> dirty data. The other is too much read I/O, pushing hot data out of
> the cache, leading to a storm of random I/O later when the foreground
> workload needs to get that stuff back, basically killing the system,
> sometimes for hours, while it tries to get back the stuff it lost.


> TL;DR: This change is fine with me, but this whole system has much
> deeper issues.

It seems like there is a good chance the customers of yours that
complained about the read I/O (which was not accompanied by dirtying)
were really bothered by all of their indexes being read by VACUUM. The
freeze map is probably quite effective as far as that goes, but the
burden of index vacuuming is what they tend to notice. This is perhaps
made worse by the sudden infrequent nature of the index vacuuming
against a big append-only or append-mostly table. I imagine that the
problem here is that we're doing index vacuuming when we shouldn't be
-- these customers basically had it right. Their intuition that this
is unnecessary is in fact the right one. How can it be okay to vacuum
an index when the table only has 10 dead tuples (just to freeze some
pages at the end of the table)? That's ridiculous. And it has nothing
to do with these settings. (Even if I'm wrong to suggest that that was
what it was, I think that the details and nuance of what actually
happened is likely to be important.)

We should be avoiding index vacuuming in many more cases. If there are
only a tiny number of garbage index tuples, then we really shouldn't
bother (once again, please feel free to weigh in on Masahiko's patch
over on the "New IndexAM API controlling index vacuum strategies"
thread -- that's very interesting work IMV). Bottom-up index deletion
creates far more opportunities for this kind of stuff to naturally
occur. It will now do ~99.9% of garbage tuple cleanup in indexes that
naturally use it all the time. We can expect that intuitions held by
DBAs that have experience with other RDBMSs will start to have more
predictive power when they think about Postgres and VACUUM, which
seems like a very good thing (and something that we can and should
continue to build on). Roughly speaking, we ought to look for more and
more ways to make the physical representation of the data closer to
the logical contents of the database (that's what these DBAs start
with, that's where the intuitions seem to start with, which actually
makes perfect sense).

Now back to the main topic, the GUC's default value. I believe that
your experiences here (the experiences in both directions) are
representative -- I think I've heard of all that myself. Like you, I
think that insufficient vacuuming is much more common than excessive
vacuuming. You do still have some cases where an excessive amount of
I/O from VACUUM (without any dirtying) is the issue (or at least
*seems* to be the issue, per my aside). I think that I have a high
level theory that is consistent with what you say and may suggest a
better direction for us, but it's tricky. I'll try to resolve the
apparent contradictions in my own arguments as I go (I'm a little
burnt out at the moment, so please indulge me).

I think that The Real Problem is *not* that it's too hard to tune this
stuff as a practical matter, exactly. The entire premise of these
costing parameters is that the DBA can and should make a trade-off
between query response time/workload throughput and vacuuming, as if
these two things were separate constituents that are simply unrelated.
That sounds completely wrong to me. It sounds so wrong that I can't go
down that mental path for more than 5 seconds without giving up on it.
Are we really expected to believe that in general VACUUM probably has
all the time in the world, and so should proceed at a leisurely pace?
It's almost as if the original designer imagined that the IT
department should be made to wait on the result of one of those
"VACUUM bigtable;" reports that they seem to be so keen on (other
queries are running that deliver real business value, after all). I'm
only half-joking here -- anybody reading this should now take a moment
to actively consider just how little sense any of this makes. It's so
completely and implausibly wrong that it seems likely to actually be
slightly right, if only by mistake.

There seems to be one important way in which the cost parameter design
is accidentally useful: the page dirtying stuff probably works
reasonably well. It really does make sense to throttle VACUUM in
response to dirtying pages, optimistically assuming that VACUUM will
eventually catch up. That part makes sense precisely because it seems
like it treats VACUUM as a thing that is directly tied to the workload
(an accidental happy exception to the bogus general rule for the
costing stuff). Of course, this optimism does not work out because it
is true in some general sense that these situations will work
themselves out naturally (that's definitely not something we should
*truly* assume). There is a deeper way in which it works as a
heuristic, though: it covers all the possibilities at once (kind of),
without truly predicting what will happen with the workload. We need
to be adaptive whenever we think about GC/vacuuming, I believe --
worrying about the general or common case (for any value of common) is
often a waste of time IME. I'll now go through two different scenarios
in which this same assumption works in our favor.

I'll start with the simple case here: It's possible that VACUUM is
throttled consistently by dirtying pages, slowing it down
considerably. The situation may naturally resolve itself in an
uncomplicated way (e.g. the bursty application calms down following a
spike), in which case the original foundational design assumption I
described works out in the obvious way. Easy.

Now for the more complicated case: VACUUM is throttled consistently by
dirtying pages (if anything), but the database load does not and will
not go down. We *still* end up doing approximately the right thing
here (again this is assuming you reduce vacuum_cost_page_miss to
something like 3). The fact that we're dirtying new pages (when
throttled) suggests that the garbage in these pages really is quite
old (it's "floating garbage" in GC theory terms). So we don't want to
slow everything down by dirtying ever-more pages. At the same time we
want to get through already-dirty pages as quickly as possible, and
also quickly move past pages not in shared buffers and not dirtied
(those are relatively cheap on modern hardware). Opportunistic heap
pruning eventually runs enough that VACUUM won't have to dirty most
heap pages in a big table (in its first pass).

So dirty pages are debt that VACUUM can easily create, whereas buffer
misses are paid directly by VACUUM. It is its own backpressure, for
the most part. Making the costing stuff highly sensitive to dirtying
pages (but not sensitive to much else) works out because it either
avoids making a bad situation worse, or has no real additional
downside when the system is completely overwhelmed (i.e. bottlenecked
on cleaning dirty pages).

If it's just impossible for buffer cleaning to really keep up then
things will naturally slow down. This might even be seen as a natural
thing -- workloads must live within their means. We should do our best
to make it less likely that things slow down, by improving various
important subsystems, whatever it might be (e.g. the io_uring stuff
from Andres). But ultimately there is no way to reliably avoid having
the system get overwhelmed by dirty pages. The fact that such an
outcome is always possible is no reason to double down on dubious
top-down interventions from VACUUM. You cannot get out of debt by
taking out another loan to pay off your existing creditors.

--
Peter Geoghegan

Re: vacuum_cost_page_miss default value and modern hardware

Reply via email to