On Thu, Jan 14, 2021 at 10:42 AM Robert Haas <robertmh...@gmail.com> wrote: > There are also users I've seen get hosed by vacuuming too > aggressively. I have seen this happen in two ways. One is too much > dirty data. The other is too much read I/O, pushing hot data out of > the cache, leading to a storm of random I/O later when the foreground > workload needs to get that stuff back, basically killing the system, > sometimes for hours, while it tries to get back the stuff it lost.
> TL;DR: This change is fine with me, but this whole system has much > deeper issues. It seems like there is a good chance the customers of yours that complained about the read I/O (which was not accompanied by dirtying) were really bothered by all of their indexes being read by VACUUM. The freeze map is probably quite effective as far as that goes, but the burden of index vacuuming is what they tend to notice. This is perhaps made worse by the sudden infrequent nature of the index vacuuming against a big append-only or append-mostly table. I imagine that the problem here is that we're doing index vacuuming when we shouldn't be -- these customers basically had it right. Their intuition that this is unnecessary is in fact the right one. How can it be okay to vacuum an index when the table only has 10 dead tuples (just to freeze some pages at the end of the table)? That's ridiculous. And it has nothing to do with these settings. (Even if I'm wrong to suggest that that was what it was, I think that the details and nuance of what actually happened is likely to be important.) We should be avoiding index vacuuming in many more cases. If there are only a tiny number of garbage index tuples, then we really shouldn't bother (once again, please feel free to weigh in on Masahiko's patch over on the "New IndexAM API controlling index vacuum strategies" thread -- that's very interesting work IMV). Bottom-up index deletion creates far more opportunities for this kind of stuff to naturally occur. It will now do ~99.9% of garbage tuple cleanup in indexes that naturally use it all the time. We can expect that intuitions held by DBAs that have experience with other RDBMSs will start to have more predictive power when they think about Postgres and VACUUM, which seems like a very good thing (and something that we can and should continue to build on). Roughly speaking, we ought to look for more and more ways to make the physical representation of the data closer to the logical contents of the database (that's what these DBAs start with, that's where the intuitions seem to start with, which actually makes perfect sense). Now back to the main topic, the GUC's default value. I believe that your experiences here (the experiences in both directions) are representative -- I think I've heard of all that myself. Like you, I think that insufficient vacuuming is much more common than excessive vacuuming. You do still have some cases where an excessive amount of I/O from VACUUM (without any dirtying) is the issue (or at least *seems* to be the issue, per my aside). I think that I have a high level theory that is consistent with what you say and may suggest a better direction for us, but it's tricky. I'll try to resolve the apparent contradictions in my own arguments as I go (I'm a little burnt out at the moment, so please indulge me). I think that The Real Problem is *not* that it's too hard to tune this stuff as a practical matter, exactly. The entire premise of these costing parameters is that the DBA can and should make a trade-off between query response time/workload throughput and vacuuming, as if these two things were separate constituents that are simply unrelated. That sounds completely wrong to me. It sounds so wrong that I can't go down that mental path for more than 5 seconds without giving up on it. Are we really expected to believe that in general VACUUM probably has all the time in the world, and so should proceed at a leisurely pace? It's almost as if the original designer imagined that the IT department should be made to wait on the result of one of those "VACUUM bigtable;" reports that they seem to be so keen on (other queries are running that deliver real business value, after all). I'm only half-joking here -- anybody reading this should now take a moment to actively consider just how little sense any of this makes. It's so completely and implausibly wrong that it seems likely to actually be slightly right, if only by mistake. There seems to be one important way in which the cost parameter design is accidentally useful: the page dirtying stuff probably works reasonably well. It really does make sense to throttle VACUUM in response to dirtying pages, optimistically assuming that VACUUM will eventually catch up. That part makes sense precisely because it seems like it treats VACUUM as a thing that is directly tied to the workload (an accidental happy exception to the bogus general rule for the costing stuff). Of course, this optimism does not work out because it is true in some general sense that these situations will work themselves out naturally (that's definitely not something we should *truly* assume). There is a deeper way in which it works as a heuristic, though: it covers all the possibilities at once (kind of), without truly predicting what will happen with the workload. We need to be adaptive whenever we think about GC/vacuuming, I believe -- worrying about the general or common case (for any value of common) is often a waste of time IME. I'll now go through two different scenarios in which this same assumption works in our favor. I'll start with the simple case here: It's possible that VACUUM is throttled consistently by dirtying pages, slowing it down considerably. The situation may naturally resolve itself in an uncomplicated way (e.g. the bursty application calms down following a spike), in which case the original foundational design assumption I described works out in the obvious way. Easy. Now for the more complicated case: VACUUM is throttled consistently by dirtying pages (if anything), but the database load does not and will not go down. We *still* end up doing approximately the right thing here (again this is assuming you reduce vacuum_cost_page_miss to something like 3). The fact that we're dirtying new pages (when throttled) suggests that the garbage in these pages really is quite old (it's "floating garbage" in GC theory terms). So we don't want to slow everything down by dirtying ever-more pages. At the same time we want to get through already-dirty pages as quickly as possible, and also quickly move past pages not in shared buffers and not dirtied (those are relatively cheap on modern hardware). Opportunistic heap pruning eventually runs enough that VACUUM won't have to dirty most heap pages in a big table (in its first pass). So dirty pages are debt that VACUUM can easily create, whereas buffer misses are paid directly by VACUUM. It is its own backpressure, for the most part. Making the costing stuff highly sensitive to dirtying pages (but not sensitive to much else) works out because it either avoids making a bad situation worse, or has no real additional downside when the system is completely overwhelmed (i.e. bottlenecked on cleaning dirty pages). If it's just impossible for buffer cleaning to really keep up then things will naturally slow down. This might even be seen as a natural thing -- workloads must live within their means. We should do our best to make it less likely that things slow down, by improving various important subsystems, whatever it might be (e.g. the io_uring stuff from Andres). But ultimately there is no way to reliably avoid having the system get overwhelmed by dirty pages. The fact that such an outcome is always possible is no reason to double down on dubious top-down interventions from VACUUM. You cannot get out of debt by taking out another loan to pay off your existing creditors. -- Peter Geoghegan