On Mon, Nov 22, 2021 at 9:49 PM Andres Freund <and...@anarazel.de> wrote: > > For example, we can definitely afford to wait a few more milliseconds > > to get a cleanup lock just once > > We currently have no infrastructure to wait for an lwlock or pincount for a > limited time. And at least for the former it'd not be easy to add. It may be > worth adding that at some point, but I'm doubtful this is sufficient reason > for nontrivial new infrastructure in very performance sensitive areas.
It was a hypothetical example. To be more practical about it: it seems likely that we won't really benefit from waiting some amount of time (not forever) for a cleanup lock in non-aggressive VACUUM, once we have some of the relfrozenxid stuff we've talked about in place. In a world where we're smarter about advancing relfrozenxid in non-aggressive VACUUMs, the choice between waiting for a cleanup lock, and not waiting (but also not advancing relfrozenxid at all) matters less -- it's no longer a binary choice. It's no longer a binary choice because we will have done away with the current rigid way in which our new relfrozenxid for the relation is either FreezeLimit, or nothing at all. So far we've only talked about the case where we can update relfrozenxid with a value that happens to be much newer than FreezeLimit. If we can do that, that's great. But what about setting relfrozenxid to an *older* value than FreezeLimit instead (in a non-aggressive VACUUM)? That's also pretty good! There is still a decent chance that the final "suboptimal" relfrozenxid that we determine can be safely set in pg_class at the end of our VACUUM will still be far more recent than the preexisting relfrozenxid. Especially with larger tables. Advancing relfrozenxid should be thought of as a totally independent thing to freezing tuples, at least in vacuumlazy.c itself. That's kinda the case today, even, but *explicitly* decoupling advancing relfrozenxid from actually freezing tuples seems like a good high level goal for this project. Remember, FreezeLimit is derived from vacuum_freeze_min_age in the obvious way: OldestXmin for the VACUUM, minus vacuum_freeze_min_age GUC/reloption setting. I'm pretty sure that this means that making autovacuum freeze tuples more aggressively (by reducing vacuum_freeze_min_age) could have the perverse effect of making non-aggressive VACUUMs less likely to advance relfrozenxid -- which is exactly backwards. This effect could easily be missed, even by expert users, since there is no convenient instrumentation that shows how and when relfrozenxid is advanced. > > All of the autovacuums against the accounts table look similar to this > > one -- you don't see anything about relfrozenxid being advanced > > (because it isn't). >> Does that really make > > sense, though? > > Does what make really sense? Well, my accounts table example wasn't a particularly good one (it was a conveniently available example). I am now sure that you got the point I was trying to make here already, based on what you go on to say about non-aggressive VACUUMs optionally *not* skipping all-visible-not-all-frozen heap pages in the hopes of advancing relfrozenxid earlier (more on that idea below, in my response). On reflection, the simplest way of expressing the same idea is what I just said about decoupling (decoupling advancing relfrozenxid from freezing). > I think pgbench_accounts is just a really poor showcase. Most importantly > there's no even slightly longer running transactions that hold down the xid > horizon. But in real workloads thats incredibly common IME. It's also quite > uncommon in real workloads to huge tables in which all records are > updated. It's more common to have value ranges that are nearly static, and a > more heavily changing range. I agree. > I think the most interesting cases where using the "measured" horizon will be > advantageous is anti-wrap vacuums. Those obviously have to happen for rarely > modified tables, including completely static ones, too. Using the "measured" > horizon will allow us to reduce the frequency of anti-wrap autovacuums on old > tables, because we'll be able to set a much more recent relfrozenxid. That's probably true in practice -- but who knows these days, with the autovacuum_vacuum_insert_scale_factor stuff? Either way I see no reason to emphasize that case in the design itself. The "decoupling" concept now seems like the key design-level concept -- everything else follows naturally from that. > This is becoming more common with the increased use of partitioning. Also with bulk loading. There could easily be a tiny number of distinct XIDs that are close together in time, for many many rows -- practically one XID, or even exactly one XID. > No, not quite. We treat anti-wraparound vacuums as an emergency (including > logging messages, not cancelling). But the only mechanism we have against > anti-wrap vacuums happening is vacuum_freeze_table_age. But as you say, that's > not really a "real" mechanism, because it requires an "independent" reason to > vacuum a table. Got it. > I've seen cases where anti-wraparound vacuums weren't a problem / never > happend for important tables for a long time, because there always was an > "independent" reason for autovacuum to start doing its thing before the table > got to be autovacuum_freeze_max_age old. But at some point the important > tables started to be big enough that autovacuum didn't schedule vacuums that > got promoted to aggressive via vacuum_freeze_table_age before the anti-wrap > vacuums. Right. Not just because they were big; also because autovacuum runs at geometric intervals -- the final reltuples from last time is used to determine the point at which av runs this time. This might make sense, or it might not make any sense -- it all depends (mostly on index stuff). > Then things started to burn, because of the unpaced anti-wrap vacuums > clogging up all IO, or maybe it was the vacuums not cancelling - I don't quite > remember the details. Non-cancelling anti-wraparound VACUUMs that (all of a sudden) cause chaos because they interact badly with automated DDL is one I've seen several times -- I'm sure you have too. That was what the Manta/Joyent blogpost I referenced upthread went into. > Behaviour that lead to a "sudden" falling over, rather than getting gradually > worse are bad - they somehow tend to happen on Friday evenings :). These are among our most important challenges IMV. > Just that autovacuum should have a mechanism to trigger aggressive vacuums > (i.e. ones that are guaranteed to be able to increase relfrozenxid unless > cancelled) before getting to the "emergency"-ish anti-wraparound state. Maybe, but that runs into the problem of needing another GUC that nobody will ever be able to remember the name of. I consider the idea of adding a variety of measures that make non-aggressive VACUUM much more likely to advance relfrozenxid in practice to be far more promising. > Or alternatively that we should have a separate threshold for the "harsher" > anti-wraparound measures. Or maybe just raise the default of autovacuum_freeze_max_age, which many people don't change? That might be a lot safer than it once was. Or will be, once we manage to teach VACUUM to advance relfrozenxid more often in non-aggressive VACUUMs on Postgres 15. Imagine a world in which we have that stuff in place, as well as related enhancements added in earlier releases: autovacuum_vacuum_insert_scale_factor, the freezemap, and the wraparound failsafe. These add up to a lot; with all of that in place, the risk we'd be introducing by increasing the default value of autovacuum_freeze_max_age would be *far* lower than the risk of making the same change back in 2006. I bring up 2006 because it was the year that commit 48188e1621 added autovacuum_freeze_max_age -- the default hasn't changed since that time. > I think workloads are a bit more worried than a realistic set of benchmarksk > that one person can run yourself. No question. I absolutely accept that I only have to miss one important detail with something like this -- that just goes with the territory. Just saying that I have yet to see any evidence that the bypass-indexes behavior really hurt anything. I do take the idea that I might have missed something very seriously, despite all this. > I gave you examples of cases that I see as likely being bitten by this, > e.g. when the skipped index cleanup prevents IOS scans. When both the > likely-to-be-modified and likely-to-be-queried value ranges are a small subset > of the entire data, the 2% threshold can prevent vacuum from cleaning up > LP_DEAD entries for a long time. Or when all index scans are bitmap index > scans, and nothing ends up cleaning up the dead index entries in certain > ranges, and even an explicit vacuum doesn't fix the issue. Even a relatively > small rollback / non-HOT update rate can start to be really painful. That does seem possible. But I consider it very unlikely to appear as a regression caused by the bypass mechanism itself -- not in any way that was consistent over time. As far as I can tell, autovacuum scheduling just doesn't operate at that level of precision, and never has. I have personally observed that ANALYZE does a very bad job at noticing LP_DEAD items in tables/workloads where LP_DEAD items (not DEAD tuples) tend to concentrate [1]. The whole idea that ANALYZE should count these items as if they were normal tuples seems pretty bad to me. Put it this way: imagine you run into trouble with the bypass thing, and then you opt to disable it on that table (using the INDEX_CLEANUP reloption). Why should this step solve the problem on its own? In order for that to work, VACUUM would have to have to know to be very aggressive about these LP_DEAD items. But there is good reason to believe that it just won't ever notice them, as long as ANALYZE is expected to provide reliable statistics that drive autovacuum -- they're just too concentrated for the block-based approach to truly work. I'm not minimizing the risk. Just telling you my thoughts on this. > I'm a bit doubtful that's as important (which is not to say that it's not > worth doing). For a heavily updated table the max space usage of the line > pointer array just isn't as big a factor as ending up with only half the > usable line pointers. Agreed; by far the best chance we have of improving the line pointer bloat situation is preventing it in the first place, by increasing MaxHeapTuplesPerPage. Once we actually do that, our remaining options are going to be much less helpful -- then it really is mostly just up to VACUUM. > And it's harder to diagnose why the > cleanup isn't happening without knowledge that pages needing cleanup couldn't > be cleaned up due to pins. > > If you want to improve the logic so that we only count pages that would have > something to clean up, I'd be happy as well. It doesn't have to mean exactly > what it means today. It seems like what you really care about here are remaining cases where our inability to acquire a cleanup lock has real consequences -- you want to hear about it when it happens, however unlikely it may be. In other words, you want to keep something in log_autovacuum_* that indicates that "less than the expected amount of work was completed" due to an inability to acquire a cleanup lock. And so for you, this is a question of keeping instrumentation that might still be useful, not a question of how we define things fundamentally, at the design level. Sound right? If so, then this proposal might be acceptable to you: * Remaining DEAD tuples with storage (though not LP_DEAD items from previous opportunistic pruning) will get counted separately in the lazy_scan_noprune (no cleanup lock) path. Also count the total number of distinct pages that were found to contain one or more such DEAD tuples. * These two new counters will be reported on their own line in the log output, though only in the cases where we actually have any such tuples -- which will presumably be much rarer than simply failing to get a cleanup lock (that's now no big deal at all, because we now consistently do certain cleanup steps, and because FreezeLimit isn't the only viable thing that we can set relfrozenxid to, at least in the non-aggressive case). * There is still a limited sense in which the same items get counted as RECENTLY_DEAD -- though just those aspects that make the overall design simpler. So the helpful aspects of this are still preserved. We only need to tell pgstat_report_vacuum() that these items are "deadtuples" (remaining dead tuples). That can work by having its caller add a new int64 counter (same new tuple-based counter used for the new log line) to vacrel->new_dead_tuples. We'd also add the same new tuple counter in about the same way at the point where we determine a final vacrel->new_rel_tuples. So we wouldn't really be treating anything as RECENTLY_DEAD anymore -- pgstat_report_vacuum() and vacrel->new_dead_tuples don't specifically expect anything about RECENTLY_DEAD-ness already. > I was thinking of truncations, which I don't think vacuum-reltuples.spec > tests. Got it. I'll look into that for v2. > Maybe. But we've had quite a few bugs because we ended up changing some detail > of what is excluded in one of the counters, leading to wrong determination > about whether we scanned everything or not. Right. But let me just point out that my whole approach is to make that impossible, by not needing to count pages, except in scanned_pages (and in frozenskipped_pages + rel_pages). The processing performed for any page that we actually read during VACUUM should be uniform (or practically uniform), by definition. With minimal fudging in the cleanup lock case (because we mostly do the same work there too). There should be no reason for any more page counters now, except for non-critical instrumentation. For example, if you want to get the total number of pages skipped via the visibility map (not just all-frozen pages), then you simply subtract scanned_pages from rel_pages. > > Fundamentally, this will only work if we decide to only skip all-frozen > > pages, which (by definition) only happens within aggressive VACUUMs. > > Hm? Or if there's just no runs of all-visible pages of sufficient length, so > we don't end up skipping at all. Of course. But my point was: who knows when that'll happen? > On reason for my doubt is the following: > > We can set all-visible on a page without a FPW image (well, as long as hint > bits aren't logged). There's a significant difference between needing to WAL > log FPIs for every heap page or not, and it's not that rare for data to live > shorter than autovacuum_freeze_max_age or that limit never being reached. This sounds like an objection to one specific heuristic, and not an objection to the general idea. The only essential part is "opportunistic freezing during vacuum, when the cost is clearly very low, and the benefit is probably high". And so it now seems you were making a far more limited statement than I first believed. Obviously many variations are possible -- there is a spectrum. Example: a heuristic that makes VACUUM notice when it is going to freeze at least one tuple on a page, iff the page will be marked all-visible in any case -- we should instead freeze every tuple on the page, and mark the page all-frozen, batching work (could account for LP_DEAD items here too, not counting them on the assumption that they'll become LP_UNUSED during the second heap pass later on). If we see these conditions, then the likely explanation is that the tuples on the heap page happen to have XIDs that are "split" by the not-actually-important FreezeLimit cutoff, despite being essentially similar in any way that matters. If you want to make the same heuristic more conservative: only do this when no existing tuples are frozen, since that could be taken as a sign of the original heuristic not quite working on the same heap page at an earlier stage. I suspect that even very conservative versions of the same basic idea would still help a lot. > Perhaps we can have most of the benefit even without that. If we were to > freeze whenever it didn't cause an additional FPWing, and perhaps didn't skip > all-visible but not !all-frozen pages if they were less than x% of the > to-be-scanned data, we should be able to to still increase relfrozenxid in a > lot of cases? I bet that's true. I like that idea. If we had this policy, then the number of "extra" visited-in-non-aggressive-vacuum pages (all-visible but not yet all-frozen pages) could be managed over time through more opportunistic freezing. This might make it work even better. These all-visible (but not all-frozen) heap pages could be considered "tenured", since they have survived at least one full VACUUM cycle without being unset. So why not also freeze them based on the assumption that they'll probably stay that way forever? There won't be so many of the pages when we do this anyway, by definition -- since we'd have a heuristic that limited the total number (say to no more than 10% of the total relation size, something like that). We're smoothing out the work that currently takes place all together during an aggressive VACUUM this way. Moreover, there is perhaps a good chance that the total number of all-visible-not all-frozen heap pages will *stay* low over time, as a result of this policy actually working -- there may be a virtuous cycle that totally prevents us from getting an aggressive VACUUM even once. > > I have occasionally wondered if the whole idea of reading heap pages > > with only a pin (and having cleanup locks in VACUUM) is really worth > > it -- alternative designs seem possible. Obviously that's a BIG > > discussion, and not one to have right now. But it seems kind of > > relevant. > > With 'reading' do you mean reads-from-os, or just references to buffer > contents? The latter. [1] https://postgr.es/m/CAH2-Wz=9r83wcwzcpuh4fvpedm4znzbzmvp3rt21+xhqwmu...@mail.gmail.com -- Peter Geoghegan