On Tue, Jan 24, 2023 at 11:21 AM Robert Haas <robertmh...@gmail.com> wrote: > > The whole article was about how this DROP TRIGGER pattern worked just > > fine most of the time, because most of the time autovacuum was just > > autocancelled. They say this at one point: > > > > "The normal autovacuum mechanism is skipped when locks are held in > > order to minimize service disruption. However, because transaction > > wraparound is such a severe problem, if the system gets too close to > > wraparound, an autovacuum is launched that does not back off under > > lock contention." > > If this isn't arguing in favor of exactly what I'm saying, I don't > know what that would look like.
I'm happy to clear that up. What you said was: "So I think this sounds like exactly the kind of case I was talking about, where autovacuums keep getting cancelled until we decide to stop cancelling them. If so, then they were going to have a problem whenever that happened." Just because *some* autovacuums get cancelled doesn't mean they *all* get cancelled. And, even if the rate is quite high, that may not be much of a problem in itself (especially now that we have the freeze map). 200 million XIDs usually amounts to a lot of wall clock time. Even if it is rather difficult to finish up, we only have to get lucky once. The fact that autovacuum eventually got to the point of requiring an antiwraparound autovacuum on the problematic table does indeed strongly suggest that any other, earlier autovacuums were relatively unlikely to have advanced relfrozenxid in the end -- or at least couldn't on this one occasion. But that in itself is just not relevant to our current discussion, since even the tiniest perturbation would have been enough to prevent a non-aggressive VACUUM from being able to advance relfrozenxid. Before 15, non-aggressive VACUUMs would throw away the opportunity to do so just because they couldn't immediately get a cleanup lock on one single heap page. It's quite possible that most or all prior aggressive VACUUMs were not antiwraparound autovacuums, because the dead tuples accounting was enough to launch an autovacuum at some point after age(relfrozenxid) exceeded vacuum_freeze_table_age that was still before it could reach autovacuum_freeze_max_age. That would give you a cancellable aggressive VACUUM -- a VACUUM that actually has a non-zero chance of advancing relfrozenxid. Sure, it's possible that such a cancellable aggressive autovacuum was indeed cancelled, and that that factor made the crucial difference. But I find it far easier to believe that there simply was no such aggressive autovacuum in the first place (not this time), since it could have only happened when autovacuum thinks that there are sufficiently many dead tuples to justify launching an autovacuum in the first place. Which, as we now all accept, is based on highly dubious sampling by ANALYZE. So I think it's much more likely to be that factor (dead tuple accounting is bad), as well as the absurd false dichotomy between aggressive and non-aggressive -- plus the issue at hand, the auto-cancellation behavior. I don't claim to know what is inevitable, or what is guaranteed to work or not work. I only claim that we can meaningfully reduce the absolute risk by using a fairly simple approach, principally by not needlessly coupling the auto-cancellation behavior to *all* autovacuums that are specifically triggered by age(relfrozenxid). As Andres said at one point, doing those two things at exactly the same time is just arbitrary. -- Peter Geoghegan