Hi, On 2019-03-29 20:51:38 +0900, Michael Paquier wrote: > So, coming back to this thread, and studying the problem again, it > looks that the diagnostic that a non-aggressive, anti-wraparound > vacuum could be triggered because the worker sees trouble in the > force because of some activity happening in parallel. Hence, if we > face this case, it looks right to skip the vacuum for this relation. > > Attached is an updated patch with a better error message, more > comments, and the removal of the anti-wraparound non-aggressive log > which was added in 28a8fa9. The error message could be better I > guess. Suggestions are welcome.
> diff --git a/src/backend/access/heap/vacuumlazy.c > b/src/backend/access/heap/vacuumlazy.c > index 5c554f9465..82be8c81f3 100644 > --- a/src/backend/access/heap/vacuumlazy.c > +++ b/src/backend/access/heap/vacuumlazy.c > @@ -248,6 +248,25 @@ heap_vacuum_rel(Relation onerel, VacuumParams *params, > if (params->options & VACOPT_DISABLE_PAGE_SKIPPING) > aggressive = true; > > + /* > + * When running an anti-wraparound vacuum, we expect relfrozenxid to be > + * old enough so as aggressive is always set. If this is not the case, > + * it could be possible that another concurrent vacuum process has done > + * the work for this relation so that relfrozenxid in relcache has > + * already been moved forward enough, causing this vacuum run to be > + * non-aggressive. If that happens, note that this relation no longer > + * needs to be vacuumed, so just skip it. > + */ > + if (params->is_wraparound && !aggressive) > + { > + ereport(LOG, > + (errmsg_internal("found vacuum to prevent > wraparound of table \"%s.%s.%s\" to be not aggressive, so skipping", > + > get_database_name(MyDatabaseId), > + > get_namespace_name(RelationGetNamespace(onerel)), > + > RelationGetRelationName(onerel)))); > + return; > + } > + Which valid scenario can lead to this? Neither the comment, nor commit message explain it. Unless you're thinking of scenarios where autovacuum and manual vacuum are mixed, I don't really see valid reasons? Normally autovacuum's locking + the table_recheck_autovac() check should prevent problematic scenarios. I do see a few scenarios that can trigger this - but they all more or less are bugs. It doesn't strike me as a good idea to work around such bugs by silently neutering heap_vacuum_rel(). The likelihood of that temporarily covering up more severe problems seems significant - they're likely to then later bite you with a cluster shutdown. Greetings, Andres Freund