Hi, On 2022-03-30 17:50:42 -0700, Peter Geoghegan wrote: > I tried several times to recreate this issue on CI. No luck with that, > though -- can't get it to fail again after 4 attempts.
It's really annoying that we don't have Assert variants that show the compared values, that might make it easier to intepret what's going on. Something vaguely like EXPECT_EQ_U32 in regress.c. Maybe AssertCmp(type, a, op, b), Then the assertion could have been something like AssertCmp(int32, diff, >, 0) Does the line number in the failed run actually correspond to the xid, rather than the mxid case? I didn't check. You could try to increase the likelihood of reproducing the failure by duplicating the invocation that lead to the crash a few times in the .cirrus.yml file in your dev branch. That might allow hitting the problem more quickly. Maybe reduce autovacuum_naptime in src/tools/ci/pg_ci_base.conf? Or locally - one thing that windows CI does different from the other platforms is that it runs isolation, contrib and a bunch of other tests using the same cluster. Which of course increases the likelihood of autovacuum having stuff to do, *particularly* on shared relations - normally there's probably not enough changes for that. You can do something similar locally on linux with make -Otarget -C contrib/ -j48 -s USE_MODULE_DB=1 installcheck prove_installcheck=true (the prove_installcheck=true to prevent tap tests from running, we don't seem to have another way for that) I don't think windows uses USE_MODULE_DB=1, but it allows to cause a lot more load concurrently than running tests serially... > We know that this particular assertion did not fail during the same VACUUM: > > Assert(vacrel->NewRelfrozenXid == OldestXmin || > TransactionIdPrecedesOrEquals(vacrel->relfrozenxid, > vacrel->NewRelfrozenXid)); The comment in your patch says "is either older or newer than FreezeLimit" - I assume that's some rephrasing damage? > So it's hard to see how this could be a bug in the patch -- the final > new relfrozenxid is presumably equal to VACUUM's OldestXmin in the > problem scenario seen on the CI Windows instance yesterday (that's why > this earlier assertion didn't fail). Perhaps it's worth commiting improved assertions on master? If this is indeed a pre-existing bug, and we're just missing due to slightly less stringent asserts, we could rectify that separately. > The surprising part of the CI failure must have taken place just after > this assertion, when VACUUM's call to vacuum_set_xid_limits() actually > updates pg_class.relfrozenxid with vacrel->NewRelfrozenXid -- > presumably because the existing relfrozenxid appeared to be "in the > future" when we examine it in pg_class again. We see evidence that > this must have happened afterwards, when the closely related assertion > (used only in instrumentation code) fails: Hm. This triggers some vague memories. There's some oddities around shared relations being vacuumed separately in all the databases and thus having separate horizons. After "remembering" that, I looked in the cirrus log for the failed run, and the worker was processing shared a shared relation last: 2022-03-30 03:48:30.238 GMT [5984][autovacuum worker] LOG: automatic analyze of table "contrib_regression.pg_catalog.pg_authid" Obviously that's not a guarantee that the next table processed also is a shared catalog, but ... Oh, the relid is actually in the stack trace. 0x4ee = 1262 = pg_database. Which makes sense, the test ends up with a high percentage of dead rows in pg_database, due to all the different contrib tests creating/dropping a database. > From my patch: > > > if (frozenxid_updated) > > { > > - diff = (int32) (FreezeLimit - vacrel->relfrozenxid); > > + diff = (int32) (vacrel->NewRelfrozenXid - > > vacrel->relfrozenxid); > > + Assert(diff > 0); > > appendStringInfo(&buf, > > _("new relfrozenxid: %u, which is %d xids > > ahead of previous value\n"), > > - FreezeLimit, diff); > > + vacrel->NewRelfrozenXid, diff); > > } Perhaps this ought to be an elog() instead of an Assert()? Something has gone pear shaped if we get here... It's a bit annoying though, because it'd have to be a PANIC to be visible on the bf / CI :(. Greetings, Andres Freund