On Thu, Dec 28, 2023 at 11:42 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > Thomas Munro <thomas.mu...@gmail.com> writes: > > In CommitTransaction() there is a stretch of code beginning s->state = > > TRANS_COMMIT and ending s->state = TRANS_DEFAULT, from which we call > > out to various subsystems' AtEOXact_XXX() functions. There is no way > > to roll back in that state, so anything that throws ERROR from those > > routines is going to get something much like $SUBJECT. Hmm, we'd know > > which exact code path got that EIO from your smoldering core if we'd > > put an explicit critical section there (if we're going to PANIC > > anyway, it might as well not be from a different stack after > > longjmp()...). > > +1, there's basically no hope of debugging this sort of problem > as things stand.
I was reminded of this thread by Justin's other file system snafu thread. Naively defining a critical section to match the extent of the TRANS_COMMIT state doesn't work, as a bunch of code under there uses palloc(). That reminds me of the nearby RelationTruncate() thread, and there is possibly even some overlap, plus more in this case... ugh. Hmm, AtEOXact_RelationMap() is one of those steps, but lives just outside the crypto-critical-section created by TRANS_COMMIT, though has its own normal CS for logging. I wonder, given that "updating the map file is effectively commit of the relocation", why wouldn't it have a variant of the problem solved by DELAY_CHKPT_START for normal commit records, under diabolical scheduling? It's a stretch, but: You log XLOG_RELMAP_UPDATE, a concurrent checkpoint runs with REDO after that record, you crash before/during durable_rename(), and then you perform crash recovery. Now your catalog is still using the old relfilenode on the primary, but any replica following along replays XLOG_RELMAP_UPDATE and is using the new relfilenode, frozen in time, for queries, while replaying changes to the old relfilenode. Right?