On Thu, Jan 16, 2025 at 12:52:54PM -0800, Noah Misch wrote: > On Thu, Jan 16, 2025 at 04:50:09PM +0900, Michael Paquier wrote: > > As far as I understand, the most important point of the logic is to > > detect and discard the future files first in restoreTwoPhaseData() -> > > ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at > > the beginning of recovery. Once this filtering is done, it should be > > safe to use your FullTransactionIdFromAllowableAt() when doing > > the fxid <-> xid transitions between the records and the files on disk > > flushed by a restartpoint which store an XID, and the shmem state of > > GlobalTransactionData with a fxid. > > (I did not expect that a function called restoreTwoPhaseData() would run > before > a function called PrescanPreparedTransactions(), but so it is.) > > How is it that restoreTwoPhaseData() -> ProcessTwoPhaseBuffer() can safely > call TransactionIdDidAbort() when we've not replayed WAL to make CLOG > consistent? What can it assume about the value of TransamVariables->nextXid > at that early time?
I think it can't use CLOG safely. I regret answering my own question now as opposed to refraining from asking it, but hopefully this will save you time. Since EndPrepare() flushes the XLOG_XACT_PREPARE record before anything writes to pg_twophase, presence of a pg_twophase file tells us its record once existed in the flushed WAL stream. Even before we've started the server on a base backup, each pg_twophase file pertains to an XLOG_XACT_PREPARE LSN no later than the end-of-backup checkpoint. If a later file exists, the backup copied a file after the end-of-backup checkpoint started, which is a violation of the base backup spec. Elsewhere in recovery, we have the principle that data directory content means little until we reach a consistent state by replaying the end-of-backup checkpoint or by crash recovery reaching the end of WAL. For twophase, that calls for ignoring pg_twophase content. If a restartpoint has a pg_twophase file to write, we should allow that to overwrite an old file iff we've not reached consistency. (We must not read the old pg_twophase file, which may contain an unfinished write.) By the time we reach consistency, every file in pg_twophase will be applicable (not committed or aborted). Each file either predates the start-of-backup checkpoint (or crash recovery equivalent), or recovery wrote that file. We need readdir(pg_twophase) only at end of recovery, when we need _twophase_recover callbacks to restore enough state to accept writes. (During hot standby, state from XLOG_RUNNING_XACTS suffices[1] for the actions possible in that mode.) In other words, the root problem that led to commits e358425 and 7e125b2 was recovery interpreting pg_twophase file content before reaching consistency. We can't make the ProcessTwoPhaseBuffer() checks safe before reaching consistency. Is that right? Once ProcessTwoPhaseBuffer() stops running before consistency, it won't strictly need the range checks. We may as well have something like those range checks as a best-effort way to detect base backup spec violations. Thanks, nm [1] Incidentally, the comment lock_twophase_standby_recover incorrectly claims it runs "when starting up into hot standby mode."