Re: An improvement of ProcessTwoPhaseBuffer logic

Noah Misch Thu, 16 Jan 2025 16:52:43 -0800

On Thu, Jan 16, 2025 at 12:52:54PM -0800, Noah Misch wrote:
> On Thu, Jan 16, 2025 at 04:50:09PM +0900, Michael Paquier wrote:
> > As far as I understand, the most important point of the logic is to
> > detect and discard the future files first in restoreTwoPhaseData() ->
> > ProcessTwoPhaseBuffer() when scanning the contents of pg_twophase at
> > the beginning of recovery.  Once this filtering is done, it should be
> > safe to use your FullTransactionIdFromAllowableAt() when doing
> > the fxid <-> xid transitions between the records and the files on disk
> > flushed by a restartpoint which store an XID, and the shmem state of
> > GlobalTransactionData with a fxid.
> 
> (I did not expect that a function called restoreTwoPhaseData() would run 
> before
> a function called PrescanPreparedTransactions(), but so it is.)
> 
> How is it that restoreTwoPhaseData() -> ProcessTwoPhaseBuffer() can safely
> call TransactionIdDidAbort() when we've not replayed WAL to make CLOG
> consistent?  What can it assume about the value of TransamVariables->nextXid
> at that early time?


I think it can't use CLOG safely.  I regret answering my own question now as
opposed to refraining from asking it, but hopefully this will save you time.

Since EndPrepare() flushes the XLOG_XACT_PREPARE record before anything writes
to pg_twophase, presence of a pg_twophase file tells us its record once
existed in the flushed WAL stream.  Even before we've started the server on a
base backup, each pg_twophase file pertains to an XLOG_XACT_PREPARE LSN no
later than the end-of-backup checkpoint.  If a later file exists, the backup
copied a file after the end-of-backup checkpoint started, which is a violation
of the base backup spec.

Elsewhere in recovery, we have the principle that data directory content means
little until we reach a consistent state by replaying the end-of-backup
checkpoint or by crash recovery reaching the end of WAL.  For twophase, that
calls for ignoring pg_twophase content.  If a restartpoint has a pg_twophase
file to write, we should allow that to overwrite an old file iff we've not
reached consistency.  (We must not read the old pg_twophase file, which may
contain an unfinished write.)

By the time we reach consistency, every file in pg_twophase will be applicable
(not committed or aborted).  Each file either predates the start-of-backup
checkpoint (or crash recovery equivalent), or recovery wrote that file.  We
need readdir(pg_twophase) only at end of recovery, when we need
_twophase_recover callbacks to restore enough state to accept writes.  (During
hot standby, state from XLOG_RUNNING_XACTS suffices[1] for the actions
possible in that mode.)

In other words, the root problem that led to commits e358425 and 7e125b2 was
recovery interpreting pg_twophase file content before reaching consistency.
We can't make the ProcessTwoPhaseBuffer() checks safe before reaching
consistency.  Is that right?  Once ProcessTwoPhaseBuffer() stops running
before consistency, it won't strictly need the range checks.  We may as well
have something like those range checks as a best-effort way to detect base
backup spec violations.

Thanks,
nm

[1] Incidentally, the comment lock_twophase_standby_recover incorrectly claims
it runs "when starting up into hot standby mode."

Re: An improvement of ProcessTwoPhaseBuffer logic

Reply via email to