Hey everyone,

I've discovered a serious bug that leads to a server crash upon promoting an 
instance that crashed previously and did
recovery in standby mode.

The bug is present in PostgreSQL versions 13 and 14 (and in earlier versions, 
though it doesn't manifest itself so
catastrophically).
The circumstances to trigger the bug are as follows:
- postgresql is configured for hot_standby, archiving, and prepared transactions
- prepare a transaction
- crash postgresql
- create standby.signal file
- start postgresql, wait for recovery to finish
- promote

The promotion will fail with a FATAL error, stating that "requested WAL segment 
.* has already been removed".
The FATAL error causes the startup process to exit, so postmaster shuts down 
again.

Here's an exemplary log output, maybe this helps people to find this issue when 
they search for it online:

LOG:  consistent recovery state reached at 0/15D8AB0
LOG:  database system is ready to accept read only connections
LOG:  received promote request
LOG:  redo done at 0/15D89B8
LOG:  last completed transaction was at log time 2023-06-16 13:09:53.71118+02
LOG:  selected new timeline ID: 2
LOG:  archive recovery complete
FATAL:  requested WAL segment pg_wal/000000010000000000000001 has already been 
removed
LOG:  startup process (PID 1650358) exited with exit code 1
LOG:  terminating any other active server processes
LOG:  database system is shut down


The cause of this failure is an oversight (rather obvious in hindsight):
The renaming of the WAL file (that was last written to before the crash 
happened) to .partial is done *before* PostgreSQL
might have to read this very file to recover prepared transactions from it.
The relevant function calls here are durable_rename() and 
RecoverPreparedTransactions() in xlog.c .

Note that it is important that the PREPARE entry is in the WAL file that 
PostgreSQL is writing to prior to the inital
crash.
This has happened repeatedly in production already with a customer that uses 
prepared transactions quite frequently.
I assume that this has happened for others too, but the circumstances of the 
crash and the cause are very dubious, and
troubleshooting it is pretty difficult.


This behaviour has - apparently unintentionally - been fixed in PG 15 and 
upwards (see commit 811051c ), as part of a
general restructure and reorganization of this portion of xlog.c (see commit 
6df1543 ).

Furthermore, it seems this behaviour does not appear in PG 12 and older, due to 
another possible bug:
In PG 13 and newer, the XLogReaderState is reset in XLogBeginRead() before 
reading WAL in XlogReadTwoPhaseData() in
twophase.c .
In the older releases (PG <= 12), this reset is not done, so the requested LSN 
containing the prepared transaction can
(by happy coincidence) be read from in-memory buffers, and PostgreSQL 
consequently manages to come up just fine (as the
WAL has already been read into buffers prior to the .partial rename).
If the older releases also where to properly reset the XLogReaderState, they 
would also fail to find the LSN on disk, and
hence PostgreSQL would crash again.

I've attached patches for PG 14 and PG 13 that mimic the change in PG15 (commit 
811051c ) and reorder the crucial events,
placing the recovery of prepared transactions *before* renaming the file.
I've also attached recovery test scripts for PG >= 12 and PG <= 11 that can be 
used to verify that promote after recovery
with prepared transactions works.

A note for myself in the future and whomever may find it useful:
The test can be copied to src/test/recovery/t/ and selectively run (after 
you've ./configure'd for TAP testing and
compiled everything) from within the src/test/recovery directory using 
something like:
    make check PROVE_TESTS='t/PG_geq_12_promote_prepare_xact.pl'


My humble opinion is that this fix should be backpatched to PG 14 and PG 13.
It's debatable whether the fix needs to be brought back to 12 and older also, 
as those do not exhibit this issue, but the
order of renaming is still wrong.
I'm not sure if there could be cases where the in-memory buffers of the 
walreader are too small to cover a whole WAL
file.
There could also be other issues from operations that require reading WAL that 
happen after the .partial rename, I
haven't checked in depth what else happens in the affected codepath.
Please let me know if you think this should also be fixed in PG 12 and earlier, 
so I can produce the patches for those
versions as well.


Kind regards
Julian

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a05a82119e..e0d9b89d78 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7858,6 +7858,60 @@ StartupXLOG(void)
 			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
 	}
 
+	/*
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
+	TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up the commit log and subtrans, if not already done for hot
+	 * standby.  (commit timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+	{
+		StartupCLOG();
+		StartupSUBTRANS(oldestActiveXID);
+	}
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
 	if (ArchiveRecoveryRequested)
 	{
 		/*
@@ -7939,60 +7993,6 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
-	TransactionIdRetreat(ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up the commit log and subtrans, if not already done for hot
-	 * standby.  (commit timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-	{
-		StartupCLOG();
-		StartupSUBTRANS(oldestActiveXID);
-	}
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
-
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
 	 * commit timestamp.
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 134a6481b9..fa1d3b8536 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8011,6 +8011,57 @@ StartupXLOG(void)
 			CreateCheckPoint(CHECKPOINT_END_OF_RECOVERY | CHECKPOINT_IMMEDIATE);
 	}
 
+	/*
+	 * Preallocate additional log files, if wanted.
+	 */
+	PreallocXlogFiles(EndOfLog);
+
+	/*
+	 * Okay, we're officially UP.
+	 */
+	InRecovery = false;
+
+	/* start the archive_timeout timer and LSN running */
+	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+	XLogCtl->lastSegSwitchLSN = EndOfLog;
+
+	/* also initialize latestCompletedXid, to nextXid - 1 */
+	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
+	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
+	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
+	LWLockRelease(ProcArrayLock);
+
+	/*
+	 * Start up subtrans, if not already done for hot standby.  (commit
+	 * timestamps are started below, if necessary.)
+	 */
+	if (standbyState == STANDBY_DISABLED)
+		StartupSUBTRANS(oldestActiveXID);
+
+	/*
+	 * Perform end of recovery actions for any SLRUs that need it.
+	 */
+	TrimCLOG();
+	TrimMultiXact();
+
+	/* Reload shared-memory state for prepared transactions */
+	RecoverPreparedTransactions();
+
+	/* Shut down xlogreader */
+	if (readFile >= 0)
+	{
+		close(readFile);
+		readFile = -1;
+	}
+	XLogReaderFree(xlogreader);
+
+	/*
+	 * If any of the critical GUCs have changed, log them before we allow
+	 * backends to write WAL.
+	 */
+	LocalSetXLogInsertAllowed();
+	XLogReportParameters();
+
 	if (ArchiveRecoveryRequested)
 	{
 		/*
@@ -8092,57 +8143,6 @@ StartupXLOG(void)
 		}
 	}
 
-	/*
-	 * Preallocate additional log files, if wanted.
-	 */
-	PreallocXlogFiles(EndOfLog);
-
-	/*
-	 * Okay, we're officially UP.
-	 */
-	InRecovery = false;
-
-	/* start the archive_timeout timer and LSN running */
-	XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-	XLogCtl->lastSegSwitchLSN = EndOfLog;
-
-	/* also initialize latestCompletedXid, to nextXid - 1 */
-	LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
-	ShmemVariableCache->latestCompletedXid = ShmemVariableCache->nextXid;
-	FullTransactionIdRetreat(&ShmemVariableCache->latestCompletedXid);
-	LWLockRelease(ProcArrayLock);
-
-	/*
-	 * Start up subtrans, if not already done for hot standby.  (commit
-	 * timestamps are started below, if necessary.)
-	 */
-	if (standbyState == STANDBY_DISABLED)
-		StartupSUBTRANS(oldestActiveXID);
-
-	/*
-	 * Perform end of recovery actions for any SLRUs that need it.
-	 */
-	TrimCLOG();
-	TrimMultiXact();
-
-	/* Reload shared-memory state for prepared transactions */
-	RecoverPreparedTransactions();
-
-	/* Shut down xlogreader */
-	if (readFile >= 0)
-	{
-		close(readFile);
-		readFile = -1;
-	}
-	XLogReaderFree(xlogreader);
-
-	/*
-	 * If any of the critical GUCs have changed, log them before we allow
-	 * backends to write WAL.
-	 */
-	LocalSetXLogInsertAllowed();
-	XLogReportParameters();
-
 	/*
 	 * Local WAL inserts enabled, so it's time to finish initialization of
 	 * commit timestamp.

Attachment: PG_geq_12_promote_prepare_xact.pl
Description: Perl program

Attachment: PG_leq_11_promote_prepare_xact.pl
Description: Perl program

Reply via email to