From bb906c85b1ac96e3ae2b9cda3edd50d6d979b442 Mon Sep 17 00:00:00 2001
From: Pavan Deolasee <pavan.deolasee@gmail.com>
Date: Wed, 11 Apr 2018 22:56:03 +0530
Subject: [PATCH 1/2] Do not overwrite the nextOid counter while replaying
 ONLINE checkpoint

This is a long standing bug in the replay of ONLINE checkpoint record which may
lead to duplicate OID assignment post a crash recovery or standby promotion.
For example, if may have a situation where an ONLINE checkpoint begins,we then
consume all cached OIDs, allocate a new set of OIDs and generate a XLOG_NEXTOID
record and consume more OIDs. The ONLINE checkpoint then ends and we write a
checkpoint record. This checkpoint record will now have OID counter less than
the OIDs written to the disk.

For that reason, in case of a crash, followed by REDO recovery recovery or
while replaying the ONLINE checkpoint on the standby, we shouldn't overwrite
the current nextOid counter with what's stored in the checkpoint record.

This bug prominently brought out another bug in the TOAST and should be
backpatched to all supported releases.
---
 src/backend/access/transam/xlog.c | 28 +++++++++++++++++++++++-----
 1 file changed, 23 insertions(+), 5 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a51df6b0b9..ab1e209cd3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9785,11 +9785,29 @@ xlog_redo(XLogReaderState *record)
 								  checkPoint.nextXid))
 			ShmemVariableCache->nextXid = checkPoint.nextXid;
 		LWLockRelease(XidGenLock);
-		/* ... but still treat OID counter as exact */
-		LWLockAcquire(OidGenLock, LW_EXCLUSIVE);
-		ShmemVariableCache->nextOid = checkPoint.nextOid;
-		ShmemVariableCache->oidCount = 0;
-		LWLockRelease(OidGenLock);
+
+		/*
+		 * In an ONLINE checkpoint, don't update the nextOid counter. When the
+		 * REDO recovery began, we must have initialized nextOid counter to the
+		 * value stored in the REDO checkpoint record. If there was no
+		 * XLOG_NEXTOID record replayed since the REDO checkpoint, then we know
+		 * that the current value is correct. If we replayed a XLOG_NEXTOID
+		 * record, then we should rather use the value advanced by the
+		 * XLOG_NEXTOID and not overwrite with a potentially old value, thus
+		 * issuing duplicate OIDs when we are fully up.
+		 *
+		 * If you're still curious why that can happen, note that the ONLINE
+		 * checkpoint may have started before we consumed all the cached OIDs
+		 * and requested another block of OIDs, generating a XLOG_NEXTOID
+		 * record. So the value noted in the checkpoint record would be stale.
+		 *
+		 * Note that since we don't have a concept of wrap-around detection for
+		 * OIDs, like we have for XIDs, we can't merely check if the value
+		 * stored in the ONLINE checkpoint record is greater than the current
+		 * value and apply the change only in that case like we do for nextXid.
+		 */
+
+		/* Handle multixact */
 		MultiXactAdvanceNextMXact(checkPoint.nextMulti,
 								  checkPoint.nextMultiOffset);
 
-- 
2.14.3 (Apple Git-98)