[draft] demote primary to standby

Jehan-Guillaume de Rorthais Thu, 12 Jun 2025 10:15:26 -0700

Hi all,

I went back to my old patch about the "demote" action[1].

# Why?

The goal of the "demote" action is the opposite of the "promote" one: it
turns an "in production" instance to a standby one. This is a required step
to make cluster management easier, as instance, for switching over the
primary role from a node to another one. This is a common cluster scenario
during eg. software maintenance/upgrade.

This can also be useful for clusterware where instances state can be changed
from one way or the other. At least Patroni, pg_auto_failover and Pacemaker
agents have some demote related code that actually shutdown the instance and
sometime start it back as a standby.

# State of the patch

This patch is a draft. I removed a lot of code and complexity compared to my
previous version. It only focus on demoting a node using a "demote
checkpoint" after stopping all existing user backends.

This is a first incremental step just to settle various complexities before
moving ahead.

The tests focus on 2PC, some limited writes and checking if everything is
following/surviving gracefully during demote/promote/switchover.

# TODO

* cleaner code to handle various subsystem

Current draft vaguely deal with various subsystems from ShutdownXLOG() and
StartupXLOG(). The code in StartupXLOG() was kind of a test/fail/retry
session so things /seems/ to work.

A serious study and better code flow is needed to re-init the subsystem
correctly, with better explanations and comments.

I try to deal with MultiXact and Prepared Transactions in this patch, let me
know what you think this code design.

* remove the checkpoint

Robert Haas pointed in previous discussion that the checkpoint was an issue
as the demote action needed to be fast to be useful. Waiting for minutes for
the checkpoint to finish in various scenario is not acceptable, as expressed
here:

https://www.postgresql.org/message-id/flat/CA%2BTgmoYe8uCgtYFGfnv3vWpZTygsdkSu2F4MNiqhkar_UKbWfQ%40mail.gmail.com#2a488f8c2ea1696197d9edf98fcb4472

I still have to think about how the demote can be executed without the this
checkpoint. Any advice, warning, idea about this topic is appreciated.

Thank you!

[1] https://www.postgresql.org/message-id/flat/20200617174451.222078b4%40firost

>From ba5cabbbdb8adadf264cc1e9e02c9560050429c6 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <j...@dalibo.com>
Date: Thu, 12 Jun 2025 18:36:42 +0200
Subject: [PATCH v1 1/2] Demote draft using dedicated checkpoint

---
 src/backend/access/rmgrdesc/xlogdesc.c    |  11 +-
 src/backend/access/transam/multixact.c    |  16 ++
 src/backend/access/transam/twophase.c     |  97 +++++++++
 src/backend/access/transam/xlog.c         | 250 +++++++++++++---------
 src/backend/access/transam/xlogrecovery.c |  62 +++++-
 src/backend/postmaster/checkpointer.c     |  28 +++
 src/backend/postmaster/postmaster.c       | 127 ++++++++++-
 src/backend/storage/ipc/procsignal.c      |   5 +
 src/backend/storage/lmgr/lock.c           |  12 ++
 src/backend/tcop/backend_startup.c        |   5 +
 src/bin/pg_controldata/pg_controldata.c   |   2 +
 src/bin/pg_ctl/pg_ctl.c                   | 112 +++++++++-
 src/include/access/multixact.h            |   1 +
 src/include/access/twophase.h             |   1 +
 src/include/access/xlog.h                 |  21 +-
 src/include/access/xlogrecovery.h         |   2 +
 src/include/catalog/pg_control.h          |  28 +--
 src/include/postmaster/bgwriter.h         |   2 +
 src/include/storage/lock.h                |   2 +
 src/include/storage/procsignal.h          |   1 +
 src/include/tcop/backend_startup.h        |   1 +
 src/include/utils/pidfile.h               |   1 +
 22 files changed, 646 insertions(+), 141 deletions(-)

diff --git a/src/backend/access/rmgrdesc/xlogdesc.c b/src/backend/access/rmgrdesc/xlogdesc.c
index 58040f28656..fb180f52107 100644
--- a/src/backend/access/rmgrdesc/xlogdesc.c
+++ b/src/backend/access/rmgrdesc/xlogdesc.c
@@ -60,8 +60,8 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 	char	   *rec = XLogRecGetData(record);
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
 
-	if (info == XLOG_CHECKPOINT_SHUTDOWN ||
-		info == XLOG_CHECKPOINT_ONLINE)
+	if (info == XLOG_CHECKPOINT_SHUTDOWN || info == XLOG_CHECKPOINT_ONLINE ||
+		info == XLOG_CHECKPOINT_DEMOTE)
 	{
 		CheckPoint *checkpoint = (CheckPoint *) rec;
 
@@ -87,7 +87,9 @@ xlog_desc(StringInfo buf, XLogReaderState *record)
 						 checkpoint->oldestCommitTsXid,
 						 checkpoint->newestCommitTsXid,
 						 checkpoint->oldestActiveXid,
-						 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" : "online");
+						 (info == XLOG_CHECKPOINT_SHUTDOWN) ? "shutdown" :
+							(info == XLOG_CHECKPOINT_DEMOTE) ? "demote" :
+								"online");
 	}
 	else if (info == XLOG_NEXTOID)
 	{
@@ -179,6 +181,9 @@ xlog_identify(uint8 info)
 		case XLOG_CHECKPOINT_SHUTDOWN:
 			id = "CHECKPOINT_SHUTDOWN";
 			break;
+		case XLOG_CHECKPOINT_DEMOTE:
+			id = "CHECKPOINT_DEMOTE";
+			break;
 		case XLOG_CHECKPOINT_ONLINE:
 			id = "CHECKPOINT_ONLINE";
 			break;
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 3c06ac45532..651e8272820 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -2171,6 +2171,22 @@ StartupMultiXact(void)
 						pageno);
 }
 
+/*
+ * This must be called during demote.
+ */
+void
+ShutdownMultiXact(void)
+{
+	/* FIXME: This seems enough for now, but maybe it would be best to call
+	 * MultiXactShmemInit() instead of this half-backed reinit, hoping for
+	 * StartupXLOG to do the right thing from there?
+	 */
+	/* signal that we're officially down */
+	LWLockAcquire(MultiXactGenLock, LW_EXCLUSIVE);
+	MultiXactState->finishedStartup = false;
+	LWLockRelease(MultiXactGenLock);
+}
+
 /*
  * This must be called ONCE at the end of startup/recovery.
  */
diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 73a80559194..d9e9c78382b 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -1673,6 +1673,103 @@ FinishPreparedTransaction(const char *gid, bool isCommit)
 	pfree(buf);
 }
 
+/*
+ * ShutdownPreparedTransactions: clean prepared from sheared memory
+ *
+ * This is called during the demote process to clean the shared memory
+ * before the startup process load everything back in correctly
+ * for the standby mode.
+ *
+ * Note: this function assume all prepared transaction have been
+ * written to disk. In consequence, it must be called AFTER the demote
+ * shutdown checkpoint.
+ */
+void
+ShutdownPreparedTransactions(void)
+{
+	int i;
+
+	for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
+	{
+		GlobalTransaction gxact;
+		PGPROC     *proc;
+		TransactionId xid;
+		char       *buf;
+		char       *bufptr;
+		TwoPhaseFileHeader *hdr;
+		TransactionId latestXid;
+		TransactionId *children;
+
+		gxact = TwoPhaseState->prepXacts[i];
+		proc = &ProcGlobal->allProcs[gxact->pgprocno];
+		xid = gxact->xid;
+
+		/* Read and validate 2PC state data */
+		Assert(gxact->ondisk);
+		buf = ReadTwoPhaseFile(xid, false);
+
+		/*
+		 * Disassemble the header area
+		 */
+		hdr = (TwoPhaseFileHeader *) buf;
+		Assert(TransactionIdEquals(hdr->xid, xid));
+		bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader))
+				 + MAXALIGN(hdr->gidlen);
+		children = (TransactionId *) bufptr;
+		bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId))
+				  + MAXALIGN(hdr->ncommitrels * sizeof(RelFileLocator))
+				  + MAXALIGN(hdr->nabortrels * sizeof(RelFileLocator))
+				  + MAXALIGN(hdr->ncommitstats * sizeof(xl_xact_stats_item))
+				  + MAXALIGN(hdr->nabortstats * sizeof(xl_xact_stats_item))
+				  + MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage));
+
+		/* compute latestXid among all children */
+		latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children);
+
+		/* remove dummy proc associated to the gaxt */
+		ProcArrayRemove(proc, latestXid);
+
+		/*
+		 * FIXME: This lock is probably not needed during the demote process
+		 * as all backends are already gone.
+		 */
+		LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE);
+
+		/* cleanup locks */
+		for (;;)
+		{
+			TwoPhaseRecordOnDisk *record = (TwoPhaseRecordOnDisk *) bufptr;
+
+			Assert(record->rmid <= TWOPHASE_RM_MAX_ID);
+			if (record->rmid == TWOPHASE_RM_END_ID)
+				break;
+
+			bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk));
+
+			if (record->rmid == TWOPHASE_RM_LOCK_ID)
+				lock_twophase_shutdown(xid, record->info,
+									   (void *) bufptr, record->len);
+
+			bufptr += MAXALIGN(record->len);
+		}
+
+		/* and put it back in the freelist */
+		gxact->next = TwoPhaseState->freeGXacts;
+		TwoPhaseState->freeGXacts = gxact;
+
+		/*
+		 * Release the lock as all callbacks are called and shared memory cleanup
+		 * is done.
+		 */
+		LWLockRelease(TwoPhaseStateLock);
+
+		pfree(buf);
+	}
+
+	TwoPhaseState->numPrepXacts -= i;
+	Assert(TwoPhaseState->numPrepXacts == 0);
+}
+
 /*
  * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record.
  */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1914859b2ee..7b701dd6c57 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5628,6 +5628,7 @@ StartupXLOG(void)
 	XLogRecPtr	missingContrecPtr;
 	TransactionId oldestActiveXID;
 	bool		promoted = false;
+	bool		is_demoting = false;
 
 	/*
 	 * We should have an aux process resource owner to use, and we should not
@@ -5693,6 +5694,13 @@ StartupXLOG(void)
 							str_time(ControlFile->time))));
 			break;
 
+		case DB_DEMOTING:
+			ereport(LOG,
+					(errmsg("database system was demoted at %s",
+							str_time(ControlFile->time))));
+			is_demoting = true;
+			break;
+
 		default:
 			ereport(FATAL,
 					(errcode(ERRCODE_DATA_CORRUPTED),
@@ -5732,7 +5740,8 @@ StartupXLOG(void)
 	 *   persisted.  To avoid that, fsync the entire data directory.
 	 */
 	if (ControlFile->state != DB_SHUTDOWNED &&
-		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY)
+		ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY &&
+		ControlFile->state != DB_DEMOTING)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
@@ -5822,26 +5831,55 @@ StartupXLOG(void)
 	 * control file. On recovery, all unlogged relations are blown away, so
 	 * the unlogged LSN counter can be reset too.
 	 */
-	if (ControlFile->state == DB_SHUTDOWNED)
+	if (ControlFile->state == DB_SHUTDOWNED ||
+		ControlFile->state == DB_DEMOTING)
 		pg_atomic_write_membarrier_u64(&XLogCtl->unloggedLSN,
 									   ControlFile->unloggedLSN);
 	else
 		pg_atomic_write_membarrier_u64(&XLogCtl->unloggedLSN,
 									   FirstNormalUnloggedLSN);
 
+	if (!is_demoting)
+	{
+		/*
+		 * Copy any missing timeline history files between 'now' and the
+		 * recovery target timeline from archive to pg_wal. While we don't need
+		 * those files ourselves - the history file of the recovery target
+		 * timeline covers all the previous timelines in the history too - a
+		 * cascading standby server might be interested in them. Or, if you
+		 * archive the WAL from this server to a different archive than the
+		 * primary, it'd be good for all the history files to get archived
+		 * there after failover, so that you can use one of the old timelines
+		 * as a PITR target. Timeline history files are small, so it's better
+		 * to copy them unnecessarily than not copy them and regret later.
+		 */
+		restoreTimeLineHistoryFiles(checkPoint.ThisTimeLineID, recoveryTargetTLI);
+	}
+
 	/*
-	 * Copy any missing timeline history files between 'now' and the recovery
-	 * target timeline from archive to pg_wal. While we don't need those files
-	 * ourselves - the history file of the recovery target timeline covers all
-	 * the previous timelines in the history too - a cascading standby server
-	 * might be interested in them. Or, if you archive the WAL from this
-	 * server to a different archive than the primary, it'd be good for all
-	 * the history files to get archived there after failover, so that you can
-	 * use one of the old timelines as a PITR target. Timeline history files
-	 * are small, so it's better to copy them unnecessarily than not copy them
-	 * and regret later.
+	 * Following restoreTwoPhaseData() needs RecoveryInProgress() to return
+	 * the appropriate state as it calls PrepareRedoAdd()
 	 */
-	restoreTimeLineHistoryFiles(checkPoint.ThisTimeLineID, recoveryTargetTLI);
+	if (InRecovery)
+	{
+		/* Initialize state for RecoveryInProgress() */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		if (InArchiveRecovery)
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
+		else
+			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		/*
+		 * Update pg_control to show that we are recovering and to show the
+		 * selected checkpoint as the place we are starting from. We also mark
+		 * pg_control with any minimum recovery stop point obtained from a
+		 * backup history file.
+		 *
+		 * No need to hold ControlFileLock yet, we aren't up far enough.
+		 */
+		UpdateControlFile();
+	}
 
 	/*
 	 * Before running in recovery, scan pg_twophase and fill in its status to
@@ -5877,24 +5915,6 @@ StartupXLOG(void)
 	/* REDO */
 	if (InRecovery)
 	{
-		/* Initialize state for RecoveryInProgress() */
-		SpinLockAcquire(&XLogCtl->info_lck);
-		if (InArchiveRecovery)
-			XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE;
-		else
-			XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
-		SpinLockRelease(&XLogCtl->info_lck);
-
-		/*
-		 * Update pg_control to show that we are recovering and to show the
-		 * selected checkpoint as the place we are starting from. We also mark
-		 * pg_control with any minimum recovery stop point obtained from a
-		 * backup history file.
-		 *
-		 * No need to hold ControlFileLock yet, we aren't up far enough.
-		 */
-		UpdateControlFile();
-
 		/*
 		 * If there was a backup label file, it's done its job and the info
 		 * has now been propagated into pg_control.  We must get rid of the
@@ -5976,7 +5996,7 @@ StartupXLOG(void)
 
 			InitRecoveryTransactionEnvironment();
 
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 				oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids);
 			else
 				oldestActiveXID = checkPoint.oldestActiveXid;
@@ -5998,7 +6018,7 @@ StartupXLOG(void)
 			 * empty running-xacts record and use that here and now. Recover
 			 * additional standby state for prepared transactions.
 			 */
-			if (wasShutdown)
+			if (wasShutdown || is_demoting)
 			{
 				RunningTransactionsData running;
 				TransactionId latestCompletedXid;
@@ -6207,40 +6227,47 @@ StartupXLOG(void)
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
 	/*
-	 * Tricky point here: lastPage contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * XLog buffers initialization vars are alreday set after a demote, no need
+	 * to deal with them in this situation
 	 */
-	if (EndOfLog % XLOG_BLCKSZ != 0)
-	{
-		char	   *page;
-		int			len;
-		int			firstIdx;
-
-		firstIdx = XLogRecPtrToBufIdx(EndOfLog);
-		len = EndOfLog - endOfRecoveryInfo->lastPageBeginPtr;
-		Assert(len < XLOG_BLCKSZ);
-
-		/* Copy the valid part of the last block, and zero the rest */
-		page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
-		memcpy(page, endOfRecoveryInfo->lastPage, len);
-		memset(page + len, 0, XLOG_BLCKSZ - len);
-
-		pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-		pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
-		XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr;
-	}
-	else
+	if (!is_demoting)
 	{
 		/*
-		 * There is no partial block to copy. Just set InitializedUpTo, and
-		 * let the first attempt to insert a log record to initialize the next
-		 * buffer.
+		 * Tricky point here: lastPage contains the *last* block that the LastRec
+		 * record spans, not the one it starts in.  The last block is indeed the
+		 * one we want to use.
 		 */
-		pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
-		XLogCtl->InitializedFrom = EndOfLog;
+		if (EndOfLog % XLOG_BLCKSZ != 0)
+		{
+			char	   *page;
+			int			len;
+			int			firstIdx;
+
+			firstIdx = XLogRecPtrToBufIdx(EndOfLog);
+			len = EndOfLog - endOfRecoveryInfo->lastPageBeginPtr;
+			Assert(len < XLOG_BLCKSZ);
+
+			/* Copy the valid part of the last block, and zero the rest */
+			page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
+			memcpy(page, endOfRecoveryInfo->lastPage, len);
+			memset(page + len, 0, XLOG_BLCKSZ - len);
+
+			pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
+			pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
+			XLogCtl->InitializedFrom = endOfRecoveryInfo->lastPageBeginPtr;
+		}
+		else
+		{
+			/*
+			 * There is no partial block to copy. Just set InitializedUpTo, and
+			 * let the first attempt to insert a log record to initialize the next
+			 * buffer.
+			 */
+			pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
+			XLogCtl->InitializedFrom = EndOfLog;
+		}
+		pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));
 	}
-	pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));
 
 	/*
 	 * Update local and shared status.  This is OK to do without any locks
@@ -6522,30 +6549,20 @@ bool
 RecoveryInProgress(void)
 {
 	/*
-	 * We check shared state each time only until we leave recovery mode. We
-	 * can't re-enter recovery, so there's no need to keep checking after the
-	 * shared variable has once been seen false.
+	 * use volatile pointer to make sure we make a fresh read of the
+	 * shared variable.
 	 */
-	if (!LocalRecoveryInProgress)
-		return false;
-	else
-	{
-		/*
-		 * use volatile pointer to make sure we make a fresh read of the
-		 * shared variable.
-		 */
-		volatile XLogCtlData *xlogctl = XLogCtl;
+	volatile XLogCtlData *xlogctl = XLogCtl;
 
-		LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
+	LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE);
 
-		/*
-		 * Note: We don't need a memory barrier when we're still in recovery.
-		 * We might exit recovery immediately after return, so the caller
-		 * can't rely on 'true' meaning that we're still in recovery anyway.
-		 */
+	/*
+	 * Note: We don't need a memory barrier when we're still in recovery.
+	 * We might exit recovery immediately after return, so the caller
+	 * can't rely on 'true' meaning that we're still in recovery anyway.
+	 */
 
-		return LocalRecoveryInProgress;
-	}
+	return LocalRecoveryInProgress;
 }
 
 /*
@@ -6789,6 +6806,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN)
 void
 ShutdownXLOG(int code, Datum arg)
 {
+	bool is_demoting = DatumGetBool(arg);
+
 	/*
 	 * We should have an aux process resource owner to use, and we should not
 	 * be in a transaction that's installed some other resowner.
@@ -6800,7 +6819,7 @@ ShutdownXLOG(int code, Datum arg)
 
 	/* Don't be chatty in standalone mode */
 	ereport(IsPostmasterEnvironment ? LOG : NOTICE,
-			(errmsg("shutting down")));
+			(is_demoting?errmsg("demoting"):errmsg("shutting down")));
 
 	/*
 	 * Signal walsenders to move to stopping state.
@@ -6826,7 +6845,21 @@ ShutdownXLOG(int code, Datum arg)
 		if (XLogArchivingActive())
 			RequestXLogSwitch(false);
 
-		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
+		if (is_demoting)
+		{
+			/*
+			 * FIXME demote: avoiding checkpoint?
+			 * A checkpoint is probably running during a demote action. If
+			 * we don't want to wait for the checkpoint during the demote,
+			 * we might need to cancel it as it will not be able to write
+			 * to the WAL after the demote.
+			 */
+			CreateCheckPoint(CHECKPOINT_IS_DEMOTE | CHECKPOINT_IMMEDIATE);
+			ShutdownPreparedTransactions();
+			ShutdownMultiXact(); // FIXME: replace with MultiXactShmemInit()?
+		}
+		else
+			CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
 }
 
@@ -6851,8 +6884,9 @@ LogCheckpointStart(int flags, bool restartpoint)
 	else
 		ereport(LOG,
 		/* translator: the placeholders show checkpoint options */
-				(errmsg("checkpoint starting:%s%s%s%s%s%s%s%s",
+				(errmsg("checkpoint starting:%s%s%s%s%s%s%s%s%s",
 						(flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "",
+						(flags & CHECKPOINT_IS_DEMOTE) ? " demote" : "",
 						(flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "",
 						(flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "",
 						(flags & CHECKPOINT_FORCE) ? " force" : "",
@@ -7041,6 +7075,7 @@ update_checkpoint_display(int flags, bool restartpoint, bool reset)
  *
  * flags is a bitwise OR of the following:
  *	CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown.
+ *	CHECKPOINT_IS_DEMOTE: checkpoint is for demote.
  *	CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery.
  *	CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP,
  *		ignoring checkpoint_completion_target parameter.
@@ -7077,6 +7112,7 @@ bool
 CreateCheckPoint(int flags)
 {
 	bool		shutdown;
+	bool		demote;
 	CheckPoint	checkPoint;
 	XLogRecPtr	recptr;
 	XLogSegNo	_logSegNo;
@@ -7089,14 +7125,20 @@ CreateCheckPoint(int flags)
 	int			oldXLogAllowed = 0;
 
 	/*
-	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
-	 * issued at a different time.
+	 * An end-of-recovery checkpoint or demote is really a shutdown checkpoint,
+	 * just issued at a different time.
 	 */
-	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY))
+	if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
+				 CHECKPOINT_IS_DEMOTE))
 		shutdown = true;
 	else
 		shutdown = false;
 
+	if (flags & CHECKPOINT_IS_DEMOTE)
+		demote = true;
+	else
+		demote = false;
+
 	/* sanity check */
 	if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0)
 		elog(ERROR, "can't create a checkpoint during recovery");
@@ -7127,7 +7169,7 @@ CreateCheckPoint(int flags)
 	if (shutdown)
 	{
 		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
-		ControlFile->state = DB_SHUTDOWNING;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNING;
 		UpdateControlFile();
 		LWLockRelease(ControlFileLock);
 	}
@@ -7158,7 +7200,7 @@ CreateCheckPoint(int flags)
 	 * avoid inserting duplicate checkpoints when the system is idle.
 	 */
 	if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY |
-				  CHECKPOINT_FORCE)) == 0)
+				  CHECKPOINT_FORCE | CHECKPOINT_IS_DEMOTE)) == 0)
 	{
 		if (last_important_lsn == ControlFile->checkPoint)
 		{
@@ -7386,8 +7428,8 @@ CreateCheckPoint(int flags)
 	 * allows us to reconstruct the state of running transactions during
 	 * archive recovery, if required. Skip, if this info disabled.
 	 *
-	 * If we are shutting down, or Startup process is completing crash
-	 * recovery we don't need to write running xact data.
+	 * If we are shutting down, demoting or Startup process is completing
+	 * crash recovery we don't need to write running xact data.
 	 */
 	if (!shutdown && XLogStandbyInfoActive())
 		LogStandbySnapshot();
@@ -7399,18 +7441,21 @@ CreateCheckPoint(int flags)
 	 */
 	XLogBeginInsert();
 	XLogRegisterData(&checkPoint, sizeof(checkPoint));
-	recptr = XLogInsert(RM_XLOG_ID,
-						shutdown ? XLOG_CHECKPOINT_SHUTDOWN :
-						XLOG_CHECKPOINT_ONLINE);
+	if (demote)
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_DEMOTE);
+	else if (shutdown)
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_SHUTDOWN);
+	else
+		recptr = XLogInsert(RM_XLOG_ID, XLOG_CHECKPOINT_ONLINE);
 
 	XLogFlush(recptr);
 
 	/*
-	 * We mustn't write any new WAL after a shutdown checkpoint, or it will be
-	 * overwritten at next startup.  No-one should even try, this just allows
-	 * sanity-checking.  In the case of an end-of-recovery checkpoint, we want
-	 * to just temporarily disable writing until the system has exited
-	 * recovery.
+	 * We mustn't write any new WAL after a shutdown or demote checkpoint, or
+	 * it will be overwritten at next startup.  No-one should even try, this
+	 * just allows sanity-checking.  In the case of an end-of-recovery
+	 * checkpoint, we want to just temporarily disable writing until the system
+	 * has exited recovery.
 	 */
 	if (shutdown)
 	{
@@ -7426,7 +7471,8 @@ CreateCheckPoint(int flags)
 	 */
 	if (shutdown && checkPoint.redo != ProcLastRecPtr)
 		ereport(PANIC,
-				(errmsg("concurrent write-ahead log activity while database system is shutting down")));
+				(errmsg("concurrent write-ahead log activity while database system is %s",
+						demote? "demoting":"shutting down")));
 
 	/*
 	 * Remember the prior checkpoint's redo ptr for
@@ -7439,7 +7485,7 @@ CreateCheckPoint(int flags)
 	 */
 	LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
 	if (shutdown)
-		ControlFile->state = DB_SHUTDOWNED;
+		ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
 	/* crash recovery should always recover to the end of WAL */
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 6ce979f2d8b..682e2a9e30d 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -527,8 +527,10 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	bool		haveBackupLabel = false;
 	CheckPoint	checkPoint;
 	bool		backupFromStandby = false;
+	bool		wasDemote = false;
 
 	dbstate_at_startup = ControlFile->state;
+	// wasDemote = (dbstate_at_startup == DB_DEMOTING);
 
 	/*
 	 * Initialize on the assumption we want to recover to the latest timeline
@@ -759,7 +761,8 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 			(ControlFile->minRecoveryPoint != InvalidXLogRecPtr ||
 			 ControlFile->backupEndRequired ||
 			 ControlFile->backupEndPoint != InvalidXLogRecPtr ||
-			 ControlFile->state == DB_SHUTDOWNED))
+			 ControlFile->state == DB_SHUTDOWNED ||
+			 ControlFile->state == DB_DEMOTING))
 		{
 			InArchiveRecovery = true;
 			if (StandbyModeRequested)
@@ -803,6 +806,7 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 		}
 		memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint));
 		wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN);
+		wasDemote = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_DEMOTE);
 	}
 
 	if (ArchiveRecoveryRequested)
@@ -875,10 +879,18 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 						LSN_FORMAT_ARGS(ControlFile->minRecoveryPoint),
 						ControlFile->minRecoveryPointTLI)));
 
-	ereport(DEBUG1,
-			(errmsg_internal("redo record is at %X/%X; shutdown %s",
-							 LSN_FORMAT_ARGS(checkPoint.redo),
-							 wasShutdown ? "true" : "false")));
+	if (wasShutdown)
+		ereport(DEBUG1,
+			(errmsg_internal("redo record is at %X/%X; shutdown",
+							 LSN_FORMAT_ARGS(checkPoint.redo))));
+	else if (wasDemote)
+		ereport(DEBUG1,
+			(errmsg_internal("redo record is at %X/%X; demote",
+							 LSN_FORMAT_ARGS(checkPoint.redo))));
+	else
+		ereport(DEBUG1,
+			(errmsg_internal("redo record is at %X/%X;",
+							 LSN_FORMAT_ARGS(checkPoint.redo))));
 	ereport(DEBUG1,
 			(errmsg_internal("next transaction ID: " UINT64_FORMAT "; next OID: %u",
 							 U64FromFullTransactionId(checkPoint.nextXid),
@@ -938,7 +950,21 @@ InitWalRecovery(ControlFileData *ControlFile, bool *wasShutdown_ptr,
 	{
 		if (InArchiveRecovery)
 		{
-			ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+			if (wasDemote)
+			{
+				// FIXME: why not locking whatever the demoting or not?
+				//		  is ControlFileLock available or not when starting in
+				//		  recovery?
+				/*
+				 * Avoid concurrent access to the ControlFile datas
+				 * during demotion.
+				*/
+				LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
+				LWLockRelease(ControlFileLock);
+			}
+			else
+				ControlFile->state = DB_IN_ARCHIVE_RECOVERY;
 		}
 		else
 		{
@@ -4100,6 +4126,7 @@ ReadCheckpointRecord(XLogPrefetcher *xlogprefetcher, XLogRecPtr RecPtr,
 	}
 	info = record->xl_info & ~XLR_INFO_MASK;
 	if (info != XLOG_CHECKPOINT_SHUTDOWN &&
+		info != XLOG_CHECKPOINT_DEMOTE &&
 		info != XLOG_CHECKPOINT_ONLINE)
 	{
 		ereport(LOG,
@@ -4490,6 +4517,29 @@ CheckPromoteSignal(void)
 	return false;
 }
 
+/*
+ * Remove the file signaling a demote request.
+ */
+void
+RemoveDemoteSignalFiles(void)
+{
+	unlink(DEMOTE_SIGNAL_FILE);
+}
+
+/*
+ * Check if a demote request appeared.
+ */
+bool
+CheckDemoteSignal(void)
+{
+	struct stat stat_buf;
+
+	if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0)
+		return true;
+
+	return false;
+}
+
 /*
  * Wake up startup process to replay newly arrived WAL, or to notice that
  * failover has been requested.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fda91ffd1ce..098d4671ef2 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -155,6 +155,8 @@ static double ckpt_cached_elapsed;
 
 static pg_time_t last_checkpoint_time;
 static pg_time_t last_xlog_switch_time;
+// FIXME: shouldn't it be in postmaster.c with other pending signals?
+static volatile sig_atomic_t pending_demote_request = false;
 
 /* Prototypes for private functions */
 
@@ -659,6 +661,23 @@ ProcessCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
+	if (pending_demote_request)
+	{
+		ereport(LOG, (errmsg("demote signal received by checkpointer")));
+		pending_demote_request = false;
+		/* Close down the database */
+		ShutdownXLOG(0, BoolGetDatum(true));
+		/*
+		 * Exit checkpointer. We could keep it around during demotion, but
+		 * exiting here has multiple benefices:
+		 * - to create a fresh process with clean local vars
+		 *   (eg. LocalRecoveryInProgress)
+		 * - to signal postmaster the demote shutdown checkpoint is done
+		 *   and keep going with next steps of the demotion
+		 */
+		cancel_before_shmem_exit(pgstat_before_server_shutdown, 0);
+		proc_exit(0);
+	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -782,6 +801,7 @@ CheckpointWriteDelay(int flags, double progress)
 		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
+		!pending_demote_request &&
 		IsCheckpointOnSchedule(progress))
 	{
 		if (ConfigReloadPending)
@@ -913,6 +933,14 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
+/* SIGUSR1: set flag to demote */
+void
+ReqCheckpointDemoteHandler(SIGNAL_ARGS)
+{
+	pending_demote_request = true;
+	SetLatch(MyLatch);
+}
+
 /* SIGINT: set flag to trigger writing of shutdown checkpoint */
 static void
 ReqShutdownXLOG(SIGNAL_ARGS)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 490f7ce3664..1a7c77878cc 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -290,8 +290,8 @@ static int	Shutdown = NoShutdown;
 static bool FatalError = false; /* T if recovering from backend crash */
 
 /*
- * We use a simple state machine to control startup, shutdown, and
- * crash recovery (which is rather like shutdown followed by startup).
+ * We use a simple state machine to control startup, shutdown, demote and
+ * crash recovery (both are rather like shutdown followed by startup).
  *
  * After doing all the postmaster initialization work, we enter PM_STARTUP
  * state and the startup process is launched. The startup process begins by
@@ -335,6 +335,7 @@ typedef enum
 {
 	PM_INIT,					/* postmaster starting */
 	PM_STARTUP,					/* waiting for startup subprocess */
+	PM_DEMOTING,				/* waiting for backends to stop */
 	PM_RECOVERY,				/* in archive recovery mode */
 	PM_HOT_STANDBY,				/* in hot standby mode */
 	PM_RUN,						/* normal "database is alive" state */
@@ -391,6 +392,7 @@ static bool HaveCrashedWorker = false;
 static volatile sig_atomic_t pending_pm_pmsignal;
 static volatile sig_atomic_t pending_pm_child_exit;
 static volatile sig_atomic_t pending_pm_reload_request;
+static volatile sig_atomic_t pending_pm_demote_request;
 static volatile sig_atomic_t pending_pm_shutdown_request;
 static volatile sig_atomic_t pending_pm_fast_shutdown_request;
 static volatile sig_atomic_t pending_pm_immediate_shutdown_request;
@@ -443,6 +445,9 @@ static void signal_child(PMChild *pmchild, int signal);
 static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
+
+
+
 static void LaunchMissingBackgroundProcesses(void);
 static void maybe_start_bgworkers(void);
 static bool maybe_reap_io_worker(int pid);
@@ -1815,14 +1820,16 @@ canAcceptConnections(BackendType backend_type)
 	Assert(backend_type == B_BACKEND || backend_type == B_AUTOVAC_WORKER);
 
 	/*
-	 * Can't start backends when in startup/shutdown/inconsistent recovery
-	 * state.  We treat autovac workers the same as user backends for this
-	 * purpose.
+	 * Can't start backends when in startup/demote/shutdown/inconsistent
+	 * recovery state.  We treat autovac workers the same as user backends for
+	 * this purpose.
 	 */
 	if (pmState != PM_RUN && pmState != PM_HOT_STANDBY)
 	{
 		if (Shutdown > NoShutdown)
 			return CAC_SHUTDOWN;	/* shutdown is pending */
+		else if (pending_pm_demote_request)
+			return CAC_DEMOTE;		/* demote is pending */
 		else if (!FatalError && pmState == PM_STARTUP)
 			return CAC_STARTUP; /* normal startup */
 		else if (!FatalError && pmState == PM_RECOVERY)
@@ -1969,6 +1976,7 @@ InitProcessGlobals(void)
 /*
  * Child processes use SIGUSR1 to notify us of 'pmsignals'.  pg_ctl uses
  * SIGUSR1 to ask postmaster to check for logrotate and promote files.
+ * FIXME: and demote as well?
  */
 static void
 handle_pm_pmsignal_signal(SIGNAL_ARGS)
@@ -2377,7 +2385,23 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_DEMOTING &&
+				pending_pm_demote_request)
+			{
+				/*
+				 * The checkpointer exit signals the demote checkpoint is done.
+				 * The startup recovery mode can be started from there.
+				 */
+				ereport(DEBUG1,
+						(errmsg_internal("checkpointer shutdown for demote")));
+
+				/*
+				 * FIXME: stop other subprocess we want to restart after demote
+				 *        here
+				 */
+				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+			}
+			else if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
@@ -2744,6 +2768,7 @@ HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
 			/* there might be more backends to wait for */
 			break;
 
+		case PM_DEMOTING:
 		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_CHECKPOINTER:
@@ -2889,6 +2914,14 @@ PostmasterStateMachine(void)
 		}
 	}
 
+	// if (pmState == PM_DEMOTING)
+	// {
+	// 	/* FIXME: this was waiting for backend to finish their xact.
+	// 	 *
+	// 	 * should it be stop/force backend here?
+	// 	 */
+	// }
+
 	/*
 	 * In the PM_WAIT_BACKENDS state, wait for all the regular backends and
 	 * processes like autovacuum and background workers that are comparable to
@@ -3014,12 +3047,20 @@ PostmasterStateMachine(void)
 				 * entered FatalError state.
 				 */
 			}
+			else if (pending_pm_demote_request)
+			{
+				ereport(LOG, (errmsg("sending demote signal to checkpointer")));
+				SendProcSignal(CheckpointerPMChild->pid,
+							   PROCSIG_CHECKPOINTER_DEMOTING,
+							   INVALID_PROC_NUMBER);
+				UpdatePMState(PM_DEMOTING);
+			}
 			else
 			{
 				/*
 				 * If we get here, we are proceeding with normal shutdown. All
 				 * the regular children are gone, and it's time to tell the
-				 * checkpointer to do a shutdown checkpoint.
+				 * checkpointer to do a shutdown or demote checkpoint.
 				 */
 				Assert(Shutdown > NoShutdown);
 				/* Start the checkpointer if not running */
@@ -3188,6 +3229,20 @@ PostmasterStateMachine(void)
 		}
 	}
 
+	/* Demoting: start the Startup Process */
+	if (pending_pm_demote_request && pmState == PM_DEMOTING &&
+		CheckpointerPMChild == NULL)
+	{
+		/* stop archiver process if not required during standby */
+		if (!XLogArchivingAlways() && PgArchPMChild != NULL)
+			signal_child(PgArchPMChild, SIGQUIT);
+
+		StartupPMChild = StartChildProcess(B_STARTUP);
+		Assert(StartupPMChild != 0);
+		StartupStatus = STARTUP_RUNNING;
+		UpdatePMState(PM_STARTUP);
+	}
+
 	/*
 	 * If we need to recover from a crash, wait for all non-syslogger children
 	 * to exit, then reset shmem and start the startup process.
@@ -3236,6 +3291,7 @@ pmstate_name(PMState state)
 	{
 			PM_TOSTR_CASE(PM_INIT);
 			PM_TOSTR_CASE(PM_STARTUP);
+			PM_TOSTR_CASE(PM_DEMOTING);
 			PM_TOSTR_CASE(PM_RECOVERY);
 			PM_TOSTR_CASE(PM_HOT_STANDBY);
 			PM_TOSTR_CASE(PM_RUN);
@@ -3717,6 +3773,7 @@ process_pm_pmsignal(void)
 		if (!EnableHotStandby)
 		{
 			AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STANDBY);
+			pending_pm_demote_request = false;
 #ifdef USE_SYSTEMD
 			sd_notify(0, "READY=1");
 #endif
@@ -3750,6 +3807,10 @@ process_pm_pmsignal(void)
 		StartWorkerNeeded = true;
 	}
 
+	/* FIXME: better included in previous conditional block? */
+	if (pending_pm_demote_request)
+		pending_pm_demote_request = false;
+
 	/* Process background worker state changes. */
 	if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
 	{
@@ -3886,6 +3947,56 @@ process_pm_pmsignal(void)
 		 */
 		signal_child(StartupPMChild, SIGUSR2);
 	}
+
+	if (CheckDemoteSignal() && pmState != PM_RUN)
+	{
+		pending_pm_demote_request = false;
+		RemoveDemoteSignalFiles();
+		ereport(LOG,
+				(errmsg("ignoring demote signal because already in standby mode")));
+	}
+	/* received demote signal */
+	else if (CheckDemoteSignal())
+	{
+		FILE	   *standby_file;
+
+		ereport(LOG, (errmsg("received demote request")));
+
+		RemoveDemoteSignalFiles();
+
+		/* create the standby signal file */
+		standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w");
+		if (!standby_file)
+		{
+			ereport(ERROR, (errcode_for_file_access(),
+							errmsg("could not create file \"%s\": %m",
+								   STANDBY_SIGNAL_FILE)));
+			goto out;
+		}
+
+		if (FreeFile(standby_file))
+		{
+			ereport(ERROR, (errcode_for_file_access(),
+							errmsg("could not write file \"%s\": %m",
+								   STANDBY_SIGNAL_FILE)));
+			unlink(STANDBY_SIGNAL_FILE);
+			goto out;
+		}
+
+		pending_pm_demote_request = true;
+		connsAllowed = false;
+
+		/* Report status */
+		AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING);
+
+		UpdatePMState(PM_STOP_BACKENDS);
+		PostmasterStateMachine();
+
+		// UpdatePMState(PM_DEMOTING);
+	}
+
+out:
+
 }
 
 /*
@@ -3939,7 +4050,6 @@ CountChildren(BackendTypeMask targetMask)
 	return cnt;
 }
 
-
 /*
  * StartChildProcess -- start an auxiliary process for the postmaster
  *
@@ -4186,6 +4296,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 		case PM_WAIT_IO_WORKERS:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
+		case PM_DEMOTING:
 			break;
 
 		case PM_RUN:
diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c
index a9bb540b55a..6f5caaa25c3 100644
--- a/src/backend/storage/ipc/procsignal.c
+++ b/src/backend/storage/ipc/procsignal.c
@@ -32,6 +32,7 @@
 #include "storage/smgr.h"
 #include "tcop/tcopprot.h"
 #include "utils/memutils.h"
+#include "postmaster/bgwriter.h"
 
 /*
  * The SIGUSR1 signal is multiplexed to support signaling multiple event
@@ -715,6 +716,10 @@ procsignal_sigusr1_handler(SIGNAL_ARGS)
 	if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN))
 		HandleRecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN);
 
+	/* signal checkpoint process to ignite a demote procedure */
+	if (CheckProcSignal(PROCSIG_CHECKPOINTER_DEMOTING))
+		ReqCheckpointDemoteHandler(PROCSIG_CHECKPOINTER_DEMOTING);
+
 	SetLatch(MyLatch);
 }
 
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 2776ceb295b..7f18ed17e40 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -4569,6 +4569,18 @@ lock_twophase_postabort(TransactionId xid, uint16 info,
 	lock_twophase_postcommit(xid, info, recdata, len);
 }
 
+/*
+ * 2PC shutdown from lock table.
+ *
+ * This is actually just the same as the COMMIT case.
+ */
+void
+lock_twophase_shutdown(TransactionId xid, uint16 info,
+					   void *recdata, uint32 len)
+{
+	lock_twophase_postcommit(xid, info, recdata, len);
+}
+
 /*
  *		VirtualXactLockTableInsert
  *
diff --git a/src/backend/tcop/backend_startup.c b/src/backend/tcop/backend_startup.c
index a7d1fec981f..a9f8d59dfd7 100644
--- a/src/backend/tcop/backend_startup.c
+++ b/src/backend/tcop/backend_startup.c
@@ -327,6 +327,11 @@ BackendInitialize(ClientSocket *client_sock, CAC_state cac)
 							 errmsg("the database system is not yet accepting connections"),
 							 errdetail("Consistent recovery state has not been yet reached.")));
 				break;
+			case CAC_DEMOTE:
+				ereport(FATAL,
+						(errcode(ERRCODE_CANNOT_CONNECT_NOW),
+						 errmsg("the database system is demoting")));
+					break;
 			case CAC_SHUTDOWN:
 				ereport(FATAL,
 						(errcode(ERRCODE_CANNOT_CONNECT_NOW),
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 7bb801bb886..128616513ee 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -57,6 +57,8 @@ dbState(DBState state)
 			return _("shut down");
 		case DB_SHUTDOWNED_IN_RECOVERY:
 			return _("shut down in recovery");
+		case DB_DEMOTING:
+			return _("demoting");
 		case DB_SHUTDOWNING:
 			return _("shutting down");
 		case DB_IN_CRASH_RECOVERY:
diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c
index 8a405ff122c..44ef5c6804c 100644
--- a/src/bin/pg_ctl/pg_ctl.c
+++ b/src/bin/pg_ctl/pg_ctl.c
@@ -33,7 +33,6 @@
 #include "pqexpbuffer.h"
 #endif
 
-
 typedef enum
 {
 	SMART_MODE,
@@ -58,6 +57,7 @@ typedef enum
 	RESTART_COMMAND,
 	RELOAD_COMMAND,
 	STATUS_COMMAND,
+	DEMOTE_COMMAND,
 	PROMOTE_COMMAND,
 	LOGROTATE_COMMAND,
 	KILL_COMMAND,
@@ -98,6 +98,7 @@ static char postopts_file[MAXPGPATH];
 static char version_file[MAXPGPATH];
 static char pid_file[MAXPGPATH];
 static char promote_file[MAXPGPATH];
+static char demote_file[MAXPGPATH];
 static char logrotate_file[MAXPGPATH];
 
 static volatile pid_t postmasterPID = -1;
@@ -125,6 +126,7 @@ static void do_restart(void);
 static void do_reload(void);
 static void do_status(void);
 static void do_promote(void);
+static void do_demote(void);
 static void do_logrotate(void);
 static void do_kill(pid_t pid);
 static void print_msg(const char *msg);
@@ -1259,6 +1261,109 @@ do_promote(void)
 		print_msg(_("server promoting\n"));
 }
 
+/*
+ * demote
+ */
+
+static void
+do_demote(void)
+{
+	int			cnt;
+	FILE		   *dmtfile;
+	pid_t		pid;
+	// FIXME struct stat	statbuf;
+
+	pid = get_pgpid(false);
+
+	if (pid == 0)			/* no pid file */
+	{
+		write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file);
+		write_stderr(_("Is server running?\n"));
+		exit(1);
+	}
+	else if (pid < 0)		/* standalone backend, not postmaster */
+	{
+		pid = -pid;
+		write_stderr(_("%s: cannot demote server; "
+					   "single-user server is running (PID: %d)\n"),
+					 progname, pid);
+		exit(1);
+	}
+
+	snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data);
+
+	if ((dmtfile = fopen(demote_file, "w")) == NULL)
+	{
+		write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	if (fclose(dmtfile))
+	{
+		write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"),
+					 progname, demote_file, strerror(errno));
+		exit(1);
+	}
+
+	sig = SIGUSR1;
+	if (kill((pid_t) pid, sig) != 0)
+	{
+		write_stderr(_("%s: could not send demote signal (PID: %d): %s\n"), progname, pid,
+					 strerror(errno));
+		exit(1);
+	}
+
+	if (!do_wait)
+	{
+		print_msg(_("server demoting\n"));
+		return;
+	}
+	else
+	{
+		// /*
+		//  * FIXME demote
+		//  * If backup_label exists, an online backup is running. Warn the user
+		//  * that smart demote will wait for it to finish. However, if the
+		//  * server is in archive recovery, we're recovering from an online
+		//  * backup instead of performing one.
+		//  */
+		// if (shutdown_mode == SMART_MODE &&
+		// 	stat(backup_file, &statbuf) == 0 &&
+		// 	get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		// {
+		// 	print_msg(_("WARNING: online backup mode is active\n"
+		// 			    "Demote will not complete until pg_stop_backup() is called.\n\n"));
+		// }
+
+		print_msg(_("waiting for server to demote..."));
+
+		for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++)
+		{
+			if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY)
+				break;
+
+			if (cnt % WAITS_PER_SEC == 0)
+				print_msg(".");
+			pg_usleep(USEC_PER_SEC / WAITS_PER_SEC);
+		}
+
+		if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY)
+		{
+			print_msg(_(" failed\n"));
+
+			write_stderr(_("%s: server does not demote\n"), progname);
+			if (shutdown_mode == SMART_MODE)
+				write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n"
+							   "waiting for session-initiated disconnection.\n"));
+			exit(1);
+		}
+		print_msg(_(" done\n"));
+
+		print_msg(_("server demoted\n"));
+	}
+}
+
 /*
  * log rotate
  */
@@ -2378,6 +2483,8 @@ main(int argc, char **argv)
 			ctl_command = STATUS_COMMAND;
 		else if (strcmp(argv[optind], "promote") == 0)
 			ctl_command = PROMOTE_COMMAND;
+		else if (strcmp(argv[optind], "demote") == 0)
+			ctl_command = DEMOTE_COMMAND;
 		else if (strcmp(argv[optind], "logrotate") == 0)
 			ctl_command = LOGROTATE_COMMAND;
 		else if (strcmp(argv[optind], "kill") == 0)
@@ -2491,6 +2598,9 @@ main(int argc, char **argv)
 		case PROMOTE_COMMAND:
 			do_promote();
 			break;
+		case DEMOTE_COMMAND:
+			do_demote();
+			break;
 		case LOGROTATE_COMMAND:
 			do_logrotate();
 			break;
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 4e6b0eec2ff..de2008ff240 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -125,6 +125,7 @@ extern Size MultiXactShmemSize(void);
 extern void MultiXactShmemInit(void);
 extern void BootStrapMultiXact(void);
 extern void StartupMultiXact(void);
+extern void ShutdownMultiXact(void);
 extern void TrimMultiXact(void);
 extern void SetMultiXactIdLimit(MultiXactId oldest_datminmxid,
 								Oid oldest_datoid,
diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h
index 9fa82355033..2ed7a3f4b73 100644
--- a/src/include/access/twophase.h
+++ b/src/include/access/twophase.h
@@ -55,6 +55,7 @@ extern void RecoverPreparedTransactions(void);
 extern void CheckPointTwoPhase(XLogRecPtr redo_horizon);
 
 extern void FinishPreparedTransaction(const char *gid, bool isCommit);
+void ShutdownPreparedTransactions(void); //FIXME: extern ?
 
 extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn,
 						   XLogRecPtr end_lsn, RepOriginId origin_id);
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d313099c027..06d97e45fdd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -137,18 +137,20 @@ extern PGDLLIMPORT bool XLOG_DEBUG;
 
 /* These directly affect the behavior of CreateCheckPoint and subsidiaries */
 #define CHECKPOINT_IS_SHUTDOWN	0x0001	/* Checkpoint is for shutdown */
-#define CHECKPOINT_END_OF_RECOVERY	0x0002	/* Like shutdown checkpoint, but
+#define CHECKPOINT_IS_DEMOTE	0x0002	/* Like shutdown checkpoint, but
+										 * issued at end of WAL production */
+#define CHECKPOINT_END_OF_RECOVERY	0x0004	/* Like shutdown checkpoint, but
 											 * issued at end of WAL recovery */
-#define CHECKPOINT_IMMEDIATE	0x0004	/* Do it without delays */
-#define CHECKPOINT_FORCE		0x0008	/* Force even if no activity */
-#define CHECKPOINT_FLUSH_ALL	0x0010	/* Flush all pages, including those
+#define CHECKPOINT_IMMEDIATE	0x0008	/* Do it without delays */
+#define CHECKPOINT_FORCE		0x0010	/* Force even if no activity */
+#define CHECKPOINT_FLUSH_ALL	0x0020	/* Flush all pages, including those
 										 * belonging to unlogged tables */
 /* These are important to RequestCheckpoint */
-#define CHECKPOINT_WAIT			0x0020	/* Wait for completion */
-#define CHECKPOINT_REQUESTED	0x0040	/* Checkpoint request has been made */
+#define CHECKPOINT_WAIT			0x0040	/* Wait for completion */
+#define CHECKPOINT_REQUESTED	0x0080	/* Checkpoint request has been made */
 /* These indicate the cause of a checkpoint request */
-#define CHECKPOINT_CAUSE_XLOG	0x0080	/* XLOG consumption */
-#define CHECKPOINT_CAUSE_TIME	0x0100	/* Elapsed time */
+#define CHECKPOINT_CAUSE_XLOG	0x0100	/* XLOG consumption */
+#define CHECKPOINT_CAUSE_TIME	0x0200	/* Elapsed time */
 
 /*
  * Flag bits for the record being inserted, set using XLogSetRecordFlags().
@@ -311,4 +313,7 @@ extern SessionBackupState get_backup_status(void);
 /* files to signal promotion to primary */
 #define PROMOTE_SIGNAL_FILE		"promote"
 
+/* file to signal demotion from production to standby */
+#define DEMOTE_SIGNAL_FILE             "demote"
+
 #endif							/* XLOG_H */
diff --git a/src/include/access/xlogrecovery.h b/src/include/access/xlogrecovery.h
index 91446303024..e8566b98b22 100644
--- a/src/include/access/xlogrecovery.h
+++ b/src/include/access/xlogrecovery.h
@@ -134,6 +134,7 @@ typedef struct
 extern EndOfWalRecoveryInfo *FinishWalRecovery(void);
 extern void ShutdownWalRecovery(void);
 extern void RemovePromoteSignalFiles(void);
+extern void RemoveDemoteSignalFiles(void);
 
 extern bool HotStandbyActive(void);
 extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI);
@@ -146,6 +147,7 @@ extern XLogRecPtr GetCurrentReplayRecPtr(TimeLineID *replayEndTLI);
 
 extern bool PromoteIsTriggered(void);
 extern bool CheckPromoteSignal(void);
+extern bool CheckDemoteSignal(void);
 extern void WakeupRecovery(void);
 
 extern void StartupRequestWalReceiverRestart(void);
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 63e834a6ce4..e33a8aba505 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -66,20 +66,21 @@ typedef struct CheckPoint
 
 /* XLOG info values for XLOG rmgr */
 #define XLOG_CHECKPOINT_SHUTDOWN		0x00
-#define XLOG_CHECKPOINT_ONLINE			0x10
-#define XLOG_NOOP						0x20
-#define XLOG_NEXTOID					0x30
-#define XLOG_SWITCH						0x40
-#define XLOG_BACKUP_END					0x50
-#define XLOG_PARAMETER_CHANGE			0x60
-#define XLOG_RESTORE_POINT				0x70
-#define XLOG_FPW_CHANGE					0x80
-#define XLOG_END_OF_RECOVERY			0x90
-#define XLOG_FPI_FOR_HINT				0xA0
-#define XLOG_FPI						0xB0
+#define XLOG_CHECKPOINT_DEMOTE			0x10
+#define XLOG_CHECKPOINT_ONLINE			0x20
+#define XLOG_NOOP						0x30
+#define XLOG_NEXTOID					0x40
+#define XLOG_SWITCH						0x50
+#define XLOG_BACKUP_END					0x60
+#define XLOG_PARAMETER_CHANGE			0x70
+#define XLOG_RESTORE_POINT				0x80
+#define XLOG_FPW_CHANGE					0x90
+#define XLOG_END_OF_RECOVERY			0xA0
+#define XLOG_FPI_FOR_HINT				0xB0
+#define XLOG_FPI						0xD0
 /* 0xC0 is used in Postgres 9.5-11 */
-#define XLOG_OVERWRITE_CONTRECORD		0xD0
-#define XLOG_CHECKPOINT_REDO			0xE0
+#define XLOG_OVERWRITE_CONTRECORD		0xE0
+#define XLOG_CHECKPOINT_REDO			0xF0
 
 
 /*
@@ -91,6 +92,7 @@ typedef enum DBState
 	DB_STARTUP = 0,
 	DB_SHUTDOWNED,
 	DB_SHUTDOWNED_IN_RECOVERY,
+	DB_DEMOTING,
 	DB_SHUTDOWNING,
 	DB_IN_CRASH_RECOVERY,
 	DB_IN_ARCHIVE_RECOVERY,
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 800ecbfd13b..8156fa3f225 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -42,4 +42,6 @@ extern void CheckpointerShmemInit(void);
 
 extern bool FirstCallSinceLastCheckpoint(void);
 
+extern void ReqCheckpointDemoteHandler(SIGNAL_ARGS);
+
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h
index 4862b80eec3..ae4df734eff 100644
--- a/src/include/storage/lock.h
+++ b/src/include/storage/lock.h
@@ -603,6 +603,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info,
 									 void *recdata, uint32 len);
 extern void lock_twophase_postabort(TransactionId xid, uint16 info,
 									void *recdata, uint32 len);
+extern void lock_twophase_shutdown(TransactionId xid, uint16 info,
+								   void *recdata, uint32 len);
 extern void lock_twophase_standby_recover(TransactionId xid, uint16 info,
 										  void *recdata, uint32 len);
 
diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h
index afeeb1ca019..12b5d062941 100644
--- a/src/include/storage/procsignal.h
+++ b/src/include/storage/procsignal.h
@@ -36,6 +36,7 @@ typedef enum
 	PROCSIG_BARRIER,			/* global barrier interrupt  */
 	PROCSIG_LOG_MEMORY_CONTEXT, /* ask backend to log the memory contexts */
 	PROCSIG_PARALLEL_APPLY_MESSAGE, /* Message from parallel apply workers */
+	PROCSIG_CHECKPOINTER_DEMOTING,	/* ask checkpointer to demote */
 
 	/* Recovery conflict reasons */
 	PROCSIG_RECOVERY_CONFLICT_FIRST,
diff --git a/src/include/tcop/backend_startup.h b/src/include/tcop/backend_startup.h
index dcb9d056643..1289c22ec0c 100644
--- a/src/include/tcop/backend_startup.h
+++ b/src/include/tcop/backend_startup.h
@@ -34,6 +34,7 @@ typedef enum CAC_state
 {
 	CAC_OK,
 	CAC_STARTUP,
+	CAC_DEMOTE,
 	CAC_SHUTDOWN,
 	CAC_RECOVERY,
 	CAC_NOTHOTSTANDBY,
diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h
index c8f248fd924..ae9d0ffc102 100644
--- a/src/include/utils/pidfile.h
+++ b/src/include/utils/pidfile.h
@@ -50,6 +50,7 @@
  */
 #define PM_STATUS_STARTING		"starting"	/* still starting up */
 #define PM_STATUS_STOPPING		"stopping"	/* in shutdown sequence */
+#define PM_STATUS_DEMOTING		"demoting"	/* demote sequence */
 #define PM_STATUS_READY			"ready   "	/* ready for connections */
 #define PM_STATUS_STANDBY		"standby "	/* up, won't accept connections */
 
-- 
2.49.0

>From 8882ae738219e6a9bec1f40eb67a658642ae2780 Mon Sep 17 00:00:00 2001
From: Jehan-Guillaume de Rorthais <j...@dalibo.com>
Date: Wed, 29 Jan 2025 13:03:51 +0100
Subject: [PATCH v1 2/2] Add demote tests

---
 src/test/perl/PostgreSQL/Test/Cluster.pm |  24 ++++
 src/test/recovery/t/046_demote.pl        | 165 +++++++++++++++++++++++
 2 files changed, 189 insertions(+)
 create mode 100644 src/test/recovery/t/046_demote.pl

diff --git a/src/test/perl/PostgreSQL/Test/Cluster.pm b/src/test/perl/PostgreSQL/Test/Cluster.pm
index 1c11750ac1d..2e7d8f7e950 100644
--- a/src/test/perl/PostgreSQL/Test/Cluster.pm
+++ b/src/test/perl/PostgreSQL/Test/Cluster.pm
@@ -1341,6 +1341,30 @@ sub promote
 
 =pod
 
+=item $node->demote()
+
+Wrapper for pg_ctl demote
+
+=cut
+
+sub demote
+{
+	my ($self) = @_;
+	my $pgdata = $self->data_dir;
+	my $logfile = $self->logfile;
+	my $name = $self->name;
+
+	local %ENV = $self->_get_env();
+
+	print "### Demoting node \"$name\"\n";
+
+	PostgreSQL::Test::Utils::system_or_bail('pg_ctl', '-D', $pgdata, '-l',
+		$logfile, 'demote');
+	return;
+}
+
+=pod
+
 =item $node->logrotate()
 
 Wrapper for pg_ctl logrotate
diff --git a/src/test/recovery/t/046_demote.pl b/src/test/recovery/t/046_demote.pl
new file mode 100644
index 00000000000..2b23d4603d7
--- /dev/null
+++ b/src/test/recovery/t/046_demote.pl
@@ -0,0 +1,165 @@
+
+# Copyright (c) 2021-2024, PostgreSQL Global Development Group
+
+# Test demote/promote actions in various scenarios using three
+# nodes alpha, beta and gamma. We check proper actions results,
+# correct data replication and cascade across multiple
+# demote/promote, manual switchover, smart and fast demote.
+
+use strict;
+use warnings FATAL => 'all';
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Time::HiRes qw(usleep);
+use Test::More;
+
+$ENV{PGDATABASE} = 'postgres';
+
+# Initialize node alpha
+my $node_alpha = PostgreSQL::Test::Cluster->new('alpha');
+$node_alpha->init(allows_streaming => 1);
+$node_alpha->append_conf(
+	'postgresql.conf', q[
+	max_prepared_transactions = 10
+	log_min_messages = debug1
+]);
+
+# Take backup
+my $backup_name = 'alpha_backup';
+$node_alpha->start;
+$node_alpha->backup($backup_name);
+
+# Create node beta from backup
+my $node_beta = PostgreSQL::Test::Cluster->new('beta');
+$node_beta->init_from_backup($node_alpha, $backup_name);
+$node_beta->enable_streaming($node_alpha);
+$node_beta->start;
+
+# Create node gamma from backup
+my $node_gamma = PostgreSQL::Test::Cluster->new('gamma');
+$node_gamma->init_from_backup($node_alpha, $backup_name);
+$node_gamma->enable_streaming($node_alpha);
+$node_gamma->start;
+
+# Create some 2PC on alpha for future tests
+$node_alpha->safe_psql('postgres', q{
+CREATE TABLE ins AS SELECT 1 AS i;
+BEGIN;
+CREATE TABLE new AS SELECT generate_series(1,5) AS i;
+PREPARE TRANSACTION 'pxact1';
+BEGIN;
+INSERT INTO ins VALUES (2);
+PREPARE TRANSACTION 'pxact2';
+});
+
+# Demote alpha.
+$node_alpha->demote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', 'Node "alpha" demoted to standby' );
+
+is( $node_alpha->safe_psql( 'postgres', "SELECT i FROM ins"),
+	'1', 'Can read from table "ins" after a demote' );
+
+# Promote alpha back in production.
+$node_alpha->promote;
+
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', 'Node "alpha" promoted after being demoted' );
+
+# Check all 2PC xact have been restored
+is( $node_alpha->safe_psql(
+		'postgres',
+		"SELECT string_agg(gid, ',' order by gid asc) FROM pg_prepared_xacts"),
+	'pxact1,pxact2',
+	"Prepared transactions still exists after demote -> promote sequence" );
+
+# Check writing in table "ins"
+is ( $node_alpha->safe_psql( 'postgres',
+							 "insert into ins values (0) returning i"),
+	'0', 'Can write in table "ins" after demote -> promote sequence' );
+
+# OK
+is ( $node_alpha->safe_psql( 'postgres',
+	"SELECT array_agg(i::text ORDER BY i ASC) FROM ins"), '{0,1}',
+	'Can read data from "ins" after demote -> promote sequence' );
+
+# Commit one 2PC and check it on alpha and beta
+is ( $node_alpha->psql( 'postgres', "commit prepared 'pxact1'"), 0,
+	'Prepared transaction "pxact1" commited after a demote -> promote sequence' );
+
+is( $node_alpha->safe_psql(
+		'postgres', "SELECT 1 FROM pg_class WHERE relname = 'new'"),
+	'1', 'Table "new" created from commited prepared transaction "pxact1"' );
+
+# Check writing in table "new"
+is ( $node_alpha->safe_psql( 'postgres',
+	 "insert into new values (6) returning i"),
+	 '6', 'Can write in table "new"' );
+
+is( $node_alpha->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5,6}', "Can read data from table 'new'" );
+
+$node_alpha->wait_for_catchup($node_beta);
+$node_alpha->wait_for_catchup($node_gamma);
+
+is( $node_beta->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5,6}', 'Prepared transaction "pxact1" replicated to "beta"' );
+
+is( $node_gamma->safe_psql(
+		'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"),
+	'{1,2,3,4,5,6}', 'prepared transaction "pxact1" replicated to "gamma"' );
+
+# swap roles between alpha and beta
+
+# demote alpha and check it
+$node_alpha->demote;
+is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	't', "node alpha demoted again" );
+
+# promote beta and check it
+$node_beta->promote;
+is( $node_beta->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'),
+	'f', "node beta promoted" );
+
+# Setup alpha to replicate from beta
+$node_alpha->enable_streaming($node_beta);
+$node_alpha->reload;
+
+# check alpha is replicating from it
+$node_beta->wait_for_catchup($node_alpha);
+
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_alpha->name, 'alpha is replicating from beta' );
+
+# check gamma is still replicating from from alpha
+$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive'));
+
+is( $node_alpha->safe_psql(
+		'postgres', 'SELECT application_name FROM pg_stat_replication'),
+	$node_gamma->name, 'gamma is replicating from beta' );
+
+# make sure the second 2PC is still available on beta
+is( $node_beta->safe_psql(
+		'postgres', 'SELECT gid FROM pg_prepared_xacts'),
+	'pxact2', 'Second repared transactions still exists on "beta"' );
+
+# commit the second 2PC and check its result on alpha and beta nodes
+$node_beta->safe_psql( 'postgres', "commit prepared 'pxact2'");
+
+is( $node_beta->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', 'prepared transaction "pxact2" commited on node "beta"' );
+
+$node_beta->wait_for_catchup($node_alpha);
+is( $node_alpha->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', 'prepared transaction "pxact2" streamed to node "alpha"' );
+
+# check the 2PC has been cascaded to gamma
+$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive'));
+is( $node_gamma->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'),
+	'1', 'Prepared transaction "pxact2" streamed to "gamma"' );
+
+done_testing();
-- 
2.49.0

[draft] demote primary to standby

Reply via email to