Hi, Yet another summary + patch + tests.
Demote now keeps backends with no active xid alive. Smart mode keeps all backends: it waits for them to finish their xact and enter read-only. Fast mode terminate backends wit an active xid and keeps all other ones. Backends enters "read-only" using LocalXLogInsertAllowed=0 and flip it to -1 (check recovery state) once demoted. During demote, no new session is allowed. As backends with no active xid survive, a new SQL admin function "pg_demote(fast bool, wait bool, wait_seconds int)" had been added. Demote now relies on sigusr1 instead of hijacking sigterm/sigint and pmdie(). The resulting refactoring makes the code much simpler, cleaner, with better isolation of actions from the code point of view. Thanks to the refactoring, the patch now only adds one state to the state machine: PM_DEMOTING. A second one could be use to replace: /* Demoting: start the Startup Process */ if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0) with eg.: if (pmState == PM_DEMOTED) I believe it might be a bit simpler to understand, but the existing comment might be good enough as well. The full state machine path for demote is: PM_DEMOTING /* wait for active xid backend to finish */ PM_SHUTDOWN /* wait for checkpoint shutdown and its various shutdown tasks */ PM_SHUTDOWN && !CheckpointerPID /* aka PM_DEMOTED: start Startup process */ PM_STARTUP Tests in "recovery/t/021_promote-demote.pl" grows from 13 to 24 tests, adding tests on backend behaviors during demote and new function pg_demote(). On my todo: * cancel running checkpoint for fast demote ? * forbid demote when PITR backup is in progress * user documentation * Robert's concern about snapshot during hot standby * anything else reported to me Plus, I might be able to split the backend part and their signals of the patch 0002 in its own patch if it helps the review. It would apply after 0001 and before actual 0002. As there was no consensus and the discussions seemed to conclude this patch set should keep growing to see were it goes, I wonder if/when I should add it to the commitfest. Advice? Opinion? Regards,
>From da3c4575f8ea40c089483b9cfa209db4993148ff Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais <j...@dalibo.com> Date: Fri, 31 Jul 2020 10:58:40 +0200 Subject: [PATCH 1/4] demote: setter functions for LocalXLogInsert local variable Adds functions extern LocalSetXLogInsertNotAllowed() and LocalSetXLogInsertCheckRecovery() to set the local variable LocalXLogInsert respectively to 0 and -1. These functions are declared as extern for future need in the demote patch. Function LocalSetXLogInsertAllowed() already exists and declared as static as it is not needed outside of xlog.h. --- src/backend/access/transam/xlog.c | 27 +++++++++++++++++++++++---- src/include/access/xlog.h | 2 ++ 2 files changed, 25 insertions(+), 4 deletions(-) diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 756b838e6a..25a9f78690 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -7711,7 +7711,7 @@ StartupXLOG(void) Insert->fullPageWrites = lastFullPageWrites; LocalSetXLogInsertAllowed(); UpdateFullPageWrites(); - LocalXLogInsertAllowed = -1; + LocalSetXLogInsertCheckRecovery(); if (InRecovery) { @@ -8219,6 +8219,25 @@ LocalSetXLogInsertAllowed(void) InitXLOGAccess(); } +/* + * Make XLogInsertAllowed() return false in the current process only. + */ +void +LocalSetXLogInsertNotAllowed(void) +{ + LocalXLogInsertAllowed = 0; +} + +/* + * Make XLogInsertCheckRecovery() return false in the current process only. + */ +void +LocalSetXLogInsertCheckRecovery(void) +{ + LocalXLogInsertAllowed = -1; +} + + /* * Subroutine to try to fetch and validate a prior checkpoint record. * @@ -9004,9 +9023,9 @@ CreateCheckPoint(int flags) if (shutdown) { if (flags & CHECKPOINT_END_OF_RECOVERY) - LocalXLogInsertAllowed = -1; /* return to "check" state */ + LocalSetXLogInsertCheckRecovery(); /* return to "check" state */ else - LocalXLogInsertAllowed = 0; /* never again write WAL */ + LocalSetXLogInsertNotAllowed(); /* never again write WAL */ } /* @@ -9159,7 +9178,7 @@ CreateEndOfRecoveryRecord(void) END_CRIT_SECTION(); - LocalXLogInsertAllowed = -1; /* return to "check" state */ + LocalSetXLogInsertCheckRecovery(); /* return to "check" state */ } /* diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 221af87e71..8c9cadc6da 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -306,6 +306,8 @@ extern RecoveryState GetRecoveryState(void); extern bool HotStandbyActive(void); extern bool HotStandbyActiveInReplay(void); extern bool XLogInsertAllowed(void); +extern void LocalSetXLogInsertNotAllowed(void); +extern void LocalSetXLogInsertCheckRecovery(void); extern void GetXLogReceiptTime(TimestampTz *rtime, bool *fromStream); extern XLogRecPtr GetXLogReplayRecPtr(TimeLineID *replayTLI); extern XLogRecPtr GetXLogInsertRecPtr(void); -- 2.20.1
>From e25f4699deba41025e05ddf0d85755ad6dc8917e Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais <j...@dalibo.com> Date: Fri, 10 Apr 2020 18:01:45 +0200 Subject: [PATCH 2/4] demote: support demoting instance from production to standby Procedure and Architecture: * demote can be triggered using "pg_ctl [-m {fast|smart}] demote" or using the SQL admin function "pg_demote(fast bool, wait bool, wait_seconds int)" * on sigusr1, postmaster check for demote signal files "demote" or "demote_fast" in PGDATA * if in production and a demote signal file is found, the demote process starts by setting PM_DEMOTING * set DB_DEMOTING state in controlfile * PM_DEMOTING waits for all backends to be in read only * every idle backends set LocalXLogInsert=0 immediatly to forbid new writes * in fast mode, every backend holding a xid is terminated * in smart mode, wait for running xact to finish * once all backends are read only, set PM_SHUTDOWN to create a shutdown checkpoint * during the shutdown chechpoint, ShutdownXLOG now takes a boolean arg to handle demote differently than a normal shutdown * the shutdown checkpoint set the cluster state as DB_DEMOTING in the controlfile * the checkpointer then exits to have a fresh restart for code simplicity * Postmaster sets PM_STARTUP on checkpointer exit status * the startup process is then started from PostmasterStateMachine() and try to handle subsystems init correctly during demote * the demote procedure keeps some sub-processes alive: stat collector, bgwriter and optionally archiver and wal senders * at the end of demote, send USR1 to signal the backends and wal senders to set their environment as in recovery and cascading Discuss/Todo: * add doc * code reviewing * do not handle backup in progress during demote * investigate snapshots shmem needs/init during recovery compare to production * cancel running checkpoint during demote * replace with a END_OF_PRODUCTION xlog record? --- src/backend/access/transam/twophase.c | 95 +++++++ src/backend/access/transam/xlog.c | 315 ++++++++++++++++-------- src/backend/postmaster/checkpointer.c | 28 +++ src/backend/postmaster/pgstat.c | 3 + src/backend/postmaster/postmaster.c | 235 +++++++++++++++++- src/backend/storage/ipc/procarray.c | 2 + src/backend/storage/ipc/procsignal.c | 30 +++ src/backend/storage/lmgr/lock.c | 12 + src/backend/tcop/postgres.c | 44 ++++ src/backend/utils/init/globals.c | 1 + src/bin/pg_controldata/pg_controldata.c | 2 + src/bin/pg_ctl/pg_ctl.c | 117 +++++++++ src/include/access/twophase.h | 1 + src/include/access/xlog.h | 23 +- src/include/catalog/pg_control.h | 1 + src/include/libpq/libpq-be.h | 7 +- src/include/miscadmin.h | 1 + src/include/pgstat.h | 1 + src/include/postmaster/bgwriter.h | 1 + src/include/storage/lock.h | 2 + src/include/storage/procsignal.h | 4 + src/include/tcop/tcopprot.h | 2 + src/include/utils/pidfile.h | 1 + 23 files changed, 799 insertions(+), 129 deletions(-) diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c index 9b2e59bf0e..fda085631f 100644 --- a/src/backend/access/transam/twophase.c +++ b/src/backend/access/transam/twophase.c @@ -1565,6 +1565,101 @@ FinishPreparedTransaction(const char *gid, bool isCommit) pfree(buf); } +/* + * ShutdownPreparedTransactions: clean prepared from sheared memory + * + * This is called during the demote process to clean the shared memory + * before the startup process load everything back in correctly + * for the standby mode. + * + * Note: this function assue all prepared transaction have been + * written to disk. In consequence, it must be called AFTER the demote + * shutdown checkpoint. + */ +void +ShutdownPreparedTransactions(void) +{ + int i; + + for (i = 0; i < TwoPhaseState->numPrepXacts; i++) + { + GlobalTransaction gxact; + PGPROC *proc; + TransactionId xid; + char *buf; + char *bufptr; + TwoPhaseFileHeader *hdr; + TransactionId latestXid; + TransactionId *children; + + gxact = TwoPhaseState->prepXacts[i]; + proc = &ProcGlobal->allProcs[gxact->pgprocno]; + xid = ProcGlobal->allPgXact[gxact->pgprocno].xid; + + /* Read and validate 2PC state data */ + Assert(gxact->ondisk); + buf = ReadTwoPhaseFile(xid, false); + + /* + * Disassemble the header area + */ + hdr = (TwoPhaseFileHeader *) buf; + Assert(TransactionIdEquals(hdr->xid, xid)); + bufptr = buf + MAXALIGN(sizeof(TwoPhaseFileHeader)) + + MAXALIGN(hdr->gidlen); + children = (TransactionId *) bufptr; + bufptr += MAXALIGN(hdr->nsubxacts * sizeof(TransactionId)) + + MAXALIGN(hdr->ncommitrels * sizeof(RelFileNode)) + + MAXALIGN(hdr->nabortrels * sizeof(RelFileNode)) + + MAXALIGN(hdr->ninvalmsgs * sizeof(SharedInvalidationMessage)); + + /* compute latestXid among all children */ + latestXid = TransactionIdLatest(xid, hdr->nsubxacts, children); + + /* remove dummy proc associated to the gaxt */ + ProcArrayRemove(proc, latestXid); + + /* + * This lock is probably not needed during the demote process + * as all backends are already gone. + */ + LWLockAcquire(TwoPhaseStateLock, LW_EXCLUSIVE); + + /* cleanup locks */ + for (;;) + { + TwoPhaseRecordOnDisk *record = (TwoPhaseRecordOnDisk *) bufptr; + + Assert(record->rmid <= TWOPHASE_RM_MAX_ID); + if (record->rmid == TWOPHASE_RM_END_ID) + break; + + bufptr += MAXALIGN(sizeof(TwoPhaseRecordOnDisk)); + + if (record->rmid == TWOPHASE_RM_LOCK_ID) + lock_twophase_shutdown(xid, record->info, + (void *) bufptr, record->len); + + bufptr += MAXALIGN(record->len); + } + + /* and put it back in the freelist */ + gxact->next = TwoPhaseState->freeGXacts; + TwoPhaseState->freeGXacts = gxact; + + /* + * Release the lock as all callbacks are called and shared memory cleanup + * is done. + */ + LWLockRelease(TwoPhaseStateLock); + + pfree(buf); + } + + TwoPhaseState->numPrepXacts -= i; + Assert(TwoPhaseState->numPrepXacts == 0); +} + /* * Scan 2PC state data in memory and call the indicated callbacks for each 2PC record. */ diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c index 25a9f78690..4aaa138b1b 100644 --- a/src/backend/access/transam/xlog.c +++ b/src/backend/access/transam/xlog.c @@ -6298,6 +6298,11 @@ CheckRequiredParameterValues(void) /* * This must be called ONCE during postmaster or standalone-backend startup */ +/* + * FIXME demote: part of the code here assume there's no other active + * processes before signal PMSIGNAL_RECOVERY_STARTED is sent. + */ + void StartupXLOG(void) { @@ -6321,6 +6326,7 @@ StartupXLOG(void) XLogPageReadPrivate private; bool promoted = false; struct stat st; + bool is_demoting = false; /* * We should have an aux process resource owner to use, and we should not @@ -6385,6 +6391,25 @@ StartupXLOG(void) str_time(ControlFile->time)))); break; + case DB_DEMOTING: + ereport(LOG, + (errmsg("database system was demoted at %s", + str_time(ControlFile->time)))); + is_demoting = true; + bgwriterLaunched = true; + InArchiveRecovery = true; + StandbyMode = true; + + /* + * previous state was RECOVERY_STATE_DONE. We need to + * reinit it to something else so RecoveryInProgress() + * doesn't return false. + */ + SpinLockAcquire(&XLogCtl->info_lck); + XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE; + SpinLockRelease(&XLogCtl->info_lck); + break; + default: ereport(FATAL, (errmsg("control file contains invalid database cluster state"))); @@ -6418,7 +6443,8 @@ StartupXLOG(void) * persisted. To avoid that, fsync the entire data directory. */ if (ControlFile->state != DB_SHUTDOWNED && - ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY) + ControlFile->state != DB_SHUTDOWNED_IN_RECOVERY && + ControlFile->state != DB_DEMOTING) { RemoveTempXlogFiles(); SyncDataDirectory(); @@ -6674,7 +6700,8 @@ StartupXLOG(void) (errmsg("could not locate a valid checkpoint record"))); } memcpy(&checkPoint, XLogRecGetData(xlogreader), sizeof(CheckPoint)); - wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN); + wasShutdown = ((record->xl_info & ~XLR_INFO_MASK) == XLOG_CHECKPOINT_SHUTDOWN) && + !is_demoting; } /* @@ -6736,9 +6763,9 @@ StartupXLOG(void) LastRec = RecPtr = checkPointLoc; ereport(DEBUG1, - (errmsg_internal("redo record is at %X/%X; shutdown %s", + (errmsg_internal("redo record is at %X/%X; %s checkpoint", (uint32) (checkPoint.redo >> 32), (uint32) checkPoint.redo, - wasShutdown ? "true" : "false"))); + wasShutdown ? "shutdown" : is_demoting? "demote": ""))); ereport(DEBUG1, (errmsg_internal("next transaction ID: " UINT64_FORMAT "; next OID: %u", U64FromFullTransactionId(checkPoint.nextFullXid), @@ -6772,47 +6799,7 @@ StartupXLOG(void) checkPoint.newestCommitTsXid); XLogCtl->ckptFullXid = checkPoint.nextFullXid; - /* - * Initialize replication slots, before there's a chance to remove - * required resources. - */ - StartupReplicationSlots(); - - /* - * Startup logical state, needs to be setup now so we have proper data - * during crash recovery. - */ - StartupReorderBuffer(); - - /* - * Startup MultiXact. We need to do this early to be able to replay - * truncations. - */ - StartupMultiXact(); - - /* - * Ditto for commit timestamps. Activate the facility if the setting is - * enabled in the control file, as there should be no tracking of commit - * timestamps done when the setting was disabled. This facility can be - * started or stopped when replaying a XLOG_PARAMETER_CHANGE record. - */ - if (ControlFile->track_commit_timestamp) - StartupCommitTs(); - - /* - * Recover knowledge about replay progress of known replication partners. - */ - StartupReplicationOrigin(); - /* - * Initialize unlogged LSN. On a clean shutdown, it's restored from the - * control file. On recovery, all unlogged relations are blown away, so - * the unlogged LSN counter can be reset too. - */ - if (ControlFile->state == DB_SHUTDOWNED) - XLogCtl->unloggedLSN = ControlFile->unloggedLSN; - else - XLogCtl->unloggedLSN = FirstNormalUnloggedLSN; /* * We must replay WAL entries using the same TimeLineID they were created @@ -6821,19 +6808,64 @@ StartupXLOG(void) */ ThisTimeLineID = checkPoint.ThisTimeLineID; - /* - * Copy any missing timeline history files between 'now' and the recovery - * target timeline from archive to pg_wal. While we don't need those files - * ourselves - the history file of the recovery target timeline covers all - * the previous timelines in the history too - a cascading standby server - * might be interested in them. Or, if you archive the WAL from this - * server to a different archive than the primary, it'd be good for all the - * history files to get archived there after failover, so that you can use - * one of the old timelines as a PITR target. Timeline history files are - * small, so it's better to copy them unnecessarily than not copy them and - * regret later. - */ - restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI); + if (!is_demoting) + { + /* + * Initialize replication slots, before there's a chance to remove + * required resources. + */ + StartupReplicationSlots(); + + /* + * Startup logical state, needs to be setup now so we have proper data + * during crash recovery. + */ + StartupReorderBuffer(); + + /* + * Startup MultiXact. We need to do this early to be able to replay + * truncations. + */ + StartupMultiXact(); + + /* + * Ditto for commit timestamps. Activate the facility if the setting is + * enabled in the control file, as there should be no tracking of commit + * timestamps done when the setting was disabled. This facility can be + * started or stopped when replaying a XLOG_PARAMETER_CHANGE record. + */ + if (ControlFile->track_commit_timestamp) + StartupCommitTs(); + + /* + * Recover knowledge about replay progress of known replication partners. + */ + StartupReplicationOrigin(); + + /* + * Initialize unlogged LSN. On a clean shutdown, it's restored from the + * control file. On recovery, all unlogged relations are blown away, so + * the unlogged LSN counter can be reset too. + */ + if (ControlFile->state == DB_SHUTDOWNED) + XLogCtl->unloggedLSN = ControlFile->unloggedLSN; + else + XLogCtl->unloggedLSN = FirstNormalUnloggedLSN; + + /* + * Copy any missing timeline history files between 'now' and the recovery + * target timeline from archive to pg_wal. While we don't need those files + * ourselves - the history file of the recovery target timeline covers all + * the previous timelines in the history too - a cascading standby server + * might be interested in them. Or, if you archive the WAL from this + * server to a different archive than the master, it'd be good for all the + * history files to get archived there after failover, so that you can use + * one of the old timelines as a PITR target. Timeline history files are + * small, so it's better to copy them unnecessarily than not copy them and + * regret later. + */ + restoreTimeLineHistoryFiles(ThisTimeLineID, recoveryTargetTLI); + } /* * Before running in recovery, scan pg_twophase and fill in its status to @@ -6888,11 +6920,25 @@ StartupXLOG(void) dbstate_at_startup = ControlFile->state; if (InArchiveRecovery) { - ControlFile->state = DB_IN_ARCHIVE_RECOVERY; + if (is_demoting) + { + /* + * Avoid concurrent access to the ControlFile datas + * during demotion. + */ + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + ControlFile->state = DB_IN_ARCHIVE_RECOVERY; + LWLockRelease(ControlFileLock); + } + else + { + ControlFile->state = DB_IN_ARCHIVE_RECOVERY; - SpinLockAcquire(&XLogCtl->info_lck); - XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE; - SpinLockRelease(&XLogCtl->info_lck); + /* This is already set if demoting */ + SpinLockAcquire(&XLogCtl->info_lck); + XLogCtl->SharedRecoveryState = RECOVERY_STATE_ARCHIVE; + SpinLockRelease(&XLogCtl->info_lck); + } } else { @@ -6982,7 +7028,8 @@ StartupXLOG(void) /* * Reset pgstat data, because it may be invalid after recovery. */ - pgstat_reset_all(); + if (!is_demoting) + pgstat_reset_all(); /* * If there was a backup label file, it's done its job and the info @@ -7044,7 +7091,7 @@ StartupXLOG(void) InitRecoveryTransactionEnvironment(); - if (wasShutdown) + if (wasShutdown || is_demoting) oldestActiveXID = PrescanPreparedTransactions(&xids, &nxids); else oldestActiveXID = checkPoint.oldestActiveXid; @@ -7057,6 +7104,11 @@ StartupXLOG(void) * Startup commit log and subtrans only. MultiXact and commit * timestamp have already been started up and other SLRUs are not * maintained during recovery and need not be started yet. + * + * Starting up commit log is technicaly not needed during demote + * as the in-memory data did not move. However, this is a + * lightweight initialization and this might seem expected as + * pure symmetry as ShutdownCLOG() is called during ShutdownXLog(). */ StartupCLOG(); StartupSUBTRANS(oldestActiveXID); @@ -7067,7 +7119,7 @@ StartupXLOG(void) * empty running-xacts record and use that here and now. Recover * additional standby state for prepared transactions. */ - if (wasShutdown) + if (wasShutdown || is_demoting) { RunningTransactionsData running; TransactionId latestCompletedXid; @@ -7938,6 +7990,7 @@ StartupXLOG(void) SpinLockAcquire(&XLogCtl->info_lck); XLogCtl->SharedRecoveryState = RECOVERY_STATE_DONE; + XLogCtl->SharedHotStandbyActive = false; SpinLockRelease(&XLogCtl->info_lck); UpdateControlFile(); @@ -8056,6 +8109,23 @@ CheckRecoveryConsistency(void) } } +/* + * Initialize the local TimeLineID + */ +bool +SetLocalRecoveryInProgress(void) +{ + /* + * use volatile pointer to make sure we make a fresh read of the + * shared variable. + */ + volatile XLogCtlData *xlogctl = XLogCtl; + + LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE); + + return LocalRecoveryInProgress; +} + /* * Is the system still in recovery? * @@ -8077,13 +8147,7 @@ RecoveryInProgress(void) return false; else { - /* - * use volatile pointer to make sure we make a fresh read of the - * shared variable. - */ - volatile XLogCtlData *xlogctl = XLogCtl; - - LocalRecoveryInProgress = (xlogctl->SharedRecoveryState != RECOVERY_STATE_DONE); + SetLocalRecoveryInProgress(); /* * Initialize TimeLineID and RedoRecPtr when we discover that recovery @@ -8503,6 +8567,8 @@ GetLastSegSwitchData(XLogRecPtr *lastSwitchLSN) void ShutdownXLOG(int code, Datum arg) { + bool is_demoting = DatumGetBool(arg); + /* * We should have an aux process resource owner to use, and we should not * be in a transaction that's installed some other resowner. @@ -8512,35 +8578,55 @@ ShutdownXLOG(int code, Datum arg) CurrentResourceOwner == AuxProcessResourceOwner); CurrentResourceOwner = AuxProcessResourceOwner; - /* Don't be chatty in standalone mode */ - ereport(IsPostmasterEnvironment ? LOG : NOTICE, - (errmsg("shutting down"))); - - /* - * Signal walsenders to move to stopping state. - */ - WalSndInitStopping(); - - /* - * Wait for WAL senders to be in stopping state. This prevents commands - * from writing new WAL. - */ - WalSndWaitStopping(); + if (is_demoting) + { + /* Don't be chatty in standalone mode */ + ereport(IsPostmasterEnvironment ? LOG : NOTICE, + (errmsg("demoting"))); - if (RecoveryInProgress()) - CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); + /* + * FIXME demote: avoiding checkpoint? + * A checkpoint is probably running during a demote action. If + * we don't want to wait for the checkpoint during the demote, + * we might need to cancel it as it will not be able to write + * to the WAL after the demote. + */ + CreateCheckPoint(CHECKPOINT_IS_DEMOTE | CHECKPOINT_IMMEDIATE); + ShutdownPreparedTransactions(); + LocalRecoveryInProgress = true; + } else { + /* Don't be chatty in standalone mode */ + ereport(IsPostmasterEnvironment ? LOG : NOTICE, + (errmsg("shutting down"))); + /* - * If archiving is enabled, rotate the last XLOG file so that all the - * remaining records are archived (postmaster wakes up the archiver - * process one more time at the end of shutdown). The checkpoint - * record will go to the next XLOG file and won't be archived (yet). + * Signal walsenders to move to stopping state. */ - if (XLogArchivingActive() && XLogArchiveCommandSet()) - RequestXLogSwitch(false); + WalSndInitStopping(); + + /* + * Wait for WAL senders to be in stopping state. This prevents commands + * from writing new WAL. + */ + WalSndWaitStopping(); + + if (RecoveryInProgress()) + CreateRestartPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); + else + { + /* + * If archiving is enabled, rotate the last XLOG file so that all the + * remaining records are archived (postmaster wakes up the archiver + * process one more time at the end of shutdown). The checkpoint + * record will go to the next XLOG file and won't be archived (yet). + */ + if (XLogArchivingActive() && XLogArchiveCommandSet()) + RequestXLogSwitch(false); - CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); + CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE); + } } ShutdownCLOG(); ShutdownCommitTs(); @@ -8554,9 +8640,10 @@ ShutdownXLOG(int code, Datum arg) static void LogCheckpointStart(int flags, bool restartpoint) { - elog(LOG, "%s starting:%s%s%s%s%s%s%s%s", + elog(LOG, "%s starting:%s%s%s%s%s%s%s%s%s", restartpoint ? "restartpoint" : "checkpoint", (flags & CHECKPOINT_IS_SHUTDOWN) ? " shutdown" : "", + (flags & CHECKPOINT_IS_DEMOTE) ? " demote" : "", (flags & CHECKPOINT_END_OF_RECOVERY) ? " end-of-recovery" : "", (flags & CHECKPOINT_IMMEDIATE) ? " immediate" : "", (flags & CHECKPOINT_FORCE) ? " force" : "", @@ -8692,6 +8779,7 @@ UpdateCheckPointDistanceEstimate(uint64 nbytes) * * flags is a bitwise OR of the following: * CHECKPOINT_IS_SHUTDOWN: checkpoint is for database shutdown. + * CHECKPOINT_IS_DEMOTE: checkpoint is for demote. * CHECKPOINT_END_OF_RECOVERY: checkpoint is for end of WAL recovery. * CHECKPOINT_IMMEDIATE: finish the checkpoint ASAP, * ignoring checkpoint_completion_target parameter. @@ -8720,6 +8808,7 @@ void CreateCheckPoint(int flags) { bool shutdown; + bool demote; CheckPoint checkPoint; XLogRecPtr recptr; XLogSegNo _logSegNo; @@ -8732,14 +8821,21 @@ CreateCheckPoint(int flags) int nvxids; /* - * An end-of-recovery checkpoint is really a shutdown checkpoint, just - * issued at a different time. + * An end-of-recovery or demote checkpoint is really a shutdown checkpoint, + * just issued at a different time. */ - if (flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY)) + if (flags & (CHECKPOINT_IS_SHUTDOWN | + CHECKPOINT_IS_DEMOTE | + CHECKPOINT_END_OF_RECOVERY)) shutdown = true; else shutdown = false; + if (flags & CHECKPOINT_IS_DEMOTE) + demote = true; + else + demote = false; + /* sanity check */ if (RecoveryInProgress() && (flags & CHECKPOINT_END_OF_RECOVERY) == 0) elog(ERROR, "can't create a checkpoint during recovery"); @@ -8780,7 +8876,7 @@ CreateCheckPoint(int flags) if (shutdown) { LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); - ControlFile->state = DB_SHUTDOWNING; + ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNING; ControlFile->time = (pg_time_t) time(NULL); UpdateControlFile(); LWLockRelease(ControlFileLock); @@ -8826,7 +8922,7 @@ CreateCheckPoint(int flags) * avoid inserting duplicate checkpoints when the system is idle. */ if ((flags & (CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_END_OF_RECOVERY | - CHECKPOINT_FORCE)) == 0) + CHECKPOINT_IS_DEMOTE | CHECKPOINT_FORCE)) == 0) { if (last_important_lsn == ControlFile->checkPoint) { @@ -8994,8 +9090,8 @@ CreateCheckPoint(int flags) * allows us to reconstruct the state of running transactions during * archive recovery, if required. Skip, if this info disabled. * - * If we are shutting down, or Startup process is completing crash - * recovery we don't need to write running xact data. + * If we are shutting down, demoting or Startup process is completing + * crash recovery we don't need to write running xact data. */ if (!shutdown && XLogStandbyInfoActive()) LogStandbySnapshot(); @@ -9014,11 +9110,11 @@ CreateCheckPoint(int flags) XLogFlush(recptr); /* - * We mustn't write any new WAL after a shutdown checkpoint, or it will be - * overwritten at next startup. No-one should even try, this just allows - * sanity-checking. In the case of an end-of-recovery checkpoint, we want - * to just temporarily disable writing until the system has exited - * recovery. + * We mustn't write any new WAL after a shutdown or demote checkpoint, or + * it will be overwritten at next startup. No-one should even try, this + * just allows sanity-checking. In the case of an end-of-recovery + * checkpoint, we want to just temporarily disable writing until the system + * has exited recovery. */ if (shutdown) { @@ -9034,7 +9130,8 @@ CreateCheckPoint(int flags) */ if (shutdown && checkPoint.redo != ProcLastRecPtr) ereport(PANIC, - (errmsg("concurrent write-ahead log activity while database system is shutting down"))); + (errmsg("concurrent write-ahead log activity while database system is %s", + demote? "demoting":"shutting down"))); /* * Remember the prior checkpoint's redo ptr for @@ -9047,7 +9144,7 @@ CreateCheckPoint(int flags) */ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); if (shutdown) - ControlFile->state = DB_SHUTDOWNED; + ControlFile->state = demote? DB_DEMOTING:DB_SHUTDOWNED; ControlFile->checkPoint = ProcLastRecPtr; ControlFile->checkPointCopy = checkPoint; ControlFile->time = (pg_time_t) time(NULL); diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c index 624a3238b8..58473a61fd 100644 --- a/src/backend/postmaster/checkpointer.c +++ b/src/backend/postmaster/checkpointer.c @@ -151,6 +151,7 @@ double CheckPointCompletionTarget = 0.5; * Private state */ static bool ckpt_active = false; +static volatile sig_atomic_t demoteRequestPending = false; /* these values are valid when ckpt_active is true: */ static pg_time_t ckpt_start_time; @@ -552,6 +553,21 @@ HandleCheckpointerInterrupts(void) */ UpdateSharedMemoryConfig(); } + if (demoteRequestPending) + { + demoteRequestPending = false; + /* Close down the database */ + ShutdownXLOG(0, BoolGetDatum(true)); + /* + * Exit checkpointer. We could keep it around during demotion, but + * exiting here has multiple benefices: + * - to create a fresh process with clean local vars + * (eg. LocalRecoveryInProgress) + * - to signal postmaster the demote shutdown checkpoint is done + * and keep going with next steps of the demotion + */ + proc_exit(0); + } if (ShutdownRequestPending) { /* @@ -680,6 +696,7 @@ CheckpointWriteDelay(int flags, double progress) * in which case we just try to catch up as quickly as possible. */ if (!(flags & CHECKPOINT_IMMEDIATE) && + !demoteRequestPending && !ShutdownRequestPending && !ImmediateCheckpointRequested() && IsCheckpointOnSchedule(progress)) @@ -812,6 +829,17 @@ IsCheckpointOnSchedule(double progress) * -------------------------------- */ +/* SIGUSR1: set flag to demote */ +void +ReqCheckpointDemoteHandler(SIGNAL_ARGS) +{ + int save_errno = errno; + + demoteRequestPending = true; + + errno = save_errno; +} + /* SIGINT: set flag to run a normal checkpoint right away */ static void ReqCheckpointHandler(SIGNAL_ARGS) diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c index 15f92b66c6..d20b8d8530 100644 --- a/src/backend/postmaster/pgstat.c +++ b/src/backend/postmaster/pgstat.c @@ -3854,6 +3854,9 @@ pgstat_get_wait_ipc(WaitEventIPC w) case WAIT_EVENT_PROMOTE: event_name = "Promote"; break; + case WAIT_EVENT_DEMOTE: + event_name = "Demote"; + break; case WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT: event_name = "RecoveryConflictSnapshot"; break; diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c index 5b5fc97c72..8004770d8c 100644 --- a/src/backend/postmaster/postmaster.c +++ b/src/backend/postmaster/postmaster.c @@ -275,12 +275,13 @@ static StartupStatusEnum StartupStatus = STARTUP_NOT_RUNNING; #define ImmediateShutdown 3 static int Shutdown = NoShutdown; +static bool DemoteSignal = false; /* true on demote request */ static bool FatalError = false; /* T if recovering from backend crash */ /* - * We use a simple state machine to control startup, shutdown, and - * crash recovery (which is rather like shutdown followed by startup). + * We use a simple state machine to control startup, shutdown, demote and + * crash recovery (both are rather like shutdown followed by startup). * * After doing all the postmaster initialization work, we enter PM_STARTUP * state and the startup process is launched. The startup process begins by @@ -324,6 +325,7 @@ typedef enum { PM_INIT, /* postmaster starting */ PM_STARTUP, /* waiting for startup subprocess */ + PM_DEMOTING, /* waiting for idle or RO backends for demote */ PM_RECOVERY, /* in archive recovery mode */ PM_HOT_STANDBY, /* in hot standby mode */ PM_RUN, /* normal "database is alive" state */ @@ -414,10 +416,14 @@ static bool RandomCancelKey(int32 *cancel_key); static void signal_child(pid_t pid, int signal); static bool SignalSomeChildren(int signal, int targets); static void TerminateChildren(int signal); +static void RemoveDemoteSignalFiles(void); +static bool CheckDemoteSignal(void); + #define SignalChildren(sig) SignalSomeChildren(sig, BACKEND_TYPE_ALL) static int CountChildren(int target); +static int CountXacts(void); static bool assign_backendlist_entry(RegisteredBgWorker *rw); static void maybe_start_bgworkers(void); static bool CreateOptsFile(int argc, char *argv[], char *fullprogname); @@ -2305,6 +2311,11 @@ retry1: (errcode(ERRCODE_CANNOT_CONNECT_NOW), errmsg("the database system is starting up"))); break; + case CAC_DEMOTE: + ereport(FATAL, + (errcode(ERRCODE_CANNOT_CONNECT_NOW), + errmsg("the database system is demoting"))); + break; case CAC_SHUTDOWN: ereport(FATAL, (errcode(ERRCODE_CANNOT_CONNECT_NOW), @@ -2436,10 +2447,11 @@ canAcceptConnections(int backend_type) CAC_state result = CAC_OK; /* - * Can't start backends when in startup/shutdown/inconsistent recovery - * state. We treat autovac workers the same as user backends for this - * purpose. However, bgworkers are excluded from this test; we expect - * bgworker_should_start_now() decided whether the DB state allows them. + * Can't start backends when in startup/demote/shutdown/inconsistent + * recovery state. We treat autovac workers the same as user backends + * for this purpose. However, bgworkers are excluded from this test; + * we expect bgworker_should_start_now() decided whether the DB state + * allows them. * * In state PM_WAIT_BACKUP only superusers can connect (this must be * allowed so that a superuser can end online backup mode); we return @@ -2452,6 +2464,8 @@ canAcceptConnections(int backend_type) { if (pmState == PM_WAIT_BACKUP) result = CAC_WAITBACKUP; /* allow superusers only */ + else if (DemoteSignal) + return CAC_DEMOTE; /* demote is pending */ else if (Shutdown > NoShutdown) return CAC_SHUTDOWN; /* shutdown is pending */ else if (!FatalError && @@ -3108,7 +3122,18 @@ reaper(SIGNAL_ARGS) if (pid == CheckpointerPID) { CheckpointerPID = 0; - if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN) + if (EXIT_STATUS_0(exitstatus) && + DemoteSignal && + pmState == PM_SHUTDOWN) + { + /* + * The checkpointer exit signals the demote shutdown checkpoint + * is done. The startup recovery mode can be started from there. + */ + ereport(DEBUG1, + (errmsg_internal("checkpointer shutdown for demote"))); + } + else if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN) { /* * OK, we saw normal exit of the checkpointer after it's been @@ -3802,6 +3827,25 @@ PostmasterStateMachine(void) pmState = PM_WAIT_BACKENDS; } + if (pmState == PM_DEMOTING) + { + int numXacts = CountXacts(); + + /* + * PM_DEMOTING state ends when we have no active transactions + * and all backends set LocalXLogInsertAllowed=0 + */ + if (numXacts == 0) + { + ereport(LOG, (errmsg("all backends in read only"))); + + SendProcSignal(CheckpointerPID, PROCSIG_CHECKPOINTER_DEMOTING, InvalidBackendId); + pmState = PM_SHUTDOWN; + } + else + ereport(LOG, (errmsg("waiting for %d transactions to finish", numXacts))); + } + if (pmState == PM_WAIT_READONLY) { /* @@ -3995,6 +4039,20 @@ PostmasterStateMachine(void) (StartupStatus == STARTUP_CRASHED || !restart_after_crash)) ExitPostmaster(1); + + /* Demoting: start the Startup Process */ + if (DemoteSignal && pmState == PM_SHUTDOWN && CheckpointerPID == 0) + { + /* stop archiver process if not required during standby */ + if (!XLogArchivingAlways() && PgArchPID != 0) + signal_child(PgArchPID, SIGQUIT); + + StartupPID = StartupDataBase(); + Assert(StartupPID != 0); + StartupStatus = STARTUP_RUNNING; + pmState = PM_STARTUP; + } + /* * If we need to recover from a crash, wait for all non-syslogger children * to exit, then reset shmem and StartupDataBase. @@ -5205,8 +5263,12 @@ sigusr1_handler(SIGNAL_ARGS) * Crank up the background tasks. It doesn't matter if this fails, * we'll just try again later. */ + if (!DemoteSignal) + Assert(PgArchPID == 0); + Assert(CheckpointerPID == 0); CheckpointerPID = StartCheckpointer(); + Assert(BgWriterPID == 0); BgWriterPID = StartBackgroundWriter(); @@ -5214,8 +5276,7 @@ sigusr1_handler(SIGNAL_ARGS) * Start the archiver if we're responsible for (re-)archiving received * files. */ - Assert(PgArchPID == 0); - if (XLogArchivingAlways()) + if (PgArchPID == 0 && XLogArchivingAlways()) PgArchPID = pgarch_start(); /* @@ -5226,6 +5287,7 @@ sigusr1_handler(SIGNAL_ARGS) if (!EnableHotStandby) { AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_STANDBY); + DemoteSignal = false; #ifdef USE_SYSTEMD sd_notify(0, "READY=1"); #endif @@ -5236,11 +5298,15 @@ sigusr1_handler(SIGNAL_ARGS) if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) && pmState == PM_RECOVERY && Shutdown == NoShutdown) { + dlist_iter iter; + /* * Likewise, start other special children as needed. */ - Assert(PgStatPID == 0); - PgStatPID = pgstat_start(); + if (!DemoteSignal) + Assert(PgStatPID == 0); + if(PgStatPID == 0) + PgStatPID = pgstat_start(); ereport(LOG, (errmsg("database system is ready to accept read only connections"))); @@ -5251,7 +5317,17 @@ sigusr1_handler(SIGNAL_ARGS) sd_notify(0, "READY=1"); #endif + if (DemoteSignal) + dlist_foreach(iter, &BackendList) + { + Backend *bp = dlist_container(Backend, elem, iter.cur); + + if (!bp->dead_end && bp->bkend_type & (BACKEND_TYPE_NORMAL|BACKEND_TYPE_WALSND)) + SendProcSignal(bp->pid, PROCSIG_DEMOTED, InvalidBackendId); + } + pmState = PM_HOT_STANDBY; + DemoteSignal = false; /* Some workers may be scheduled to start now */ StartWorkerNeeded = true; } @@ -5342,6 +5418,97 @@ sigusr1_handler(SIGNAL_ARGS) signal_child(StartupPID, SIGUSR2); } + if (CheckDemoteSignal() && pmState != PM_RUN ) + { + DemoteSignal = false; + RemoveDemoteSignalFiles(); + ereport(LOG, + (errmsg("ignoring demote signal because already in standby mode"))); + } + /* received demote signal */ + else if (CheckDemoteSignal()) + { + FILE *standby_file; + dlist_iter iter; + bool fast_demote; + struct stat stat_buf; + + fast_demote = (stat(DEMOTE_FAST_SIGNAL_FILE, &stat_buf) == 0); + + DemoteSignal = true; + RemoveDemoteSignalFiles(); + + /* create the standby signal file */ + standby_file = AllocateFile(STANDBY_SIGNAL_FILE, "w"); + if (!standby_file) + { + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create file \"%s\": %m", + STANDBY_SIGNAL_FILE))); + goto out; + } + + if (FreeFile(standby_file)) + { + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write file \"%s\": %m", + STANDBY_SIGNAL_FILE))); + unlink(STANDBY_SIGNAL_FILE); + goto out; + } + + if (fast_demote == 0) + { + /* smart demote */ + ereport(LOG, (errmsg("received smart demote request"))); + + } + else + { + /* fast demote */ + ereport(LOG, (errmsg("received fast demote request"))); + } + + SignalSomeChildren(SIGTERM, + BACKEND_TYPE_AUTOVAC | BACKEND_TYPE_BGWORKER); + + /* and the autovac launcher too */ + if (AutoVacPID != 0) + signal_child(AutoVacPID, SIGTERM); + /* and the bgwriter too */ + if (BgWriterPID != 0) + signal_child(BgWriterPID, SIGTERM); + /* and the walwriter too */ + if (WalWriterPID != 0) + signal_child(WalWriterPID, SIGTERM); + + dlist_foreach(iter, &BackendList) + { + Backend *bp = dlist_container(Backend, elem, iter.cur); + + if (bp->dead_end) + continue; + /* + * Assign bkend_type for any recently announced WAL Sender + * processes. + */ + if (bp->bkend_type == BACKEND_TYPE_NORMAL && + ! IsPostmasterChildWalSender(bp->child_slot)) + SendProcSignal(bp->pid, + (fast_demote?PROCSIG_DEMOTING_FAST:PROCSIG_DEMOTING), + InvalidBackendId); + } + + pmState = PM_DEMOTING; + + /* Report status */ + AddToDataDirLockFile(LOCK_FILE_LINE_PM_STATUS, PM_STATUS_DEMOTING); + } + +out: + #ifdef WIN32 PG_SETMASK(&UnBlockSig); #endif @@ -5439,6 +5606,26 @@ CountChildren(int target) } +/* + * Count up the number of active transactions + */ +static int +CountXacts(void) +{ + int i; + int cnt = 0; + + for (i = 0; i < ProcGlobal->allProcCount; ++i) + { + PGXACT *xact = &ProcGlobal->allPgXact[i]; + if (TransactionIdIsValid(xact->xid)) + cnt++; + } + + return cnt; +} + + /* * StartChildProcess -- start an auxiliary process for the postmaster * @@ -5904,6 +6091,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time) case PM_WAIT_BACKENDS: case PM_WAIT_READONLY: case PM_WAIT_BACKUP: + case PM_DEMOTING: break; case PM_RUN: @@ -6652,3 +6840,28 @@ InitPostmasterDeathWatchHandle(void) GetLastError()))); #endif /* WIN32 */ } + +/* + * Remove the files signaling a demote request. + */ +static void +RemoveDemoteSignalFiles(void) +{ + unlink(DEMOTE_SIGNAL_FILE); + unlink(DEMOTE_FAST_SIGNAL_FILE); +} + +/* + * Check if a demote request appeared. + */ +static bool +CheckDemoteSignal(void) +{ + struct stat stat_buf; + + if (stat(DEMOTE_SIGNAL_FILE, &stat_buf) == 0 || + stat(DEMOTE_FAST_SIGNAL_FILE, &stat_buf) == 0) + return true; + + return false; +} diff --git a/src/backend/storage/ipc/procarray.c b/src/backend/storage/ipc/procarray.c index b448533564..0ccc32f4ce 100644 --- a/src/backend/storage/ipc/procarray.c +++ b/src/backend/storage/ipc/procarray.c @@ -191,6 +191,8 @@ ProcArrayShmemSize(void) size = add_size(size, mul_size(sizeof(int), PROCARRAY_MAXPROCS)); /* + * TODO demote: check safe hotStandby related init and snapshot mech. + * * During Hot Standby processing we have a data structure called * KnownAssignedXids, created in shared memory. Local data structures are * also created in various backends during GetSnapshotData(), diff --git a/src/backend/storage/ipc/procsignal.c b/src/backend/storage/ipc/procsignal.c index 4fa385b0ec..ac14c662d3 100644 --- a/src/backend/storage/ipc/procsignal.c +++ b/src/backend/storage/ipc/procsignal.c @@ -28,6 +28,7 @@ #include "storage/shmem.h" #include "storage/sinval.h" #include "tcop/tcopprot.h" +#include "postmaster/bgwriter.h" /* * The SIGUSR1 signal is multiplexed to support signaling multiple event @@ -585,6 +586,35 @@ procsignal_sigusr1_handler(SIGNAL_ARGS) if (CheckProcSignal(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN)) RecoveryConflictInterrupt(PROCSIG_RECOVERY_CONFLICT_BUFFERPIN); + /* signal checkpoint process to ignite a demote procedure */ + if (CheckProcSignal(PROCSIG_CHECKPOINTER_DEMOTING)) + ReqCheckpointDemoteHandler(PROCSIG_CHECKPOINTER_DEMOTING); + + /* + * ask backends to enter in read only by setting + * LocalXLogInsertAllowed = 0 as soon as their active xact + * finished + */ + if (CheckProcSignal(PROCSIG_DEMOTING)) + ReqDemoteHandler(PROCSIG_DEMOTING); + + /* + * ask backends to enter in read only by setting + * LocalXLogInsertAllowed = 0 if they are idle, or + * interrupt their current xact and terminate. + */ + if (CheckProcSignal(PROCSIG_DEMOTING_FAST)) + ReqDemoteHandler(PROCSIG_DEMOTING_FAST); + + /* + * demote complete. Ask beckends to rely on + * recovery status for LocalXLogInsertAllowed by + * setting it to -1. + * WAL sender set am_cascading. + */ + if (CheckProcSignal(PROCSIG_DEMOTED)) + ReqDemotedHandler(PROCSIG_DEMOTED); + SetLatch(MyLatch); latch_sigusr1_handler(); diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 95989ce79b..52f85cd1b3 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -4371,6 +4371,18 @@ lock_twophase_postabort(TransactionId xid, uint16 info, lock_twophase_postcommit(xid, info, recdata, len); } +/* + * 2PC shutdown from lock table. + * + * This is actually just the same as the COMMIT case. + */ +void +lock_twophase_shutdown(TransactionId xid, uint16 info, + void *recdata, uint32 len) +{ + lock_twophase_postcommit(xid, info, recdata, len); +} + /* * VirtualXactLockTableInsert * diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c index c9424f167c..6bd1e1e1d0 100644 --- a/src/backend/tcop/postgres.c +++ b/src/backend/tcop/postgres.c @@ -67,6 +67,7 @@ #include "rewrite/rewriteHandler.h" #include "storage/bufmgr.h" #include "storage/ipc.h" +#include "storage/pmsignal.h" #include "storage/proc.h" #include "storage/procsignal.h" #include "storage/sinval.h" @@ -3211,6 +3212,42 @@ ProcessInterrupts(void) HandleParallelMessages(); } +/* SIGUSR1: set flag to demote */ +void +ReqDemoteHandler(ProcSignalReason reason) +{ + if (MyBackendType != B_BACKEND) + return; + + if (TransactionIdIsValid(MyPgXact->xid)) + { + if (reason == PROCSIG_DEMOTING_FAST) + { + InterruptPending = true; + ProcDiePending = true; + SetLatch(MyLatch); + } + else + DemotePending = true; + } + else + LocalSetXLogInsertNotAllowed(); +} + +/* SIGUSR1: reset LocalRecoveryInProgress */ +void +ReqDemotedHandler(ProcSignalReason reason) +{ + ereport(LOG, + (errmsg("received demote complete signal"))); + + SetLocalRecoveryInProgress(); + LocalSetXLogInsertCheckRecovery(); + + if (MyBackendType == B_WAL_SENDER) + am_cascading_walsender = true; +} + /* * IA64-specific code to fetch the AR.BSP register for stack depth checks. @@ -4224,6 +4261,12 @@ PostgresMain(int argc, char *argv[], /* Send out notify signals and transmit self-notifies */ ProcessCompletedNotifies(); + if (DemotePending) { + LocalSetXLogInsertNotAllowed(); + DemotePending = false; + SendPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE); + } + /* * Also process incoming notifies, if any. This is mostly to * ensure stable behavior in tests: if any notifies were @@ -4285,6 +4328,7 @@ PostgresMain(int argc, char *argv[], { ConfigReloadPending = false; ProcessConfigFile(PGC_SIGHUP); + SetLocalRecoveryInProgress(); } /* diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index 6ab8216839..021f6af434 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -33,6 +33,7 @@ volatile sig_atomic_t ProcDiePending = false; volatile sig_atomic_t ClientConnectionLost = false; volatile sig_atomic_t IdleInTransactionSessionTimeoutPending = false; volatile sig_atomic_t ProcSignalBarrierPending = false; +volatile sig_atomic_t DemotePending = false; volatile uint32 InterruptHoldoffCount = 0; volatile uint32 QueryCancelHoldoffCount = 0; volatile uint32 CritSectionCount = 0; diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c index e73639df74..c144cc35d3 100644 --- a/src/bin/pg_controldata/pg_controldata.c +++ b/src/bin/pg_controldata/pg_controldata.c @@ -57,6 +57,8 @@ dbState(DBState state) return _("shut down"); case DB_SHUTDOWNED_IN_RECOVERY: return _("shut down in recovery"); + case DB_DEMOTING: + return _("demoting"); case DB_SHUTDOWNING: return _("shutting down"); case DB_IN_CRASH_RECOVERY: diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 1cdc3ebaa3..a7805bd219 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -62,6 +62,7 @@ typedef enum RESTART_COMMAND, RELOAD_COMMAND, STATUS_COMMAND, + DEMOTE_COMMAND, PROMOTE_COMMAND, LOGROTATE_COMMAND, KILL_COMMAND, @@ -103,6 +104,7 @@ static char version_file[MAXPGPATH]; static char pid_file[MAXPGPATH]; static char backup_file[MAXPGPATH]; static char promote_file[MAXPGPATH]; +static char demote_file[MAXPGPATH]; static char logrotate_file[MAXPGPATH]; static volatile pgpid_t postmasterPID = -1; @@ -129,6 +131,7 @@ static void do_stop(void); static void do_restart(void); static void do_reload(void); static void do_status(void); +static void do_demote(void); static void do_promote(void); static void do_logrotate(void); static void do_kill(pgpid_t pid); @@ -1029,6 +1032,115 @@ do_stop(void) } +static void +do_demote(void) +{ + int cnt; + FILE *dmtfile; + pgpid_t pid; + struct stat statbuf; + + pid = get_pgpid(false); + + if (pid == 0) /* no pid file */ + { + write_stderr(_("%s: PID file \"%s\" does not exist\n"), progname, pid_file); + write_stderr(_("Is server running?\n")); + exit(1); + } + else if (pid < 0) /* standalone backend, not postmaster */ + { + pid = -pid; + write_stderr(_("%s: cannot demote server; " + "single-user server is running (PID: %ld)\n"), + progname, pid); + exit(1); + } + + if (shutdown_mode == IMMEDIATE_MODE) + { + write_stderr(_("%s: cannot demote server using immediate mode"), + progname); + exit(1); + } + else if (shutdown_mode == FAST_MODE) + snprintf(demote_file, MAXPGPATH, "%s/demote_fast", pg_data); + else + snprintf(demote_file, MAXPGPATH, "%s/demote", pg_data); + + if ((dmtfile = fopen(demote_file, "w")) == NULL) + { + write_stderr(_("%s: could not create demote signal file \"%s\": %s\n"), + progname, demote_file, strerror(errno)); + exit(1); + } + + if (fclose(dmtfile)) + { + write_stderr(_("%s: could not write demote signal file \"%s\": %s\n"), + progname, demote_file, strerror(errno)); + exit(1); + } + + sig = SIGUSR1; + if (kill((pid_t) pid, sig) != 0) + { + write_stderr(_("%s: could not send demote signal (PID: %ld): %s\n"), progname, pid, + strerror(errno)); + exit(1); + } + + if (!do_wait) + { + print_msg(_("server demoting\n")); + return; + } + else + { + /* + * FIXME demote + * If backup_label exists, an online backup is running. Warn the user + * that smart demote will wait for it to finish. However, if the + * server is in archive recovery, we're recovering from an online + * backup instead of performing one. + */ + if (shutdown_mode == SMART_MODE && + stat(backup_file, &statbuf) == 0 && + get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY) + { + print_msg(_("WARNING: online backup mode is active\n" + "Demote will not complete until pg_stop_backup() is called.\n\n")); + } + + print_msg(_("waiting for server to demote...")); + + for (cnt = 0; cnt < wait_seconds * WAITS_PER_SEC; cnt++) + { + if (get_control_dbstate() == DB_IN_ARCHIVE_RECOVERY) + break; + + if (cnt % WAITS_PER_SEC == 0) + print_msg("."); + pg_usleep(USEC_PER_SEC / WAITS_PER_SEC); + } + + if (get_control_dbstate() != DB_IN_ARCHIVE_RECOVERY) + { + print_msg(_(" failed\n")); + + write_stderr(_("%s: server does not demote\n"), progname); + if (shutdown_mode == SMART_MODE) + write_stderr(_("HINT: The \"-m fast\" option immediately disconnects sessions rather than\n" + "waiting for session-initiated disconnection.\n")); + exit(1); + } + print_msg(_(" done\n")); + + print_msg(_("server demoted\n")); + } +} + + /* * restart/reload routines */ @@ -2447,6 +2559,8 @@ main(int argc, char **argv) ctl_command = RELOAD_COMMAND; else if (strcmp(argv[optind], "status") == 0) ctl_command = STATUS_COMMAND; + else if (strcmp(argv[optind], "demote") == 0) + ctl_command = DEMOTE_COMMAND; else if (strcmp(argv[optind], "promote") == 0) ctl_command = PROMOTE_COMMAND; else if (strcmp(argv[optind], "logrotate") == 0) @@ -2554,6 +2668,9 @@ main(int argc, char **argv) case RELOAD_COMMAND: do_reload(); break; + case DEMOTE_COMMAND: + do_demote(); + break; case PROMOTE_COMMAND: do_promote(); break; diff --git a/src/include/access/twophase.h b/src/include/access/twophase.h index 2ca71c3445..4b56f92181 100644 --- a/src/include/access/twophase.h +++ b/src/include/access/twophase.h @@ -53,6 +53,7 @@ extern void RecoverPreparedTransactions(void); extern void CheckPointTwoPhase(XLogRecPtr redo_horizon); extern void FinishPreparedTransaction(const char *gid, bool isCommit); +void ShutdownPreparedTransactions(void); extern void PrepareRedoAdd(char *buf, XLogRecPtr start_lsn, XLogRecPtr end_lsn, RepOriginId origin_id); diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h index 8c9cadc6da..b1b1ea67f9 100644 --- a/src/include/access/xlog.h +++ b/src/include/access/xlog.h @@ -219,18 +219,20 @@ extern bool XLOG_DEBUG; /* These directly affect the behavior of CreateCheckPoint and subsidiaries */ #define CHECKPOINT_IS_SHUTDOWN 0x0001 /* Checkpoint is for shutdown */ -#define CHECKPOINT_END_OF_RECOVERY 0x0002 /* Like shutdown checkpoint, but +#define CHECKPOINT_IS_DEMOTE 0x0002 /* Like shutdown checkpoint, but + * issued at end of WAL production */ +#define CHECKPOINT_END_OF_RECOVERY 0x0004 /* Like shutdown checkpoint, but * issued at end of WAL recovery */ -#define CHECKPOINT_IMMEDIATE 0x0004 /* Do it without delays */ -#define CHECKPOINT_FORCE 0x0008 /* Force even if no activity */ -#define CHECKPOINT_FLUSH_ALL 0x0010 /* Flush all pages, including those +#define CHECKPOINT_IMMEDIATE 0x0008 /* Do it without delays */ +#define CHECKPOINT_FORCE 0x0010 /* Force even if no activity */ +#define CHECKPOINT_FLUSH_ALL 0x0020 /* Flush all pages, including those * belonging to unlogged tables */ /* These are important to RequestCheckpoint */ -#define CHECKPOINT_WAIT 0x0020 /* Wait for completion */ -#define CHECKPOINT_REQUESTED 0x0040 /* Checkpoint request has been made */ +#define CHECKPOINT_WAIT 0x0040 /* Wait for completion */ +#define CHECKPOINT_REQUESTED 0x0080 /* Checkpoint request has been made */ /* These indicate the cause of a checkpoint request */ -#define CHECKPOINT_CAUSE_XLOG 0x0080 /* XLOG consumption */ -#define CHECKPOINT_CAUSE_TIME 0x0100 /* Elapsed time */ +#define CHECKPOINT_CAUSE_XLOG 0x0100 /* XLOG consumption */ +#define CHECKPOINT_CAUSE_TIME 0x0200 /* Elapsed time */ /* * Flag bits for the record being inserted, set using XLogSetRecordFlags(). @@ -301,6 +303,7 @@ extern const char *xlog_identify(uint8 info); extern void issue_xlog_fsync(int fd, XLogSegNo segno); +extern bool SetLocalRecoveryInProgress(void); extern bool RecoveryInProgress(void); extern RecoveryState GetRecoveryState(void); extern bool HotStandbyActive(void); @@ -397,4 +400,8 @@ extern SessionBackupState get_backup_status(void); /* files to signal promotion to primary */ #define PROMOTE_SIGNAL_FILE "promote" +/* files to signal demotion to standby */ +#define DEMOTE_SIGNAL_FILE "demote" +#define DEMOTE_FAST_SIGNAL_FILE "demote_fast" + #endif /* XLOG_H */ diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h index de5670e538..f529f8c7bd 100644 --- a/src/include/catalog/pg_control.h +++ b/src/include/catalog/pg_control.h @@ -87,6 +87,7 @@ typedef enum DBState DB_STARTUP = 0, DB_SHUTDOWNED, DB_SHUTDOWNED_IN_RECOVERY, + DB_DEMOTING, DB_SHUTDOWNING, DB_IN_CRASH_RECOVERY, DB_IN_ARCHIVE_RECOVERY, diff --git a/src/include/libpq/libpq-be.h b/src/include/libpq/libpq-be.h index 179ebaa104..a9e27f009e 100644 --- a/src/include/libpq/libpq-be.h +++ b/src/include/libpq/libpq-be.h @@ -70,7 +70,12 @@ typedef struct typedef enum CAC_state { - CAC_OK, CAC_STARTUP, CAC_SHUTDOWN, CAC_RECOVERY, CAC_TOOMANY, + CAC_OK, + CAC_STARTUP, + CAC_DEMOTE, + CAC_SHUTDOWN, + CAC_RECOVERY, + CAC_TOOMANY, CAC_WAITBACKUP } CAC_state; diff --git a/src/include/miscadmin.h b/src/include/miscadmin.h index 72e3352398..d60804208f 100644 --- a/src/include/miscadmin.h +++ b/src/include/miscadmin.h @@ -83,6 +83,7 @@ extern PGDLLIMPORT volatile sig_atomic_t QueryCancelPending; extern PGDLLIMPORT volatile sig_atomic_t ProcDiePending; extern PGDLLIMPORT volatile sig_atomic_t IdleInTransactionSessionTimeoutPending; extern PGDLLIMPORT volatile sig_atomic_t ProcSignalBarrierPending; +extern PGDLLIMPORT volatile sig_atomic_t DemotePending; extern PGDLLIMPORT volatile sig_atomic_t ClientConnectionLost; diff --git a/src/include/pgstat.h b/src/include/pgstat.h index 1387201382..f1c0a37e76 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -880,6 +880,7 @@ typedef enum WAIT_EVENT_PROCARRAY_GROUP_UPDATE, WAIT_EVENT_PROC_SIGNAL_BARRIER, WAIT_EVENT_PROMOTE, + WAIT_EVENT_DEMOTE, WAIT_EVENT_RECOVERY_CONFLICT_SNAPSHOT, WAIT_EVENT_RECOVERY_CONFLICT_TABLESPACE, WAIT_EVENT_RECOVERY_PAUSE, diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h index 0a5708b32e..4d4f0ea1dd 100644 --- a/src/include/postmaster/bgwriter.h +++ b/src/include/postmaster/bgwriter.h @@ -41,5 +41,6 @@ extern Size CheckpointerShmemSize(void); extern void CheckpointerShmemInit(void); extern bool FirstCallSinceLastCheckpoint(void); +extern void ReqCheckpointDemoteHandler(SIGNAL_ARGS); #endif /* _BGWRITER_H */ diff --git a/src/include/storage/lock.h b/src/include/storage/lock.h index fdabf42721..d3b08163a2 100644 --- a/src/include/storage/lock.h +++ b/src/include/storage/lock.h @@ -574,6 +574,8 @@ extern void lock_twophase_postcommit(TransactionId xid, uint16 info, void *recdata, uint32 len); extern void lock_twophase_postabort(TransactionId xid, uint16 info, void *recdata, uint32 len); +extern void lock_twophase_shutdown(TransactionId xid, uint16 info, + void *recdata, uint32 len); extern void lock_twophase_standby_recover(TransactionId xid, uint16 info, void *recdata, uint32 len); diff --git a/src/include/storage/procsignal.h b/src/include/storage/procsignal.h index 5cb39697f3..7264e9a705 100644 --- a/src/include/storage/procsignal.h +++ b/src/include/storage/procsignal.h @@ -34,6 +34,10 @@ typedef enum PROCSIG_PARALLEL_MESSAGE, /* message from cooperating parallel backend */ PROCSIG_WALSND_INIT_STOPPING, /* ask walsenders to prepare for shutdown */ PROCSIG_BARRIER, /* global barrier interrupt */ + PROCSIG_DEMOTING, /* ask backends to demote in smart mode */ + PROCSIG_DEMOTING_FAST, /* ask backends to demote in fast mode */ + PROCSIG_DEMOTED, /* ask backends to switch to recovery mode */ + PROCSIG_CHECKPOINTER_DEMOTING, /* ask checkpointer to demote */ /* Recovery conflict reasons */ PROCSIG_RECOVERY_CONFLICT_DATABASE, diff --git a/src/include/tcop/tcopprot.h b/src/include/tcop/tcopprot.h index bd30607b07..e5f42f9fec 100644 --- a/src/include/tcop/tcopprot.h +++ b/src/include/tcop/tcopprot.h @@ -68,6 +68,8 @@ extern void StatementCancelHandler(SIGNAL_ARGS); extern void FloatExceptionHandler(SIGNAL_ARGS) pg_attribute_noreturn(); extern void RecoveryConflictInterrupt(ProcSignalReason reason); /* called from SIGUSR1 * handler */ +extern void ReqDemoteHandler(ProcSignalReason reason); /* called from SIGUSR1 handler */ +extern void ReqDemotedHandler(ProcSignalReason reason); /* called from SIGUSR1 handler */ extern void ProcessClientReadInterrupt(bool blocked); extern void ProcessClientWriteInterrupt(bool blocked); diff --git a/src/include/utils/pidfile.h b/src/include/utils/pidfile.h index 63fefe5c4c..f761d2c4ef 100644 --- a/src/include/utils/pidfile.h +++ b/src/include/utils/pidfile.h @@ -50,6 +50,7 @@ */ #define PM_STATUS_STARTING "starting" /* still starting up */ #define PM_STATUS_STOPPING "stopping" /* in shutdown sequence */ +#define PM_STATUS_DEMOTING "demoting" /* demote sequence */ #define PM_STATUS_READY "ready " /* ready for connections */ #define PM_STATUS_STANDBY "standby " /* up, won't accept connections */ -- 2.20.1
>From 673494349f497af71978985531a1fd44b8fc71c0 Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais <j...@dalibo.com> Date: Fri, 31 Jul 2020 18:07:38 +0200 Subject: [PATCH 3/4] demote: add pg_demote() function --- src/backend/access/transam/xlogfuncs.c | 94 ++++++++++++++++++++++++++ src/backend/catalog/system_views.sql | 6 ++ src/include/catalog/pg_proc.dat | 4 ++ 3 files changed, 104 insertions(+) diff --git a/src/backend/access/transam/xlogfuncs.c b/src/backend/access/transam/xlogfuncs.c index 290658b22c..733f465d38 100644 --- a/src/backend/access/transam/xlogfuncs.c +++ b/src/backend/access/transam/xlogfuncs.c @@ -784,3 +784,97 @@ pg_promote(PG_FUNCTION_ARGS) (errmsg("server did not promote within %d seconds", wait_seconds))); PG_RETURN_BOOL(false); } + +/* + * Demotes a production server. + * + * A result of "true" means that demotion has been completed if "wait" is + * "true", or initiated if "wait" is false. + */ +Datum +pg_demote(PG_FUNCTION_ARGS) +{ + bool fast = PG_GETARG_BOOL(0); + bool wait = PG_GETARG_BOOL(1); + int wait_seconds = PG_GETARG_INT32(2); + char demote_filename[] = "demote_fast"; + FILE *demote_file; + int i; + + if (RecoveryInProgress()) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("recovery in progress"), + errhint("you can not demote while already in recovery."))); + + if (!EnableHotStandby) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("function pg_demote() requires hot_standby parameter to be enabled"), + errhint("The function can not return its status from a non hot_standby-enabled standby"))); + + if (wait_seconds <= 0) + ereport(ERROR, + (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE), + errmsg("\"wait_seconds\" must not be negative or zero"))); + + if (!fast) + demote_filename[6] = '\0'; + + /* create the demote signal file */ + demote_file = AllocateFile(demote_filename, "w"); + if (!demote_file) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not create file \"%s\": %m", + demote_filename))); + + if (FreeFile(demote_file)) + ereport(ERROR, + (errcode_for_file_access(), + errmsg("could not write file \"%s\": %m", + demote_filename))); + + /* signal the postmaster */ + if (kill(PostmasterPid, SIGUSR1) != 0) + { + ereport(WARNING, + (errmsg("failed to send signal to postmaster: %m"))); + (void) unlink(demote_filename); + PG_RETURN_BOOL(false); + } + + /* return immediately if waiting was not requested */ + if (!wait) + PG_RETURN_BOOL(true); + + /* wait for the amount of time wanted until demotion */ +#define WAITS_PER_SECOND 10 + for (i = 0; i < WAITS_PER_SECOND * wait_seconds; i++) + { + int rc; + + ResetLatch(MyLatch); + + if (RecoveryInProgress()) + PG_RETURN_BOOL(true); + + CHECK_FOR_INTERRUPTS(); + + rc = WaitLatch(MyLatch, + WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH, + 1000L / WAITS_PER_SECOND, + WAIT_EVENT_DEMOTE); + + /* + * Emergency bailout if postmaster has died. This is to avoid the + * necessity for manual cleanup of all postmaster children. + */ + if (rc & WL_POSTMASTER_DEATH) + PG_RETURN_BOOL(false); + } + + ereport(WARNING, + (errmsg("server did not demote within %d seconds", wait_seconds))); + PG_RETURN_BOOL(false); +} diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 8625cbeab6..573d7b46eb 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1219,6 +1219,11 @@ CREATE OR REPLACE FUNCTION RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_promote' PARALLEL SAFE; +CREATE OR REPLACE FUNCTION + pg_demote(fast boolean DEFAULT true, wait boolean DEFAULT true, wait_seconds integer DEFAULT 60) + RETURNS boolean STRICT VOLATILE LANGUAGE INTERNAL AS 'pg_demote' + PARALLEL SAFE; + -- legacy definition for compatibility with 9.3 CREATE OR REPLACE FUNCTION json_populate_record(base anyelement, from_json json, use_json_as_text boolean DEFAULT false) @@ -1435,6 +1440,7 @@ REVOKE EXECUTE ON FUNCTION pg_reload_conf() FROM public; REVOKE EXECUTE ON FUNCTION pg_current_logfile() FROM public; REVOKE EXECUTE ON FUNCTION pg_current_logfile(text) FROM public; REVOKE EXECUTE ON FUNCTION pg_promote(boolean, integer) FROM public; +REVOKE EXECUTE ON FUNCTION pg_demote(boolean, boolean, integer) FROM public; REVOKE EXECUTE ON FUNCTION pg_stat_reset() FROM public; REVOKE EXECUTE ON FUNCTION pg_stat_reset_shared(text) FROM public; diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 082a11f270..9e4d000d00 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -6084,6 +6084,10 @@ proname => 'pg_promote', provolatile => 'v', prorettype => 'bool', proargtypes => 'bool int4', proargnames => '{wait,wait_seconds}', prosrc => 'pg_promote' }, +{ oid => '8967', descr => 'demote production server', + proname => 'pg_demote', provolatile => 'v', prorettype => 'bool', + proargtypes => 'bool bool int4', proargnames => '{fast,wait,wait_seconds}', + prosrc => 'pg_demote' }, { oid => '2848', descr => 'switch to new wal file', proname => 'pg_switch_wal', provolatile => 'v', prorettype => 'pg_lsn', proargtypes => '', prosrc => 'pg_switch_wal' }, -- 2.20.1
>From 4d0ad53e42f3385bab21588b7729008bcb10b6af Mon Sep 17 00:00:00 2001 From: Jehan-Guillaume de Rorthais <j...@dalibo.com> Date: Fri, 10 Jul 2020 02:00:38 +0200 Subject: [PATCH 4/4] demote: add various tests related to demote and promote actions * demote/promote with a standby replicating from the node * make sure 2PC survive a demote/promote cycle * commit 2PC and check the result * swap roles between primary and standby * make sure wal sender enters cascade mode * commit a 2PC on the new primary * confirm behavior of backends during smart/fast demote --- src/test/perl/PostgresNode.pm | 25 ++ src/test/recovery/t/021_promote-demote.pl | 287 ++++++++++++++++++++++ 2 files changed, 312 insertions(+) create mode 100644 src/test/recovery/t/021_promote-demote.pl diff --git a/src/test/perl/PostgresNode.pm b/src/test/perl/PostgresNode.pm index 8c1b77376f..4488365ffc 100644 --- a/src/test/perl/PostgresNode.pm +++ b/src/test/perl/PostgresNode.pm @@ -906,6 +906,31 @@ sub promote =pod +=item $node->demote() + +Wrapper for pg_ctl demote + +=cut + +sub demote +{ + my ($self, $mode) = @_; + my $port = $self->port; + my $pgdata = $self->data_dir; + my $logfile = $self->logfile; + my $name = $self->name; + + $mode = 'fast' unless defined $mode; + + print "### Demoting node \"$name\" using mode $mode\n"; + + TestLib::system_or_bail('pg_ctl', '-D', $pgdata, '-l', $logfile, + '-m', $mode, 'demote'); + return; +} + +=pod + =item $node->logrotate() Wrapper for pg_ctl logrotate diff --git a/src/test/recovery/t/021_promote-demote.pl b/src/test/recovery/t/021_promote-demote.pl new file mode 100644 index 0000000000..245acfb211 --- /dev/null +++ b/src/test/recovery/t/021_promote-demote.pl @@ -0,0 +1,287 @@ +# Test demote/promote actions in various scenarios using three +# nodes alpha, beta and gamma. We check proper actions results, +# correct data replication and cascade across multiple +# demote/promote, manual switchover, smart and fast demote. + +use strict; +use warnings; +use PostgresNode; +use TestLib; +use Test::More tests => 24; + +$ENV{PGDATABASE} = 'postgres'; + +# Initialize node alpha +my $node_alpha = get_new_node('alpha'); +$node_alpha->init(allows_streaming => 1); +$node_alpha->append_conf( + 'postgresql.conf', qq( + max_prepared_transactions = 10 +)); + +# Take backup +my $backup_name = 'alpha_backup'; +$node_alpha->start; +$node_alpha->backup($backup_name); + +# Create node beta from backup +my $node_beta = get_new_node('beta'); +$node_beta->init_from_backup($node_alpha, $backup_name); +$node_beta->enable_streaming($node_alpha); +$node_beta->start; + +# Create node gamma from backup +my $node_gamma = get_new_node('gamma'); +$node_gamma->init_from_backup($node_alpha, $backup_name); +$node_gamma->enable_streaming($node_alpha); +$node_gamma->start; + +# Create some 2PC on alpha for future tests +$node_alpha->safe_psql('postgres', q{ +CREATE TABLE ins AS SELECT 1 AS i; +BEGIN; +CREATE TABLE new AS SELECT generate_series(1,5) AS i; +PREPARE TRANSACTION 'pxact1'; +BEGIN; +INSERT INTO ins VALUES (2); +PREPARE TRANSACTION 'pxact2'; +}); + +# create an in idle in xact session +my ($sess1_in, $sess1_out, $sess1_err) = ('', '', ''); +my $sess1 = IPC::Run::start( + [ + 'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d', + $node_alpha->connstr('postgres') + ], + '<', \$sess1_in, + '>', \$sess1_out, + '2>', \$sess1_err); + +$sess1_in = q{ +BEGIN; +CREATE TABLE public.test_aborted (i int); +SELECT pg_backend_pid(); +}; +$sess1->pump until $sess1_out =~ qr/[[:digit:]]+[\r\n]$/m; +my $sess1_pid = $sess1_out; +chomp $sess1_pid; + +# create an in idle session +my ($sess2_in, $sess2_out, $sess2_err) = ('', '', ''); +my $sess2 = IPC::Run::start( + [ + 'psql', '-X', '-qAt', '-v', 'ON_ERROR_STOP=1', '-f', '-', '-d', + $node_alpha->connstr('postgres') + ], + '<', \$sess2_in, + '>', \$sess2_out, + '2>', \$sess2_err); +$sess2_in = q{ +SELECT pg_backend_pid(); +}; +$sess2->pump until $sess2_out =~ qr/\d+\s*$/m; +my $sess2_pid = $sess2_out; +chomp $sess2_pid; + +$sess2_in = q{ +SELECT pg_is_in_recovery(); +}; +$sess2->pump until $sess2_out =~ qr/(t|f)\s*$/m; + +# idle session is not in recovery +is( $1, 'f', 'idle session is not in recovery' ); + +# Fast demote alpha. +# Secondaries beta and gamma should keep streaming from it as cascaded standbys. +# Idle in xact session should be terminate, idle session should stay alive. +$node_alpha->demote('fast'); + +is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'), + 't', 'node alpha demoted to standby' ); + +is( $node_alpha->safe_psql( + 'postgres', + 'SELECT array_agg(application_name ORDER BY application_name ASC) FROM pg_stat_replication'), + '{beta,gamma}', 'standbys keep replicating with alpha after demote' ); + +# the idle in xact session should not survive the demote +is( $node_alpha->safe_psql( + 'postgres', + qq{SELECT count(*) + FROM pg_catalog.pg_stat_activity + WHERE pid = $sess1_pid}), + '0', 'previous idle in transaction session should be terminated' ); + +# table "test_aborted" has been rollbacked +is( $node_alpha->safe_psql( + 'postgres', + q{SELECT count(*) FROM pg_catalog.pg_class + WHERE relname='test_aborted' + AND relnamespace = (SELECT oid FROM pg_namespace + WHERE nspname='public')}), + '0', 'the tansaction bas been aborted during fast demote' ); + +# the idle session should survive the demote +is( $node_alpha->safe_psql( + 'postgres', + qq{SELECT count(*) + FROM pg_catalog.pg_stat_activity + WHERE pid = $sess2_pid}), + '1', "the idle session should survive the demote: $sess2_pid" ); + +# the idle session should report in recovery +$sess2_out = ''; +$sess2_in = q{ +SELECT pg_is_in_recovery(); +}; +$sess2->pump until $sess2_out =~ qr/(t|f)\s*$/m; + +# idle session is not in recovery +is( $1, 't', 'the idle session reports in recovery' ); + +# close both sessions +$sess1_out = $sess2_out = $sess1_in = $sess2_in = ''; +$sess1->finish; +$sess2->finish; + +# Promote alpha back in production. +$node_alpha->promote; + +is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'), + 'f', "node alpha promoted" ); + +# Check all 2PC xact have been restored +is( $node_alpha->safe_psql( + 'postgres', + "SELECT string_agg(gid, ',' order by gid asc) FROM pg_prepared_xacts"), + 'pxact1,pxact2', "prepared transactions 'pxact1' and 'pxact2' exists" ); + +# Commit one 2PC and check it on alpha and beta +$node_alpha->safe_psql( 'postgres', "commit prepared 'pxact1'"); + +is( $node_alpha->safe_psql( + 'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"), + '{1,2,3,4,5}', "prepared transaction 'pxact1' commited" ); + +$node_alpha->wait_for_catchup($node_beta); +$node_alpha->wait_for_catchup($node_gamma); + +is( $node_beta->safe_psql( + 'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"), + '{1,2,3,4,5}', "prepared transaction 'pxact1' replicated to beta" ); + +is( $node_gamma->safe_psql( + 'postgres', "SELECT array_agg(i::text ORDER BY i ASC) FROM new"), + '{1,2,3,4,5}', "prepared transaction 'pxact1' replicated to gamma" ); + +# create another idle in xact session +$sess1_in = q{ +BEGIN; +CREATE TABLE public.test_succeed (i int); +SELECT pg_backend_pid(); +}; +$sess1->pump until $sess1_out =~ qr/\d+\s*$/m; +$sess1_pid = $sess1_out; +chomp $sess1_pid; + +# swap roles between alpha and beta + +# Demote alpha in smart mode. +# Don't wait for demote to complete here so we can use sess1 +# to keep doing some more write activity before commit and demote. +is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_demote(false, false)'), + 't', "demote signal sent to node alpha" ); + +# wait for the demote to begin and wait for active xact. +my $fh; +while (1) { + my $status; + open my $fh, '<', $node_alpha->data_dir . '/postmaster.pid'; + $status = $_ while <$fh>; + close $fh; + chomp($status); + last if $status eq 'demoting'; + sleep 1; +} + +# make sure the demote waits for running xacts +sleep 2; + +# test no new session possible during demote +$sess2_in = q{ +SELECT 1; +}; +$sess2->start; +$sess2->finish; +ok( $sess2_err =~ /FATAL: the database system is demoting\s$/, 'session rejected during demote process'); + +# add some write activity on demote-blocking session sess1 +$sess1_out = ''; +$sess1_in = q{ +INSERT INTO public.test_succeed VALUES (1) RETURNING i; +COMMIT; +}; +$sess1->pump until $sess1_out =~ qr/\d+\s*$/m; +$sess1->finish; + +chomp($sess1_out); +is($sess1_out, '1', 'session in active xact able to write the smart demote signal'); + +$node_alpha->poll_query_until('postgres', 'SELECT pg_is_in_recovery()', 't'); + +is( $node_alpha->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'), + 't', "node alpha demoted" ); + +# fetch the last REDO location from alpha and chek beta received everyting +my ($stdout, $stderr) = run_command([ 'pg_controldata', $node_alpha->data_dir ]); +$stdout =~ m{REDO location:\s+([0-9A-F]+/[0-9A-F]+)$}mg; +my $redo_loc = $1; + +is( $node_beta->safe_psql( + 'postgres', + "SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), '$redo_loc') > 0 "), + 't', "node beta received the demote checkpoint from alpha" ); + +# promote beta and check it +$node_beta->promote; +is( $node_beta->safe_psql( 'postgres', 'SELECT pg_is_in_recovery()'), + 'f', "node beta promoted" ); + +# Setup alpha to replicate from beta +$node_alpha->enable_streaming($node_beta); +$node_alpha->reload; + +# check alpha is replicating from it +$node_beta->wait_for_catchup($node_alpha); + +is( $node_beta->safe_psql( + 'postgres', 'SELECT application_name FROM pg_stat_replication'), + $node_alpha->name, 'alpha is replicating from beta' ); + +# check gamma is still replicating from from alpha +$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive')); + +is( $node_alpha->safe_psql( + 'postgres', 'SELECT application_name FROM pg_stat_replication'), + $node_gamma->name, 'gamma is replicating from beta' ); + +# make sure the second 2PC is still available on beta +is( $node_beta->safe_psql( + 'postgres', 'SELECT gid FROM pg_prepared_xacts'), + 'pxact2', "prepared transactions pxact2' exists" ); + +# commit the second 2PC and check its result on alpha and beta nodes +$node_beta->safe_psql( 'postgres', "commit prepared 'pxact2'"); + +is( $node_beta->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'), + '1', "prepared transaction 'pxact2' commited" ); + +$node_beta->wait_for_catchup($node_alpha); +is( $node_alpha->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'), + '1', "prepared transaction 'pxact2' streamed to alpha" ); + +# check the 2PC has been cascaded to gamma +$node_alpha->wait_for_catchup($node_gamma, 'write', $node_alpha->lsn('receive')); +is( $node_gamma->safe_psql( 'postgres', 'SELECT 1 FROM ins WHERE i=2'), + '1', "prepared transaction 'pxact2' streamed to gamma" ); -- 2.20.1