Re: Reorder shutdown sequence, to flush pgstats later

Andres Freund Wed, 08 Jan 2025 11:26:38 -0800

Hi,

On 2025-01-08 15:10:03 +0200, Heikki Linnakangas wrote:
> On 08/01/2025 04:10, Andres Freund wrote:
> > I instead opted to somewhat rearrange the shutdown sequence:
> > - I don't love the naming of the various PMState values (existing and new),
> >    but a larger renaming should probably be done separately?
>
> We currently have PM_WAIT_BACKENDS and PM_WAIT_DEAD_END. Maybe follow that
> example and name all the shutdown states after what you're waiting for:
>
>     PM_WAIT_BACKENDS, /* waiting for live backends to exit */
>     PM_WAIT_XLOG_SHUTDOWN, /* waiting for checkpointer to do shutdown ckpt
> */
>     PM_WAIT_XLOG_ARCHIVAL, /* waiting for archiver and walsenders to
>                               finish */
>     PM_WAIT_DEAD_END,      /* waiting for dead-end children to exit */
>     PM_WAIT_CHECKPOINTER,  /* waiting for checkpointer to shut down */
>     PM_NO_CHILDREN,        /* all important children have exited */


I think this is much better than before. I don't love PM_WAIT_XLOG_ARCHIVAL,
but I can't think of anything better.

It's probably best to break it out into a separate commit, see attached.


> > - There's logging in ShutdownXLOG() that's only triggered if called from
> >    checkpointer. Seems kinda odd - shouldn't we just move that to
> >    checkpointer.c?
>
> What logging do you mean? The "LOG: shutting down" message? No objections to
> moving it, although it doesn't bother me either way.

Yea. It's also an oddly unspecific message :).


> > The first two patches are just logging improvements that I found helpful to
> > write this patch:
> >
> > 0001: Instead of modifying pmState directly, do so via a new function
> >        UpdatePMState(), which logs the state transition at DEBUG2.
> >
> >        Requires a helper to stringify PMState.
> >
> >        I've written one-off versions of this patch quite a few times. 
> > Without
> >        knowing in what state postmaster is in, it's quite hard to debug the
> >        state machine.
>
> +1
>
> >     elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
> >              pmState, pmstate_name(pmState), newState, 
> > pmstate_name(newState));
>
> I think the state names are enough, no need to print the numerical values.

I thought the numeric value was mildly helpful as the numerically higher value
makes it slightly easier to understand the "direction" of state changes. But I
also wasn't sure it's worth it. Removed in the attached.


> > 0002: Make logging of postmaster signalling child processes more consistent
> >
> >        I found it somewhat hard to understand what's happening during state
> >        changes without being able to see the signals being sent. While we 
> > did
> >        have logging in SignalChildren(), we didn't in signal_child(), and 
> > most
> >        signals that are important for the shutdown sequence are sent via
> >        signal_child().
>
> +1
>
> I don't think the cases where DEBUG2 or DEBUG4 are used currently are very
> well thought out. To save a few lines of code, I'd be happy with merging
> signal_child_ext() and signal_child() and just always use DEBUG2 or DEBUG4.
> Or DEBUG3 as a compromise :-).

Heh. DEBUG2 seems a bit much compared to some other stuff, e.g. a config
reload will be logged at DEBUG2, seems a bit superfluous to then also use
DEBUG2 for all individual signals.

I was a bit hesitant to lower the level in sigquit_child(), but thinking about
it, the fact that we're gonna signal children is already obvious from the log.

So I went for your compromise position :)


> > The next is a small prerequisite:
> >
> > 0003: postmaster: Introduce variadic btmask_all_except()
> >
> >        I proposed this in another thread, where I also needed
> >        btmask_all_except3() but then removed the only call to
> >        btmask_all_except2(). Instead of adding/removing code, it seems 
> > better
> >        to just use a variadic function.
>
> Nice. A variadic btmask_add() might be handy too.

It does seem to make the code a bit more readable. See attached. Wasn't quite
sure in how many btmask_add() invocations to break it up...


> > And then 0004, the reason for this thread.

One problem I found is this comment:
         * Note: we deliberately ignore SIGTERM, because during a standard Unix
         * system shutdown cycle, init will SIGTERM all processes at once.  We
         * want to wait for the backends to exit, whereupon the postmaster will
         * tell us it's okay to shut down (via SIGUSR2).

Which would cause problems with v1 of the patch, as postmaster would write a
shutdown checkpoint in response to SIGTERM.

Unfortunately we don't have any obvious unused signals available, as SIGINT is
used to request a checkpoint.

To deal with that I added a new patch to the stack, which makes
RequestCheckpoint() use SetLatch() on checkpointer's latch to wake it up,
instead of kill(SIGINT).  Not that that matters, but using SetLatch()
directly, instead of doing SetLatch() in the signal handler, is slightly more
efficient, because it avoids a signal interrupt when checkpointer isn't
waiting on a latch.

The relevant comment mentioned "or is in process of restarting" - but that's
afaict bogus, we don't allow checkpointer to restart.

After that, postmaster can use SIGINT to request checkpointer to write the
shutdown checkpoint.


I'm currently to plan the four patches relatively soon, unless somebody speaks
up, of course. They seem fairly uncontroversial.  The renaming of the phases
doesn't need to wait much longer, I think.

The last two (0006: trigger checkpoints via SetLatch, 0007: change the
shutdown sequence), probably can use a few more eyes.

Greetings,

Andres Freund

>From 51052e513c195aa901fb0d4fbc548ff37627bccb Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 7 Jan 2025 17:59:43 -0500
Subject: [PATCH v2 1/7] postmaster: Update pmState in new function

This makes logging of state changes easier - state transitions are now logged
at DEBUG1.

Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Bertrand Drouvot <[email protected]>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 78 +++++++++++++++++++++--------
 1 file changed, 57 insertions(+), 21 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0a915dac4f7..de278b0850d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -411,6 +411,7 @@ static void HandleChildCrash(int pid, int exitstatus, const char *procname);
 static void LogChildExit(int lev, const char *procname,
 						 int pid, int exitstatus);
 static void PostmasterStateMachine(void);
+static void UpdatePMState(PMState newState);
 
 static void ExitPostmaster(int status) pg_attribute_noreturn();
 static int	ServerLoop(void);
@@ -1363,7 +1364,7 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	pmState = PM_STARTUP;
+	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2099,7 +2100,7 @@ process_pm_shutdown_request(void)
 			else if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 				/* There should be no clients, so proceed to stop children */
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 
 			/*
@@ -2133,7 +2134,7 @@ process_pm_shutdown_request(void)
 			if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 				/* Just shut down background processes silently */
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 			else if (pmState == PM_RUN ||
 					 pmState == PM_HOT_STANDBY)
@@ -2141,7 +2142,7 @@ process_pm_shutdown_request(void)
 				/* Report that we're about to zap live client sessions */
 				ereport(LOG,
 						(errmsg("aborting any active transactions")));
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 
 			/*
@@ -2176,7 +2177,7 @@ process_pm_shutdown_request(void)
 			/* (note we don't apply send_abort_for_crash here) */
 			SetQuitSignalReason(PMQUIT_FOR_STOP);
 			TerminateChildren(SIGQUIT);
-			pmState = PM_WAIT_BACKENDS;
+			UpdatePMState(PM_WAIT_BACKENDS);
 
 			/* set stopwatch for them to die */
 			AbortStartTime = time(NULL);
@@ -2231,7 +2232,7 @@ process_pm_child_exit(void)
 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
 			{
 				StartupStatus = STARTUP_NOT_RUNNING;
-				pmState = PM_WAIT_BACKENDS;
+				UpdatePMState(PM_WAIT_BACKENDS);
 				/* PostmasterStateMachine logic does the rest */
 				continue;
 			}
@@ -2243,7 +2244,7 @@ process_pm_child_exit(void)
 				StartupStatus = STARTUP_NOT_RUNNING;
 				Shutdown = Max(Shutdown, SmartShutdown);
 				TerminateChildren(SIGTERM);
-				pmState = PM_WAIT_BACKENDS;
+				UpdatePMState(PM_WAIT_BACKENDS);
 				/* PostmasterStateMachine logic does the rest */
 				continue;
 			}
@@ -2288,7 +2289,7 @@ process_pm_child_exit(void)
 				{
 					StartupStatus = STARTUP_NOT_RUNNING;
 					if (pmState == PM_STARTUP)
-						pmState = PM_WAIT_BACKENDS;
+						UpdatePMState(PM_WAIT_BACKENDS);
 				}
 				else
 					StartupStatus = STARTUP_CRASHED;
@@ -2304,7 +2305,7 @@ process_pm_child_exit(void)
 			FatalError = false;
 			AbortStartTime = 0;
 			ReachedNormalRunning = true;
-			pmState = PM_RUN;
+			UpdatePMState(PM_RUN);
 			connsAllowed = true;
 
 			/*
@@ -2377,7 +2378,7 @@ process_pm_child_exit(void)
 				 */
 				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
 
-				pmState = PM_SHUTDOWN_2;
+				UpdatePMState(PM_SHUTDOWN_2);
 			}
 			else
 			{
@@ -2729,7 +2730,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		pmState == PM_RUN ||
 		pmState == PM_STOP_BACKENDS ||
 		pmState == PM_SHUTDOWN)
-		pmState = PM_WAIT_BACKENDS;
+		UpdatePMState(PM_WAIT_BACKENDS);
 
 	/*
 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
@@ -2821,7 +2822,7 @@ PostmasterStateMachine(void)
 			 * Then we're ready to stop other children.
 			 */
 			if (CountChildren(btmask(B_BACKEND)) == 0)
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 		}
 	}
 
@@ -2909,7 +2910,7 @@ PostmasterStateMachine(void)
 
 			SignalChildren(SIGTERM, targetMask);
 
-			pmState = PM_WAIT_BACKENDS;
+			UpdatePMState(PM_WAIT_BACKENDS);
 		}
 
 		/* Are any of the target processes still running? */
@@ -2920,7 +2921,7 @@ PostmasterStateMachine(void)
 				/*
 				 * Stop any dead-end children and stop creating new ones.
 				 */
-				pmState = PM_WAIT_DEAD_END;
+				UpdatePMState(PM_WAIT_DEAD_END);
 				ConfigurePostmasterWaitSet(false);
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
@@ -2945,7 +2946,7 @@ PostmasterStateMachine(void)
 				if (CheckpointerPMChild != NULL)
 				{
 					signal_child(CheckpointerPMChild, SIGUSR2);
-					pmState = PM_SHUTDOWN;
+					UpdatePMState(PM_SHUTDOWN);
 				}
 				else
 				{
@@ -2960,7 +2961,7 @@ PostmasterStateMachine(void)
 					 * for checkpointer fork failure.
 					 */
 					FatalError = true;
-					pmState = PM_WAIT_DEAD_END;
+					UpdatePMState(PM_WAIT_DEAD_END);
 					ConfigurePostmasterWaitSet(false);
 
 					/* Kill the walsenders and archiver too */
@@ -2980,7 +2981,7 @@ PostmasterStateMachine(void)
 		 */
 		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			pmState = PM_WAIT_DEAD_END;
+			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
 			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 		}
@@ -3013,7 +3014,7 @@ PostmasterStateMachine(void)
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
-			pmState = PM_NO_CHILDREN;
+			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
 
@@ -3097,7 +3098,7 @@ PostmasterStateMachine(void)
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		pmState = PM_STARTUP;
+		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3106,6 +3107,41 @@ PostmasterStateMachine(void)
 	}
 }
 
+static const char *
+pmstate_name(PMState state)
+{
+#define PM_TOSTR_CASE(state) case state: return #state
+	switch (state)
+	{
+			PM_TOSTR_CASE(PM_INIT);
+			PM_TOSTR_CASE(PM_STARTUP);
+			PM_TOSTR_CASE(PM_RECOVERY);
+			PM_TOSTR_CASE(PM_HOT_STANDBY);
+			PM_TOSTR_CASE(PM_RUN);
+			PM_TOSTR_CASE(PM_STOP_BACKENDS);
+			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
+			PM_TOSTR_CASE(PM_SHUTDOWN);
+			PM_TOSTR_CASE(PM_SHUTDOWN_2);
+			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_NO_CHILDREN);
+	}
+#undef PM_TOSTR_CASE
+
+	pg_unreachable();
+}
+
+/*
+ * Simple wrapper for updating pmState. The main reason to have this wrapper
+ * is that it makes it easy to log all state transitions.
+ */
+static void
+UpdatePMState(PMState newState)
+{
+	elog(DEBUG1, "updating PMState from %s to %s",
+		 pmstate_name(pmState), pmstate_name(newState));
+	pmState = newState;
+}
+
 /*
  * Launch background processes after state change, or relaunch after an
  * existing process has exited.
@@ -3524,7 +3560,7 @@ process_pm_pmsignal(void)
 #endif
 		}
 
-		pmState = PM_RECOVERY;
+		UpdatePMState(PM_RECOVERY);
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
@@ -3539,7 +3575,7 @@ process_pm_pmsignal(void)
 		sd_notify(0, "READY=1");
 #endif
 
-		pmState = PM_HOT_STANDBY;
+		UpdatePMState(PM_HOT_STANDBY);
 		connsAllowed = true;
 
 		/* Some workers may be scheduled to start now */
-- 
2.45.2.746.g06e570c0df.dirty

>From 28652f322f7a345a0c06cc46a9d5659afdc7f62f Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 7 Jan 2025 20:22:36 -0500
Subject: [PATCH v2 2/7] postmaster: Improve logging of signals sent by
 postmaster

Previously many, in some cases important, signals we never logged. In other
cases the signal name was only included numerically.

As part of this, change the debug log level the signal is logged at to DEBUG3,
previously some where DEBUG2, some DEBUG4.

Also move from direct use of kill() to signal the av launcher to
signal_child(). There doesn't seem to be a reason for directly using kill().

Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Bertrand Drouvot <[email protected]>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/backend/postmaster/postmaster.c | 55 ++++++++++++++++++++++-------
 1 file changed, 42 insertions(+), 13 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index de278b0850d..51ec097cd43 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1693,7 +1693,7 @@ ServerLoop(void)
 		{
 			avlauncher_needs_signal = false;
 			if (AutoVacLauncherPMChild != NULL)
-				kill(AutoVacLauncherPMChild->pid, SIGUSR2);
+				signal_child(AutoVacLauncherPMChild, SIGUSR2);
 		}
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
@@ -3255,6 +3255,37 @@ LaunchMissingBackgroundProcesses(void)
 		maybe_start_bgworkers();
 }
 
+/*
+ * Return string representation of signal.
+ *
+ * Because this is only implemented in signals we already rely on in this file
+ * we don't need to deal with unimplemented or same-numeric-value signals (as
+ * we'd e.g. have to for EWOULDBLOCK / EAGAIN).
+ */
+static const char *
+pm_signame(int signal)
+{
+#define PM_TOSTR_CASE(state) case state: return #state
+	switch (signal)
+	{
+			PM_TOSTR_CASE(SIGABRT);
+			PM_TOSTR_CASE(SIGCHLD);
+			PM_TOSTR_CASE(SIGHUP);
+			PM_TOSTR_CASE(SIGINT);
+			PM_TOSTR_CASE(SIGKILL);
+			PM_TOSTR_CASE(SIGQUIT);
+			PM_TOSTR_CASE(SIGTERM);
+			PM_TOSTR_CASE(SIGUSR1);
+			PM_TOSTR_CASE(SIGUSR2);
+		default:
+			/* all signals sent by postmaster should be listed here */
+			Assert(false);
+			return "(unknown)";
+	}
+#undef PM_TOSTR_CASE
+	pg_unreachable();
+}
+
 /*
  * Send a signal to a postmaster child process
  *
@@ -3276,6 +3307,12 @@ signal_child(PMChild *pmchild, int signal)
 {
 	pid_t		pid = pmchild->pid;
 
+	ereport(DEBUG3,
+			(errmsg_internal("sending signal %d/%s to %s process %d",
+							 signal, pm_signame(signal),
+							 GetBackendTypeDesc(pmchild->bkend_type),
+							 (int) pmchild->pid)));
+
 	if (kill(pid, signal) < 0)
 		elog(DEBUG3, "kill(%ld,%d) failed: %m", (long) pid, signal);
 #ifdef HAVE_SETSID
@@ -3297,19 +3334,14 @@ signal_child(PMChild *pmchild, int signal)
 
 /*
  * Convenience function for killing a child process after a crash of some
- * other child process.  We log the action at a higher level than we would
- * otherwise do, and we apply send_abort_for_crash to decide which signal
- * to send.  Normally it's SIGQUIT -- and most other comments in this file
- * are written on the assumption that it is -- but developers might prefer
- * to use SIGABRT to collect per-child core dumps.
+ * other child process.  We apply send_abort_for_crash to decide which signal
+ * to send.  Normally it's SIGQUIT -- and most other comments in this file are
+ * written on the assumption that it is -- but developers might prefer to use
+ * SIGABRT to collect per-child core dumps.
  */
 static void
 sigquit_child(PMChild *pmchild)
 {
-	ereport(DEBUG2,
-			(errmsg_internal("sending %s to process %d",
-							 (send_abort_for_crash ? "SIGABRT" : "SIGQUIT"),
-							 (int) pmchild->pid)));
 	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
 }
 
@@ -3341,9 +3373,6 @@ SignalChildren(int signal, BackendTypeMask targetMask)
 		if (!btmask_contains(targetMask, bp->bkend_type))
 			continue;
 
-		ereport(DEBUG4,
-				(errmsg_internal("sending signal %d to %s process %d",
-								 signal, GetBackendTypeDesc(bp->bkend_type), (int) bp->pid)));
 		signal_child(bp, signal);
 		signaled = true;
 	}
-- 
2.45.2.746.g06e570c0df.dirty

>From 0ce39702fc396075ee87f7063e775302fd6fb791 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 7 Jan 2025 19:51:00 -0500
Subject: [PATCH v2 3/7] postmaster: Introduce variadic btmask_all_except()

Upcoming patches would otherwise need btmask_all_except3().

Reviewed-by: Heikki Linnakangas <[email protected]>
Discussion: https://postgr.es/m/w3z6w3g4aovivs735nk4pzjhmegntecesm3kktpebchegm5o53@aonnq2kn27xi
---
 src/backend/postmaster/postmaster.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 51ec097cd43..f9d6c552341 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -164,23 +164,20 @@ btmask_del(BackendTypeMask mask, BackendType t)
 }
 
 static inline BackendTypeMask
-btmask_all_except(BackendType t)
+btmask_all_except_n(int nargs, BackendType *t)
 {
 	BackendTypeMask mask = BTYPE_MASK_ALL;
 
-	mask = btmask_del(mask, t);
+	for (int i = 0; i < nargs; i++)
+		mask = btmask_del(mask, t[i]);
 	return mask;
 }
 
-static inline BackendTypeMask
-btmask_all_except2(BackendType t1, BackendType t2)
-{
-	BackendTypeMask mask = BTYPE_MASK_ALL;
-
-	mask = btmask_del(mask, t1);
-	mask = btmask_del(mask, t2);
-	return mask;
-}
+#define btmask_all_except(...) \
+	btmask_all_except_n( \
+		lengthof(((BackendType[]){__VA_ARGS__})), \
+		(BackendType[]){__VA_ARGS__} \
+	)
 
 static inline bool
 btmask_contains(BackendTypeMask mask, BackendType t)
@@ -2979,7 +2976,7 @@ PostmasterStateMachine(void)
 		 * left by now anyway; what we're really waiting for is walsenders and
 		 * archiver.
 		 */
-		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-- 
2.45.2.746.g06e570c0df.dirty

>From d463ec7e1dfaae92d9bd4626448688b9269ebb1b Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Wed, 8 Jan 2025 14:10:49 -0500
Subject: [PATCH v2 4/7] postmaster: Make btmask_add() variadic

Suggested-by: Heikki Linnakangas <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/backend/postmaster/postmaster.c | 54 ++++++++++++++++++-----------
 1 file changed, 34 insertions(+), 20 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index f9d6c552341..05e93ac167b 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -150,12 +150,19 @@ btmask(BackendType t)
 }
 
 static inline BackendTypeMask
-btmask_add(BackendTypeMask mask, BackendType t)
+btmask_add_n(BackendTypeMask mask, int nargs, BackendType *t)
 {
-	mask.mask |= 1 << t;
+	for (int i = 0; i < nargs; i++)
+		mask.mask |= 1 << t[i];
 	return mask;
 }
 
+#define btmask_add(mask, ...) \
+	btmask_add_n(mask, \
+		lengthof(((BackendType[]){__VA_ARGS__})), \
+		(BackendType[]){__VA_ARGS__} \
+	)
+
 static inline BackendTypeMask
 btmask_del(BackendTypeMask mask, BackendType t)
 {
@@ -2840,22 +2847,27 @@ PostmasterStateMachine(void)
 		/*
 		 * PM_WAIT_BACKENDS state ends when we have no regular backends, no
 		 * autovac launcher or workers, and no bgworkers (including
-		 * unconnected ones).  No walwriter, bgwriter, slot sync worker, or
-		 * WAL summarizer either.
+		 * unconnected ones).
 		 */
-		targetMask = btmask_add(targetMask, B_BACKEND);
-		targetMask = btmask_add(targetMask, B_AUTOVAC_LAUNCHER);
-		targetMask = btmask_add(targetMask, B_AUTOVAC_WORKER);
-		targetMask = btmask_add(targetMask, B_BG_WORKER);
+		targetMask = btmask_add(targetMask,
+								B_BACKEND,
+								B_AUTOVAC_LAUNCHER,
+								B_AUTOVAC_WORKER,
+								B_BG_WORKER);
 
-		targetMask = btmask_add(targetMask, B_WAL_WRITER);
-		targetMask = btmask_add(targetMask, B_BG_WRITER);
-		targetMask = btmask_add(targetMask, B_SLOTSYNC_WORKER);
-		targetMask = btmask_add(targetMask, B_WAL_SUMMARIZER);
+		/*
+		 * No walwriter, bgwriter, slot sync worker, or WAL summarizer either.
+		 */
+		targetMask = btmask_add(targetMask,
+								B_WAL_WRITER,
+								B_BG_WRITER,
+								B_SLOTSYNC_WORKER,
+								B_WAL_SUMMARIZER);
 
 		/* If we're in recovery, also stop startup and walreceiver procs */
-		targetMask = btmask_add(targetMask, B_STARTUP);
-		targetMask = btmask_add(targetMask, B_WAL_RECEIVER);
+		targetMask = btmask_add(targetMask,
+								B_STARTUP,
+								B_WAL_RECEIVER);
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
@@ -2878,17 +2890,19 @@ PostmasterStateMachine(void)
 		{
 			BackendTypeMask remainMask = BTYPE_MASK_NONE;
 
-			remainMask = btmask_add(remainMask, B_WAL_SENDER);
-			remainMask = btmask_add(remainMask, B_ARCHIVER);
-			remainMask = btmask_add(remainMask, B_DEAD_END_BACKEND);
-			remainMask = btmask_add(remainMask, B_LOGGER);
+			remainMask = btmask_add(remainMask,
+									B_WAL_SENDER,
+									B_ARCHIVER,
+									B_DEAD_END_BACKEND,
+									B_LOGGER);
 
 			/* checkpointer may or may not be in targetMask already */
 			remainMask = btmask_add(remainMask, B_CHECKPOINTER);
 
 			/* these are not real postmaster children */
-			remainMask = btmask_add(remainMask, B_INVALID);
-			remainMask = btmask_add(remainMask, B_STANDALONE_BACKEND);
+			remainMask = btmask_add(remainMask,
+									B_INVALID,
+									B_STANDALONE_BACKEND);
 
 			/* All types should be included in targetMask or remainMask */
 			Assert((remainMask.mask | targetMask.mask) == BTYPE_MASK_ALL.mask);
-- 
2.45.2.746.g06e570c0df.dirty

>From 8781ea45bb8e4f25215ebf5a4e31a3476dec6d99 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Wed, 8 Jan 2025 14:05:57 -0500
Subject: [PATCH v2 5/7] postmaster: Rename some shutdown related PMState phase
 names

The previous names weren't particularly clear. Future patches will add more
shutdown phases, making it even more important to have understandable shutdown
phases.

Suggested-by: Heikki Linnakangas <[email protected]>
Discussion: https://postgr.es/m/[email protected]
---
 src/backend/postmaster/postmaster.c | 48 +++++++++++++++--------------
 src/backend/replication/README      |  4 +--
 2 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 05e93ac167b..58f2619c45f 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -316,9 +316,10 @@ static bool FatalError = false; /* T if recovering from backend crash */
  * Notice that this state variable does not distinguish *why* we entered
  * states later than PM_RUN --- Shutdown and FatalError must be consulted
  * to find that out.  FatalError is never true in PM_RECOVERY, PM_HOT_STANDBY,
- * or PM_RUN states, nor in PM_SHUTDOWN states (because we don't enter those
- * states when trying to recover from a crash).  It can be true in PM_STARTUP
- * state, because we don't clear it until we've successfully started WAL redo.
+ * or PM_RUN states, nor in PM_WAIT_XLOG_SHUTDOWN states (because we don't
+ * enter those states when trying to recover from a crash).  It can be true in
+ * PM_STARTUP state, because we don't clear it until we've successfully
+ * started WAL redo.
  */
 typedef enum
 {
@@ -329,9 +330,9 @@ typedef enum
 	PM_RUN,						/* normal "database is alive" state */
 	PM_STOP_BACKENDS,			/* need to stop remaining backends */
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
-	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown
+	PM_WAIT_XLOG_SHUTDOWN,		/* waiting for checkpointer to do shutdown
 								 * ckpt */
-	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
+	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
@@ -2354,7 +2355,7 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
@@ -2363,8 +2364,8 @@ process_pm_child_exit(void)
 				 * occur on next postmaster start.)
 				 *
 				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_SHUTDOWN state) but we might
-				 * have dead-end children to wait for.
+				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
+				 * we might have dead-end children to wait for.
 				 *
 				 * If we have an archiver subprocess, tell it to do a last
 				 * archive cycle and quit. Likewise, if we have walsender
@@ -2382,7 +2383,7 @@ process_pm_child_exit(void)
 				 */
 				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
 
-				UpdatePMState(PM_SHUTDOWN_2);
+				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
 			}
 			else
 			{
@@ -2733,7 +2734,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		pmState == PM_HOT_STANDBY ||
 		pmState == PM_RUN ||
 		pmState == PM_STOP_BACKENDS ||
-		pmState == PM_SHUTDOWN)
+		pmState == PM_WAIT_XLOG_SHUTDOWN)
 		UpdatePMState(PM_WAIT_BACKENDS);
 
 	/*
@@ -2957,7 +2958,7 @@ PostmasterStateMachine(void)
 				if (CheckpointerPMChild != NULL)
 				{
 					signal_child(CheckpointerPMChild, SIGUSR2);
-					UpdatePMState(PM_SHUTDOWN);
+					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
 				{
@@ -2982,13 +2983,13 @@ PostmasterStateMachine(void)
 		}
 	}
 
-	if (pmState == PM_SHUTDOWN_2)
+	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_SHUTDOWN_2 state ends when there's no other children than
-		 * dead-end children left. There shouldn't be any regular backends
-		 * left by now anyway; what we're really waiting for is walsenders and
-		 * archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
+		 * than dead-end children left. There shouldn't be any regular
+		 * backends left by now anyway; what we're really waiting for is
+		 * walsenders and archiver.
 		 */
 		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
@@ -3131,8 +3132,8 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_RUN);
 			PM_TOSTR_CASE(PM_STOP_BACKENDS);
 			PM_TOSTR_CASE(PM_WAIT_BACKENDS);
-			PM_TOSTR_CASE(PM_SHUTDOWN);
-			PM_TOSTR_CASE(PM_SHUTDOWN_2);
+			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
+			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
@@ -3172,9 +3173,10 @@ LaunchMissingBackgroundProcesses(void)
 	 * The checkpointer and the background writer are active from the start,
 	 * until shutdown is initiated.
 	 *
-	 * (If the checkpointer is not running when we enter the PM_SHUTDOWN
-	 * state, it is launched one more time to perform the shutdown checkpoint.
-	 * That's done in PostmasterStateMachine(), not here.)
+	 * (If the checkpointer is not running when we enter the
+	 * PM_WAIT_XLOG_SHUTDOWN state, it is launched one more time to perform
+	 * the shutdown checkpoint.  That's done in PostmasterStateMachine(), not
+	 * here.)
 	 */
 	if (pmState == PM_RUN || pmState == PM_RECOVERY ||
 		pmState == PM_HOT_STANDBY || pmState == PM_STARTUP)
@@ -3994,8 +3996,8 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	{
 		case PM_NO_CHILDREN:
 		case PM_WAIT_DEAD_END:
-		case PM_SHUTDOWN_2:
-		case PM_SHUTDOWN:
+		case PM_WAIT_XLOG_ARCHIVAL:
+		case PM_WAIT_XLOG_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
 			break;
diff --git a/src/backend/replication/README b/src/backend/replication/README
index 8fcd78da9aa..f05d2aae5f9 100644
--- a/src/backend/replication/README
+++ b/src/backend/replication/README
@@ -45,8 +45,8 @@ shutdown checkpoint and terminating pgarch and other auxiliary processes, but
 that's not desirable for walsenders, because we want the standby servers to
 receive all the WAL, including the shutdown checkpoint, before the primary
 is shut down. Therefore postmaster treats walsenders like the pgarch process,
-and instructs them to terminate at PM_SHUTDOWN_2 phase, after all regular
-backends have died and checkpointer has issued the shutdown checkpoint.
+and instructs them to terminate at the PM_WAIT_XLOG_ARCHIVAL phase, after all
+regular backends have died and checkpointer has issued the shutdown checkpoint.
 
 When postmaster accepts a connection, it immediately forks a new process
 to handle the handshake and authentication, and the process initializes to
-- 
2.45.2.746.g06e570c0df.dirty

>From e956c53bea042524f7905e7d457149a559322d5d Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Wed, 8 Jan 2025 13:32:29 -0500
Subject: [PATCH v2 6/7] checkpointer: Request checkpoint via latch instead of
 signal

The main reason for this is that a future commit would like to use SIGINT for
another purpose. But it's also a tad nicer and tad more efficient to use
SetLatch(), as it avoids a signal when checkpointer already is busy.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/checkpointer.c | 60 +++++++++------------------
 1 file changed, 19 insertions(+), 41 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..cd76315bfa4 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -159,9 +159,6 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
-/* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
-
 
 /*
  * Main entry point for checkpointer process
@@ -191,7 +188,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, SIG_IGN);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -860,23 +857,6 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
-/* --------------------------------
- *		signal handler routines
- * --------------------------------
- */
-
-/* SIGINT: set flag to run a normal checkpoint right away */
-static void
-ReqCheckpointHandler(SIGNAL_ARGS)
-{
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
-	SetLatch(MyLatch);
-}
-
-
 /* --------------------------------
  *		communication with backends
  * --------------------------------
@@ -990,38 +970,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * when it does start, with or without getting a signal.
 	 */
 #define MAX_SIGNAL_TRIES 600	/* max wait 60.0 sec */
 	for (ntries = 0;; ntries++)
 	{
-		if (CheckpointerShmem->checkpointer_pid == 0)
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		ProcNumber	checkpointerProc = procglobal->checkpointerProc;
+
+		if (checkpointerProc == INVALID_PROC_NUMBER)
 		{
 			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
 			{
 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: checkpointer is not running");
-				break;
-			}
-		}
-		else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
-		{
-			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
+					 "could not notify checkpoint: checkpointer is not running");
 				break;
 			}
 		}
 		else
-			break;				/* signal sent successfully */
+		{
+			SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+			/* notified successfully */
+			break;
+		}
 
 		CHECK_FOR_INTERRUPTS();
 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
-- 
2.45.2.746.g06e570c0df.dirty

>From 02bbfc3b1cda477c41687972e20ac662e3483a5e Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Wed, 8 Jan 2025 14:09:28 -0500
Subject: [PATCH v2 7/7] Change shutdown sequence to terminate checkpointer
 last

The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. That already happens in
checkpointer, even though walsenders can be active longer.

The only reason the current state does not actively cause problems is that
walsender currently generate any stats. However, there is a patch to change
that.  Another need for this originates in the AIO patchset, where IO workers
can emit stats in some edge cases and need to run while the shutdown
checkpoint is being written.

This commit changes the shutdown sequence so checkpointer is signalled to
trigger writing the shutdown checkpoint without terminating it. Once
checkpointer wrote the checkpoint it will wait for a termination
signal.

Postmaster now triggers the shutdown checkpoint where we previously did so by
terminating checkpointer. Checkpointer is terminated after all other children
have been terminated, tracked using the new PM_WAIT_CHECKPOINTER PMState.

Reviewed-by: Heikki Linnakangas <[email protected]>
Reviewed-by: Bertrand Drouvot <[email protected]>
Discussion: https://postgr.es/m/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu@m3cfzxicm5kp
---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 114 +++++++++++----
 src/backend/postmaster/postmaster.c           | 136 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 180 insertions(+), 74 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index cd76315bfa4..411aa66aa61 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -51,6 +51,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +142,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -159,6 +161,9 @@ static bool ImmediateCheckpointRequested(void);
 static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
+/* Signal handlers */
+static void ReqShutdownXLOG(SIGNAL_ARGS);
+
 
 /*
  * Main entry point for checkpointer process
@@ -188,7 +193,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, SIG_IGN);
+	pqsignal(SIGINT, ReqShutdownXLOG);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -211,8 +216,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -327,7 +335,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint or terminate.
 	 */
 	for (;;)
 	{
@@ -346,7 +354,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -517,8 +528,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			ckpt_active = false;
 
-			/* We may have received an interrupt during the checkpoint. */
+			/*
+			 * We may have received an interrupt during the checkpoint and the
+			 * latch might have been reset (e.g. in CheckpointWriteDelay).
+			 */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -557,6 +573,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By seperating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -586,29 +652,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -729,6 +772,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -857,6 +901,20 @@ IsCheckpointOnSchedule(double progress)
 }
 
 
+/* --------------------------------
+ *		signal handler routines
+ * --------------------------------
+ */
+
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+	ShutdownXLOGPending = true;
+	SetLatch(MyLatch);
+}
+
+
 /* --------------------------------
  *		communication with backends
  * --------------------------------
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 58f2619c45f..87ddc7a2da0 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -335,6 +335,7 @@ typedef enum
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
+	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
 
@@ -2355,35 +2356,16 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know it wrote a shutdown checkpoint,
+				 * otherwise we'd still be in PM_SHUTDOWN.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
-				 * we might have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
-				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
+				 * At this point we have no children left.
 				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+				UpdatePMState(PM_NO_CHILDREN);
 			}
 			else
 			{
@@ -2938,9 +2920,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2954,10 +2936,10 @@ PostmasterStateMachine(void)
 				/* Start the checkpointer if not running */
 				if (CheckpointerPMChild == NULL)
 					CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
-				/* And tell it to shut down */
+				/* And tell it to write the shutdown checkpoint */
 				if (CheckpointerPMChild != NULL)
 				{
-					signal_child(CheckpointerPMChild, SIGUSR2);
+					signal_child(CheckpointerPMChild, SIGINT);
 					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
@@ -2986,16 +2968,16 @@ PostmasterStateMachine(void)
 	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
-		 * than dead-end children left. There shouldn't be any regular
-		 * backends left by now anyway; what we're really waiting for is
-		 * walsenders and archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
+		 * than checkpointer and dead-end children left. There shouldn't be
+		 * any regular backends left by now anyway; what we're really waiting
+		 * for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND, B_CHECKPOINTER)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER, B_CHECKPOINTER));
 		}
 	}
 
@@ -3003,29 +2985,52 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_DEAD_END state ends when all other children are gone except
-		 * for the logger.  During normal shutdown, all that remains are
-		 * dead-end backends, but in FatalError processing we jump straight
-		 * here with more processes remaining.  Note that they have already
-		 * been sent appropriate shutdown signals, either during a normal
-		 * state transition leading up to PM_WAIT_DEAD_END, or during
-		 * FatalError processing.
+		 * for the logger and checkpointer.  During normal shutdown, all that
+		 * remains are dead-end backends and checkpointer, but in FatalError
+		 * processing we jump straight here with more processes remaining.
+		 * Note that they have already been sent appropriate shutdown signals,
+		 * either during a normal state transition leading up to
+		 * PM_WAIT_DEAD_END, or during FatalError processing.
 		 *
 		 * The reason we wait is to protect against a new postmaster starting
 		 * conflicting subprocesses; this isn't an ironclad protection, but it
 		 * at least helps in the shutdown-and-immediately-restart scenario.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_CHECKPOINTER)) == 0)
 		{
 			/* These other guys should be dead already */
 			Assert(StartupPMChild == NULL);
 			Assert(WalReceiverPMChild == NULL);
 			Assert(WalSummarizerPMChild == NULL);
 			Assert(BgWriterPMChild == NULL);
-			Assert(CheckpointerPMChild == NULL);
 			Assert(WalWriterPMChild == NULL);
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
+
+			UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+			/*
+			 * Now that everyone else is gone, tell checkpointer to shut down
+			 * too. That allows checkpointer to perform some last bits of
+			 * cleanup without other processes interfering.
+			 */
+			if (CheckpointerPMChild != NULL)
+				signal_child(CheckpointerPMChild, SIGUSR2);
+		}
+	}
+
+	if (pmState == PM_WAIT_CHECKPOINTER)
+	{
+		/*
+		 * PM_WAIT_CHECKPOINTER ends when, drumroll, checkpointer has shut
+		 * down. Note that checkpointer already has completed the shutdown
+		 * checkpoint, we just wait for it to do some last bits of cleanup and
+		 * then exit.
+		 */
+		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		{
+			Assert(CheckpointerPMChild == NULL);
 			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
@@ -3135,6 +3140,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_TOSTR_CASE
@@ -3563,6 +3569,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3674,9 +3682,46 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (pmState == PM_WAIT_XLOG_SHUTDOWN &&
+		CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/*
+		 * Checkpointer completed the shutdown checkpoint.
+		 *
+		 * If we have an archiver subprocess, tell it to do a last archive
+		 * cycle and quit. Likewise, if we have walsender processes, tell them
+		 * to send any remaining WAL and quit.
+		 */
+		Assert(Shutdown > NoShutdown);
+
+		/* Waken archiver for the last time */
+		if (PgArchPMChild != NULL)
+			signal_child(PgArchPMChild, SIGUSR2);
+
+		/*
+		 * Waken walsenders for the last time. No regular backends should be
+		 * around anymore.
+		 */
+		SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+		UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3684,7 +3729,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -3995,6 +4040,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_WAIT_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.45.2.746.g06e570c0df.dirty

Re: Reorder shutdown sequence, to flush pgstats later

Reply via email to