Reorder shutdown sequence, to flush pgstats later

Andres Freund Tue, 07 Jan 2025 18:11:13 -0800

Hi,

In the AIO patchset I just hit a corner case in which IO workers can emit
stats ([1]). That in turn can trigger an assertion failure, because by the
time the IO workers are shut down, checkpointer already has exited and written
out the stats. In this case the relevant IO has completed, but nothing
triggered a stats flush in the IO workers before checkpointer exited.


I think Bilal has also hit this problem somewhat recently, as part of his work
to track WAL IO in pg_stat_io. We intentionally allow walsenders to be active
after checkpointer writes out the shutdown checkpoint and exits.


Robert suggested, when chatting with him about this problem, that we could use
a global barrier to trigger pending stats to be flushed before checkpointer
exits. While that, I think, would fix "my" issue with pending stats in IO
workers, I don't think it'd fix Bilal's problem.

I think that to properly fix this we need a place to flush stats after
everything has shut down.

One way we could do this would be to introduce a new aux process that's
started at the end of the shutdown sequence. But that seems a bit too big a
hammer, without corresponding benefit.


I instead opted to somewhat rearrange the shutdown sequence:

    This commit changes the shutdown sequence so checkpointer is signalled to
    trigger writing the shutdown checkpoint without terminating it. Once
    checkpointer wrote the checkpoint it will wait for a termination
    signal.

    Postmaster now triggers the shutdown checkpoint where we previously did so 
by
    terminating checkpointer. Checkpointer is terminated after all other 
children
    have been terminated, tracked using the new PM_SHUTDOWN_CHECKPOINTER 
PMState.

    In addition, the existing PM_SHUTDOWN_2 state is renamed to
    PM_XLOG_IS_SHUTDOWN, that seems a lot more descriptive.



This patch is not fully polished, there are a few details I'm not sure about:

- Previously the shutdown checkpoint and process exit were triggered from
  HandleCheckpointerInterrupts().  I could have continued doing so in the
  patch, but it seemed quite weird to have a wait loop in a function called
  HandleCheckpointerInterrupts().

  Instead I made the main loop in CheckpointerMain() terminate if checkpointer
  was asked to write the shutdown checkpoint or exit

- In process_pm_pmsignal() the code now triggers a PostmasterStateMachine() if
  checkpointer signalled ShutdownXLOG having completed. We didn't really have
  a need for that before. process_pm_child_exit() does call
  PostmasterStateMachine(), but somehow calling it in process_pm_pmsignal()
  feels slightly off

- I don't love the naming of the various PMState values (existing and new),
  but a larger renaming should probably be done separately?

- There's logging in ShutdownXLOG() that's only triggered if called from
  checkpointer. Seems kinda odd - shouldn't we just move that to
  checkpointer.c?


The first two patches are just logging improvements that I found helpful to
write this patch:

0001: Instead of modifying pmState directly, do so via a new function
      UpdatePMState(), which logs the state transition at DEBUG2.

      Requires a helper to stringify PMState.

      I've written one-off versions of this patch quite a few times. Without
      knowing in what state postmaster is in, it's quite hard to debug the
      state machine.


0002: Make logging of postmaster signalling child processes more consistent

      I found it somewhat hard to understand what's happening during state
      changes without being able to see the signals being sent. While we did
      have logging in SignalChildren(), we didn't in signal_child(), and most
      signals that are important for the shutdown sequence are sent via
      signal_child().


The next is a small prerequisite:

0003: postmaster: Introduce variadic btmask_all_except()

      I proposed this in another thread, where I also needed
      btmask_all_except3() but then removed the only call to
      btmask_all_except2(). Instead of adding/removing code, it seems better
      to just use a variadic function.

And then 0004, the reason for this thread.


Comments? Better ideas?

Greetings,

Andres Freund


[1] A buffered write completes and RememberSyncRequest() fails, leading to the
    IO workers performing the flush itself.

>From 5978a67e17b2750b766d959b4529c09a8d5b3023 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 7 Jan 2025 17:59:43 -0500
Subject: [PATCH v1 1/4] postmaster: Update pmState in new function

This mainly makes logging of state changes easier.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 78 +++++++++++++++++++++--------
 1 file changed, 57 insertions(+), 21 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 0a915dac4f7..9d23c3482ae 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -411,6 +411,7 @@ static void HandleChildCrash(int pid, int exitstatus, const char *procname);
 static void LogChildExit(int lev, const char *procname,
 						 int pid, int exitstatus);
 static void PostmasterStateMachine(void);
+static void UpdatePMState(PMState newState);
 
 static void ExitPostmaster(int status) pg_attribute_noreturn();
 static int	ServerLoop(void);
@@ -1363,7 +1364,7 @@ PostmasterMain(int argc, char *argv[])
 	StartupPMChild = StartChildProcess(B_STARTUP);
 	Assert(StartupPMChild != NULL);
 	StartupStatus = STARTUP_RUNNING;
-	pmState = PM_STARTUP;
+	UpdatePMState(PM_STARTUP);
 
 	/* Some workers may be scheduled to start now */
 	maybe_start_bgworkers();
@@ -2099,7 +2100,7 @@ process_pm_shutdown_request(void)
 			else if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 				/* There should be no clients, so proceed to stop children */
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 
 			/*
@@ -2133,7 +2134,7 @@ process_pm_shutdown_request(void)
 			if (pmState == PM_STARTUP || pmState == PM_RECOVERY)
 			{
 				/* Just shut down background processes silently */
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 			else if (pmState == PM_RUN ||
 					 pmState == PM_HOT_STANDBY)
@@ -2141,7 +2142,7 @@ process_pm_shutdown_request(void)
 				/* Report that we're about to zap live client sessions */
 				ereport(LOG,
 						(errmsg("aborting any active transactions")));
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 			}
 
 			/*
@@ -2176,7 +2177,7 @@ process_pm_shutdown_request(void)
 			/* (note we don't apply send_abort_for_crash here) */
 			SetQuitSignalReason(PMQUIT_FOR_STOP);
 			TerminateChildren(SIGQUIT);
-			pmState = PM_WAIT_BACKENDS;
+			UpdatePMState(PM_WAIT_BACKENDS);
 
 			/* set stopwatch for them to die */
 			AbortStartTime = time(NULL);
@@ -2231,7 +2232,7 @@ process_pm_child_exit(void)
 				(EXIT_STATUS_0(exitstatus) || EXIT_STATUS_1(exitstatus)))
 			{
 				StartupStatus = STARTUP_NOT_RUNNING;
-				pmState = PM_WAIT_BACKENDS;
+				UpdatePMState(PM_WAIT_BACKENDS);
 				/* PostmasterStateMachine logic does the rest */
 				continue;
 			}
@@ -2243,7 +2244,7 @@ process_pm_child_exit(void)
 				StartupStatus = STARTUP_NOT_RUNNING;
 				Shutdown = Max(Shutdown, SmartShutdown);
 				TerminateChildren(SIGTERM);
-				pmState = PM_WAIT_BACKENDS;
+				UpdatePMState(PM_WAIT_BACKENDS);
 				/* PostmasterStateMachine logic does the rest */
 				continue;
 			}
@@ -2288,7 +2289,7 @@ process_pm_child_exit(void)
 				{
 					StartupStatus = STARTUP_NOT_RUNNING;
 					if (pmState == PM_STARTUP)
-						pmState = PM_WAIT_BACKENDS;
+						UpdatePMState(PM_WAIT_BACKENDS);
 				}
 				else
 					StartupStatus = STARTUP_CRASHED;
@@ -2304,7 +2305,7 @@ process_pm_child_exit(void)
 			FatalError = false;
 			AbortStartTime = 0;
 			ReachedNormalRunning = true;
-			pmState = PM_RUN;
+			UpdatePMState(PM_RUN);
 			connsAllowed = true;
 
 			/*
@@ -2377,7 +2378,7 @@ process_pm_child_exit(void)
 				 */
 				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
 
-				pmState = PM_SHUTDOWN_2;
+				UpdatePMState(PM_SHUTDOWN_2);
 			}
 			else
 			{
@@ -2729,7 +2730,7 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 		pmState == PM_RUN ||
 		pmState == PM_STOP_BACKENDS ||
 		pmState == PM_SHUTDOWN)
-		pmState = PM_WAIT_BACKENDS;
+		UpdatePMState(PM_WAIT_BACKENDS);
 
 	/*
 	 * .. and if this doesn't happen quickly enough, now the clock is ticking
@@ -2821,7 +2822,7 @@ PostmasterStateMachine(void)
 			 * Then we're ready to stop other children.
 			 */
 			if (CountChildren(btmask(B_BACKEND)) == 0)
-				pmState = PM_STOP_BACKENDS;
+				UpdatePMState(PM_STOP_BACKENDS);
 		}
 	}
 
@@ -2909,7 +2910,7 @@ PostmasterStateMachine(void)
 
 			SignalChildren(SIGTERM, targetMask);
 
-			pmState = PM_WAIT_BACKENDS;
+			UpdatePMState(PM_WAIT_BACKENDS);
 		}
 
 		/* Are any of the target processes still running? */
@@ -2920,7 +2921,7 @@ PostmasterStateMachine(void)
 				/*
 				 * Stop any dead-end children and stop creating new ones.
 				 */
-				pmState = PM_WAIT_DEAD_END;
+				UpdatePMState(PM_WAIT_DEAD_END);
 				ConfigurePostmasterWaitSet(false);
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
@@ -2945,7 +2946,7 @@ PostmasterStateMachine(void)
 				if (CheckpointerPMChild != NULL)
 				{
 					signal_child(CheckpointerPMChild, SIGUSR2);
-					pmState = PM_SHUTDOWN;
+					UpdatePMState(PM_SHUTDOWN);
 				}
 				else
 				{
@@ -2960,7 +2961,7 @@ PostmasterStateMachine(void)
 					 * for checkpointer fork failure.
 					 */
 					FatalError = true;
-					pmState = PM_WAIT_DEAD_END;
+					UpdatePMState(PM_WAIT_DEAD_END);
 					ConfigurePostmasterWaitSet(false);
 
 					/* Kill the walsenders and archiver too */
@@ -2980,7 +2981,7 @@ PostmasterStateMachine(void)
 		 */
 		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			pmState = PM_WAIT_DEAD_END;
+			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
 			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 		}
@@ -3013,7 +3014,7 @@ PostmasterStateMachine(void)
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
-			pmState = PM_NO_CHILDREN;
+			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
 
@@ -3097,7 +3098,7 @@ PostmasterStateMachine(void)
 		StartupPMChild = StartChildProcess(B_STARTUP);
 		Assert(StartupPMChild != NULL);
 		StartupStatus = STARTUP_RUNNING;
-		pmState = PM_STARTUP;
+		UpdatePMState(PM_STARTUP);
 		/* crash recovery started, reset SIGKILL flag */
 		AbortStartTime = 0;
 
@@ -3106,6 +3107,41 @@ PostmasterStateMachine(void)
 	}
 }
 
+static const char *
+pmstate_name(PMState state)
+{
+#define PM_CASE(state) case state: return #state
+	switch (state)
+	{
+			PM_CASE(PM_INIT);
+			PM_CASE(PM_STARTUP);
+			PM_CASE(PM_RECOVERY);
+			PM_CASE(PM_HOT_STANDBY);
+			PM_CASE(PM_RUN);
+			PM_CASE(PM_STOP_BACKENDS);
+			PM_CASE(PM_WAIT_BACKENDS);
+			PM_CASE(PM_SHUTDOWN);
+			PM_CASE(PM_SHUTDOWN_2);
+			PM_CASE(PM_WAIT_DEAD_END);
+			PM_CASE(PM_NO_CHILDREN);
+	}
+#undef PM_CASE
+
+	pg_unreachable();
+}
+
+/*
+ * Simple wrapper for updating pmState. The main reason to have this wrapper
+ * is that it makes it easy to log all state transitions.
+ */
+static void
+UpdatePMState(PMState newState)
+{
+	elog(DEBUG1, "updating PMState from %d/%s to %d/%s",
+		 pmState, pmstate_name(pmState), newState, pmstate_name(newState));
+	pmState = newState;
+}
+
 /*
  * Launch background processes after state change, or relaunch after an
  * existing process has exited.
@@ -3524,7 +3560,7 @@ process_pm_pmsignal(void)
 #endif
 		}
 
-		pmState = PM_RECOVERY;
+		UpdatePMState(PM_RECOVERY);
 	}
 
 	if (CheckPostmasterSignal(PMSIGNAL_BEGIN_HOT_STANDBY) &&
@@ -3539,7 +3575,7 @@ process_pm_pmsignal(void)
 		sd_notify(0, "READY=1");
 #endif
 
-		pmState = PM_HOT_STANDBY;
+		UpdatePMState(PM_HOT_STANDBY);
 		connsAllowed = true;
 
 		/* Some workers may be scheduled to start now */
-- 
2.45.2.746.g06e570c0df.dirty

>From d7ecde678fb61cc8b56d9b10507ee98506c2bd11 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 7 Jan 2025 20:22:36 -0500
Subject: [PATCH v1 2/4] postmaster: Improve logging of signals sent by
 postmaster

Previously many, in some cases important, signals we never logged. In other
cases the signal name was only included numerically.

Also move from direct use of kill() to signal the av launcher to
signal_child(). There doesn't seem to be a reason for directly using kill().

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/backend/postmaster/postmaster.c | 61 ++++++++++++++++++++++++-----
 1 file changed, 51 insertions(+), 10 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 9d23c3482ae..d57fee54e0d 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -419,6 +419,7 @@ static int	BackendStartup(ClientSocket *client_sock);
 static void report_fork_failure_to_client(ClientSocket *client_sock, int errnum);
 static CAC_state canAcceptConnections(BackendType backend_type);
 static void signal_child(PMChild *pmchild, int signal);
+static void signal_child_ext(PMChild *pmchild, int signal, int elevel);
 static void sigquit_child(PMChild *pmchild);
 static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
@@ -1693,7 +1694,7 @@ ServerLoop(void)
 		{
 			avlauncher_needs_signal = false;
 			if (AutoVacLauncherPMChild != NULL)
-				kill(AutoVacLauncherPMChild->pid, SIGUSR2);
+				signal_child(AutoVacLauncherPMChild, SIGUSR2);
 		}
 
 #ifdef HAVE_PTHREAD_IS_THREADED_NP
@@ -3255,6 +3256,37 @@ LaunchMissingBackgroundProcesses(void)
 		maybe_start_bgworkers();
 }
 
+/*
+ * Return string representation of signal.
+ *
+ * Because this is only implemented in signals we already rely on in this file
+ * we don't need to deal with unimplemented or same-numeric-value signals (as
+ * we'd e.g. have to for EWOULDBLOCK / EAGAIN).
+ */
+static const char *
+pm_signame(int signal)
+{
+#define PM_CASE(state) case state: return #state
+	switch (signal)
+	{
+			PM_CASE(SIGABRT);
+			PM_CASE(SIGCHLD);
+			PM_CASE(SIGHUP);
+			PM_CASE(SIGINT);
+			PM_CASE(SIGKILL);
+			PM_CASE(SIGQUIT);
+			PM_CASE(SIGTERM);
+			PM_CASE(SIGUSR1);
+			PM_CASE(SIGUSR2);
+		default:
+			/* all signals sent by postmaster should be listed here */
+			Assert(false);
+			return "(unknown)";
+	}
+#undef PM_CASE
+	pg_unreachable();
+}
+
 /*
  * Send a signal to a postmaster child process
  *
@@ -3272,10 +3304,16 @@ LaunchMissingBackgroundProcesses(void)
  * child twice will not cause any problems.
  */
 static void
-signal_child(PMChild *pmchild, int signal)
+signal_child_ext(PMChild *pmchild, int signal, int elevel)
 {
 	pid_t		pid = pmchild->pid;
 
+	ereport(elevel,
+			(errmsg_internal("sending signal %d/%s to %s process %d",
+							 signal, pm_signame(signal),
+							 GetBackendTypeDesc(pmchild->bkend_type),
+							 (int) pmchild->pid)));
+
 	if (kill(pid, signal) < 0)
 		elog(DEBUG3, "kill(%ld,%d) failed: %m", (long) pid, signal);
 #ifdef HAVE_SETSID
@@ -3295,6 +3333,15 @@ signal_child(PMChild *pmchild, int signal)
 #endif
 }
 
+/*
+ * Like signal_child_ext(), but with a default debug level.
+ */
+static void
+signal_child(PMChild *pmchild, int signal)
+{
+	signal_child_ext(pmchild, signal, DEBUG4);
+}
+
 /*
  * Convenience function for killing a child process after a crash of some
  * other child process.  We log the action at a higher level than we would
@@ -3306,11 +3353,8 @@ signal_child(PMChild *pmchild, int signal)
 static void
 sigquit_child(PMChild *pmchild)
 {
-	ereport(DEBUG2,
-			(errmsg_internal("sending %s to process %d",
-							 (send_abort_for_crash ? "SIGABRT" : "SIGQUIT"),
-							 (int) pmchild->pid)));
-	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
+	signal_child_ext(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT),
+					 DEBUG2);
 }
 
 /*
@@ -3341,9 +3385,6 @@ SignalChildren(int signal, BackendTypeMask targetMask)
 		if (!btmask_contains(targetMask, bp->bkend_type))
 			continue;
 
-		ereport(DEBUG4,
-				(errmsg_internal("sending signal %d to %s process %d",
-								 signal, GetBackendTypeDesc(bp->bkend_type), (int) bp->pid)));
 		signal_child(bp, signal);
 		signaled = true;
 	}
-- 
2.45.2.746.g06e570c0df.dirty

>From 7f78e0a6609f9bfb33430137d083012d7bd15ad7 Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 7 Jan 2025 19:51:00 -0500
Subject: [PATCH v1 3/4] postmaster: Introduce variadic btmask_all_except()

Upcoming patches would otherwise need btmask_all_except3().

Author:
Reviewed-by:
Discussion: https://postgr.es/m/w3z6w3g4aovivs735nk4pzjhmegntecesm3kktpebchegm5o53@aonnq2kn27xi
Backpatch:
---
 src/backend/postmaster/postmaster.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d57fee54e0d..d65b00f9338 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -164,23 +164,20 @@ btmask_del(BackendTypeMask mask, BackendType t)
 }
 
 static inline BackendTypeMask
-btmask_all_except(BackendType t)
+btmask_all_except_n(int nargs, BackendType *t)
 {
 	BackendTypeMask mask = BTYPE_MASK_ALL;
 
-	mask = btmask_del(mask, t);
+	for (int i = 0; i < nargs; i++)
+		mask = btmask_del(mask, t[i]);
 	return mask;
 }
 
-static inline BackendTypeMask
-btmask_all_except2(BackendType t1, BackendType t2)
-{
-	BackendTypeMask mask = BTYPE_MASK_ALL;
-
-	mask = btmask_del(mask, t1);
-	mask = btmask_del(mask, t2);
-	return mask;
-}
+#define btmask_all_except(...) \
+	btmask_all_except_n( \
+		lengthof(((BackendType[]){__VA_ARGS__})), \
+		(BackendType[]){__VA_ARGS__} \
+	)
 
 static inline bool
 btmask_contains(BackendTypeMask mask, BackendType t)
@@ -2980,7 +2977,7 @@ PostmasterStateMachine(void)
 		 * left by now anyway; what we're really waiting for is walsenders and
 		 * archiver.
 		 */
-		if (CountChildren(btmask_all_except2(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-- 
2.45.2.746.g06e570c0df.dirty

>From 5d7f521ea233fae2ac1e0956ada8d65db508c0ee Mon Sep 17 00:00:00 2001
From: Andres Freund <[email protected]>
Date: Tue, 7 Jan 2025 21:06:51 -0500
Subject: [PATCH v1 4/4] WIP: Change shutdown sequence to terminate
 checkpointer last

The main motivation for this change is to have a process that can serialize
stats after all other processes have terminated. That already happens in
checkpointer, even though walsenders can be active longer.

The only reason the current state does not actively cause problems is that
walsender currently generate any stats. However, there is a patch to change
that.  Another need for this originates in the AIO patchset, where IO workers
can emit stats in some edge cases and need to run while the shutdown
checkpoint is being written.

This commit changes the shutdown sequence so checkpointer is signalled to
trigger writing the shutdown checkpoint without terminating it. Once
checkpointer wrote the checkpoint it will wait for a termination
signal.

Postmaster now triggers the shutdown checkpoint where we previously did so by
terminating checkpointer. Checkpointer is terminated after all other children
have been terminated, tracked using the new PM_SHUTDOWN_CHECKPOINTER PMState.

In addition, the existing PM_SHUTDOWN_2 state is renamed to
PM_XLOG_IS_SHUTDOWN, that seems a lot more descriptive.

Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 103 +++++++++----
 src/backend/postmaster/postmaster.c           | 139 ++++++++++++------
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 170 insertions(+), 76 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..44e25994ca7 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -51,6 +51,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +142,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -161,6 +163,7 @@ static void UpdateSharedMemoryConfig(void);
 
 /* Signal handlers */
 static void ReqCheckpointHandler(SIGNAL_ARGS);
+static void ReqShutdownXLOG(SIGNAL_ARGS);
 
 
 /*
@@ -192,12 +195,12 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
 	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
-	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
+	pqsignal(SIGTERM, SignalHandlerForShutdownRequest);
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
 	pqsignal(SIGPIPE, SIG_IGN);
 	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
-	pqsignal(SIGUSR2, SignalHandlerForShutdownRequest);
+	pqsignal(SIGUSR2, ReqShutdownXLOG);
 
 	/*
 	 * Reset some signals that are accepted by postmaster but not here
@@ -214,8 +217,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster.c will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -330,7 +336,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint.
 	 */
 	for (;;)
 	{
@@ -349,7 +355,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -522,6 +531,8 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			/* We may have received an interrupt during the checkpoint. */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -560,6 +571,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By seperating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -589,29 +650,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -732,6 +770,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -876,6 +915,14 @@ ReqCheckpointHandler(SIGNAL_ARGS)
 	SetLatch(MyLatch);
 }
 
+/* SIGUSR2: set flag to trigger writing of shutdown checkpoint */
+static void
+ReqShutdownXLOG(SIGNAL_ARGS)
+{
+	ShutdownXLOGPending = true;
+	SetLatch(MyLatch);
+}
+
 
 /* --------------------------------
  *		communication with backends
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index d65b00f9338..2fe8e2b23b1 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -324,9 +324,10 @@ typedef enum
 	PM_WAIT_BACKENDS,			/* waiting for live backends to exit */
 	PM_SHUTDOWN,				/* waiting for checkpointer to do shutdown
 								 * ckpt */
-	PM_SHUTDOWN_2,				/* waiting for archiver and walsenders to
+	PM_XLOG_IS_SHUTDOWN,		/* waiting for archiver and walsenders to
 								 * finish */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
+	PM_SHUTDOWN_CHECKPOINTER,	/* waiting for checkpointer to shut down */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
 
@@ -2348,35 +2349,16 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_SHUTDOWN_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know it wrote a shutdown checkpoint,
+				 * otherwise we'd still be in PM_SHUTDOWN.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_SHUTDOWN state) but we might
-				 * have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
-				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
+				 * At this point we have no children left.
 				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_SHUTDOWN_2);
+				UpdatePMState(PM_NO_CHILDREN);
 			}
 			else
 			{
@@ -2924,9 +2906,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2969,19 +2951,19 @@ PostmasterStateMachine(void)
 		}
 	}
 
-	if (pmState == PM_SHUTDOWN_2)
+	if (pmState == PM_XLOG_IS_SHUTDOWN)
 	{
 		/*
-		 * PM_SHUTDOWN_2 state ends when there's no other children than
-		 * dead-end children left. There shouldn't be any regular backends
-		 * left by now anyway; what we're really waiting for is walsenders and
-		 * archiver.
+		 * PM_XLOG_IS_SHUTDOWN state ends when there's no children other than
+		 * checkpointer and dead-end children left. There shouldn't be any
+		 * regular backends left by now anyway; what we're really waiting for
+		 * is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND, B_CHECKPOINTER)) == 0)
 		{
 			UpdatePMState(PM_WAIT_DEAD_END);
 			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER, B_CHECKPOINTER));
 		}
 	}
 
@@ -2989,29 +2971,51 @@ PostmasterStateMachine(void)
 	{
 		/*
 		 * PM_WAIT_DEAD_END state ends when all other children are gone except
-		 * for the logger.  During normal shutdown, all that remains are
-		 * dead-end backends, but in FatalError processing we jump straight
-		 * here with more processes remaining.  Note that they have already
-		 * been sent appropriate shutdown signals, either during a normal
-		 * state transition leading up to PM_WAIT_DEAD_END, or during
-		 * FatalError processing.
+		 * for the logger and checkpointer.  During normal shutdown, all that
+		 * remains are dead-end backends and checkpointer, but in FatalError
+		 * processing we jump straight here with more processes remaining.
+		 * Note that they have already been sent appropriate shutdown signals,
+		 * either during a normal state transition leading up to
+		 * PM_WAIT_DEAD_END, or during FatalError processing.
 		 *
 		 * The reason we wait is to protect against a new postmaster starting
 		 * conflicting subprocesses; this isn't an ironclad protection, but it
 		 * at least helps in the shutdown-and-immediately-restart scenario.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		if (CountChildren(btmask_all_except(B_LOGGER, B_CHECKPOINTER)) == 0)
 		{
 			/* These other guys should be dead already */
 			Assert(StartupPMChild == NULL);
 			Assert(WalReceiverPMChild == NULL);
 			Assert(WalSummarizerPMChild == NULL);
 			Assert(BgWriterPMChild == NULL);
-			Assert(CheckpointerPMChild == NULL);
 			Assert(WalWriterPMChild == NULL);
 			Assert(AutoVacLauncherPMChild == NULL);
 			Assert(SlotSyncWorkerPMChild == NULL);
 			/* syslogger is not considered here */
+
+			UpdatePMState(PM_SHUTDOWN_CHECKPOINTER);
+
+			/*
+			 * Now that everyone else is gone, tell checkpointer to shut down
+			 * too. That allows checkpointer to perform some last bits of of
+			 * cleanup without other processes interfering.
+			 */
+			SignalChildren(SIGTERM, btmask(B_CHECKPOINTER));
+		}
+	}
+
+	if (pmState == PM_SHUTDOWN_CHECKPOINTER)
+	{
+		/*
+		 * PM_SHUTDOWN_CHECKPOINTER ends when, drumroll, checkpointer has shut
+		 * down. Note that checkpointer already has completed the shutdown
+		 * checkpoint, we just wait for it to do some last bits of cleanup and
+		 * then exit.
+		 */
+		if (CountChildren(btmask_all_except(B_LOGGER)) == 0)
+		{
+			Assert(CheckpointerPMChild == NULL);
 			UpdatePMState(PM_NO_CHILDREN);
 		}
 	}
@@ -3119,8 +3123,9 @@ pmstate_name(PMState state)
 			PM_CASE(PM_STOP_BACKENDS);
 			PM_CASE(PM_WAIT_BACKENDS);
 			PM_CASE(PM_SHUTDOWN);
-			PM_CASE(PM_SHUTDOWN_2);
+			PM_CASE(PM_XLOG_IS_SHUTDOWN);
 			PM_CASE(PM_WAIT_DEAD_END);
+			PM_CASE(PM_SHUTDOWN_CHECKPOINTER);
 			PM_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_CASE
@@ -3559,6 +3564,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3670,9 +3677,46 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (pmState == PM_SHUTDOWN &&
+		CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/*
+		 * Checkpointer completed the shutdown checkpoint.
+		 *
+		 * If we have an archiver subprocess, tell it to do a last archive
+		 * cycle and quit. Likewise, if we have walsender processes, tell them
+		 * to send any remaining WAL and quit.
+		 */
+		Assert(Shutdown > NoShutdown);
+
+		/* Waken archiver for the last time */
+		if (PgArchPMChild != NULL)
+			signal_child(PgArchPMChild, SIGUSR2);
+
+		/*
+		 * Waken walsenders for the last time. No regular backends should be
+		 * around anymore.
+		 */
+		SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+		UpdatePMState(PM_XLOG_IS_SHUTDOWN);
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3680,7 +3724,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -3991,8 +4035,9 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_SHUTDOWN_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
-		case PM_SHUTDOWN_2:
+		case PM_XLOG_IS_SHUTDOWN:
 		case PM_SHUTDOWN:
 		case PM_WAIT_BACKENDS:
 		case PM_STOP_BACKENDS:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.45.2.746.g06e570c0df.dirty

Reorder shutdown sequence, to flush pgstats later

Reply via email to