Re: Show WAL write and fsync stats in pg_stat_io

Nazir Bilal Yavuz Fri, 24 Jan 2025 00:31:43 -0800

Hi,

On Wed, 22 Jan 2025 at 03:14, Michael Paquier <mich...@paquier.xyz> wrote:
>
> On Thu, Jan 16, 2025 at 11:40:51AM +0300, Nazir Bilal Yavuz wrote:
> > I encountered another problem while rebasing the patch. The problem is
> > basically we do not expect any pending stats while restoring the stats
> > at the initdb. However, WAL IOs (WAL read and WAL init IOs for now)
> > may happen before restoring the stats, so we end up having pending
> > stats before restoring them and that causes initdb to fail.
>
> On top of 4feba03d8b92, I've reused something close to the patch you
> have posted previously in case, and the issue with allocations for
> pending stats should be gone.


Yes, they are fixed; thanks!

> Could it be possible to post a new version of the patch?  You should
> be able to reuse pgstat_count_backend_io_op[_time]() for your work
> with WAL data in pg_stat_io if you need a low-level control of things,
> but I suspect that calling pgstat_count_io_op() & the other should be
> enough to get the job done with a new IOObject.

I think there is only one problem remaining now. walsenders have stats
to report with this patch and they may shutdown after the
checkpointer, which causes '027_stream_regress.pl' test to fail.
Andres is already working on fixing that issue [1],
'027_stream_regress.pl' test passes after applying Andres' proposed
fix.

v9 is rebased and attached as three patches. The first one is a
squashed patch for the current version of Andres' proposed fix to pass
the CI, the second one is for adding WAL stats to pg_stat_io and the
third one is for fetching timing columns from pg_stat_io in the
pg_stat_wal view.

There is a change in the main patch (0002). Now, stats are being
flushed after the main loop in the PerformWalRecovery() function in
the xlogrecovery.c file. Stats were flushed in the main loop before
but I thought that might be costly so moved it to after main loop.

[1] 
postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp

--
Regards,
Nazir Bilal Yavuz
Microsoft

From 5ee9f2ca206f201beb7264f460de9e8ad4c08ba0 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 10 Jan 2025 11:11:40 -0500
Subject: [PATCH v9 1/3] Squash reorder shutdown sequence patches

---
 src/include/storage/pmsignal.h                |   3 +-
 src/backend/postmaster/checkpointer.c         | 157 +++++---
 src/backend/postmaster/postmaster.c           | 341 +++++++++++-------
 .../utils/activity/wait_event_names.txt       |   1 +
 4 files changed, 318 insertions(+), 184 deletions(-)

diff --git a/src/include/storage/pmsignal.h b/src/include/storage/pmsignal.h
index 3fbe5bf1136..d84a383047e 100644
--- a/src/include/storage/pmsignal.h
+++ b/src/include/storage/pmsignal.h
@@ -40,9 +40,10 @@ typedef enum
 	PMSIGNAL_BACKGROUND_WORKER_CHANGE,	/* background worker state change */
 	PMSIGNAL_START_WALRECEIVER, /* start a walreceiver */
 	PMSIGNAL_ADVANCE_STATE_MACHINE, /* advance postmaster's state machine */
+	PMSIGNAL_XLOG_IS_SHUTDOWN,	/* ShutdownXLOG() completed */
 } PMSignalReason;
 
-#define NUM_PMSIGNALS (PMSIGNAL_ADVANCE_STATE_MACHINE+1)
+#define NUM_PMSIGNALS (PMSIGNAL_XLOG_IS_SHUTDOWN+1)
 
 /*
  * Reasons why the postmaster would send SIGQUIT to its children.
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9bfd0fd665c..767bf9f5cf8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -10,10 +10,13 @@
  * fill WAL segments; the checkpointer itself doesn't watch for the
  * condition.)
  *
- * Normal termination is by SIGUSR2, which instructs the checkpointer to
- * execute a shutdown checkpoint and then exit(0).  (All backends must be
- * stopped before SIGUSR2 is issued!)  Emergency termination is by SIGQUIT;
- * like any backend, the checkpointer will simply abort and exit on SIGQUIT.
+ * The normal termination sequence is that checkpointer is instructed to
+ * execute the shutdown checkpoint by SIGINT.  After that checkpointer waits
+ * to be terminated via SIGUSR2, which instructs the checkpointer to exit(0).
+ * All backends must be stopped before SIGINT or SIGUSR2 is issued!
+ *
+ * Emergency termination is by SIGQUIT; like any backend, the checkpointer
+ * will simply abort and exit on SIGQUIT.
  *
  * If the checkpointer exits unexpectedly, the postmaster treats that the same
  * as a backend crash: shared memory may be corrupted, so remaining backends
@@ -51,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
+#include "storage/pmsignal.h"
 #include "storage/proc.h"
 #include "storage/procsignal.h"
 #include "storage/shmem.h"
@@ -141,6 +145,7 @@ double		CheckPointCompletionTarget = 0.9;
  * Private state
  */
 static bool ckpt_active = false;
+static volatile sig_atomic_t ShutdownXLOGPending = false;
 
 /* these values are valid when ckpt_active is true: */
 static pg_time_t ckpt_start_time;
@@ -160,7 +165,7 @@ static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
 
 /* Signal handlers */
-static void ReqCheckpointHandler(SIGNAL_ARGS);
+static void ReqShutdownXLOG(SIGNAL_ARGS);
 
 
 /*
@@ -191,7 +196,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * tell us it's okay to shut down (via SIGUSR2).
 	 */
 	pqsignal(SIGHUP, SignalHandlerForConfigReload);
-	pqsignal(SIGINT, ReqCheckpointHandler); /* request checkpoint */
+	pqsignal(SIGINT, ReqShutdownXLOG);
 	pqsignal(SIGTERM, SIG_IGN); /* ignore SIGTERM */
 	/* SIGQUIT handler was already set up by InitPostmasterChild */
 	pqsignal(SIGALRM, SIG_IGN);
@@ -214,8 +219,11 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	 * process during a normal shutdown, and since checkpointer is shut down
 	 * very late...
 	 *
-	 * Walsenders are shut down after the checkpointer, but currently don't
-	 * report stats. If that changes, we need a more complicated solution.
+	 * While e.g. walsenders are active after the shutdown checkpoint has been
+	 * written (and thus could produce more stats), checkpointer stays around
+	 * after the shutdown checkpoint has been written. postmaster will only
+	 * signal checkpointer to exit after all processes that could emit stats
+	 * have been shut down.
 	 */
 	before_shmem_exit(pgstat_before_server_shutdown, 0);
 
@@ -330,7 +338,7 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 	ProcGlobal->checkpointerProc = MyProcNumber;
 
 	/*
-	 * Loop forever
+	 * Loop until we've been asked to write shutdown checkpoint or terminate.
 	 */
 	for (;;)
 	{
@@ -349,7 +357,10 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 		 * Process any requests or signals received recently.
 		 */
 		AbsorbSyncRequests();
+
 		HandleCheckpointerInterrupts();
+		if (ShutdownXLOGPending || ShutdownRequestPending)
+			break;
 
 		/*
 		 * Detect a pending checkpoint request by checking whether the flags
@@ -520,8 +531,13 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 
 			ckpt_active = false;
 
-			/* We may have received an interrupt during the checkpoint. */
+			/*
+			 * We may have received an interrupt during the checkpoint and the
+			 * latch might have been reset (e.g. in CheckpointWriteDelay).
+			 */
 			HandleCheckpointerInterrupts();
+			if (ShutdownXLOGPending || ShutdownRequestPending)
+				break;
 		}
 
 		/* Check for archive_timeout and switch xlog files if necessary. */
@@ -560,6 +576,56 @@ CheckpointerMain(char *startup_data, size_t startup_data_len)
 						 cur_timeout * 1000L /* convert to ms */ ,
 						 WAIT_EVENT_CHECKPOINTER_MAIN);
 	}
+
+	/*
+	 * From here on, elog(ERROR) should end with exit(1), not send control
+	 * back to the sigsetjmp block above.
+	 */
+	ExitOnAnyError = true;
+
+	if (ShutdownXLOGPending)
+	{
+		/*
+		 * Close down the database.
+		 *
+		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
+		 * updates the statistics, increment the checkpoint request and flush
+		 * out pending statistic.
+		 */
+		PendingCheckpointerStats.num_requested++;
+		ShutdownXLOG(0, 0);
+		pgstat_report_checkpointer();
+		pgstat_report_wal(true);
+
+		/*
+		 * Tell postmaster that we're done.
+		 */
+		SendPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN);
+	}
+
+	/*
+	 * Wait until we're asked to shut down. By separating the writing of the
+	 * shutdown checkpoint from checkpointer exiting, checkpointer can perform
+	 * some should-be-as-late-as-possible work like writing out stats.
+	 */
+	for (;;)
+	{
+		/* Clear any already-pending wakeups */
+		ResetLatch(MyLatch);
+
+		HandleCheckpointerInterrupts();
+
+		if (ShutdownRequestPending)
+			break;
+
+		(void) WaitLatch(MyLatch,
+						 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
+						 0,
+						 WAIT_EVENT_CHECKPOINTER_SHUTDOWN);
+	}
+
+	/* Normal exit from the checkpointer is here */
+	proc_exit(0);				/* done */
 }
 
 /*
@@ -589,29 +655,6 @@ HandleCheckpointerInterrupts(void)
 		 */
 		UpdateSharedMemoryConfig();
 	}
-	if (ShutdownRequestPending)
-	{
-		/*
-		 * From here on, elog(ERROR) should end with exit(1), not send control
-		 * back to the sigsetjmp block above
-		 */
-		ExitOnAnyError = true;
-
-		/*
-		 * Close down the database.
-		 *
-		 * Since ShutdownXLOG() creates restartpoint or checkpoint, and
-		 * updates the statistics, increment the checkpoint request and flush
-		 * out pending statistic.
-		 */
-		PendingCheckpointerStats.num_requested++;
-		ShutdownXLOG(0, 0);
-		pgstat_report_checkpointer();
-		pgstat_report_wal(true);
-
-		/* Normal exit from the checkpointer is here */
-		proc_exit(0);			/* done */
-	}
 
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
@@ -732,6 +775,7 @@ CheckpointWriteDelay(int flags, double progress)
 	 * in which case we just try to catch up as quickly as possible.
 	 */
 	if (!(flags & CHECKPOINT_IMMEDIATE) &&
+		!ShutdownXLOGPending &&
 		!ShutdownRequestPending &&
 		!ImmediateCheckpointRequested() &&
 		IsCheckpointOnSchedule(progress))
@@ -865,14 +909,11 @@ IsCheckpointOnSchedule(double progress)
  * --------------------------------
  */
 
-/* SIGINT: set flag to run a normal checkpoint right away */
+/* SIGINT: set flag to trigger writing of shutdown checkpoint */
 static void
-ReqCheckpointHandler(SIGNAL_ARGS)
+ReqShutdownXLOG(SIGNAL_ARGS)
 {
-	/*
-	 * The signaling process should have set ckpt_flags nonzero, so all we
-	 * need do is ensure that our main loop gets kicked out of any wait.
-	 */
+	ShutdownXLOGPending = true;
 	SetLatch(MyLatch);
 }
 
@@ -990,38 +1031,36 @@ RequestCheckpoint(int flags)
 	SpinLockRelease(&CheckpointerShmem->ckpt_lck);
 
 	/*
-	 * Send signal to request checkpoint.  It's possible that the checkpointer
-	 * hasn't started yet, or is in process of restarting, so we will retry a
-	 * few times if needed.  (Actually, more than a few times, since on slow
-	 * or overloaded buildfarm machines, it's been observed that the
-	 * checkpointer can take several seconds to start.)  However, if not told
-	 * to wait for the checkpoint to occur, we consider failure to send the
-	 * signal to be nonfatal and merely LOG it.  The checkpointer should see
-	 * the request when it does start, with or without getting a signal.
+	 * Set checkpointer's latch to request checkpoint.  It's possible that the
+	 * checkpointer hasn't started yet, so we will retry a few times if
+	 * needed.  (Actually, more than a few times, since on slow or overloaded
+	 * buildfarm machines, it's been observed that the checkpointer can take
+	 * several seconds to start.)  However, if not told to wait for the
+	 * checkpoint to occur, we consider failure to set the latch to be
+	 * nonfatal and merely LOG it.  The checkpointer should see the request
+	 * when it does start, with or without the SetLatch().
 	 */
 #define MAX_SIGNAL_TRIES 600	/* max wait 60.0 sec */
 	for (ntries = 0;; ntries++)
 	{
-		if (CheckpointerShmem->checkpointer_pid == 0)
+		volatile PROC_HDR *procglobal = ProcGlobal;
+		ProcNumber	checkpointerProc = procglobal->checkpointerProc;
+
+		if (checkpointerProc == INVALID_PROC_NUMBER)
 		{
 			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
 			{
 				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: checkpointer is not running");
-				break;
-			}
-		}
-		else if (kill(CheckpointerShmem->checkpointer_pid, SIGINT) != 0)
-		{
-			if (ntries >= MAX_SIGNAL_TRIES || !(flags & CHECKPOINT_WAIT))
-			{
-				elog((flags & CHECKPOINT_WAIT) ? ERROR : LOG,
-					 "could not signal for checkpoint: %m");
+					 "could not notify checkpoint: checkpointer is not running");
 				break;
 			}
 		}
 		else
-			break;				/* signal sent successfully */
+		{
+			SetLatch(&GetPGProcByNumber(checkpointerProc)->procLatch);
+			/* notified successfully */
+			break;
+		}
 
 		CHECK_FOR_INTERRUPTS();
 		pg_usleep(100000L);		/* wait 0.1 sec, then retry */
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 5f615d0f605..115ad3d31d2 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -334,6 +334,7 @@ typedef enum
 								 * ckpt */
 	PM_WAIT_XLOG_ARCHIVAL,		/* waiting for archiver and walsenders to
 								 * finish */
+	PM_WAIT_CHECKPOINTER,		/* waiting for checkpointer to shut down */
 	PM_WAIT_DEAD_END,			/* waiting for dead-end children to exit */
 	PM_NO_CHILDREN,				/* all important children have exited */
 } PMState;
@@ -424,7 +425,6 @@ static int	BackendStartup(ClientSocket *client_sock);
 static void report_fork_failure_to_client(ClientSocket *client_sock, int errnum);
 static CAC_state canAcceptConnections(BackendType backend_type);
 static void signal_child(PMChild *pmchild, int signal);
-static void sigquit_child(PMChild *pmchild);
 static bool SignalChildren(int signal, BackendTypeMask targetMask);
 static void TerminateChildren(int signal);
 static int	CountChildren(BackendTypeMask targetMask);
@@ -2355,35 +2355,19 @@ process_pm_child_exit(void)
 		{
 			ReleasePostmasterChildSlot(CheckpointerPMChild);
 			CheckpointerPMChild = NULL;
-			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_XLOG_SHUTDOWN)
+			if (EXIT_STATUS_0(exitstatus) && pmState == PM_WAIT_CHECKPOINTER)
 			{
 				/*
 				 * OK, we saw normal exit of the checkpointer after it's been
-				 * told to shut down.  We expect that it wrote a shutdown
-				 * checkpoint.  (If for some reason it didn't, recovery will
-				 * occur on next postmaster start.)
+				 * told to shut down.  We know checkpointer wrote a shutdown
+				 * checkpoint, otherwise we'd still be in
+				 * PM_WAIT_XLOG_SHUTDOWN state.
 				 *
-				 * At this point we should have no normal backend children
-				 * left (else we'd not be in PM_WAIT_XLOG_SHUTDOWN state) but
-				 * we might have dead-end children to wait for.
-				 *
-				 * If we have an archiver subprocess, tell it to do a last
-				 * archive cycle and quit. Likewise, if we have walsender
-				 * processes, tell them to send any remaining WAL and quit.
+				 * At this point only dead-end children should be left.
 				 */
-				Assert(Shutdown > NoShutdown);
-
-				/* Waken archiver for the last time */
-				if (PgArchPMChild != NULL)
-					signal_child(PgArchPMChild, SIGUSR2);
-
-				/*
-				 * Waken walsenders for the last time. No regular backends
-				 * should be around anymore.
-				 */
-				SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
-
-				UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+				UpdatePMState(PM_WAIT_DEAD_END);
+				ConfigurePostmasterWaitSet(false);
+				SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
 			}
 			else
 			{
@@ -2665,6 +2649,86 @@ CleanupBackend(PMChild *bp,
 		LogChildExit(DEBUG2, procname, bp_pid, exitstatus);
 }
 
+/*
+ * Transition into FatalError state, in response to something bad having
+ * happened. Commonly the caller will have logged the reason for entering
+ * FatalError state.
+ */
+static void
+HandleFatalError(QuitSignalReason reason, bool consider_sigabrt)
+{
+	int			sigtosend;
+
+	SetQuitSignalReason(reason);
+
+	if (consider_sigabrt && send_abort_for_crash)
+		sigtosend = SIGABRT;
+	else
+		sigtosend = SIGQUIT;
+
+	/*
+	 * Signal all other child processes to exit.
+	 *
+	 * We could exclude dead-end children here, but at least when sending
+	 * SIGABRT it seems better to include them.
+	 */
+	TerminateChildren(sigtosend);
+
+	if (Shutdown != ImmediateShutdown)
+		FatalError = true;
+
+	/* We now transit into a state of waiting for children to die */
+	switch (pmState)
+	{
+		case PM_INIT:
+			/* shouldn't have any children */
+			Assert(false);
+			break;
+		case PM_STARTUP:
+			/* should have been handled in process_pm_child_exit */
+			Assert(false);
+			break;
+
+			/* wait for children to die */
+		case PM_RECOVERY:
+		case PM_HOT_STANDBY:
+		case PM_RUN:
+		case PM_STOP_BACKENDS:
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_BACKENDS:
+			/* there might be more backends to wait for */
+			break;
+
+		case PM_WAIT_XLOG_SHUTDOWN:
+		case PM_WAIT_XLOG_ARCHIVAL:
+		case PM_WAIT_CHECKPOINTER:
+
+			/*
+			 * Note that we switch *back* to PM_WAIT_BACKENDS here. This way
+			 * the PM_WAIT_BACKENDS && FatalError code in
+			 * PostmasterStateMachine does not have to be duplicated.
+			 *
+			 * XXX: This seems rather ugly, but it's not obvious if the
+			 * alternative is better.
+			 */
+			UpdatePMState(PM_WAIT_BACKENDS);
+			break;
+
+		case PM_WAIT_DEAD_END:
+		case PM_NO_CHILDREN:
+			break;
+	}
+
+	/*
+	 * .. and if this doesn't happen quickly enough, now the clock is ticking
+	 * for us to kill them without mercy.
+	 */
+	if (AbortStartTime == 0)
+		AbortStartTime = time(NULL);
+}
+
 /*
  * HandleChildCrash -- cleanup after failed backend, bgwriter, checkpointer,
  * walwriter, autovacuum, archiver, slot sync worker, or background worker.
@@ -2677,8 +2741,6 @@ CleanupBackend(PMChild *bp,
 static void
 HandleChildCrash(int pid, int exitstatus, const char *procname)
 {
-	bool		take_action;
-
 	/*
 	 * We only log messages and send signals if this is the first process
 	 * crash and we're not doing an immediate shutdown; otherwise, we're only
@@ -2686,63 +2748,18 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
 	 * signaled children, nonzero exit status is to be expected, so don't
 	 * clutter log.
 	 */
-	take_action = !FatalError && Shutdown != ImmediateShutdown;
+	if (FatalError || Shutdown == ImmediateShutdown)
+		return;
 
-	if (take_action)
-	{
-		LogChildExit(LOG, procname, pid, exitstatus);
-		ereport(LOG,
-				(errmsg("terminating any other active server processes")));
-		SetQuitSignalReason(PMQUIT_FOR_CRASH);
-	}
+	LogChildExit(LOG, procname, pid, exitstatus);
+	ereport(LOG,
+			(errmsg("terminating any other active server processes")));
 
 	/*
-	 * Signal all other child processes to exit.  The crashed process has
-	 * already been removed from ActiveChildList.
+	 * Switch into error state. The crashed process has already been removed
+	 * from ActiveChildList.
 	 */
-	if (take_action)
-	{
-		dlist_iter	iter;
-
-		dlist_foreach(iter, &ActiveChildList)
-		{
-			PMChild    *bp = dlist_container(PMChild, elem, iter.cur);
-
-			/* We do NOT restart the syslogger */
-			if (bp == SysLoggerPMChild)
-				continue;
-
-			if (bp == StartupPMChild)
-				StartupStatus = STARTUP_SIGNALED;
-
-			/*
-			 * This backend is still alive.  Unless we did so already, tell it
-			 * to commit hara-kiri.
-			 *
-			 * We could exclude dead-end children here, but at least when
-			 * sending SIGABRT it seems better to include them.
-			 */
-			sigquit_child(bp);
-		}
-	}
-
-	if (Shutdown != ImmediateShutdown)
-		FatalError = true;
-
-	/* We now transit into a state of waiting for children to die */
-	if (pmState == PM_RECOVERY ||
-		pmState == PM_HOT_STANDBY ||
-		pmState == PM_RUN ||
-		pmState == PM_STOP_BACKENDS ||
-		pmState == PM_WAIT_XLOG_SHUTDOWN)
-		UpdatePMState(PM_WAIT_BACKENDS);
-
-	/*
-	 * .. and if this doesn't happen quickly enough, now the clock is ticking
-	 * for us to kill them without mercy.
-	 */
-	if (AbortStartTime == 0)
-		AbortStartTime = time(NULL);
+	HandleFatalError(PMQUIT_FOR_CRASH, true);
 }
 
 /*
@@ -2840,6 +2857,9 @@ PostmasterStateMachine(void)
 	 * PM_WAIT_BACKENDS, but we signal the processes first, before waiting for
 	 * them.  Treating it as a distinct pmState allows us to share this code
 	 * across multiple shutdown code paths.
+	 *
+	 * Note that HandleFatalError() switches to PM_WAIT_BACKENDS even if we
+	 * were, before the fatal error, in a "more advanced" state.
 	 */
 	if (pmState == PM_STOP_BACKENDS || pmState == PM_WAIT_BACKENDS)
 	{
@@ -2872,16 +2892,20 @@ PostmasterStateMachine(void)
 
 		/*
 		 * If we are doing crash recovery or an immediate shutdown then we
-		 * expect the checkpointer to exit as well, otherwise not.
+		 * expect archiver, checkpointer and walsender to exit as well,
+		 * otherwise not.
 		 */
 		if (FatalError || Shutdown >= ImmediateShutdown)
-			targetMask = btmask_add(targetMask, B_CHECKPOINTER);
+			targetMask = btmask_add(targetMask,
+									B_CHECKPOINTER,
+									B_ARCHIVER,
+									B_WAL_SENDER);
 
 		/*
-		 * Walsenders and archiver will continue running; they will be
-		 * terminated later after writing the checkpoint record.  We also let
-		 * dead-end children to keep running for now.  The syslogger process
-		 * exits last.
+		 * Normally walsenders and archiver will continue running; they will
+		 * be terminated later after writing the checkpoint record.  We also
+		 * let dead-end children to keep running for now.  The syslogger
+		 * process exits last.
 		 *
 		 * This assertion checks that we have covered all backend types,
 		 * either by including them in targetMask, or by noting here that they
@@ -2892,13 +2916,17 @@ PostmasterStateMachine(void)
 			BackendTypeMask remainMask = BTYPE_MASK_NONE;
 
 			remainMask = btmask_add(remainMask,
-									B_WAL_SENDER,
-									B_ARCHIVER,
 									B_DEAD_END_BACKEND,
 									B_LOGGER);
 
-			/* checkpointer may or may not be in targetMask already */
-			remainMask = btmask_add(remainMask, B_CHECKPOINTER);
+			/*
+			 * Archiver, checkpointer and walsender may or may not be in
+			 * targetMask already.
+			 */
+			remainMask = btmask_add(remainMask,
+									B_ARCHIVER,
+									B_CHECKPOINTER,
+									B_WAL_SENDER);
 
 			/* these are not real postmaster children */
 			remainMask = btmask_add(remainMask,
@@ -2938,9 +2966,9 @@ PostmasterStateMachine(void)
 				SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
 
 				/*
-				 * We already SIGQUIT'd walsenders and the archiver, if any,
-				 * when we started immediate shutdown or entered FatalError
-				 * state.
+				 * We already SIGQUIT'd archiver, checkpointer and walsenders,
+				 * if any, when we started immediate shutdown or entered
+				 * FatalError state.
 				 */
 			}
 			else
@@ -2954,10 +2982,10 @@ PostmasterStateMachine(void)
 				/* Start the checkpointer if not running */
 				if (CheckpointerPMChild == NULL)
 					CheckpointerPMChild = StartChildProcess(B_CHECKPOINTER);
-				/* And tell it to shut down */
+				/* And tell it to write the shutdown checkpoint */
 				if (CheckpointerPMChild != NULL)
 				{
-					signal_child(CheckpointerPMChild, SIGUSR2);
+					signal_child(CheckpointerPMChild, SIGINT);
 					UpdatePMState(PM_WAIT_XLOG_SHUTDOWN);
 				}
 				else
@@ -2971,34 +2999,50 @@ PostmasterStateMachine(void)
 					 * We don't consult send_abort_for_crash here, as it's
 					 * unlikely that dumping cores would illuminate the reason
 					 * for checkpointer fork failure.
+					 *
+					 * XXX: Is it worth inventing a different PMQUIT value
+					 * that signals that the cluster is in a bad state,
+					 * without a process having crashed?
 					 */
-					FatalError = true;
-					UpdatePMState(PM_WAIT_DEAD_END);
-					ConfigurePostmasterWaitSet(false);
-
-					/* Kill the walsenders and archiver too */
-					SignalChildren(SIGQUIT, btmask_all_except(B_LOGGER));
+					HandleFatalError(PMQUIT_FOR_CRASH, false);
 				}
 			}
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_XLOG_SHUTDOWN to
+	 * PM_WAIT_XLOG_ARCHIVAL is in proccess_pm_pmsignal(), in response to
+	 * PMSIGNAL_XLOG_IS_SHUTDOWN.
+	 */
+
 	if (pmState == PM_WAIT_XLOG_ARCHIVAL)
 	{
 		/*
-		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no other children
-		 * than dead-end children left. There shouldn't be any regular
-		 * backends left by now anyway; what we're really waiting for is
-		 * walsenders and archiver.
+		 * PM_WAIT_XLOG_ARCHIVAL state ends when there's no children other
+		 * than checkpointer and dead-end children left. There shouldn't be
+		 * any regular backends left by now anyway; what we're really waiting
+		 * for is for walsenders and archiver to exit.
 		 */
-		if (CountChildren(btmask_all_except(B_LOGGER, B_DEAD_END_BACKEND)) == 0)
+		if (CountChildren(btmask_all_except(B_CHECKPOINTER, B_LOGGER, B_DEAD_END_BACKEND)) == 0)
 		{
-			UpdatePMState(PM_WAIT_DEAD_END);
-			ConfigurePostmasterWaitSet(false);
-			SignalChildren(SIGTERM, btmask_all_except(B_LOGGER));
+			UpdatePMState(PM_WAIT_CHECKPOINTER);
+
+			/*
+			 * Now that everyone important is gone, tell checkpointer to shut
+			 * down too. That allows checkpointer to perform some last bits of
+			 * cleanup without other processes interfering.
+			 */
+			if (CheckpointerPMChild != NULL)
+				signal_child(CheckpointerPMChild, SIGUSR2);
 		}
 	}
 
+	/*
+	 * The state transition from PM_WAIT_CHECKPOINTER to PM_WAIT_DEAD_END is
+	 * in proccess_pm_child_exit().
+	 */
+
 	if (pmState == PM_WAIT_DEAD_END)
 	{
 		/*
@@ -3135,6 +3179,7 @@ pmstate_name(PMState state)
 			PM_TOSTR_CASE(PM_WAIT_XLOG_SHUTDOWN);
 			PM_TOSTR_CASE(PM_WAIT_XLOG_ARCHIVAL);
 			PM_TOSTR_CASE(PM_WAIT_DEAD_END);
+			PM_TOSTR_CASE(PM_WAIT_CHECKPOINTER);
 			PM_TOSTR_CASE(PM_NO_CHILDREN);
 	}
 #undef PM_TOSTR_CASE
@@ -3347,19 +3392,6 @@ signal_child(PMChild *pmchild, int signal)
 #endif
 }
 
-/*
- * Convenience function for killing a child process after a crash of some
- * other child process.  We apply send_abort_for_crash to decide which signal
- * to send.  Normally it's SIGQUIT -- and most other comments in this file are
- * written on the assumption that it is -- but developers might prefer to use
- * SIGABRT to collect per-child core dumps.
- */
-static void
-sigquit_child(PMChild *pmchild)
-{
-	signal_child(pmchild, (send_abort_for_crash ? SIGABRT : SIGQUIT));
-}
-
 /*
  * Send a signal to the targeted children.
  */
@@ -3565,6 +3597,8 @@ ExitPostmaster(int status)
 static void
 process_pm_pmsignal(void)
 {
+	bool		request_state_update = false;
+
 	pending_pm_pmsignal = false;
 
 	ereport(DEBUG2,
@@ -3676,9 +3710,67 @@ process_pm_pmsignal(void)
 		WalReceiverRequested = true;
 	}
 
+	if (CheckPostmasterSignal(PMSIGNAL_XLOG_IS_SHUTDOWN))
+	{
+		/* Checkpointer completed the shutdown checkpoint */
+		if (pmState == PM_WAIT_XLOG_SHUTDOWN)
+		{
+			/*
+			 * If we have an archiver subprocess, tell it to do a last archive
+			 * cycle and quit. Likewise, if we have walsender processes, tell
+			 * them to send any remaining WAL and quit.
+			 */
+			Assert(Shutdown > NoShutdown);
+
+			/* Waken archiver for the last time */
+			if (PgArchPMChild != NULL)
+				signal_child(PgArchPMChild, SIGUSR2);
+
+			/*
+			 * Waken walsenders for the last time. No regular backends should
+			 * be around anymore.
+			 */
+			SignalChildren(SIGUSR2, btmask(B_WAL_SENDER));
+
+			UpdatePMState(PM_WAIT_XLOG_ARCHIVAL);
+		}
+		else if (!FatalError && Shutdown != ImmediateShutdown)
+		{
+			/*
+			 * Checkpointer only ought to perform the shutdown checkpoint
+			 * during shutdown.  If somehow checkpointer did so in another
+			 * situation, we have no choice but to crash-restart.
+			 *
+			 * It's possible however that we get PMSIGNAL_XLOG_IS_SHUTDOWN
+			 * outside of PM_WAIT_XLOG_SHUTDOWN if an orderly shutdown was
+			 * "interrupted" by a crash or an immediate shutdown.
+			 */
+			ereport(LOG,
+					(errmsg("WAL was shut down unexpectedly")));
+
+			/*
+			 * Doesn't seem likely to help to take send_abort_for_crash into
+			 * account here.
+			 */
+			HandleFatalError(PMQUIT_FOR_CRASH, false);
+		}
+
+		/*
+		 * Need to run PostmasterStateMachine() to check if we already can go
+		 * to the next state.
+		 */
+		request_state_update = true;
+	}
+
 	/*
 	 * Try to advance postmaster's state machine, if a child requests it.
-	 *
+	 */
+	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	{
+		request_state_update = true;
+	}
+
+	/*
 	 * Be careful about the order of this action relative to this function's
 	 * other actions.  Generally, this should be after other actions, in case
 	 * they have effects PostmasterStateMachine would need to know about.
@@ -3686,7 +3778,7 @@ process_pm_pmsignal(void)
 	 * cannot have any (immediate) effect on the state machine, but does
 	 * depend on what state we're in now.
 	 */
-	if (CheckPostmasterSignal(PMSIGNAL_ADVANCE_STATE_MACHINE))
+	if (request_state_update)
 	{
 		PostmasterStateMachine();
 	}
@@ -3997,6 +4089,7 @@ bgworker_should_start_now(BgWorkerStartTime start_time)
 	switch (pmState)
 	{
 		case PM_NO_CHILDREN:
+		case PM_WAIT_CHECKPOINTER:
 		case PM_WAIT_DEAD_END:
 		case PM_WAIT_XLOG_ARCHIVAL:
 		case PM_WAIT_XLOG_SHUTDOWN:
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index 0b53cba807d..e199f071628 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -56,6 +56,7 @@ AUTOVACUUM_MAIN	"Waiting in main loop of autovacuum launcher process."
 BGWRITER_HIBERNATE	"Waiting in background writer process, hibernating."
 BGWRITER_MAIN	"Waiting in main loop of background writer process."
 CHECKPOINTER_MAIN	"Waiting in main loop of checkpointer process."
+CHECKPOINTER_SHUTDOWN	"Waiting for checkpointer process to be terminated."
 LOGICAL_APPLY_MAIN	"Waiting in main loop of logical replication apply process."
 LOGICAL_LAUNCHER_MAIN	"Waiting in main loop of logical replication launcher process."
 LOGICAL_PARALLEL_APPLY_MAIN	"Waiting in main loop of logical replication parallel apply process."
-- 
2.47.2

From 67c5e26dd6da61e212ce45f19a366c109ce456d5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavu...@gmail.com>
Date: Thu, 23 Jan 2025 10:45:23 +0300
Subject: [PATCH v9 2/3] Add WAL I/O stats to both pg_stat_io view and per
 backend I/O statistics

This commit adds WAL I/O stats to both pg_stat_io view and per backend
I/O statistics (pg_stat_get_backend_io()).

This commit introduces a three new I/O concepts:

- IOObject IOOBJECT_WAL is used for tracking all WAL I/Os.
- IOOBJECT_WAL / IOCONTEXT_NORMAL is used for tracking I/O operations
  done on already created wal segments.
- IOOBJECT_WAL / IOCONTEXT_INIT is used for tracking I/O operations done
  while creating the WAL segments.

For now 027_stream_regress.pl -> 'check contents of pg_stat_statements on
regression database' test fails. It is because walsenders may shutdown
after the checkpointer. There is ongoing discussion to fix that problem:
postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp
---
 src/include/pgstat.h                        |  4 +-
 src/backend/access/transam/xlog.c           | 57 ++++++--------
 src/backend/access/transam/xlogreader.c     |  8 ++
 src/backend/access/transam/xlogrecovery.c   | 14 ++++
 src/backend/utils/activity/pgstat_backend.c |  9 ++-
 src/backend/utils/activity/pgstat_io.c      | 84 ++++++++++++++++++---
 src/test/regress/expected/stats.out         | 53 +++++++++++++
 src/test/regress/sql/stats.sql              | 25 ++++++
 doc/src/sgml/monitoring.sgml                | 19 ++++-
 9 files changed, 221 insertions(+), 52 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index d0d45150977..0a456858c9c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -274,14 +274,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..ee26e9739bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2425,38 +2425,23 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			ssize_t		written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
 			nleft = nbytes;
+
+			io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 			do
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
-
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
-				}
-
 				PendingWalStats.wal_write++;
 
 				if (written <= 0)
@@ -2481,6 +2466,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				startoffset += written;
 			} while (nleft > 0);
 
+			pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+									IOOP_WRITE, io_start, npages, npages * XLOG_BLCKSZ);
+
 			npages = 0;
 
 			/*
@@ -3216,6 +3204,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3259,6 +3248,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3294,6 +3285,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1, wal_segment_size);
+
 	if (save_errno)
 	{
 		/*
@@ -3310,6 +3304,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3322,6 +3318,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1, 0);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8683,7 +8682,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8696,11 +8695,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8746,16 +8741,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
-	}
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1, 0);
 
 	PendingWalStats.wal_sync++;
 }
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3596af06172..0b6e3db353e 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1507,6 +1507,9 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	instr_time	io_start;
+#endif
 
 	p = buf;
 	recptr = startptr;
@@ -1552,6 +1555,8 @@ WALRead(XLogReaderState *state,
 			segbytes = nbytes;
 
 #ifndef FRONTEND
+		io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
 
@@ -1561,6 +1566,9 @@ WALRead(XLogReaderState *state,
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, readbytes);
 #endif
 
 		if (readbytes <= 0)
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806f..91fa72fadcc 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1836,6 +1837,9 @@ PerformWalRecovery(void)
 		 * end of main redo apply loop
 		 */
 
+		/* Report pending statistics to the cumulative stats system  */
+		pgstat_flush_io(false);
+
 		if (reachedRecoveryTarget)
 		{
 			if (!reachedConsistency)
@@ -3306,6 +3310,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3398,6 +3403,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3406,6 +3413,10 @@ retry:
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, r);
+
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
 		if (r < 0)
 		{
@@ -3426,6 +3437,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1, r);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..338da73a9a9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -24,6 +24,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
@@ -43,7 +44,7 @@ void
 pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
 								IOOp io_op, instr_time io_time)
 {
-	Assert(track_io_timing);
+	Assert(track_io_timing || track_wal_io_timing);
 
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return;
@@ -236,9 +237,6 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
 		case B_STARTUP:
@@ -249,7 +247,10 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_SLOTSYNC_WORKER:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 6ff5d9e96a1..17924759e90 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
@@ -105,6 +106,20 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  */
@@ -112,14 +127,24 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt, uint64 bytes)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate timing data.  pgstat_count_buffer is for pgstat_database. As
+	 * pg_stat_database only counts blk_read_time and blk_write_time, it is
+	 * set for IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
+	 *
+	 * pgBufferUsage is for EXPLAIN.  pgBufferUsage has write and read stats
+	 * for shared, local and temporary blocks.  Temporary blocks are ignored
+	 * here.
+	 */
+	if (pgstat_should_track_io_time(io_object))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		if (io_object != IOOBJECT_WAL &&
+			(io_op == IOOP_WRITE || io_op == IOOP_EXTEND))
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -127,7 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			else if (io_object == IOOBJECT_TEMP_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
 		}
-		else if (io_op == IOOP_READ)
+		else if (io_object != IOOBJECT_WAL && io_op == IOOP_READ)
 		{
 			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -241,6 +266,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -260,6 +287,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -332,8 +361,6 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -354,9 +381,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 			return false;
 
 		case B_AUTOVAC_LAUNCHER:
@@ -368,7 +392,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -394,6 +421,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -460,14 +496,28 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	/*
 	 * Some BackendTypes will not do certain IOOps.
 	 */
-	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+	if ((bktype == B_BG_WRITER) &&
 		(io_op == IOOP_READ || io_op == IOOP_EVICT || io_op == IOOP_HIT))
 		return false;
 
+	if ((bktype == B_CHECKPOINTER) &&
+		((io_object != IOOBJECT_WAL && io_op == IOOP_READ) ||
+		 (io_op == IOOP_EVICT || io_op == IOOP_HIT)))
+		return false;
+
 	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes don't do reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		(bktype == B_WAL_RECEIVER || bktype == B_BG_WRITER ||
+		 bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_WORKER ||
+		 bktype == B_WAL_WRITER))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -492,6 +542,22 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * An IOCONTEXT_INIT done for an IOOBJECT_WAL io_object does writes and
+	 * syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	/*
+	 * An IOCONTEXT_NORMAL done for an IOOBJECT_WAL io_object does writes,
+	 * reads and syncs.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a0317b7208e..7dfcaf03b78 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1256,6 +1283,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1273,6 +1301,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1335,6 +1366,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 399c72bbcf7..9e368637199 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -602,6 +616,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -621,6 +636,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -649,6 +667,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e5888fae2b5..c6f2c384512 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2581,9 +2581,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2636,6 +2637,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2660,6 +2666,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
-- 
2.47.2

From af0946112394a478e9424d1870e2200abdb77f7d Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavu...@gmail.com>
Date: Thu, 23 Jan 2025 13:52:31 +0300
Subject: [PATCH v9 3/3] Fetch timing columns from pg_stat_io in the
 pg_stat_wal view

In the prior commit WAL I/Os' timing stats are started to be tracked in
the pg_stat_io view. So, to avoid double accounting and to have same
timing values in both pg_stat_io and pg_stat_wal; fetch timing columns
from pg_stat_io in the pg_stat_wal view.
---
 src/include/catalog/pg_proc.dat         |  6 +++---
 src/include/pgstat.h                    |  4 ----
 src/backend/catalog/system_views.sql    | 14 ++++++++++---
 src/backend/utils/activity/pgstat_wal.c |  2 --
 src/backend/utils/adt/pgstatfuncs.c     | 14 +++----------
 src/test/regress/expected/rules.out     | 27 ++++++++++++++++---------
 doc/src/sgml/monitoring.sgml            | 10 ++++++---
 7 files changed, 41 insertions(+), 36 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 18560755d26..79f31c48d1d 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5925,9 +5925,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 0a456858c9c..faac508015e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -477,8 +477,6 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	PgStat_Counter wal_write_time;
-	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -493,8 +491,6 @@ typedef struct PgStat_PendingWalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	instr_time	wal_write_time;
-	instr_time	wal_sync_time;
 } PgStat_PendingWalStats;
 
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 46868bf7e89..9448e1ec355 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1180,6 +1180,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1187,10 +1194,11 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+        CROSS JOIN pgsio_sum_write_fsync_time p;
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 18fa6b2936a..8ee650cdc30 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -126,8 +126,6 @@ pgstat_wal_flush_cb(bool nowait)
 	WALSTAT_ACC(wal_buffers_full, PendingWalStats);
 	WALSTAT_ACC(wal_write, PendingWalStats);
 	WALSTAT_ACC(wal_sync, PendingWalStats);
-	WALSTAT_ACC_INSTR_TIME(wal_write_time);
-	WALSTAT_ACC_INSTR_TIME(wal_sync_time);
 #undef WALSTAT_ACC_INSTR_TIME
 #undef WALSTAT_ACC
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0f5e0a9778d..5de31979bc0 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1609,7 +1609,7 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1630,11 +1630,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1657,11 +1653,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 856a8349c50..a15e0096d9c 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2241,16 +2241,23 @@ pg_stat_user_tables| SELECT relid,
     autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c6f2c384512..516c4b96112 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -4965,7 +4965,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. This resets the timing
+          counter stored in the <structname>pg_stat_wal</structname> view
+          too.
          </para>
         </listitem>
         <listitem>
@@ -4982,8 +4984,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. The timing
+          counters can be reset by calling pg_stat_reset_shared set to
+          <literal>io</literal>.
          </para>
         </listitem>
         <listitem>
-- 
2.47.2

Re: Show WAL write and fsync stats in pg_stat_io

Reply via email to