Hi,

On Tue, 28 Jan 2025 at 07:23, Michael Paquier <mich...@paquier.xyz> wrote:
>
> On Mon, Jan 27, 2025 at 05:13:39PM +0300, Nazir Bilal Yavuz wrote:
> > On Mon, 27 Jan 2025 at 16:59, Nazir Bilal Yavuz <byavu...@gmail.com> wrote:
> >>
> >> On Mon, 27 Jan 2025 at 03:52, Michael Paquier <mich...@paquier.xyz> wrote:
> >> We use PendingWalStats.wal_[ write | sync ]_time only to show timings
> >> on the pg_stat_wal view, right? And now these pg_stat_wal.wal_[ write
> >> | sync ]_time datas are fetched from the pg_stat_io view when the
> >> track_wal_io_timing is on. So, I think it is correct to remove these.
>
> As you say, removing the counters in the second patch does not matter
> as if you are going to combine them and..
>
> >>
> >> I made a mistake while splitting the patches. The places where
> >> 'PendingWalStats.wal_[ write | sync ]_time are incremented (the code
> >> piece you shared)' are removed in 0002 (0001 now), but they should be
> >> removed in 0003 (0002 now) instead. This is corrected in v11.
>
> My issue was in the first patch that should not have removed them.  My
> apologies for the confusion, I should have pointed out that this was
> likely an incorrect rebase and/or patch split.

Ah, I see. Thanks for clarifying!

> > If we agree with removing PendingWalStats.wal_[ write | sync ]_time
> > variables, then it would make sense to remove PgStat_PendingWalStats
> > struct completely. We have that struct because [1] it is cheap to
> > store PendingWalStats.wal_[ write | sync ]_time as instr_time instead
> > of PgStat_Counter.
> >
> > [1] ca7b3c4c00
>
> I agree that some simplification would be nice because it also makes
> Bertrand's patch around [1] to not have some special handling with
> PgStat_PendingWalStats without us losing monitoring capabilities, I
> hope.  So maximizing simplifications before integrating more
> capabilities should make the whole implementation effort easier.
>
> What you doing in 0001 is a first good step towards this goal, as this
> also plugs in a few things for backend statistics with the calls to
> pgstat_count_io_op[_time]().

I agree. Do you think that we need to do this simplification in this
thread or does it need its own thread?

>
> +       /* Report pending statistics to the cumulative stats system */
> +       pgstat_flush_io(false);
>
> This has been discussed previously under a pgstat_report_wal() call.
> Why do you need this specific call?  Perhaps this should be documented
> as a comment?

Without this change, stats in the startup process would only get
reported during shutdown of the startup process, so I added this but I
was wrong. It was already fixed [1], so yes; we do not need it now.
This is removed in v12.

>
> +   if (io_object == IOOBJECT_WAL)
> +       return track_wal_io_timing
>
> Hmm.  Andres has commented in the past that we want pg_stat_io to
> server as a central place for all the I/O statistics.  Thinking more
> about that, I am not really convinced that we actually need to make
> this area of the code in pgstat_io.c rely on two GUCs.  How about
> simplifying things so as we just rely on track_io_timing for
> everything, without creating a strange dependency on the IOObject with
> more routines like pgstat_should_track_io_time()?  I'd really want
> less of these GUCs, not more of them with cross-dependencies depending
> on the stats kinds we are dealing with.  If we replace the timings
> from pg_stat_wal with the ones in pg_stat_io, we should be in a good
> position to remove track_wal_io_timing entirely, as well.  This has
> the merit of making your patch a lot simpler, meaning extra bonus
> points.

I agree with you but it was discussed before in this thread [2]. It
was decided to use both track_wal_io_timing and track_io_timing
because of the overhead that track_wal_io_timing creates but we can
still re-discuss it. Do you think that this discussion needs its own
thread?

If we continue to discuss it in this thread, I am in favor of removing
track_wal_io_timing and using track_io_timing for all types of I/Os.
Like you said, this cross-dependency makes things more complex than
they used to be. Downside of removing track_wal_io_timing is affecting
people who:

1- Want to track timings of only WAL I/Os.
2- Want to track timings of all IOs except WAL I/Os.

I think the first group is more important than the second because
track_io_timing already creates overhead.

One additional thing is that I think track_io_timing is a general
word. When it exists, I do not expect there to be another GUC like
track_wal_io_timing to track WAL I/Os' timings.

[1] e3cb1a586c
[2] https://www.postgresql.org/message-id/ZUmuJ7P8THHz03nx%40paquier.xyz

--
Regards,
Nazir Bilal Yavuz
Microsoft
From 2f473008518de14154765193415dfcc7e7910ba5 Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavu...@gmail.com>
Date: Mon, 27 Jan 2025 14:03:54 +0300
Subject: [PATCH v12 1/2] Add WAL I/O stats to both pg_stat_io view and per
 backend I/O statistics

This commit adds WAL I/O stats to both pg_stat_io view and per backend
I/O statistics (pg_stat_get_backend_io()).

This commit introduces a three new I/O concepts:

- IOObject IOOBJECT_WAL is used for tracking all WAL I/Os.
- IOOBJECT_WAL / IOCONTEXT_NORMAL is used for tracking I/O operations
  done on already created wal segments.
- IOOBJECT_WAL / IOCONTEXT_INIT is used for tracking I/O operations done
  while creating the WAL segments.

For now 027_stream_regress.pl -> 'check contents of pg_stat_statements on
regression database' test fails. It is because walsenders may shutdown
after the checkpointer. There is ongoing discussion to fix that problem:
postgr.es/m/flat/kgng5nrvnlv335evmsuvpnh354rw7qyazl73kdysev2cr2v5zu%40m3cfzxicm5kp
---
 src/include/pgstat.h                        |  4 +-
 src/backend/access/transam/xlog.c           | 41 ++++++-----
 src/backend/access/transam/xlogreader.c     |  8 +++
 src/backend/access/transam/xlogrecovery.c   | 11 +++
 src/backend/utils/activity/pgstat_backend.c |  9 +--
 src/backend/utils/activity/pgstat_io.c      | 79 ++++++++++++++++++---
 src/test/regress/expected/stats.out         | 53 ++++++++++++++
 src/test/regress/sql/stats.sql              | 25 +++++++
 doc/src/sgml/monitoring.sgml                | 19 ++++-
 9 files changed, 216 insertions(+), 33 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 81ec0161c09..a61b488e8d8 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -274,14 +274,16 @@ typedef enum IOObject
 {
 	IOOBJECT_RELATION,
 	IOOBJECT_TEMP_RELATION,
+	IOOBJECT_WAL,
 } IOObject;
 
-#define IOOBJECT_NUM_TYPES (IOOBJECT_TEMP_RELATION + 1)
+#define IOOBJECT_NUM_TYPES (IOOBJECT_WAL + 1)
 
 typedef enum IOContext
 {
 	IOCONTEXT_BULKREAD,
 	IOCONTEXT_BULKWRITE,
+	IOCONTEXT_INIT,
 	IOCONTEXT_NORMAL,
 	IOCONTEXT_VACUUM,
 } IOContext;
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..14c4f62b6bb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2425,7 +2425,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			Size		nbytes;
 			Size		nleft;
 			ssize_t		written;
-			instr_time	start;
+			instr_time	io_start;
 
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2435,16 +2435,17 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 			{
 				errno = 0;
 
-				/* Measure I/O timing to write WAL data */
-				if (track_wal_io_timing)
-					INSTR_TIME_SET_CURRENT(start);
-				else
-					INSTR_TIME_SET_ZERO(start);
+				io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
 
+				pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL,
+										IOOP_WRITE, io_start, 1, written);
+
+				PendingWalStats.wal_write++;
+
 				/*
 				 * Increment the I/O timing and the number of times WAL data
 				 * were written out to disk.
@@ -2454,11 +2455,9 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 					instr_time	end;
 
 					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, start);
+					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, io_start);
 				}
 
-				PendingWalStats.wal_write++;
-
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -3216,6 +3215,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	int			fd;
 	int			save_errno;
 	int			open_flags = O_RDWR | O_CREAT | O_EXCL | PG_BINARY;
+	instr_time	io_start;
 
 	Assert(logtli != 0);
 
@@ -3259,6 +3259,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
 	if (wal_init_zero)
@@ -3294,6 +3296,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT, IOOP_WRITE,
+							io_start, 1, wal_segment_size);
+
 	if (save_errno)
 	{
 		/*
@@ -3310,6 +3315,8 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not write to file \"%s\": %m", tmppath)));
 	}
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
 	{
@@ -3322,6 +3329,9 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_INIT,
+							IOOP_FSYNC, io_start, 1, 0);
+
 	if (close(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -8683,7 +8693,7 @@ void
 issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 {
 	char	   *msg = NULL;
-	instr_time	start;
+	instr_time	io_start;
 
 	Assert(tli != 0);
 
@@ -8696,11 +8706,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		wal_sync_method == WAL_SYNC_METHOD_OPEN_DSYNC)
 		return;
 
-	/* Measure I/O timing to sync the WAL file */
-	if (track_wal_io_timing)
-		INSTR_TIME_SET_CURRENT(start);
-	else
-		INSTR_TIME_SET_ZERO(start);
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
@@ -8754,9 +8760,12 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 		instr_time	end;
 
 		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, start);
+		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, io_start);
 	}
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
+							io_start, 1, 0);
+
 	PendingWalStats.wal_sync++;
 }
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 91b6a91767d..e8d97cf0732 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1507,6 +1507,9 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	instr_time	io_start;
+#endif
 
 	p = buf;
 	recptr = startptr;
@@ -1552,6 +1555,8 @@ WALRead(XLogReaderState *state,
 			segbytes = nbytes;
 
 #ifndef FRONTEND
+		io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
 
@@ -1561,6 +1566,9 @@ WALRead(XLogReaderState *state,
 
 #ifndef FRONTEND
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, readbytes);
 #endif
 
 		if (readbytes <= 0)
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index cf2b007806f..c3d7cec1e09 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -60,6 +60,7 @@
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/pgstat_internal.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -3306,6 +3307,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	uint32		targetPageOff;
 	XLogSegNo	targetSegNo PG_USED_FOR_ASSERTS_ONLY;
 	int			r;
+	instr_time	io_start;
 
 	XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size);
 	targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size);
@@ -3398,6 +3400,8 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
 	if (r != XLOG_BLCKSZ)
@@ -3406,6 +3410,10 @@ retry:
 		int			save_errno = errno;
 
 		pgstat_report_wait_end();
+
+		pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+								io_start, 1, r);
+
 		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
 		if (r < 0)
 		{
@@ -3426,6 +3434,9 @@ retry:
 	}
 	pgstat_report_wait_end();
 
+	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_READ,
+							io_start, 1, r);
+
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
diff --git a/src/backend/utils/activity/pgstat_backend.c b/src/backend/utils/activity/pgstat_backend.c
index bcf9e4b1487..338da73a9a9 100644
--- a/src/backend/utils/activity/pgstat_backend.c
+++ b/src/backend/utils/activity/pgstat_backend.c
@@ -24,6 +24,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
@@ -43,7 +44,7 @@ void
 pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
 								IOOp io_op, instr_time io_time)
 {
-	Assert(track_io_timing);
+	Assert(track_io_timing || track_wal_io_timing);
 
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return;
@@ -236,9 +237,6 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 		case B_BG_WRITER:
 		case B_CHECKPOINTER:
 		case B_STARTUP:
@@ -249,7 +247,10 @@ pgstat_tracks_backend_bktype(BackendType bktype)
 		case B_BG_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_SLOTSYNC_WORKER:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index 6ff5d9e96a1..f6e5d260c92 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -16,6 +16,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "executor/instrument.h"
 #include "storage/bufmgr.h"
 #include "utils/pgstat_internal.h"
@@ -105,6 +106,20 @@ pgstat_prepare_io_time(bool track_io_guc)
 	return io_start;
 }
 
+/*
+ * Decide if IO timings need to be tracked.  Timings associated to
+ * IOOBJECT_WAL objects are tracked if track_wal_io_timing is enabled,
+ * else rely on track_io_timing.
+ */
+static bool
+pgstat_should_track_io_time(IOObject io_object)
+{
+	if (io_object == IOOBJECT_WAL)
+		return track_wal_io_timing;
+
+	return track_io_timing;
+}
+
 /*
  * Like pgstat_count_io_op() except it also accumulates time.
  */
@@ -112,14 +127,24 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt, uint64 bytes)
 {
-	if (track_io_timing)
+	/*
+	 * Accumulate timing data.  pgstat_count_buffer is for pgstat_database. As
+	 * pg_stat_database only counts blk_read_time and blk_write_time, it is
+	 * set for IOOP_READ, IOOP_WRITE and IOOP_EXTEND.
+	 *
+	 * pgBufferUsage is for EXPLAIN.  pgBufferUsage has write and read stats
+	 * for shared, local and temporary blocks.  Temporary blocks are ignored
+	 * here.
+	 */
+	if (pgstat_should_track_io_time(io_object))
 	{
 		instr_time	io_time;
 
 		INSTR_TIME_SET_CURRENT(io_time);
 		INSTR_TIME_SUBTRACT(io_time, start_time);
 
-		if (io_op == IOOP_WRITE || io_op == IOOP_EXTEND)
+		if (io_object != IOOBJECT_WAL &&
+			(io_op == IOOP_WRITE || io_op == IOOP_EXTEND))
 		{
 			pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -127,7 +152,7 @@ pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 			else if (io_object == IOOBJECT_TEMP_RELATION)
 				INSTR_TIME_ADD(pgBufferUsage.local_blk_write_time, io_time);
 		}
-		else if (io_op == IOOP_READ)
+		else if (io_object != IOOBJECT_WAL && io_op == IOOP_READ)
 		{
 			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
 			if (io_object == IOOBJECT_RELATION)
@@ -241,6 +266,8 @@ pgstat_get_io_context_name(IOContext io_context)
 			return "bulkread";
 		case IOCONTEXT_BULKWRITE:
 			return "bulkwrite";
+		case IOCONTEXT_INIT:
+			return "init";
 		case IOCONTEXT_NORMAL:
 			return "normal";
 		case IOCONTEXT_VACUUM:
@@ -260,6 +287,8 @@ pgstat_get_io_object_name(IOObject io_object)
 			return "relation";
 		case IOOBJECT_TEMP_RELATION:
 			return "temp relation";
+		case IOOBJECT_WAL:
+			return "wal";
 	}
 
 	elog(ERROR, "unrecognized IOObject value: %d", io_object);
@@ -332,8 +361,6 @@ pgstat_io_snapshot_cb(void)
 * - Syslogger because it is not connected to shared memory
 * - Archiver because most relevant archiving IO is delegated to a
 *   specialized command or module
-* - WAL Receiver, WAL Writer, and WAL Summarizer IO are not tracked in
-*   pg_stat_io for now
 *
 * Function returns true if BackendType participates in the cumulative stats
 * subsystem for IO and false if it does not.
@@ -354,9 +381,6 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_DEAD_END_BACKEND:
 		case B_ARCHIVER:
 		case B_LOGGER:
-		case B_WAL_RECEIVER:
-		case B_WAL_WRITER:
-		case B_WAL_SUMMARIZER:
 			return false;
 
 		case B_AUTOVAC_LAUNCHER:
@@ -368,7 +392,10 @@ pgstat_tracks_io_bktype(BackendType bktype)
 		case B_SLOTSYNC_WORKER:
 		case B_STANDALONE_BACKEND:
 		case B_STARTUP:
+		case B_WAL_RECEIVER:
 		case B_WAL_SENDER:
+		case B_WAL_SUMMARIZER:
+		case B_WAL_WRITER:
 			return true;
 	}
 
@@ -394,6 +421,15 @@ pgstat_tracks_io_object(BackendType bktype, IOObject io_object,
 	if (!pgstat_tracks_io_bktype(bktype))
 		return false;
 
+	/*
+	 * Currently, IO on IOOBJECT_WAL objects can only occur in the
+	 * IOCONTEXT_NORMAL and IOCONTEXT_INIT IOContexts.
+	 */
+	if (io_object == IOOBJECT_WAL &&
+		(io_context != IOCONTEXT_NORMAL &&
+		 io_context != IOCONTEXT_INIT))
+		return false;
+
 	/*
 	 * Currently, IO on temporary relations can only occur in the
 	 * IOCONTEXT_NORMAL IOContext.
@@ -460,14 +496,28 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	/*
 	 * Some BackendTypes will not do certain IOOps.
 	 */
-	if ((bktype == B_BG_WRITER || bktype == B_CHECKPOINTER) &&
+	if ((bktype == B_BG_WRITER) &&
 		(io_op == IOOP_READ || io_op == IOOP_EVICT || io_op == IOOP_HIT))
 		return false;
 
+	if ((bktype == B_CHECKPOINTER) &&
+		((io_object != IOOBJECT_WAL && io_op == IOOP_READ) ||
+		 (io_op == IOOP_EVICT || io_op == IOOP_HIT)))
+		return false;
+
 	if ((bktype == B_AUTOVAC_LAUNCHER || bktype == B_BG_WRITER ||
 		 bktype == B_CHECKPOINTER) && io_op == IOOP_EXTEND)
 		return false;
 
+	/*
+	 * Some BackendTypes don't do reads with IOOBJECT_WAL.
+	 */
+	if (io_object == IOOBJECT_WAL && io_op == IOOP_READ &&
+		(bktype == B_WAL_RECEIVER || bktype == B_BG_WRITER ||
+		 bktype == B_AUTOVAC_WORKER || bktype == B_AUTOVAC_WORKER ||
+		 bktype == B_WAL_WRITER))
+		return false;
+
 	/*
 	 * Temporary tables are not logged and thus do not require fsync'ing.
 	 * Writeback is not requested for temporary tables.
@@ -492,6 +542,17 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 	if (!strategy_io_context && io_op == IOOP_REUSE)
 		return false;
 
+	/*
+	 * IOOBJECT_WAL IOObject will not do certain IOOps depending on IOContext.
+	 */
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_INIT &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_FSYNC))
+		return false;
+
+	if (io_object == IOOBJECT_WAL && io_context == IOCONTEXT_NORMAL &&
+		!(io_op == IOOP_WRITE || io_op == IOOP_READ || io_op == IOOP_FSYNC))
+		return false;
+
 	/*
 	 * IOOP_FSYNC IOOps done by a backend using a BufferAccessStrategy are
 	 * counted in the IOCONTEXT_NORMAL IOContext. See comment in
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index a0317b7208e..7dfcaf03b78 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -862,6 +862,33 @@ WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
  t
 (1 row)
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+ ?column? 
+----------
+ t
+(1 row)
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+ ?column? 
+----------
+ t
+(1 row)
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -1256,6 +1283,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
 -- Create a regular table and insert some data to generate IOCONTEXT_NORMAL
@@ -1273,6 +1301,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -1335,6 +1366,28 @@ SELECT current_setting('fsync') = 'off'
  t
 (1 row)
 
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+ ?column? 
+----------
+ t
+(1 row)
+
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
+ ?column? 
+----------
+ t
+(1 row)
+
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
 SELECT sum(reads) AS io_sum_shared_before_reads
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 399c72bbcf7..9e368637199 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -442,6 +442,20 @@ SELECT (current_schemas(true))[1] = ('pg_temp_' || beid::text) AS match
 FROM pg_stat_get_backend_idset() beid
 WHERE pg_stat_get_backend_pid(beid) = pg_backend_pid();
 
+-- Test pg_stat_io for WAL in an init context, that should do writes
+-- and syncs.
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'init' and object = 'wal' \gset io_sum_wal_init_
+SELECT :io_sum_wal_init_writes > 0;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_init_fsyncs > 0;
+
+-- Test pg_stat_io for WAL in a normal context, that should do reads as well.
+SELECT SUM(reads) > 0
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal';
+
 -----
 -- Test that resetting stats works for reset timestamp
 -----
@@ -602,6 +616,7 @@ SELECT pg_stat_get_subscription_stats(NULL);
 -- - extends of relations using shared buffers
 -- - fsyncs done to ensure the durability of data dirtying shared buffers
 -- - shared buffer hits
+-- - WAL writes and fsyncs in IOContext IOCONTEXT_NORMAL
 
 -- There is no test for blocks evicted from shared buffers, because we cannot
 -- be sure of the state of shared buffers at the point the test is run.
@@ -621,6 +636,9 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
   FROM pg_stat_get_backend_io(pg_backend_pid())
   WHERE object = 'relation' \gset my_io_sum_shared_before_
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_before_
 CREATE TABLE test_io_shared(a int);
 INSERT INTO test_io_shared SELECT i FROM generate_series(1,100)i;
 SELECT pg_stat_force_next_flush();
@@ -649,6 +667,13 @@ SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
 SELECT :my_io_sum_shared_after_writes >= :my_io_sum_shared_before_writes;
 SELECT current_setting('fsync') = 'off'
   OR :my_io_sum_shared_after_fsyncs >= :my_io_sum_shared_before_fsyncs;
+SELECT sum(writes) AS writes, sum(fsyncs) AS fsyncs
+  FROM pg_stat_io
+  WHERE context = 'normal' and object = 'wal' \gset io_sum_wal_normal_after_
+SELECT current_setting('synchronous_commit') = 'on';
+SELECT :io_sum_wal_normal_after_writes > :io_sum_wal_normal_before_writes;
+SELECT current_setting('fsync') = 'off'
+  OR :io_sum_wal_normal_after_fsyncs > :io_sum_wal_normal_before_fsyncs;
 
 -- Change the tablespace so that the table is rewritten directly, then SELECT
 -- from it to cause it to be read back into shared buffers.
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4e917f159aa..edc2470bcf9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2581,9 +2581,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
   </para>
 
   <para>
-   Currently, I/O on relations (e.g. tables, indexes) is tracked. However,
-   relation I/O which bypasses shared buffers (e.g. when moving a table from one
-   tablespace to another) is currently not tracked.
+   Currently, I/O on relations (e.g. tables, indexes) and WAL activity are
+   tracked. However, relation I/O which bypasses shared buffers
+   (e.g. when moving a table from one tablespace to another) is currently
+   not tracked.
   </para>
 
   <table id="pg-stat-io-view" xreflabel="pg_stat_io">
@@ -2636,6 +2637,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>temp relation</literal>: Temporary relations.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>wal</literal>: Write Ahead Logs.
+         </para>
+        </listitem>
        </itemizedlist>
        </para>
       </entry>
@@ -2660,6 +2666,13 @@ description | Waiting for a newly initialized WAL file to reach durable storage
           <literal>normal</literal>.
          </para>
         </listitem>
+        <listitem>
+         <para>
+          <literal>init</literal>: I/O operations performed while creating the
+          WAL segments are tracked in <varname>context</varname>
+          <literal>init</literal>.
+         </para>
+        </listitem>
         <listitem>
          <para>
           <literal>vacuum</literal>: I/O operations performed outside of shared
-- 
2.47.2

From 5ef211b0e2b83f6b3103d7cf948870951f9d2ccd Mon Sep 17 00:00:00 2001
From: Nazir Bilal Yavuz <byavu...@gmail.com>
Date: Mon, 27 Jan 2025 14:05:51 +0300
Subject: [PATCH v12 2/2] Fetch timing columns from pg_stat_io in the
 pg_stat_wal view

In the prior commit WAL I/Os' timing stats are started to be tracked in
the pg_stat_io view. So, to avoid double accounting and to have same
timing values in both pg_stat_io and pg_stat_wal; fetch timing columns
from pg_stat_io in the pg_stat_wal view.
---
 src/include/catalog/pg_proc.dat         |  6 +++---
 src/include/pgstat.h                    |  4 ----
 src/backend/access/transam/xlog.c       | 23 ---------------------
 src/backend/catalog/system_views.sql    | 14 ++++++++++---
 src/backend/utils/activity/pgstat_wal.c |  2 --
 src/backend/utils/adt/pgstatfuncs.c     | 14 +++----------
 src/test/regress/expected/rules.out     | 27 ++++++++++++++++---------
 doc/src/sgml/monitoring.sgml            | 10 ++++++---
 8 files changed, 41 insertions(+), 59 deletions(-)

diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 5b8c2ad2a54..e8409b4667f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5944,9 +5944,9 @@
 { oid => '1136', descr => 'statistics: information about WAL activity',
   proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o}',
-  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}',
+  proallargtypes => '{int8,int8,numeric,int8,int8,int8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o}',
+  proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,stats_reset}',
   prosrc => 'pg_stat_get_wal' },
 { oid => '6248', descr => 'statistics: information about WAL prefetching',
   proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index a61b488e8d8..c2b722268ce 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -482,8 +482,6 @@ typedef struct PgStat_WalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	PgStat_Counter wal_write_time;
-	PgStat_Counter wal_sync_time;
 	TimestampTz stat_reset_timestamp;
 } PgStat_WalStats;
 
@@ -498,8 +496,6 @@ typedef struct PgStat_PendingWalStats
 	PgStat_Counter wal_buffers_full;
 	PgStat_Counter wal_write;
 	PgStat_Counter wal_sync;
-	instr_time	wal_write_time;
-	instr_time	wal_sync_time;
 } PgStat_PendingWalStats;
 
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 14c4f62b6bb..4404e7cbac9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2446,18 +2446,6 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 
 				PendingWalStats.wal_write++;
 
-				/*
-				 * Increment the I/O timing and the number of times WAL data
-				 * were written out to disk.
-				 */
-				if (track_wal_io_timing)
-				{
-					instr_time	end;
-
-					INSTR_TIME_SET_CURRENT(end);
-					INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_write_time, end, io_start);
-				}
-
 				if (written <= 0)
 				{
 					char		xlogfname[MAXFNAMELEN];
@@ -8752,17 +8740,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 
 	pgstat_report_wait_end();
 
-	/*
-	 * Increment the I/O timing and the number of times WAL files were synced.
-	 */
-	if (track_wal_io_timing)
-	{
-		instr_time	end;
-
-		INSTR_TIME_SET_CURRENT(end);
-		INSTR_TIME_ACCUM_DIFF(PendingWalStats.wal_sync_time, end, io_start);
-	}
-
 	pgstat_count_io_op_time(IOOBJECT_WAL, IOCONTEXT_NORMAL, IOOP_FSYNC,
 							io_start, 1, 0);
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index cddc3ea9b53..f954673bdc9 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1184,6 +1184,13 @@ SELECT
 FROM pg_stat_get_io() b;
 
 CREATE VIEW pg_stat_wal AS
+    WITH pgsio_sum_write_fsync_time AS (
+        SELECT
+            sum(write_time) as sum_write_time,
+            sum(fsync_time) as sum_fsync_time
+        FROM pg_stat_io
+        WHERE context = 'normal' and object = 'wal'
+    )
     SELECT
         w.wal_records,
         w.wal_fpi,
@@ -1191,10 +1198,11 @@ CREATE VIEW pg_stat_wal AS
         w.wal_buffers_full,
         w.wal_write,
         w.wal_sync,
-        w.wal_write_time,
-        w.wal_sync_time,
+        p.sum_write_time as write_time,
+        p.sum_fsync_time as fsync_time,
         w.stats_reset
-    FROM pg_stat_get_wal() w;
+    FROM pg_stat_get_wal() w
+        CROSS JOIN pgsio_sum_write_fsync_time p;
 
 CREATE VIEW pg_stat_progress_analyze AS
     SELECT
diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c
index 18fa6b2936a..8ee650cdc30 100644
--- a/src/backend/utils/activity/pgstat_wal.c
+++ b/src/backend/utils/activity/pgstat_wal.c
@@ -126,8 +126,6 @@ pgstat_wal_flush_cb(bool nowait)
 	WALSTAT_ACC(wal_buffers_full, PendingWalStats);
 	WALSTAT_ACC(wal_write, PendingWalStats);
 	WALSTAT_ACC(wal_sync, PendingWalStats);
-	WALSTAT_ACC_INSTR_TIME(wal_write_time);
-	WALSTAT_ACC_INSTR_TIME(wal_sync_time);
 #undef WALSTAT_ACC_INSTR_TIME
 #undef WALSTAT_ACC
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index e9096a88492..cab87c74871 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1637,7 +1637,7 @@ pg_stat_get_backend_io(PG_FUNCTION_ARGS)
 Datum
 pg_stat_get_wal(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_COLS	9
+#define PG_STAT_GET_WAL_COLS	7
 	TupleDesc	tupdesc;
 	Datum		values[PG_STAT_GET_WAL_COLS] = {0};
 	bool		nulls[PG_STAT_GET_WAL_COLS] = {0};
@@ -1658,11 +1658,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 					   INT8OID, -1, 0);
 	TupleDescInitEntry(tupdesc, (AttrNumber) 6, "wal_sync",
 					   INT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "wal_write_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time",
-					   FLOAT8OID, -1, 0);
-	TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset",
+	TupleDescInitEntry(tupdesc, (AttrNumber) 7, "stats_reset",
 					   TIMESTAMPTZOID, -1, 0);
 
 	BlessTupleDesc(tupdesc);
@@ -1685,11 +1681,7 @@ pg_stat_get_wal(PG_FUNCTION_ARGS)
 	values[4] = Int64GetDatum(wal_stats->wal_write);
 	values[5] = Int64GetDatum(wal_stats->wal_sync);
 
-	/* Convert counters from microsec to millisec for display */
-	values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0);
-	values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0);
-
-	values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
+	values[6] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp);
 
 	/* Returns the record as Datum */
 	PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls)));
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 3361f6a69c9..c9a98ee0037 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2253,16 +2253,23 @@ pg_stat_user_tables| SELECT relid,
     total_autoanalyze_time
    FROM pg_stat_all_tables
   WHERE ((schemaname <> ALL (ARRAY['pg_catalog'::name, 'information_schema'::name])) AND (schemaname !~ '^pg_toast'::text));
-pg_stat_wal| SELECT wal_records,
-    wal_fpi,
-    wal_bytes,
-    wal_buffers_full,
-    wal_write,
-    wal_sync,
-    wal_write_time,
-    wal_sync_time,
-    stats_reset
-   FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset);
+pg_stat_wal| WITH pgsio_sum_write_fsync_time AS (
+         SELECT sum(pg_stat_io.write_time) AS sum_write_time,
+            sum(pg_stat_io.fsync_time) AS sum_fsync_time
+           FROM pg_stat_io
+          WHERE ((pg_stat_io.context = 'normal'::text) AND (pg_stat_io.object = 'wal'::text))
+        )
+ SELECT w.wal_records,
+    w.wal_fpi,
+    w.wal_bytes,
+    w.wal_buffers_full,
+    w.wal_write,
+    w.wal_sync,
+    p.sum_write_time AS write_time,
+    p.sum_fsync_time AS fsync_time,
+    w.stats_reset
+   FROM (pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, stats_reset)
+     CROSS JOIN pgsio_sum_write_fsync_time p);
 pg_stat_wal_receiver| SELECT pid,
     status,
     receive_start_lsn,
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index edc2470bcf9..a98a027b53c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5003,7 +5003,9 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <listitem>
          <para>
           <literal>io</literal>: Reset all the counters shown in the
-          <structname>pg_stat_io</structname> view.
+          <structname>pg_stat_io</structname> view. This resets the timing
+          counter stored in the <structname>pg_stat_wal</structname> view
+          too.
          </para>
         </listitem>
         <listitem>
@@ -5020,8 +5022,10 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         </listitem>
         <listitem>
          <para>
-          <literal>wal</literal>: Reset all the counters shown in the
-          <structname>pg_stat_wal</structname> view.
+          <literal>wal</literal>: Reset all the counters except timings shown
+          in the <structname>pg_stat_wal</structname> view. The timing
+          counters can be reset by calling pg_stat_reset_shared set to
+          <literal>io</literal>.
          </para>
         </listitem>
         <listitem>
-- 
2.47.2

Reply via email to