On Mon, Feb 24, 2025 at 01:08:03PM +0000, Bertrand Drouvot wrote:
> I agree that we've to put everything in the picture (system with or without
> cheap timing functions, lock contention and WAL flush disk time) and that we
> can certainly find configuration/workload that would get benefits from a
> dedicated track_wal_io_timing GUC.
> 
> PFA a patch to re-introduce the track_wal_io_timing GUC and to ensure that the
> WAL write and fsync timing activities are tracked in pg_stat_io (and
> pg_stat_get_backend_io()) only if both track_io_timing and track_wal_io_timing
> are enabled.
> 
> Note that to change the a051e71e28a behavior, the attached patch adds an extra
> "track_io_guc" parameter to pgstat_count_io_op_time() (like 
> pgstat_prepare_io_time
> already had in a051e71e28a).

+      bool      track_timing = track_io_timing && track_wal_io_timing;
-      start = pgstat_prepare_io_time();
+      start = pgstat_prepare_io_time(track_timing);

This does not represent exactly what I am understanding from the
comment of upthread, because WAL IO timings would require both
track_wal_io_timing and track_io_timing to be enabled with this patch.
It seems to me that what we should do is to decouple completely the
timings for WAL and non-WAL IO across the two GUCs track_wal_io_timing
and track_io_timing, without any dependency between one and the other.
This way, it is possible to only enable one of them without affecting
the paths of the other, or both if you are ready to pay the price.

If you take that into account, XLogWrite(), XLogFileInitInternal(),
issue_xlog_fsync() should trigger clock calculations only for
track_wal_io_timing.  The write and fsyncs are the hot ones in
standalone servers.

Two new things tracked in pg_stat_io are WALRead() and XLogPageRead(),
which are used at recovery, and for consistency with the rest there is
a good argument for controlling these as well with
track_wal_io_timing, I guess.

 void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
-                        instr_time start_time, uint32 cnt, uint64 bytes)
+                        instr_time start_time, uint32 cnt, uint64 bytes,
+                        bool track_io_guc)

Not much a fan of the new argument to pass the GUC value, which could
be error prone.  It would be simpler to check that start_time is 0
instead.  There is no need to change the other callers of
pgstat_count_io_op_time() if we do that.

pgstat_count_backend_io_op_time() would have triggered an assertion
failure, it needs to be adjusted.

With more tweaks applied to the docs, the attached is my result.
--
Michael
From 196f84b8bd8f817b6e21e65de90464d32a27b286 Mon Sep 17 00:00:00 2001
From: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Date: Mon, 24 Feb 2025 11:29:35 +0000
Subject: [PATCH v2] re-introduce track_wal_io_timing

a051e71e28a added the "timing tracking" in the WAL code path with "only"
track_io_timing enabled (and track_wal_io_timing still false). Then it made
sense for 2421e9a51d2 to get rid of the track_wal_io_timing GUC.

Concerns have been raised about the fact that a051e71e28a added the "timing tracking"
in the WAL code path with "only" track_io_timing enabled. So this commit re-introduces
the track_wal_io_timing GUC and ensure that WAL timing is tracked in pg_stat_io
only if track_io_timing and track_wal_io_timing are both enabled.
---
 src/include/access/xlog.h                     |  1 +
 src/include/pgstat.h                          |  2 +-
 src/backend/access/transam/xlog.c             |  9 +++---
 src/backend/access/transam/xlogreader.c       |  2 +-
 src/backend/access/transam/xlogrecovery.c     |  2 +-
 src/backend/storage/buffer/bufmgr.c           | 10 +++---
 src/backend/storage/buffer/localbuf.c         |  4 +--
 src/backend/storage/smgr/md.c                 |  4 +--
 src/backend/utils/activity/pgstat_backend.c   |  3 +-
 src/backend/utils/activity/pgstat_io.c        |  9 +++---
 src/backend/utils/misc/guc_tables.c           |  9 ++++++
 src/backend/utils/misc/postgresql.conf.sample |  1 +
 doc/src/sgml/config.sgml                      | 25 +++++++++++++++
 doc/src/sgml/monitoring.sgml                  | 31 ++++++++++++++++---
 doc/src/sgml/wal.sgml                         |  4 +--
 15 files changed, 88 insertions(+), 28 deletions(-)

diff --git src/include/access/xlog.h src/include/access/xlog.h
index 0a50d160bba8..d313099c027f 100644
--- src/include/access/xlog.h
+++ src/include/access/xlog.h
@@ -54,6 +54,7 @@ extern PGDLLIMPORT char *wal_consistency_checking_string;
 extern PGDLLIMPORT bool log_checkpoints;
 extern PGDLLIMPORT int CommitDelay;
 extern PGDLLIMPORT int CommitSiblings;
+extern PGDLLIMPORT bool track_wal_io_timing;
 extern PGDLLIMPORT int wal_decode_buffer_size;
 
 extern PGDLLIMPORT int CheckPointSegments;
diff --git src/include/pgstat.h src/include/pgstat.h
index fc651d03cf97..e64c527265b9 100644
--- src/include/pgstat.h
+++ src/include/pgstat.h
@@ -564,7 +564,7 @@ extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context,
 							   IOOp io_op, uint32 cnt, uint64 bytes);
-extern instr_time pgstat_prepare_io_time(void);
+extern instr_time pgstat_prepare_io_time(bool track_io_guc);
 extern void pgstat_count_io_op_time(IOObject io_object, IOContext io_context,
 									IOOp io_op, instr_time start_time,
 									uint32 cnt, uint64 bytes);
diff --git src/backend/access/transam/xlog.c src/backend/access/transam/xlog.c
index 919314f82587..799fc739e18c 100644
--- src/backend/access/transam/xlog.c
+++ src/backend/access/transam/xlog.c
@@ -134,6 +134,7 @@ int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
 int			wal_retrieve_retry_interval = 5000;
 int			max_slot_wal_keep_size_mb = -1;
 int			wal_decode_buffer_size = 512 * 1024;
+bool		track_wal_io_timing = false;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -2437,7 +2438,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 				/*
 				 * Measure I/O timing to write WAL data, for pg_stat_io.
 				 */
-				start = pgstat_prepare_io_time();
+				start = pgstat_prepare_io_time(track_wal_io_timing);
 
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
@@ -3248,7 +3249,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
 	/* Measure I/O timing when initializing segment */
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
 	save_errno = 0;
@@ -3310,7 +3311,7 @@ XLogFileInitInternal(XLogSegNo logsegno, TimeLineID logtli,
 	}
 
 	/* Measure I/O timing when flushing segment */
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
 	if (pg_fsync(fd) != 0)
@@ -8744,7 +8745,7 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
 	/*
 	 * Measure I/O timing to sync the WAL file for pg_stat_io.
 	 */
-	start = pgstat_prepare_io_time();
+	start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
 	switch (wal_sync_method)
diff --git src/backend/access/transam/xlogreader.c src/backend/access/transam/xlogreader.c
index 995e4f7c26fd..5483d4f0dd2b 100644
--- src/backend/access/transam/xlogreader.c
+++ src/backend/access/transam/xlogreader.c
@@ -1557,7 +1557,7 @@ WALRead(XLogReaderState *state,
 
 #ifndef FRONTEND
 		/* Measure I/O timing when reading segment */
-		io_start = pgstat_prepare_io_time();
+		io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
diff --git src/backend/access/transam/xlogrecovery.c src/backend/access/transam/xlogrecovery.c
index 915cb330295c..c5475aaa97a1 100644
--- src/backend/access/transam/xlogrecovery.c
+++ src/backend/access/transam/xlogrecovery.c
@@ -3401,7 +3401,7 @@ retry:
 	readOff = targetPageOff;
 
 	/* Measure I/O timing when reading segment */
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_wal_io_timing);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
diff --git src/backend/storage/buffer/bufmgr.c src/backend/storage/buffer/bufmgr.c
index d3216ef6ba75..75cfc2b6fe96 100644
--- src/backend/storage/buffer/bufmgr.c
+++ src/backend/storage/buffer/bufmgr.c
@@ -1509,7 +1509,7 @@ WaitReadBuffers(ReadBuffersOperation *operation)
 			io_pages[io_buffers_len++] = BufferGetBlock(buffers[i]);
 		}
 
-		io_start = pgstat_prepare_io_time();
+		io_start = pgstat_prepare_io_time(track_io_timing);
 		smgrreadv(operation->smgr, forknum, io_first_block, io_pages, io_buffers_len);
 		pgstat_count_io_op_time(io_object, io_context, IOOP_READ, io_start,
 								1, io_buffers_len * BLCKSZ);
@@ -2401,7 +2401,7 @@ ExtendBufferedRelShared(BufferManagerRelation bmr,
 		}
 	}
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/*
 	 * Note: if smgrzeroextend fails, we will end up with buffers that are
@@ -3858,7 +3858,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -4460,7 +4460,7 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-				io_start = pgstat_prepare_io_time();
+				io_start = pgstat_prepare_io_time(track_io_timing);
 
 				smgrwrite(srel,
 						  BufTagGetForkNum(&bufHdr->tag),
@@ -5917,7 +5917,7 @@ IssuePendingWritebacks(WritebackContext *wb_context, IOContext io_context)
 	sort_pending_writebacks(wb_context->pending_writebacks,
 							wb_context->nr_pending);
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
diff --git src/backend/storage/buffer/localbuf.c src/backend/storage/buffer/localbuf.c
index 78b65fb160ad..64931efaa756 100644
--- src/backend/storage/buffer/localbuf.c
+++ src/backend/storage/buffer/localbuf.c
@@ -244,7 +244,7 @@ GetLocalVictimBuffer(void)
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
-		io_start = pgstat_prepare_io_time();
+		io_start = pgstat_prepare_io_time(track_io_timing);
 
 		/* And write... */
 		smgrwrite(oreln,
@@ -414,7 +414,7 @@ ExtendBufferedRelLocal(BufferManagerRelation bmr,
 		}
 	}
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/* actually extend relation */
 	smgrzeroextend(bmr.smgr, fork, first_block, extend_by, false);
diff --git src/backend/storage/smgr/md.c src/backend/storage/smgr/md.c
index 298754d1b64a..7bf0b45e2c34 100644
--- src/backend/storage/smgr/md.c
+++ src/backend/storage/smgr/md.c
@@ -1381,7 +1381,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		ereport(DEBUG1,
 				(errmsg_internal("could not forward fsync request because request queue is full")));
 
-		io_start = pgstat_prepare_io_time();
+		io_start = pgstat_prepare_io_time(track_io_timing);
 
 		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
@@ -1786,7 +1786,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
-	io_start = pgstat_prepare_io_time();
+	io_start = pgstat_prepare_io_time(track_io_timing);
 
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
diff --git src/backend/utils/activity/pgstat_backend.c src/backend/utils/activity/pgstat_backend.c
index 4a667e7019cc..338da73a9a91 100644
--- src/backend/utils/activity/pgstat_backend.c
+++ src/backend/utils/activity/pgstat_backend.c
@@ -24,6 +24,7 @@
 
 #include "postgres.h"
 
+#include "access/xlog.h"
 #include "storage/bufmgr.h"
 #include "utils/memutils.h"
 #include "utils/pgstat_internal.h"
@@ -43,7 +44,7 @@ void
 pgstat_count_backend_io_op_time(IOObject io_object, IOContext io_context,
 								IOOp io_op, instr_time io_time)
 {
-	Assert(track_io_timing);
+	Assert(track_io_timing || track_wal_io_timing);
 
 	if (!pgstat_tracks_backend_bktype(MyBackendType))
 		return;
diff --git src/backend/utils/activity/pgstat_io.c src/backend/utils/activity/pgstat_io.c
index e7362b52a37b..edb489f3a864 100644
--- src/backend/utils/activity/pgstat_io.c
+++ src/backend/utils/activity/pgstat_io.c
@@ -83,14 +83,15 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op,
 }
 
 /*
- * Initialize the internal timing for an IO operation.
+ * Initialize the internal timing for an IO operation, depending on an
+ * IO timing GUC.
  */
 instr_time
-pgstat_prepare_io_time(void)
+pgstat_prepare_io_time(bool track_io_guc)
 {
 	instr_time	io_start;
 
-	if (track_io_timing)
+	if (track_io_guc)
 		INSTR_TIME_SET_CURRENT(io_start);
 	else
 	{
@@ -119,7 +120,7 @@ void
 pgstat_count_io_op_time(IOObject io_object, IOContext io_context, IOOp io_op,
 						instr_time start_time, uint32 cnt, uint64 bytes)
 {
-	if (track_io_timing)
+	if (!INSTR_TIME_IS_ZERO(start_time))
 	{
 		instr_time	io_time;
 
diff --git src/backend/utils/misc/guc_tables.c src/backend/utils/misc/guc_tables.c
index 690bf96ef030..ad25cbb39c55 100644
--- src/backend/utils/misc/guc_tables.c
+++ src/backend/utils/misc/guc_tables.c
@@ -1501,6 +1501,15 @@ struct config_bool ConfigureNamesBool[] =
 		false,
 		NULL, NULL, NULL
 	},
+	{
+		{"track_wal_io_timing", PGC_SUSET, STATS_CUMULATIVE,
+			gettext_noop("Collects timing statistics for WAL I/O activity."),
+			NULL
+		},
+		&track_wal_io_timing,
+		false,
+		NULL, NULL, NULL
+	},
 
 	{
 		{"update_process_title", PGC_SUSET, PROCESS_TITLE,
diff --git src/backend/utils/misc/postgresql.conf.sample src/backend/utils/misc/postgresql.conf.sample
index e771d87da1fb..5362ff805195 100644
--- src/backend/utils/misc/postgresql.conf.sample
+++ src/backend/utils/misc/postgresql.conf.sample
@@ -641,6 +641,7 @@
 #track_counts = on
 #track_cost_delay_timing = off
 #track_io_timing = off
+#track_wal_io_timing = off
 #track_functions = none			# none, pl, all
 #stats_fetch_consistency = cache	# cache, none, snapshot
 
diff --git doc/src/sgml/config.sgml doc/src/sgml/config.sgml
index a83545761085..6186648ccaf5 100644
--- doc/src/sgml/config.sgml
+++ doc/src/sgml/config.sgml
@@ -8393,6 +8393,31 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-track-wal-io-timing" xreflabel="track_wal_io_timing">
+      <term><varname>track_wal_io_timing</varname> (<type>boolean</type>)
+      <indexterm>
+       <primary><varname>track_wal_io_timing</varname> configuration parameter</primary>
+      </indexterm>
+      </term>
+      <listitem>
+       <para>
+        Enables timing of WAL I/O calls. This parameter is off by default,
+        as it will repeatedly query the operating system for the current time,
+        which may cause significant overhead on some platforms.
+        You can use the <application>pg_test_timing</application> tool to
+        measure the overhead of timing on your system.
+        I/O timing information is displayed in
+        <link linkend="monitoring-pg-stat-io-view">
+        <structname>pg_stat_io</structname></link> for the
+        <varname>object</varname> <literal>wal</literal> and in the output of
+        the <link linkend="pg-stat-get-backend-io">
+        <function>pg_stat_get_backend_io()</function></link> function.
+        Only superusers and users with the appropriate <literal>SET</literal>
+        privilege can change this setting.
+       </para>
+      </listitem>
+     </varlistentry>
+
      <varlistentry id="guc-track-functions" xreflabel="track_functions">
       <term><varname>track_functions</varname> (<type>enum</type>)
       <indexterm>
diff --git doc/src/sgml/monitoring.sgml doc/src/sgml/monitoring.sgml
index 3dfd059b7ee8..9178f1d34efd 100644
--- doc/src/sgml/monitoring.sgml
+++ doc/src/sgml/monitoring.sgml
@@ -183,6 +183,11 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    of block read, write, extend, and fsync times.
   </para>
 
+  <para>
+   The parameter <xref linkend="guc-track-wal-io-timing"/> enables monitoring
+   of WAL read, write and fsync times.
+  </para>
+
   <para>
    Normally these parameters are set in <filename>postgresql.conf</filename> so
    that they apply to all server processes, but it is possible to turn
@@ -2723,7 +2728,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
        </para>
        <para>
         Time spent in read operations in milliseconds (if
-        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+        <xref linkend="guc-track-io-timing"/> is enabled and
+        <varname>object</varname> is not <literal>wal</literal>,
+        or if <xref linkend="guc-track-wal-io-timing"/> is enabled
+        and <varname>object</varname> is <literal>wal</literal>,
+        otherwise zero)
        </para>
       </entry>
      </row>
@@ -2757,7 +2766,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
        </para>
        <para>
         Time spent in write operations in milliseconds (if
-        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+        <xref linkend="guc-track-io-timing"/> is enabled and
+        <varname>object</varname> is not <literal>wal</literal>,
+        or if <xref linkend="guc-track-wal-io-timing"/> is enabled
+        and <varname>object</varname> is <literal>wal</literal>,
+        otherwise zero)
        </para>
       </entry>
      </row>
@@ -2816,8 +2829,12 @@ description | Waiting for a newly initialized WAL file to reach durable storage
         <structfield>extend_time</structfield> <type>double precision</type>
        </para>
        <para>
-        Time spent in extend operations in milliseconds (if
-        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+        Time spent in extend operations in milliseconds. (if
+        <xref linkend="guc-track-io-timing"/> is enabled and
+        <varname>object</varname> is not <literal>wal</literal>,
+        or if <xref linkend="guc-track-wal-io-timing"/> is enabled
+        and <varname>object</varname> is <literal>wal</literal>,
+        otherwise zero)
        </para>
       </entry>
      </row>
@@ -2887,7 +2904,11 @@ description | Waiting for a newly initialized WAL file to reach durable storage
        </para>
        <para>
         Time spent in fsync operations in milliseconds (if
-        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+        <xref linkend="guc-track-io-timing"/> is enabled and
+        <varname>object</varname> is not <literal>wal</literal>,
+        or if <xref linkend="guc-track-wal-io-timing"/> is enabled
+        and <varname>object</varname> is <literal>wal</literal>,
+        otherwise zero)
        </para>
       </entry>
      </row>
diff --git doc/src/sgml/wal.sgml doc/src/sgml/wal.sgml
index 4fc09761115e..f3b86b26be90 100644
--- doc/src/sgml/wal.sgml
+++ doc/src/sgml/wal.sgml
@@ -808,7 +808,7 @@
   <para>
    There are two internal functions to write WAL data to disk:
    <function>XLogWrite</function> and <function>issue_xlog_fsync</function>.
-   When <xref linkend="guc-track-io-timing"/> is enabled, the total
+   When <xref linkend="guc-track-wal-io-timing"/> is enabled, the total
    amounts of time <function>XLogWrite</function> writes and
    <function>issue_xlog_fsync</function> syncs WAL data to disk are counted as
    <varname>write_time</varname> and <varname>fsync_time</varname> in
@@ -828,7 +828,7 @@
    <literal>fsync</literal>, or <literal>fsync_writethrough</literal>,
    the write operation moves WAL buffers to kernel cache and
    <function>issue_xlog_fsync</function> syncs them to disk. Regardless
-   of the setting of <varname>track_io_timing</varname>, the number
+   of the setting of <varname>track_wal_io_timing</varname>, the number
    of times <function>XLogWrite</function> writes and
    <function>issue_xlog_fsync</function> syncs WAL data to disk are also
    counted as <varname>writes</varname> and <varname>fsyncs</varname>
-- 
2.47.2

Attachment: signature.asc
Description: PGP signature

Reply via email to