Thanks for the review!

On Wed, Apr 26, 2023 at 10:22 PM Kyotaro Horiguchi
<horikyota....@gmail.com> wrote:
>
> At Wed, 26 Apr 2023 17:08:14 -0400, Melanie Plageman 
> <melanieplage...@gmail.com> wrote in
> > On Mon, Apr 24, 2023 at 9:29 PM Melanie Plageman
> > <melanieplage...@gmail.com> wrote:
> > > I've yet to cook up a client backend test case (e.g. with COPY). I've 
> > > taken
> > > that as a todo.
> >
> > It was trivial to see client backend writebacks in almost any scenario
> > once I set backend_flush_after. I wonder if it is worth mentioning the
> > various "*flush_after" gucs in the docs?
> >
> > > I have a few outstanding questions:
> > >
> > > 1) Does it make sense for writebacks to count the number of blocks for
> > > which writeback was requested or the number of calls to smgrwriteback() or
> > > the number of actual syscalls made? We don't actually know from outside
> > > of mdwriteback() how many FileWriteback() calls we will make.
> >
> > So, in the attached v3, I've kept the first method: writebacks are the
> > number of blocks which the backend has requested writeback of. I'd like
> > it to be clear in the docs exactly what writebacks are (so that people
> > know not to add them together with writes or something like that). I
> > made an effort but could use further docs review.
>
> +        Number of units of size <varname>op_bytes</varname> which the backend
> +        requested the kernel write out to permanent storage.
>
> I just want to mention that it is not necessarily the same as the
> number of system calls, but I'm not sure what others think about that.
My thinking is that some other IO operations, for example, extends,
count the number of blocks extended and not the number of syscalls.

> +        Time spent in writeback operations in milliseconds (if
> +        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). 
> This
> +        does not necessarily count the time spent by the kernel writing the
> +        data out. The backend will initiate write-out of the dirty pages and
> +        wait only if the request queue is full.
>
> The last sentence looks like it's taken from the sync_file_range() man
> page, but I think it's a bit too detailed. We could just say, "The
> time usually only includes the time it takes to queue write-out
> requests.", bit I'm not sure wh...

Ah, yes, I indeed took heavy inspiration from the sync_file_range()
man page :) I've modified this comment in the attached v4. I didn't want
to say "usually" since I imagine it is quite workload and configuration
dependent.

> > > 2) I'm a little nervous about not including IOObject in the writeback
> > > context. Technically, there is nothing stopping local buffer code from
> > > calling IssuePendingWritebacks(). Right now, local buffer code doesn't
> > > do ScheduleBufferTagForWriteback(). But it doesn't seem quite right to
> > > hardcode in IOOBJECT_RELATION when there is nothing wrong with
> > > requesting writeback of local buffers (AFAIK). What do you think?
> >
> > I've gone ahead and added IOObject to the WritebackContext.
>
> The smgropen call in IssuePendingWritebacks below clearly shows that
> the function only deals with shared buffers.
>
> >               /* and finally tell the kernel to write the data to storage */
> >               reln = smgropen(currlocator, InvalidBackendId);
> >               smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, 
> > nblocks);

Yes, as it is currently, IssuePendingWritebacks() is only used for shared
buffers. My rationale for including IOObject is that localbuf.c calls
smgr* functions and there isn't anything stopping it from calling
smgrwriteback() or using WritebackContexts (AFAICT).

> The callback-related code fully depends on callers following its
> expectation. So we can rewrite the following comment added to
> InitBufferPoll with a more confident tone.
>
> +        * Initialize per-backend file flush context. IOObject is initialized 
> to
> +        * IOOBJECT_RELATION and IOContext to IOCONTEXT_NORMAL since these 
> are the
> +        * most likely targets for writeback. The backend can overwrite these 
> as
> +        * appropriate.

I have updated this comment to be more confident and specific.

> Or I actually think we might not even need to pass around the io_*
> parameters and could just pass immediate values to the
> pgstat_count_io_op_time call. If we ever start using shared buffers
> for thing other than relation files (for example SLRU?), we'll have to
> consider the target individually for each buffer block. That being
> said, I'm fine with how it is either.

In IssuePendingWritebacks() we don't actually know which IOContext we
are issuing writebacks for when we call pgstat_count_io_op_time() (we do
issue pending writebacks for other IOContexts than IOCONTEXT_NORMAL). I
agree IOObject is not strictly necessary right now. I've kept IOObject a
member of WritebackContext for the reasons I mention above, however, I
am open to removing it if it adds confusion.

- Melanie
From 7cdd6fc78ed82180a705ab9667714f80d08c5f7d Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Mon, 24 Apr 2023 18:21:54 -0400
Subject: [PATCH v4] Add writeback to pg_stat_io

28e626bde00 added the notion of IOOps but neglected to include
writeback. With the addition of IO timing to pg_stat_io in ac8d53dae5,
the omission of writeback caused some confusion. Checkpointer write
timing in pg_stat_io often differed greatly from the write timing
written to the log. To fix this, add IOOp IOOP_WRITEBACK and track
writebacks and writeback timing in pg_stat_io.

Reviewed-by: Kyotaro Horiguchi <horikyota....@gmail.com>
Reported-by: Andres Freund <and...@anarazel.de>
Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de
---
 doc/src/sgml/monitoring.sgml          | 26 ++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql  |  2 ++
 src/backend/postmaster/bgwriter.c     | 10 ++++++++--
 src/backend/storage/buffer/buf_init.c | 12 +++++++++---
 src/backend/storage/buffer/bufmgr.c   | 20 ++++++++++++++++++--
 src/backend/utils/adt/pgstatfuncs.c   |  5 +++++
 src/include/catalog/pg_proc.dat       |  6 +++---
 src/include/pgstat.h                  |  3 ++-
 src/include/storage/buf_internals.h   |  6 +++++-
 src/test/regress/expected/rules.out   |  4 +++-
 src/test/regress/expected/stats.out   |  4 ++--
 src/test/regress/sql/stats.sql        |  4 ++--
 12 files changed, 85 insertions(+), 17 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 99f7f95c39..27b6f1a0a0 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3867,6 +3867,32 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writebacks</structfield> <type>bigint</type>
+       </para>
+       <para>
+        Number of units of size <varname>op_bytes</varname> which the backend
+        requested the kernel write out to permanent storage.
+       </para>
+      </entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>writeback_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in writeback operations in milliseconds (if
+        <xref linkend="guc-track-io-timing"/> is enabled, otherwise zero). This
+        includes the time spent queueing write-out requests and, potentially,
+        the time spent to write out the dirty data.
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 48aacf66ee..d0c932ad0e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1131,6 +1131,8 @@ SELECT
        b.read_time,
        b.writes,
        b.write_time,
+       b.writebacks,
+       b.writeback_time,
        b.extends,
        b.extend_time,
        b.op_bytes,
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index caad642ec9..dbf866f262 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -129,7 +129,12 @@ BackgroundWriterMain(void)
 											 ALLOCSET_DEFAULT_SIZES);
 	MemoryContextSwitchTo(bgwriter_context);
 
-	WritebackContextInit(&wb_context, &bgwriter_flush_after);
+	/*
+	 * bgwriter will only request writeback for permanent relations in
+	 * IOCONTEXT_NORMAL.
+	 */
+	WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 &bgwriter_flush_after);
 
 	/*
 	 * If an exception is encountered, processing resumes here.
@@ -185,7 +190,8 @@ BackgroundWriterMain(void)
 		MemoryContextResetAndDeleteChildren(bgwriter_context);
 
 		/* re-initialize to avoid repeated errors causing problems */
-		WritebackContextInit(&wb_context, &bgwriter_flush_after);
+		WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+							 &bgwriter_flush_after);
 
 		/* Now we can allow interrupts again */
 		RESUME_INTERRUPTS();
diff --git a/src/backend/storage/buffer/buf_init.c b/src/backend/storage/buffer/buf_init.c
index 0057443f0c..a7182fe95a 100644
--- a/src/backend/storage/buffer/buf_init.c
+++ b/src/backend/storage/buffer/buf_init.c
@@ -145,9 +145,15 @@ InitBufferPool(void)
 	/* Init other shared buffer-management stuff */
 	StrategyInitialize(!foundDescs);
 
-	/* Initialize per-backend file flush context */
-	WritebackContextInit(&BackendWritebackContext,
-						 &backend_flush_after);
+	/*
+	 * Initialize per-backend file flush context. IOContext is initialized to
+	 * IOCONTEXT_NORMAL because this is the most common context. IOObject is
+	 * initialized to IOOBJECT_RELATION because writeback is currently only
+	 * requested for permanent relations in shared buffers. The backend can
+	 * overwrite these as appropriate.
+	 */
+	WritebackContextInit(&BackendWritebackContext, IOOBJECT_RELATION,
+						 IOCONTEXT_NORMAL, &backend_flush_after);
 }
 
 /*
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..116910cdfe 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1685,6 +1685,8 @@ again:
 		FlushBuffer(buf_hdr, NULL, IOOBJECT_RELATION, io_context);
 		LWLockRelease(content_lock);
 
+		BackendWritebackContext.io_object = IOOBJECT_RELATION;
+		BackendWritebackContext.io_context = io_context;
 		ScheduleBufferTagForWriteback(&BackendWritebackContext,
 									  &buf_hdr->tag);
 	}
@@ -2555,7 +2557,12 @@ BufferSync(int flags)
 	if (num_to_scan == 0)
 		return;					/* nothing to do */
 
-	WritebackContextInit(&wb_context, &checkpoint_flush_after);
+	/*
+	 * Checkpointer will only request writeback of permanent relations in
+	 * IOCONTEXT_NORMAL.
+	 */
+	WritebackContextInit(&wb_context, IOOBJECT_RELATION, IOCONTEXT_NORMAL,
+						 &checkpoint_flush_after);
 
 	TRACE_POSTGRESQL_BUFFER_SYNC_START(NBuffers, num_to_scan);
 
@@ -5433,10 +5440,13 @@ ts_ckpt_progress_comparator(Datum a, Datum b, void *arg)
  * writeback control will be performed.
  */
 void
-WritebackContextInit(WritebackContext *context, int *max_pending)
+WritebackContextInit(WritebackContext *context, IOObject io_object,
+					 IOContext io_context, int *max_pending)
 {
 	Assert(*max_pending <= WRITEBACK_MAX_PENDING_FLUSHES);
 
+	context->io_object = io_object;
+	context->io_context = io_context;
 	context->max_pending = max_pending;
 	context->nr_pending = 0;
 }
@@ -5491,6 +5501,7 @@ ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag)
 void
 IssuePendingWritebacks(WritebackContext *context)
 {
+	instr_time	io_start;
 	int			i;
 
 	if (context->nr_pending == 0)
@@ -5502,6 +5513,8 @@ IssuePendingWritebacks(WritebackContext *context)
 	 */
 	sort_pending_writebacks(context->pending_writebacks, context->nr_pending);
 
+	io_start = pgstat_prepare_io_time();
+
 	/*
 	 * Coalesce neighbouring writes, but nothing else. For that we iterate
 	 * through the, now sorted, array of pending flushes, and look forward to
@@ -5555,6 +5568,9 @@ IssuePendingWritebacks(WritebackContext *context)
 		smgrwriteback(reln, BufTagGetForkNum(&tag), tag.blockNum, nblocks);
 	}
 
+	pgstat_count_io_op_time(context->io_object, context->io_context,
+							IOOP_WRITEBACK, io_start, context->nr_pending);
+
 	context->nr_pending = 0;
 }
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 0d57b6a7c0..70da0a2de1 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1268,6 +1268,8 @@ typedef enum io_stat_col
 	IO_COL_READ_TIME,
 	IO_COL_WRITES,
 	IO_COL_WRITE_TIME,
+	IO_COL_WRITEBACKS,
+	IO_COL_WRITEBACK_TIME,
 	IO_COL_EXTENDS,
 	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
@@ -1303,6 +1305,8 @@ pgstat_get_io_op_index(IOOp io_op)
 			return IO_COL_REUSES;
 		case IOOP_WRITE:
 			return IO_COL_WRITES;
+		case IOOP_WRITEBACK:
+			return IO_COL_WRITEBACKS;
 	}
 
 	elog(ERROR, "unrecognized IOOp value: %d", io_op);
@@ -1322,6 +1326,7 @@ pgstat_get_io_time_index(IOOp io_op)
 	{
 		case IOOP_READ:
 		case IOOP_WRITE:
+		case IOOP_WRITEBACK:
 		case IOOP_EXTEND:
 		case IOOP_FSYNC:
 			return pgstat_get_io_op_index(io_op) + 1;
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index b2bc81b15f..e766b9b600 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5760,9 +5760,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,object,context,reads,read_time,writes,write_time,writebacks,writeback_time,extends,extend_time,op_bytes,hits,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff4ad5b6d..57a2c0866a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -300,9 +300,10 @@ typedef enum IOOp
 	IOOP_READ,
 	IOOP_REUSE,
 	IOOP_WRITE,
+	IOOP_WRITEBACK,
 } IOOp;
 
-#define IOOP_NUM_TYPES (IOOP_WRITE + 1)
+#define IOOP_NUM_TYPES (IOOP_WRITEBACK + 1)
 
 typedef struct PgStat_BktypeIO
 {
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 136cf8fbaf..0921710605 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -295,6 +295,9 @@ typedef struct PendingWriteback
 /* struct forward declared in bufmgr.h */
 typedef struct WritebackContext
 {
+	IOObject	io_object;
+	IOContext	io_context;
+
 	/* pointer to the max number of writeback requests to coalesce */
 	int		   *max_pending;
 
@@ -387,7 +390,8 @@ extern PGDLLIMPORT CkptSortItem *CkptBufferIds;
  * Internal buffer management routines
  */
 /* bufmgr.c */
-extern void WritebackContextInit(WritebackContext *context, int *max_pending);
+extern void WritebackContextInit(WritebackContext *context, IOObject io_object,
+								 IOContext io_context, int *max_pending);
 extern void IssuePendingWritebacks(WritebackContext *context);
 extern void ScheduleBufferTagForWriteback(WritebackContext *context, BufferTag *tag);
 
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 69957687fe..0bd6db2bf8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1887,6 +1887,8 @@ pg_stat_io| SELECT backend_type,
     read_time,
     writes,
     write_time,
+    writebacks,
+    writeback_time,
     extends,
     extend_time,
     op_bytes,
@@ -1896,7 +1898,7 @@ pg_stat_io| SELECT backend_type,
     fsyncs,
     fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, object, context, reads, read_time, writes, write_time, writebacks, writeback_time, extends, extend_time, op_bytes, hits, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
diff --git a/src/test/regress/expected/stats.out b/src/test/regress/expected/stats.out
index 813d6d39ea..26ff27911a 100644
--- a/src/test/regress/expected/stats.out
+++ b/src/test/regress/expected/stats.out
@@ -1384,7 +1384,7 @@ SELECT pg_stat_have_stats('io', 0, 0);
  t
 (1 row)
 
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_pre_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
   FROM pg_stat_io \gset
 SELECT pg_stat_reset_shared('io');
  pg_stat_reset_shared 
@@ -1392,7 +1392,7 @@ SELECT pg_stat_reset_shared('io');
  
 (1 row)
 
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_post_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
  ?column? 
diff --git a/src/test/regress/sql/stats.sql b/src/test/regress/sql/stats.sql
index 99a28bb79c..257e91d10d 100644
--- a/src/test/regress/sql/stats.sql
+++ b/src/test/regress/sql/stats.sql
@@ -694,10 +694,10 @@ SELECT :io_sum_bulkwrite_strategy_extends_after > :io_sum_bulkwrite_strategy_ext
 
 -- Test IO stats reset
 SELECT pg_stat_have_stats('io', 0, 0);
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_pre_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
   FROM pg_stat_io \gset
 SELECT pg_stat_reset_shared('io');
-SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(hits) AS io_stats_post_reset
+SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
   FROM pg_stat_io \gset
 SELECT :io_stats_post_reset < :io_stats_pre_reset;
 
-- 
2.37.2

Reply via email to