Re: Track IO times in pg_stat_io

Melanie Plageman Thu, 09 Mar 2023 08:51:15 -0800

Hi, v4 attached addresses these review comments.

On Tue, Mar 7, 2023 at 1:39 PM Andres Freund <and...@anarazel.de> wrote:
> On 2023-03-06 11:30:13 -0500, Melanie Plageman wrote:
> > > As pgstat_bktype_io_stats_valid() is called only in Assert(), I think 
> > > that would be a good idea
> > > to also check that if counts are not Zero then times are not Zero.
> >
> > Yes, I think adding some validation around the relationship between
> > counts and timing should help prevent developers from forgetting to call
> > pg_stat_count_io_op() when calling pgstat_count_io_time() (as relevant).
> >
> > However, I think that we cannot check that if IO counts are non-zero
> > that IO times are non-zero, because the user may not have
> > track_io_timing enabled. We can check that if IO times are not zero, IO
> > counts are not zero. I've done this in the attached v3.
>
> And even if track_io_timing is enabled, the timer granularity might be so low
> that we *still* get zeroes.
>
> I wonder if we should get rid of pgStatBlockReadTime, pgStatBlockWriteTime,


And then have pg_stat_reset_shared('io') reset pg_stat_database IO
stats?

> > @@ -1000,11 +1000,27 @@ ReadBuffer_common(SMgrRelation smgr, char 
> > relpersistence, ForkNumber forkNum,
> >
> >       if (isExtend)
> >       {
> > +             instr_time      io_start,
> > +                                     io_time;
> > +
> >               /* new buffers are zero-filled */
> >               MemSet((char *) bufBlock, 0, BLCKSZ);
> > +
> > +             if (track_io_timing)
> > +                     INSTR_TIME_SET_CURRENT(io_start);
> > +             else
> > +                     INSTR_TIME_SET_ZERO(io_start);
> > +
>
> I wonder if there's an argument for tracking this in the existing IO stats as
> well. But I guess we've lived with this for a long time...

Not sure I want to include that in this patchset.

> > @@ -2981,16 +2998,16 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, 
> > IOObject io_object,
> >        * When a strategy is not in use, the write can only be a "regular" 
> > write
> >        * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
> >        */
> > -     pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> > -
> >       if (track_io_timing)
> >       {
> >               INSTR_TIME_SET_CURRENT(io_time);
> >               INSTR_TIME_SUBTRACT(io_time, io_start);
> >               
> > pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
> >               INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
> > +             pgstat_count_io_time(IOOBJECT_RELATION, io_context, 
> > IOOP_WRITE, io_time);
> >       }
>
> I think this needs a bit of cleanup - pgstat_count_buffer_write_time(),
> pgBufferUsage.blk_write_time++, pgstat_count_io_time() is a bit excessive. We
> might not be able to reduce the whole duplication at this point, but at least
> it should be a bit more centralized.

So, in the attached v4, I've introduced pgstat_io_start() and
pgstat_io_end(...). The end IO function takes the IOObject, IOOp, and
IOContext, in addition to the start_time, so that we know which
pgBufferUsage field to increment and which pgstat_count_buffer_*_time()
to call.

I will note that calling this function now causes pgBufferUsage and
pgStatBlock*Time to be incremented in a couple of places that they were
not before. I think those might have been accidental omissions, so I
think it is okay.

The exception is pgstat_count_write_time() being only called for
relations in shared buffers and not temporary relations while
pgstat_count_buffer_read_time() is called for temporary relations and
relations in shared buffers. I left that behavior as is, though it seems
like it is wrong.

I added pgstat_io_start() to pgstat.c -- not sure if it is best there.

I could separate it into a commit that does this refactoring of the
existing counting (without adding pgstat_count_io_time()) and then
another that adds pgstat_count_io_time(). I hesitated to do that until I
knew that the new functions were viable.

> > +     pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
> >       pgBufferUsage.shared_blks_written++;
> >
> >       /*
> > @@ -3594,6 +3611,9 @@ FlushRelationBuffers(Relation rel)
> >
> >       if (RelationUsesLocalBuffers(rel))
> >       {
> > +             instr_time      io_start,
> > +                                     io_time;
> > +
> >               for (i = 0; i < NLocBuffer; i++)
> >               {
> >                       uint32          buf_state;
> > @@ -3616,6 +3636,11 @@ FlushRelationBuffers(Relation rel)
> >
> >                               PageSetChecksumInplace(localpage, 
> > bufHdr->tag.blockNum);
> >
> > +                             if (track_io_timing)
> > +                                     INSTR_TIME_SET_CURRENT(io_start);
> > +                             else
> > +                                     INSTR_TIME_SET_ZERO(io_start);
> > +
> >                               smgrwrite(RelationGetSmgr(rel),
> >                                                 
> > BufTagGetForkNum(&bufHdr->tag),
> >                                                 bufHdr->tag.blockNum,
>
> I don't think you need the INSTR_TIME_SET_ZERO() in the body of the loop, to
> silence the compiler warnings you can do it one level up.

So, I didn't move it out because I am using pgstat_io_start() which does
set zero. However, I could eschew the pgstat_io_start() helper function
and just do what is in the function inline. Do you think the overhead of
set zero is worth it?

> > @@ -228,6 +230,11 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber 
> > forkNum, BlockNumber blockNum,
> >
> >               PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
> >
> > +             if (track_io_timing)
> > +                     INSTR_TIME_SET_CURRENT(io_start);
> > +             else
> > +                     INSTR_TIME_SET_ZERO(io_start);
> > +
> >               /* And write... */
> >               smgrwrite(oreln,
> >                                 BufTagGetForkNum(&bufHdr->tag),
> > @@ -239,6 +246,13 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber 
> > forkNum, BlockNumber blockNum,
> >               buf_state &= ~BM_DIRTY;
> >               pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
> >
> > +             if (track_io_timing)
> > +             {
> > +                     INSTR_TIME_SET_CURRENT(io_time);
> > +                     INSTR_TIME_SUBTRACT(io_time, io_start);
> > +                     pgstat_count_io_time(IOOBJECT_TEMP_RELATION, 
> > IOCONTEXT_NORMAL, IOOP_WRITE, io_time);
> > +             }
> > +
> >               pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, 
> > IOOP_WRITE);
> >               pgBufferUsage.local_blks_written++;
> >       }
>
> Perhaps we can instead introduce a FlushLocalBuffer()? Then we don't need this
> in multiple write paths.

FlushLocalBuffer() is a good idea. It would be nice to have it contain
more than just
  pgstat_io_start()
  smgrwrite()
  pgstat_io_end()
e.g. to have it include checksumming and marking dirty (more like
normal FlushBuffer()). I noticed that LocalBufferAlloc() does not set up
error traceback support for ereport and FlushRelationBuffers() does. Is
this intentional?

> > diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
> > index 352958e1fe..052875d86a 100644
> > --- a/src/backend/storage/smgr/md.c
> > +++ b/src/backend/storage/smgr/md.c
> > @@ -1030,6 +1030,30 @@ register_dirty_segment(SMgrRelation reln, ForkNumber 
> > forknum, MdfdVec *seg)
> >
> >       if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ 
> > ))
> >       {
> > +             instr_time      io_start,
> > +                                     io_time;
> > +
> > +             if (track_io_timing)
> > +                     INSTR_TIME_SET_CURRENT(io_start);
> > +             else
> > +                     INSTR_TIME_SET_ZERO(io_start);
> > +
> > +             ereport(DEBUG1,
> > +                             (errmsg_internal("could not forward fsync 
> > request because request queue is full")));
> > +
> > +             if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
> > +                     ereport(data_sync_elevel(ERROR),
> > +                                     (errcode_for_file_access(),
> > +                                      errmsg("could not fsync file \"%s\": 
> > %m",
> > +                                                     
> > FilePathName(seg->mdfd_vfd))));
> > +
> > +             if (track_io_timing)
> > +             {
> > +                     INSTR_TIME_SET_CURRENT(io_time);
> > +                     INSTR_TIME_SUBTRACT(io_time, io_start);
> > +                     pgstat_count_io_time(IOOBJECT_RELATION, 
> > IOCONTEXT_NORMAL, IOOP_FSYNC, io_time);
> > +             }
> > +
> >               /*
> >                * We have no way of knowing if the current IOContext is
> >                * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, 
> > VACUUM] at this
> > @@ -1042,15 +1066,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber 
> > forknum, MdfdVec *seg)
> >                * backend fsyncs.
> >                */
> >               pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, 
> > IOOP_FSYNC);
> > -
> > -             ereport(DEBUG1,
> > -                             (errmsg_internal("could not forward fsync 
> > request because request queue is full")));
> > -
> > -             if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
> > -                     ereport(data_sync_elevel(ERROR),
> > -                                     (errcode_for_file_access(),
> > -                                      errmsg("could not fsync file \"%s\": 
> > %m",
> > -                                                     
> > FilePathName(seg->mdfd_vfd))));
> >       }
> >  }
> >
> > @@ -1399,6 +1414,8 @@ int
> >  mdsyncfiletag(const FileTag *ftag, char *path)
> >  {
> >       SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
> > +     instr_time      io_start,
> > +                             io_time;
> >       File            file;
> >       bool            need_to_close;
> >       int                     result,
> > @@ -1425,10 +1442,22 @@ mdsyncfiletag(const FileTag *ftag, char *path)
> >               need_to_close = true;
> >       }
> >
> > +     if (track_io_timing)
> > +             INSTR_TIME_SET_CURRENT(io_start);
> > +     else
> > +             INSTR_TIME_SET_ZERO(io_start);
> > +
> >       /* Sync the file. */
> >       result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
> >       save_errno = errno;
> >
> > +     if (track_io_timing)
> > +     {
> > +             INSTR_TIME_SET_CURRENT(io_time);
> > +             INSTR_TIME_SUBTRACT(io_time, io_start);
> > +             pgstat_count_io_time(IOOBJECT_RELATION, IOCONTEXT_NORMAL, 
> > IOOP_FSYNC, io_time);
> > +     }
> > +
> >       if (need_to_close)
> >               FileClose(file);
>
> Perhaps we could have mdsyncfd(), used by both mdsyncfiletag() and
> register_dirty_segment()?

I agree it would be nice, but it seems like it would take a little bit
of work and might not be worth doing that in this patchset.

> > @@ -1359,20 +1378,31 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
> >
> >                               for (int io_op = 0; io_op < IOOP_NUM_TYPES; 
> > io_op++)
> >                               {
> > -                                     int                     col_idx = 
> > pgstat_get_io_op_index(io_op);
> > +                                     int                     i = 
> > pgstat_get_io_op_index(io_op);
> >
> >                                       /*
> >                                        * Some combinations of BackendType 
> > and IOOp, of IOContext
> >                                        * and IOOp, and of IOObject and IOOp 
> > are not tracked. Set
> >                                        * these cells in the view NULL.
> >                                        */
> > -                                     nulls[col_idx] = 
> > !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
> > +                                     if (pgstat_tracks_io_op(bktype, 
> > io_obj, io_context, io_op))
> > +                                             values[i] = 
> > Int64GetDatum(bktype_stats->counts[io_obj][io_context][io_op]);
> > +                                     else
> > +                                             nulls[i] = true;
> > +                             }
>
> These lines were already too long, and it's getting worse with this change.

I've started using local variables.

> >  typedef struct PgStat_BktypeIO
> >  {
> > -     PgStat_Counter 
> > data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > +     PgStat_Counter 
> > counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> > +     instr_time      
> > times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
> >  } PgStat_BktypeIO;
>
> Ah, you're going to hate me. We can't store instr_time on disk. There's
> another patch that gets substantial peformance gains by varying the frequency
> at which instr_time keeps track of time based on the CPU frequency...

What does that have to do with what we can store on disk?

If so, would it not be enough to do this when reading/writing the stats
file?

void
instr_time_deserialize(instr_time *dest, int64 *src, int length)
{
    for (size_t i = 0; i < length; i++)
    {
        INSTR_TIME_SET_ZERO(dest[i]);
        dest[i].ticks = src[i];
    }
}

void
instr_time_serialize(int64 *dest, instr_time *src, int length)
{
    for (size_t i = 0; i < length; i++)
        dest[i] = INSTR_TIME_GET_NANOSEC(src[i]);

}

> It also just doesn't have enough range to keep track of system wide
> time on a larger system. A single backend won't run for 293 years, but
> with a few thousand backends that's a whole different story.
>
> I think we need to accumulate in instr_time, but convert to floating point
> when flushing stats.

Hmmm. So, are you saying that we need to read from disk when we query
the view and add that to what is in shared memory? That we only store
the delta since the last restart in the instr_time array?

But, I don't see how that avoids the problem of backend total runtime
being 293 years. We would have to reset and write out the delta whenever
we thought the times would overflow.

But, maybe I am misunderstanding something.

- Melanie

From ace28d989df51079033af67bb86a756bfeac2ba3 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplage...@gmail.com>
Date: Mon, 6 Mar 2023 10:41:51 -0500
Subject: [PATCH v4] Track IO times in pg_stat_io

Add IO timing for reads, writes, extends, and fsyncs to pg_stat_io.

Reviewed-by: Bertrand Drouvot <bertranddrouvot...@gmail.com>
Reviewed-by: Andres Freund <and...@anarazel.de>
Discussion: https://www.postgresql.org/message-id/flat/CAAKRu_ay5iKmnbXZ3DsauViF3eMxu4m1oNnJXqV_HyqYeg55Ww%40mail.gmail.com
---
 doc/src/sgml/monitoring.sgml           | 59 ++++++++++++++++
 src/backend/catalog/system_views.sql   |  4 ++
 src/backend/storage/buffer/bufmgr.c    | 56 ++++++---------
 src/backend/storage/buffer/localbuf.c  |  6 +-
 src/backend/storage/smgr/md.c          | 27 +++++---
 src/backend/utils/activity/pgstat.c    | 77 ++++++++++++++++++++-
 src/backend/utils/activity/pgstat_io.c | 96 +++++++++++++++++++++-----
 src/backend/utils/adt/pgstatfuncs.c    | 41 +++++++++--
 src/include/catalog/pg_proc.dat        |  6 +-
 src/include/pgstat.h                   | 10 ++-
 src/test/regress/expected/rules.out    |  6 +-
 11 files changed, 311 insertions(+), 77 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 6249bb50d0..ad3667f258 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -3814,6 +3814,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>read_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in read operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3826,6 +3838,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>write_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in write operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3838,6 +3862,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>extend_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in extend operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3902,6 +3938,18 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
       </entry>
      </row>
 
+     <row>
+      <entry role="catalog_table_entry">
+       <para role="column_definition">
+        <structfield>fsync_time</structfield> <type>double precision</type>
+       </para>
+       <para>
+        Time spent in fsync operations in milliseconds (if <xref
+        linkend="guc-track-io-timing"/> is enabled, otherwise zero)
+       </para>
+      </entry>
+     </row>
+
      <row>
       <entry role="catalog_table_entry">
        <para role="column_definition">
@@ -3967,6 +4015,17 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
    </itemizedlist>
   </para>
 
+  <note>
+   <para>
+    Columns tracking I/O time will only be non-zero when <xref
+    linkend="guc-track-io-timing"/> is enabled. The user should be careful when
+    using these columns in combination with their corresponding operations to
+    ensure that <varname>track_io_timing</varname> was enabled for the entire
+    time since the last reset.
+   </para>
+  </note>
+
+
 
  </sect2>
 
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 34ca0e739f..39391bc2fc 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -1123,12 +1123,16 @@ SELECT
        b.io_object,
        b.io_context,
        b.reads,
+       b.read_time,
        b.writes,
+       b.write_time,
        b.extends,
+       b.extend_time,
        b.op_bytes,
        b.evictions,
        b.reuses,
        b.fsyncs,
+       b.fsync_time,
        b.stats_reset
 FROM pg_stat_get_io() b;
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 0a05577b68..64857f1ff6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1000,12 +1000,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	if (isExtend)
 	{
+		instr_time	io_start;
+
 		/* new buffers are zero-filled */
 		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		io_start = pgstat_io_start();
+
 		/* don't set checksum for all-zero page */
 		smgrextend(smgr, forkNum, blockNum, bufBlock, false);
 
-		pgstat_count_io_op(io_object, io_context, IOOP_EXTEND);
+		pgstat_io_end(io_start, io_object, io_context, IOOP_EXTEND);
 
 		/*
 		 * NB: we're *not* doing a ScheduleBufferTagForWriteback here;
@@ -1024,25 +1029,13 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
-			instr_time	io_start,
-						io_time;
+			instr_time	io_start;
 
-			if (track_io_timing)
-				INSTR_TIME_SET_CURRENT(io_start);
-			else
-				INSTR_TIME_SET_ZERO(io_start);
+			io_start = pgstat_io_start();
 
 			smgrread(smgr, forkNum, blockNum, bufBlock);
 
-			pgstat_count_io_op(io_object, io_context, IOOP_READ);
-
-			if (track_io_timing)
-			{
-				INSTR_TIME_SET_CURRENT(io_time);
-				INSTR_TIME_SUBTRACT(io_time, io_start);
-				pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
-				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
-			}
+			pgstat_io_end(io_start, io_object, io_context, IOOP_READ);
 
 			/* check for garbage data */
 			if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
@@ -2848,6 +2841,7 @@ BufferGetTag(Buffer buffer, RelFileLocator *rlocator, ForkNumber *forknum,
 	*blknum = bufHdr->tag.blockNum;
 }
 
+
 /*
  * FlushBuffer
  *		Physically write out a shared buffer.
@@ -2873,8 +2867,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 {
 	XLogRecPtr	recptr;
 	ErrorContextCallback errcallback;
-	instr_time	io_start,
-				io_time;
+	instr_time	io_start;
 	Block		bufBlock;
 	char	   *bufToWrite;
 	uint32		buf_state;
@@ -2949,10 +2942,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 */
 	bufToWrite = PageSetChecksumCopy((Page) bufBlock, buf->tag.blockNum);
 
-	if (track_io_timing)
-		INSTR_TIME_SET_CURRENT(io_start);
-	else
-		INSTR_TIME_SET_ZERO(io_start);
+	io_start = pgstat_io_start();
 
 	/*
 	 * bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -2981,17 +2971,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln, IOObject io_object,
 	 * When a strategy is not in use, the write can only be a "regular" write
 	 * of a dirty shared buffer (IOCONTEXT_NORMAL IOOP_WRITE).
 	 */
-	pgstat_count_io_op(IOOBJECT_RELATION, io_context, IOOP_WRITE);
-
-	if (track_io_timing)
-	{
-		INSTR_TIME_SET_CURRENT(io_time);
-		INSTR_TIME_SUBTRACT(io_time, io_start);
-		pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
-		INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
-	}
-
-	pgBufferUsage.shared_blks_written++;
+	pgstat_io_end(io_start, IOOBJECT_RELATION, io_context, IOOP_WRITE);
 
 	/*
 	 * Mark the buffer as clean (unless BM_JUST_DIRTIED has become set) and
@@ -3597,6 +3577,7 @@ FlushRelationBuffers(Relation rel)
 		for (i = 0; i < NLocBuffer; i++)
 		{
 			uint32		buf_state;
+			instr_time	io_start;
 
 			bufHdr = GetLocalBufferDescriptor(i);
 			if (BufTagMatchesRelFileLocator(&bufHdr->tag, &rel->rd_locator) &&
@@ -3616,6 +3597,12 @@ FlushRelationBuffers(Relation rel)
 
 				PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+				/*
+				 * TODO: is it okay that this formerly didn't count
+				 * pgBufferUsage or pgStatBlockReadTime
+				 */
+				io_start = pgstat_io_start();
+
 				smgrwrite(RelationGetSmgr(rel),
 						  BufTagGetForkNum(&bufHdr->tag),
 						  bufHdr->tag.blockNum,
@@ -3625,13 +3612,14 @@ FlushRelationBuffers(Relation rel)
 				buf_state &= ~(BM_DIRTY | BM_JUST_DIRTIED);
 				pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-				pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
+				pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 
 				/* Pop the error context stack */
 				error_context_stack = errcallback.previous;
 			}
 		}
 
+
 		return;
 	}
 
diff --git a/src/backend/storage/buffer/localbuf.c b/src/backend/storage/buffer/localbuf.c
index 5325ddb663..af45a72934 100644
--- a/src/backend/storage/buffer/localbuf.c
+++ b/src/backend/storage/buffer/localbuf.c
@@ -220,6 +220,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 	 */
 	if (buf_state & BM_DIRTY)
 	{
+		instr_time	io_start;
 		SMgrRelation oreln;
 		Page		localpage = (char *) LocalBufHdrGetBlock(bufHdr);
 
@@ -228,6 +229,8 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 
 		PageSetChecksumInplace(localpage, bufHdr->tag.blockNum);
 
+		io_start = pgstat_io_start();
+
 		/* And write... */
 		smgrwrite(oreln,
 				  BufTagGetForkNum(&bufHdr->tag),
@@ -239,8 +242,7 @@ LocalBufferAlloc(SMgrRelation smgr, ForkNumber forkNum, BlockNumber blockNum,
 		buf_state &= ~BM_DIRTY;
 		pg_atomic_unlocked_write_u32(&bufHdr->state, buf_state);
 
-		pgstat_count_io_op(IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
-		pgBufferUsage.local_blks_written++;
+		pgstat_io_end(io_start, IOOBJECT_TEMP_RELATION, IOCONTEXT_NORMAL, IOOP_WRITE);
 	}
 
 	/*
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 352958e1fe..2440211f03 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1030,6 +1030,17 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
+		instr_time	io_start = pgstat_io_start();
+
+		ereport(DEBUG1,
+				(errmsg_internal("could not forward fsync request because request queue is full")));
+
+		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(seg->mdfd_vfd))));
+
 		/*
 		 * We have no way of knowing if the current IOContext is
 		 * IOCONTEXT_NORMAL or IOCONTEXT_[BULKREAD, BULKWRITE, VACUUM] at this
@@ -1041,16 +1052,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 		 * IOCONTEXT_NORMAL is likely clearer when investigating the number of
 		 * backend fsyncs.
 		 */
-		pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
-
-		ereport(DEBUG1,
-				(errmsg_internal("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
+		pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
 	}
 }
 
@@ -1399,6 +1401,7 @@ int
 mdsyncfiletag(const FileTag *ftag, char *path)
 {
 	SMgrRelation reln = smgropen(ftag->rlocator, InvalidBackendId);
+	instr_time	io_start;
 	File		file;
 	bool		need_to_close;
 	int			result,
@@ -1425,6 +1428,8 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 		need_to_close = true;
 	}
 
+	io_start = pgstat_io_start();
+
 	/* Sync the file. */
 	result = FileSync(file, WAIT_EVENT_DATA_FILE_SYNC);
 	save_errno = errno;
@@ -1432,7 +1437,7 @@ mdsyncfiletag(const FileTag *ftag, char *path)
 	if (need_to_close)
 		FileClose(file);
 
-	pgstat_count_io_op(IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
+	pgstat_io_end(io_start, IOOBJECT_RELATION, IOCONTEXT_NORMAL, IOOP_FSYNC);
 
 	errno = save_errno;
 	return result;
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 60fc4e761f..ed27dbfaab 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -95,9 +95,11 @@
 
 #include "access/transam.h"
 #include "access/xact.h"
+#include "executor/instrument.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
+#include "storage/bufmgr.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
 #include "storage/lwlock.h"
@@ -108,7 +110,6 @@
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 
-
 /* ----------
  * Timer definitions.
  *
@@ -906,6 +907,80 @@ pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid)
 	return pgstat_get_entry_ref(kind, dboid, objoid, false, NULL) != NULL;
 }
 
+instr_time
+pgstat_io_start(void)
+{
+	instr_time	io_start;
+
+	if (track_io_timing)
+		INSTR_TIME_SET_CURRENT(io_start);
+	else
+		INSTR_TIME_SET_ZERO(io_start);
+
+	return io_start;
+}
+
+void
+pgstat_io_end(instr_time io_start, IOObject io_object,
+					IOContext io_context, IOOp io_op)
+{
+	instr_time	io_time;
+
+	if (track_io_timing)
+	{
+		INSTR_TIME_SET_CURRENT(io_time);
+		INSTR_TIME_SUBTRACT(io_time, io_start);
+		pgstat_count_io_time(io_object, io_context, io_op, io_time);
+
+		if (io_op == IOOP_WRITE)
+		{
+			if (io_object == IOOBJECT_RELATION)
+			{
+				/* TODO: AFAICT, pgstat_count_buffer_write_time is only called */
+				/* for shared buffers whereas pgstat_count_buffer_read_time is */
+				/* called for temp relations and shared buffers. */
+				/*
+				 * is this intentional and should I match current behavior or
+				 * not?
+				 */
+				pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
+				INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
+			}
+		}
+		else if (io_op == IOOP_READ)
+		{
+			pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
+			if (io_object == IOOBJECT_RELATION)
+			{
+				INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
+			}
+		}
+	}
+
+	pgstat_count_io_op(io_object, io_context, io_op);
+
+	if (io_op == IOOP_WRITE)
+	{
+		if (io_object == IOOBJECT_RELATION)
+			pgBufferUsage.shared_blks_written++;
+		else if (io_object == IOOBJECT_TEMP_RELATION)
+			pgBufferUsage.local_blks_written++;
+	}
+
+	/*
+	 * TODO: this is normally done later in ReadBuffer_common() is it okay to
+	 * do here?
+	 */
+	else if (io_op == IOOP_READ)
+	{
+		if (io_object == IOOBJECT_RELATION)
+			pgBufferUsage.shared_blks_read++;
+		else if (io_object == IOOBJECT_TEMP_RELATION)
+			pgBufferUsage.local_blks_read++;
+	}
+}
+
+
 /*
  * Ensure snapshot for fixed-numbered 'kind' exists.
  *
diff --git a/src/backend/utils/activity/pgstat_io.c b/src/backend/utils/activity/pgstat_io.c
index af5d554610..4a151afed6 100644
--- a/src/backend/utils/activity/pgstat_io.c
+++ b/src/backend/utils/activity/pgstat_io.c
@@ -25,36 +25,48 @@ bool		have_iostats = false;
 
 /*
  * Check that stats have not been counted for any combination of IOObject,
- * IOContext, and IOOp which are not tracked for the passed-in BackendType. The
- * passed-in PgStat_BktypeIO must contain stats from the BackendType specified
- * by the second parameter. Caller is responsible for locking the passed-in
- * PgStat_BktypeIO, if needed.
+ * IOContext, and IOOp which are not tracked for the passed-in BackendType. If
+ * the IOOp is not counted for this combination but IO time is otherwise
+ * tracked for this IOOp, check that IO time has not been counted for this
+ * combination. If stats are tracked for this combination and IO times are
+ * non-zero, counts should be non-zero.
+ *
+ * The passed-in PgStat_BktypeIO must contain stats from the BackendType
+ * specified by the second parameter. Caller is responsible for locking the
+ * passed-in PgStat_BktypeIO, if needed.
  */
 bool
 pgstat_bktype_io_stats_valid(PgStat_BktypeIO *backend_io,
 							 BackendType bktype)
 {
-	bool		bktype_tracked = pgstat_tracks_io_bktype(bktype);
-
 	for (int io_object = 0; io_object < IOOBJECT_NUM_TYPES; io_object++)
 	{
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
-			/*
-			 * Don't bother trying to skip to the next loop iteration if
-			 * pgstat_tracks_io_object() would return false here. We still
-			 * need to validate that each counter is zero anyway.
-			 */
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 			{
-				/* No stats, so nothing to validate */
-				if (backend_io->data[io_object][io_context][io_op] == 0)
+				/* we do track it */
+				if (pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				{
+					/* ensure that if IO times are non-zero, counts are > 0 */
+					if (!INSTR_TIME_IS_ZERO(backend_io->times[io_object][io_context][io_op]) &&
+						backend_io->counts[io_object][io_context][io_op] <= 0)
+						return false;
+
 					continue;
+				}
 
-				/* There are stats and there shouldn't be */
-				if (!bktype_tracked ||
-					!pgstat_tracks_io_op(bktype, io_object, io_context, io_op))
+				/* we don't track it, and it is not 0 */
+				if (backend_io->counts[io_object][io_context][io_op] != 0)
 					return false;
+
+				/* we don't track this IOOp, so make sure its IO time is zero */
+				if (pgstat_tracks_io_time(io_op) > -1)
+				{
+					if (!INSTR_TIME_IS_ZERO(backend_io->times[io_object][io_context][io_op]))
+						return false;
+				}
+
 			}
 		}
 	}
@@ -70,7 +82,21 @@ pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op)
 	Assert((unsigned int) io_op < IOOP_NUM_TYPES);
 	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
 
-	PendingIOStats.data[io_object][io_context][io_op]++;
+	PendingIOStats.counts[io_object][io_context][io_op]++;
+
+	have_iostats = true;
+}
+
+void
+pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time)
+{
+	Assert(io_object < IOOBJECT_NUM_TYPES);
+	Assert(io_context < IOCONTEXT_NUM_TYPES);
+	Assert(io_op < IOOP_NUM_TYPES);
+	Assert(pgstat_tracks_io_op(MyBackendType, io_object, io_context, io_op));
+	Assert(pgstat_tracks_io_time(io_op) != -1);
+
+	INSTR_TIME_ADD(PendingIOStats.times[io_object][io_context][io_op], time);
 
 	have_iostats = true;
 }
@@ -114,8 +140,13 @@ pgstat_flush_io(bool nowait)
 		for (int io_context = 0; io_context < IOCONTEXT_NUM_TYPES; io_context++)
 		{
 			for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
-				bktype_shstats->data[io_object][io_context][io_op] +=
-					PendingIOStats.data[io_object][io_context][io_op];
+			{
+				bktype_shstats->counts[io_object][io_context][io_op] +=
+					PendingIOStats.counts[io_object][io_context][io_op];
+
+				INSTR_TIME_ADD(bktype_shstats->times[io_object][io_context][io_op],
+							   PendingIOStats.times[io_object][io_context][io_op]);
+			}
 		}
 	}
 
@@ -384,3 +415,30 @@ pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 
 	return true;
 }
+
+/*
+ * PgStat_BktypeIO->times contains IO times for IOOps. For simplicity this
+ * array has a spot for every IOOp. pgstat_tracks_io_time() is the source of
+ * truth for which IOOps have corresponding IO times.
+ */
+IOOp
+pgstat_tracks_io_time(IOOp io_op)
+{
+	switch (io_op)
+	{
+		case IOOP_READ:
+			return IOOP_READ;
+		case IOOP_WRITE:
+			return IOOP_WRITE;
+		case IOOP_EXTEND:
+			return IOOP_EXTEND;
+		case IOOP_FSYNC:
+			return IOOP_FSYNC;
+		case IOOP_EVICT:
+		case IOOP_REUSE:
+			return -1;
+	}
+
+	elog(ERROR, "unrecognized IOOp value: %d", io_op);
+	pg_unreachable();
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index b61a12382b..9c6e492fe6 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -1255,12 +1255,16 @@ typedef enum io_stat_col
 	IO_COL_IO_OBJECT,
 	IO_COL_IO_CONTEXT,
 	IO_COL_READS,
+	IO_COL_READ_TIME,
 	IO_COL_WRITES,
+	IO_COL_WRITE_TIME,
 	IO_COL_EXTENDS,
+	IO_COL_EXTEND_TIME,
 	IO_COL_CONVERSION,
 	IO_COL_EVICTIONS,
 	IO_COL_REUSES,
 	IO_COL_FSYNCS,
+	IO_COL_FSYNC_TIME,
 	IO_COL_RESET_TIME,
 	IO_NUM_COLUMNS,
 } io_stat_col;
@@ -1292,6 +1296,21 @@ pgstat_get_io_op_index(IOOp io_op)
 	pg_unreachable();
 }
 
+/*
+ * Get the number of the column containing IO times for the specified IOOp. If
+ * the specified IOOp is one for which IO time is not tracked, return -1. Note
+ * that this function assumes that IO time for an IOOp is displayed in the view
+ * in the column directly after the IOOp counts.
+ */
+static io_stat_col
+pgstat_get_io_time_index(IOOp io_op)
+{
+	if (pgstat_tracks_io_time(io_op) == -1)
+		return -1;
+
+	return pgstat_get_io_op_index(io_op) + 1;
+}
+
 Datum
 pg_stat_get_io(PG_FUNCTION_ARGS)
 {
@@ -1359,20 +1378,32 @@ pg_stat_get_io(PG_FUNCTION_ARGS)
 
 				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
 				{
-					int			col_idx = pgstat_get_io_op_index(io_op);
+					PgStat_Counter count = bktype_stats->counts[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_op_index(io_op);
 
 					/*
 					 * Some combinations of BackendType and IOOp, of IOContext
 					 * and IOOp, and of IOObject and IOOp are not tracked. Set
 					 * these cells in the view NULL.
 					 */
-					nulls[col_idx] = !pgstat_tracks_io_op(bktype, io_obj, io_context, io_op);
+					if (pgstat_tracks_io_op(bktype, io_obj, io_context, io_op))
+						values[i] = Int64GetDatum(count);
+					else
+						nulls[i] = true;
+				}
+
+				for (int io_op = 0; io_op < IOOP_NUM_TYPES; io_op++)
+				{
+					instr_time	time = bktype_stats->times[io_obj][io_context][io_op];
+					int			i = pgstat_get_io_time_index(io_op);
 
-					if (nulls[col_idx])
+					if (i == -1)
 						continue;
 
-					values[col_idx] =
-						Int64GetDatum(bktype_stats->data[io_obj][io_context][io_op]);
+					if (!nulls[pgstat_get_io_op_index(io_op)])
+						values[i] = Float8GetDatum(INSTR_TIME_GET_MILLISEC(time));
+					else
+						nulls[i] = true;
 				}
 
 				tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 505595620e..80c0627209 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5721,9 +5721,9 @@
   proname => 'pg_stat_get_io', provolatile => 'v',
   prorows => '30', proretset => 't',
   proparallel => 'r', prorettype => 'record', proargtypes => '',
-  proallargtypes => '{text,text,text,int8,int8,int8,int8,int8,int8,int8,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{backend_type,io_object,io_context,reads,writes,extends,op_bytes,evictions,reuses,fsyncs,stats_reset}',
+  proallargtypes => '{text,text,text,int8,float8,int8,float8,int8,float8,int8,int8,int8,int8,float8,timestamptz}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{backend_type,io_object,io_context,reads,read_time,writes,write_time,extends,extend_time,op_bytes,evictions,reuses,fsyncs,fsync_time,stats_reset}',
   prosrc => 'pg_stat_get_io' },
 
 { oid => '1136', descr => 'statistics: information about WAL activity',
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f43fac09ed..a214176c1a 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -313,7 +313,8 @@ typedef enum IOOp
 
 typedef struct PgStat_BktypeIO
 {
-	PgStat_Counter data[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	PgStat_Counter counts[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
+	instr_time	times[IOOBJECT_NUM_TYPES][IOCONTEXT_NUM_TYPES][IOOP_NUM_TYPES];
 } PgStat_BktypeIO;
 
 typedef struct PgStat_IO
@@ -475,6 +476,11 @@ extern TimestampTz pgstat_get_stat_snapshot_timestamp(bool *have_snapshot);
 extern PgStat_Kind pgstat_get_kind_from_str(char *kind_str);
 extern bool pgstat_have_entry(PgStat_Kind kind, Oid dboid, Oid objoid);
 
+extern instr_time pgstat_io_start(void);
+
+extern void pgstat_io_end(instr_time io_start, IOObject io_object,
+								IOContext io_context, IOOp io_op);
+
 
 /*
  * Functions in pgstat_archiver.c
@@ -507,6 +513,7 @@ extern PgStat_CheckpointerStats *pgstat_fetch_stat_checkpointer(void);
 extern bool pgstat_bktype_io_stats_valid(PgStat_BktypeIO *context_ops,
 										 BackendType bktype);
 extern void pgstat_count_io_op(IOObject io_object, IOContext io_context, IOOp io_op);
+extern void pgstat_count_io_time(IOObject io_object, IOContext io_context, IOOp io_op, instr_time time);
 extern PgStat_IO *pgstat_fetch_stat_io(void);
 extern const char *pgstat_get_io_context_name(IOContext io_context);
 extern const char *pgstat_get_io_object_name(IOObject io_object);
@@ -516,6 +523,7 @@ extern bool pgstat_tracks_io_object(BackendType bktype,
 									IOObject io_object, IOContext io_context);
 extern bool pgstat_tracks_io_op(BackendType bktype, IOObject io_object,
 								IOContext io_context, IOOp io_op);
+extern IOOp pgstat_tracks_io_time(IOOp io_op);
 
 
 /*
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e953d1f515..5434851314 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1880,14 +1880,18 @@ pg_stat_io| SELECT backend_type,
     io_object,
     io_context,
     reads,
+    read_time,
     writes,
+    write_time,
     extends,
+    extend_time,
     op_bytes,
     evictions,
     reuses,
     fsyncs,
+    fsync_time,
     stats_reset
-   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, writes, extends, op_bytes, evictions, reuses, fsyncs, stats_reset);
+   FROM pg_stat_get_io() b(backend_type, io_object, io_context, reads, read_time, writes, write_time, extends, extend_time, op_bytes, evictions, reuses, fsyncs, fsync_time, stats_reset);
 pg_stat_progress_analyze| SELECT s.pid,
     s.datid,
     d.datname,
-- 
2.37.2

Re: Track IO times in pg_stat_io

Reply via email to