Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Andres Freund Tue, 12 Jul 2022 10:01:56 -0700

Hi,

On 2022-07-12 12:19:06 -0400, Melanie Plageman wrote:
> > > I also realized that I am not differentiating between IOPATH_SHARED and
> > > IOPATH_STRATEGY for IOOP_FSYNC. But, given that we don't know what type
> > > of buffer we are fsync'ing by the time we call register_dirty_segment(),
> > > I'm not sure how we would fix this.
> >
> > I think there scarcely happens flush for strategy-loaded buffers.  If
> > that is sensible, IOOP_FSYNC would not make much sense for
> > IOPATH_STRATEGY.
> >
> 
> Why would it be less likely for a backend to do its own fsync when
> flushing a dirty strategy buffer than a regular dirty shared buffer?


We really just don't expect a backend to do many segment fsyncs at
all. Otherwise there's something wrong with the forwarding mechanism.

It'd be different if we tracked WAL fsyncs more granularly - which would be
quite interesting - but that's something for another day^Wpatch.


> > > > Wonder if it's worth making the lock specific to the backend type?
> > > >
> > >
> > > I've added another Lock into PgStat_IOPathOps so that each BackendType
> > > can be locked separately. But, I've also kept the lock in
> > > PgStatShared_BackendIOPathOps so that reset_all and snapshot could be
> > > done easily.
> >
> > Looks fine about the lock separation.
> >
> 
> Actually, I think it is not safe to use both of these locks. So for
> picking one method, it is probably better to go with the locks in
> PgStat_IOPathOps, it will be more efficient for flush (and not for
> fetching and resetting), so that is probably the way to go, right?

I think it's good to just use one kind of lock, and efficiency of snapshotting
/ resetting is nearly irrelevant. But I don't see why it's not safe to use
both kinds of locks?


> > Looks fine, but I think pgstat_flush_io_ops() need more comments like
> > other pgstat_flush_* functions.
> >
> > +       for (int i = 0; i < BACKEND_NUM_TYPES; i++)
> > +               stats_shmem->stats[i].stat_reset_timestamp = ts;
> >
> > I'm not sure we need a separate reset timestamp for each backend type
> > but SLRU counter does the same thing..
> >
> 
> Yes, I think for SLRU stats it is because you can reset individual SLRU
> stats. Also there is no wrapper data structure to put it in. I could
> keep it in PgStatShared_BackendIOPathOps since you have to reset all IO
> operation stats at once, but I am thinking of getting rid of
> PgStatShared_BackendIOPathOps since it is not needed if I only keep the
> locks in PgStat_IOPathOps and make the global shared value an array of
> PgStat_IOPathOps.

I'm strongly against introducing super granular reset timestamps. I think that
was a mistake for SLRU stats, but we can't fix that as easily.


> Currently, strategy allocs count only reuses of a strategy buffer (not
> initial shared buffers which are added to the ring).
> strategy writes count only the writing out of dirty buffers which are
> already in the ring and are being reused.

That seems right to me.


> Alternatively, we could also count as strategy allocs all those buffers
> which are added to the ring and count as strategy writes  all those
> shared buffers which are dirty when initially added to the ring.

I don't think that'd provide valuable information. The whole reason that
strategy writes are interesting is that they can lead to writing out data a
lot sooner than they would be written out without a strategy being used.


> Subject: [PATCH v24 2/3] Track IO operation statistics
> 
> Introduce "IOOp", an IO operation done by a backend, and "IOPath", the
> location or type of IO done by a backend. For example, the checkpointer
> may write a shared buffer out. This would be counted as an IOOp write on
> an IOPath IOPATH_SHARED by BackendType "checkpointer".

I'm still not 100% happy with IOPath - seems a bit too easy to confuse with
the file path. What about 'origin'?


> Each IOOp (alloc, fsync, extend, write) is counted per IOPath
> (direct, local, shared, or strategy) through a call to
> pgstat_count_io_op().

It seems we should track reads too - it's quite interesting to know whether
reads happened because of a strategy, for example. You do reference reads in a
later part of the commit message even :)


> The primary concern of these statistics is IO operations on data blocks
> during the course of normal database operations. IO done by, for
> example, the archiver or syslogger is not counted in these statistics.

We could extend this at a later stage, if we really want to. But I'm not sure
it's interesting or fully possible. E.g. the archiver's write are largely not
done by the archiver itself, but by a command (or module these days) it shells
out to.


> Note that this commit does not add code to increment IOPATH_DIRECT. A
> future patch adding wrappers for smgrwrite(), smgrextend(), and
> smgrimmedsync() would provide a good location to call
> pgstat_count_io_op() for unbuffered IO and avoid regressions for future
> users of these functions.

Hm. Perhaps we should defer introducing IOPATH_DIRECT for now then?


> Stats on IOOps for all IOPaths for a backend are initially accumulated
> locally.
> 
> Later they are flushed to shared memory and accumulated with those from
> all other backends, exited and live.

Perhaps mention here that this later could be extended to make per-connection
stats visible?


> Some BackendTypes will not execute pgstat_report_stat() and thus must
> explicitly call pgstat_flush_io_ops() in order to flush their backend
> local IO operation statistics to shared memory.

Maybe add "flush ... during ongoing operation" or such? Because they'd all
flush at commit, IIRC.


> diff --git a/src/backend/bootstrap/bootstrap.c 
> b/src/backend/bootstrap/bootstrap.c
> index 088556ab54..963b05321e 100644
> --- a/src/backend/bootstrap/bootstrap.c
> +++ b/src/backend/bootstrap/bootstrap.c
> @@ -33,6 +33,7 @@
>  #include "miscadmin.h"
>  #include "nodes/makefuncs.h"
>  #include "pg_getopt.h"
> +#include "pgstat.h"
>  #include "storage/bufmgr.h"
>  #include "storage/bufpage.h"
>  #include "storage/condition_variable.h"

Hm?


> diff --git a/src/backend/postmaster/walwriter.c 
> b/src/backend/postmaster/walwriter.c
> index e926f8c27c..beb46dcb55 100644
> --- a/src/backend/postmaster/walwriter.c
> +++ b/src/backend/postmaster/walwriter.c
> @@ -293,18 +293,7 @@ HandleWalWriterInterrupts(void)
>       }
>  
>       if (ShutdownRequestPending)
> -     {
> -             /*
> -              * Force reporting remaining WAL statistics at process exit.
> -              *
> -              * Since pgstat_report_wal is invoked with 'force' is false in 
> main
> -              * loop to avoid overloading the cumulative stats system, there 
> may
> -              * exist unreported stats counters for the WAL writer.
> -              */
> -             pgstat_report_wal(true);
> -
>               proc_exit(0);
> -     }
>  
>       /* Perform logging of memory contexts of this process */
>       if (LogMemoryContextPending)

Let's do this in a separate commit and get it out of the way...


> @@ -682,16 +694,37 @@ AddBufferToRing(BufferAccessStrategy strategy, 
> BufferDesc *buf)
>   * if this buffer should be written and re-used.
>   */
>  bool
> -StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf)
> +StrategyRejectBuffer(BufferAccessStrategy strategy, BufferDesc *buf, bool 
> *write_from_ring)
>  {
> -     /* We only do this in bulkread mode */
> +
> +     /*
> +      * We only reject reusing and writing out the strategy buffer this in
> +      * bulkread mode.
> +      */
>       if (strategy->btype != BAS_BULKREAD)
> +     {
> +             /*
> +              * If the buffer was from the ring and we are not rejecting it, 
> consider it
> +              * a write of a strategy buffer.
> +              */
> +             if (strategy->current_was_in_ring)
> +                     *write_from_ring = true;

Hm. This is set even if the buffer wasn't dirty? I guess we don't expect
StrategyRejectBuffer() to be called for clean buffers...


>  /*
> diff --git a/src/backend/utils/activity/pgstat_database.c 
> b/src/backend/utils/activity/pgstat_database.c
> index d9275611f0..d3963f59d0 100644
> --- a/src/backend/utils/activity/pgstat_database.c
> +++ b/src/backend/utils/activity/pgstat_database.c
> @@ -47,7 +47,8 @@ pgstat_drop_database(Oid databaseid)
>  }
>  
>  /*
> - * Called from autovacuum.c to report startup of an autovacuum process.
> + * Called from autovacuum.c to report startup of an autovacuum process and
> + * flush IO Operation statistics.
>   * We are called before InitPostgres is done, so can't rely on MyDatabaseId;
>   * the db OID must be passed in, instead.
>   */
> @@ -72,6 +73,11 @@ pgstat_report_autovac(Oid dboid)
>       dbentry->stats.last_autovac_time = GetCurrentTimestamp();
>  
>       pgstat_unlock_entry(entry_ref);
> +
> +     /*
> +      * Report IO Operation statistics
> +      */
> +     pgstat_flush_io_ops(false);
>  }

Hm. I suspect this will always be zero - at this point we haven't connected to
a database, so there really can't have been much, if any, IO. I think I
suggested doing something here, but on a second look it really doesn't make
much sense.

Note that that's different from doing something in
pgstat_report_(vacuum|analyze) - clearly we've done something at that point.


>  /*
> - * Report that the table was just vacuumed.
> + * Report that the table was just vacuumed and flush IO Operation statistics.
>   */
>  void
>  pgstat_report_vacuum(Oid tableoid, bool shared,
> @@ -257,10 +257,15 @@ pgstat_report_vacuum(Oid tableoid, bool shared,
>       }
>  
>       pgstat_unlock_entry(entry_ref);
> +
> +     /*
> +      * Report IO Operations statistics
> +      */
> +     pgstat_flush_io_ops(false);
>  }
>  
>  /*
> - * Report that the table was just analyzed.
> + * Report that the table was just analyzed and flush IO Operation statistics.
>   *
>   * Caller must provide new live- and dead-tuples estimates, as well as a
>   * flag indicating whether to reset the changes_since_analyze counter.
> @@ -340,6 +345,11 @@ pgstat_report_analyze(Relation rel,
>       }
>  
>       pgstat_unlock_entry(entry_ref);
> +
> +     /*
> +      * Report IO Operations statistics
> +      */
> +     pgstat_flush_io_ops(false);
>  }

Think it'd be good to amend these comments to say that otherwise stats would
only get flushed after a multi-relatio autovacuum cycle is done / a
VACUUM/ANALYZE command processed all tables.  Perhaps add the comment to one
of the two functions, and just reference it in the other place?


> --- a/src/include/utils/backend_status.h
> +++ b/src/include/utils/backend_status.h
> @@ -306,6 +306,40 @@ extern const char 
> *pgstat_get_crashed_backend_activity(int pid, char *buffer,
>                                                                               
>                            int buflen);
>  extern uint64 pgstat_get_my_query_id(void);
>  
> +/* Utility functions */
> +
> +/*
> + * When maintaining an array of information about all valid BackendTypes, in
> + * order to avoid wasting the 0th spot, use this helper to convert a valid
> + * BackendType to a valid location in the array (given that no spot is
> + * maintained for B_INVALID BackendType).
> + */
> +static inline int backend_type_get_idx(BackendType backend_type)
> +{
> +     /*
> +      * backend_type must be one of the valid backend types. If caller is
> +      * maintaining backend information in an array that includes B_INVALID,
> +      * this function is unnecessary.
> +      */
> +     Assert(backend_type > B_INVALID && backend_type <= BACKEND_NUM_TYPES);
> +     return backend_type - 1;
> +}

In function definitions (vs declarations) we put the 'static inline int' in a
separate line from the rest of the function signature.

> +/*
> + * When using a value from an array of information about all valid
> + * BackendTypes, add 1 to the index before using it as a BackendType to 
> adjust
> + * for not maintaining a spot for B_INVALID BackendType.
> + */
> +static inline BackendType idx_get_backend_type(int idx)
> +{
> +     int backend_type = idx + 1;
> +     /*
> +      * If the array includes a spot for B_INVALID BackendType this function 
> is
> +      * not required.

The comments around this seem a bit over the top, but I also don't mind them
much.


> Add pg_stat_io, a system view which tracks the number of IOOp (allocs,
> writes, fsyncs, and extends) done through each IOPath (e.g. shared
> buffers, local buffers, unbuffered IO) by each type of backend.

Annoying question: pg_stat_io vs pg_statio? I'd not think of suggesting the
latter, except that we already have a bunch of views with that prefix.


> Some of these should always be zero. For example, checkpointer does not
> use a BufferAccessStrategy (currently), so the "strategy" IOPath for
> checkpointer will be 0 for all IOOps.

What do you think about returning NULL for the values that we except to never
be non-zero? Perhaps with an assert against non-zero values? Seems like it
might be helpful for understanding the view.


> +/*
> +* When adding a new column to the pg_stat_io view, add a new enum
> +* value here above IO_NUM_COLUMNS.
> +*/
> +enum
> +{
> +     IO_COLUMN_BACKEND_TYPE,
> +     IO_COLUMN_IO_PATH,
> +     IO_COLUMN_ALLOCS,
> +     IO_COLUMN_EXTENDS,
> +     IO_COLUMN_FSYNCS,
> +     IO_COLUMN_WRITES,
> +     IO_COLUMN_RESET_TIME,
> +     IO_NUM_COLUMNS,
> +};

We typedef pretty much every enum so the enum can be referenced without the
'enum' prefix. I'd do that here, even if we don't need it.



Greetings,

Andres Freund

Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

Reply via email to