Hi, While working on [1], I was in need to know WAL read stats (number of times and amount of WAL data read from disk, time taken to read) to measure the benefit. I had to write a developer patch to capture WAL read stats as pg_stat_wal currently emits WAL write stats. With recent works on pg_stat_io which emit data read IO stats too, I think it's better to not miss WAL read stats. It might help others who keep an eye on IOPS of the production servers for various reasons. The WAL read stats I'm thinking useful are wal_read_bytes - total amount of WAL read, wal_read - total number of times WAL is read from disk, wal_read_time - total amount of time spent reading WAL (tracked only when an existing GUC track_wal_io_timing is on).
I came up with a patch and attached it here. The WAL readers that add to WAL read stats are WAL senders, startup process and other backends using xlogreader for logical replication or pg_walinspect SQL functions. They all report stats to shared memory by calling pgstat_report_wal() in appropriate locations. In standby mode, calling pgstat_report_wa() for every record seems to be costly. Therefore, I chose to report stats every 1024 WAL records (a random number, suggestions for a better a way are welcome here). Note that the patch needs a bit more work, per [2]. With the patch, the WAL senders (processes exiting after checkpointer) will generate stats and we need to either let all or only one WAL sender to write stats to disk. Allowing one WAL sender to write might be tricky. Allowing all WAL senders to write might make too many writes to the stats file. And, we need a lock to let only one process write. I can't think of a best way here at the moment. Thoughts? [1] https://www.postgresql.org/message-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=q...@mail.gmail.com [2] /* * Write out stats after shutdown. This needs to be called by exactly one * process during a normal shutdown, and since checkpointer is shut down * very late... * * Walsenders are shut down after the checkpointer, but currently don't * report stats. If that changes, we need a more complicated solution. */ before_shmem_exit(pgstat_before_server_shutdown, 0); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
From c123e4f5e94cbb4881ce4d65c9dec3c4d5d72778 Mon Sep 17 00:00:00 2001 From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Date: Wed, 15 Feb 2023 13:29:22 +0000 Subject: [PATCH v1] Add WAL read stats to pg_stat_wal --- doc/src/sgml/monitoring.sgml | 32 +++++++++++++++++++++ src/backend/access/transam/xlogreader.c | 32 +++++++++++++++++++++ src/backend/access/transam/xlogrecovery.c | 34 +++++++++++++++++++++++ src/backend/catalog/system_views.sql | 3 ++ src/backend/replication/walsender.c | 12 ++++++++ src/backend/utils/activity/pgstat_wal.c | 6 +++- src/backend/utils/adt/pgstatfuncs.c | 24 ++++++++++++++-- src/include/catalog/pg_proc.dat | 6 ++-- src/include/pgstat.h | 9 ++++++ src/test/regress/expected/rules.out | 5 +++- 10 files changed, 155 insertions(+), 8 deletions(-) diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml index dca50707ad..20b16405b5 100644 --- a/doc/src/sgml/monitoring.sgml +++ b/doc/src/sgml/monitoring.sgml @@ -4088,6 +4088,38 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i </para></entry> </row> + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>wal_read_bytes</structfield> <type>numeric</type> + </para> + <para> + Total amount of WAL read from disk in bytes, either by recovery process + for replay, or by WAL senders for replication, or by backend processes. + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>wal_read</structfield> <type>bigint</type> + </para> + <para> + Number of times WAL is read from disk, either by recovery process for + replay, or by WAL senders for replication, or by backend processes. + </para></entry> + </row> + + <row> + <entry role="catalog_table_entry"><para role="column_definition"> + <structfield>wal_read_time</structfield> <type>double precision</type> + </para> + <para> + Total amount of time spent reading WAL from disk, either by recovery + process for replay, or by WAL senders for replication, or by backend + processes, in milliseconds (if <xref linkend="guc-track-wal-io-timing"/> + is enabled, otherwise zero). + </para></entry> + </row> + <row> <entry role="catalog_table_entry"><para role="column_definition"> <structfield>stats_reset</structfield> <type>timestamp with time zone</type> diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c index aa6c929477..29a0938dba 100644 --- a/src/backend/access/transam/xlogreader.c +++ b/src/backend/access/transam/xlogreader.c @@ -180,6 +180,11 @@ XLogReaderFree(XLogReaderState *state) pfree(state->readRecordBuf); pfree(state->readBuf); pfree(state); + +#ifndef FRONTEND + /* Report pending statistics to the cumulative stats system */ + pgstat_report_wal(true); +#endif } /* @@ -1506,6 +1511,9 @@ WALRead(XLogReaderState *state, uint32 startoff; int segbytes; int readbytes; +#ifndef FRONTEND + instr_time start; +#endif startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize); @@ -1541,6 +1549,12 @@ WALRead(XLogReaderState *state, segbytes = nbytes; #ifndef FRONTEND + /* Measure I/O timing to read WAL data */ + if (track_wal_io_timing) + INSTR_TIME_SET_CURRENT(start); + else + INSTR_TIME_SET_ZERO(start); + pgstat_report_wait_start(WAIT_EVENT_WAL_READ); #endif @@ -1550,6 +1564,21 @@ WALRead(XLogReaderState *state, #ifndef FRONTEND pgstat_report_wait_end(); + + /* + * Increment the I/O time to read WAL data and the number of times WAL + * data was read from disk. + */ + if (track_wal_io_timing) + { + instr_time duration; + + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PendingWalStats.wal_read_time += INSTR_TIME_GET_MICROSEC(duration); + } + + PendingWalStats.wal_read++; #endif if (readbytes <= 0) @@ -1568,6 +1597,9 @@ WALRead(XLogReaderState *state, p += readbytes; } +#ifndef FRONTEND + PendingWalStats.wal_read_bytes += count; +#endif return true; } diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c index dbe9394762..636440dd8f 100644 --- a/src/backend/access/transam/xlogrecovery.c +++ b/src/backend/access/transam/xlogrecovery.c @@ -1587,6 +1587,7 @@ PerformWalRecovery(void) XLogRecord *record; bool reachedRecoveryTarget = false; TimeLineID replayTLI; + uint32 pgstat_report_wal_frequency = 0; /* * Initialize shared variables for tracking progress of WAL replay, as if @@ -1745,6 +1746,16 @@ PerformWalRecovery(void) */ ApplyWalRecord(xlogreader, record, &replayTLI); + /* + * Report pending statistics to the cumulative stats system once + * every PGSTAT_REPORT_FREQUENCY times to not hinder performance. + */ + if (pgstat_report_wal_frequency++ == PGSTAT_REPORT_FREQUENCY) + { + pgstat_report_wal(false); + pgstat_report_wal_frequency = 0; + } + /* Exit loop if we reached inclusive recovery target */ if (recoveryStopsAfter(xlogreader)) { @@ -3201,6 +3212,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen, uint32 targetPageOff; XLogSegNo targetSegNo PG_USED_FOR_ASSERTS_ONLY; int r; + instr_time start; XLByteToSeg(targetPagePtr, targetSegNo, wal_segment_size); targetPageOff = XLogSegmentOffset(targetPagePtr, wal_segment_size); @@ -3293,6 +3305,12 @@ retry: /* Read the requested page */ readOff = targetPageOff; + /* Measure I/O timing to read WAL data */ + if (track_wal_io_timing) + INSTR_TIME_SET_CURRENT(start); + else + INSTR_TIME_SET_ZERO(start); + pgstat_report_wait_start(WAIT_EVENT_WAL_READ); r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff); if (r != XLOG_BLCKSZ) @@ -3321,6 +3339,22 @@ retry: } pgstat_report_wait_end(); + /* + * Increment the I/O time to read WAL data, the number of times and number + * of bytes WAL data was read from disk. + */ + if (track_wal_io_timing) + { + instr_time duration; + + INSTR_TIME_SET_CURRENT(duration); + INSTR_TIME_SUBTRACT(duration, start); + PendingWalStats.wal_read_time += INSTR_TIME_GET_MICROSEC(duration); + } + + PendingWalStats.wal_read++; + PendingWalStats.wal_read_bytes += XLOG_BLCKSZ; + Assert(targetSegNo == readSegNo); Assert(targetPageOff == readOff); Assert(reqLen <= readLen); diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql index 34ca0e739f..e3fd48690a 100644 --- a/src/backend/catalog/system_views.sql +++ b/src/backend/catalog/system_views.sql @@ -1142,6 +1142,9 @@ CREATE VIEW pg_stat_wal AS w.wal_sync, w.wal_write_time, w.wal_sync_time, + w.wal_read_bytes, + w.wal_read, + w.wal_read_time, w.stats_reset FROM pg_stat_get_wal() w; diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c index 75e8363e24..8a7d9e08e3 100644 --- a/src/backend/replication/walsender.c +++ b/src/backend/replication/walsender.c @@ -318,6 +318,9 @@ WalSndErrorCleanup(void) ConditionVariableCancelSleep(); pgstat_report_wait_end(); + /* Report pending statistics to the cumulative stats system */ + pgstat_report_wal(true); + if (xlogreader != NULL && xlogreader->seg.ws_file >= 0) wal_segment_close(xlogreader); @@ -384,6 +387,9 @@ WalSndShutdown(void) if (whereToSendOutput == DestRemote) whereToSendOutput = DestNone; + /* Report pending statistics to the cumulative stats system */ + pgstat_report_wal(true); + proc_exit(0); abort(); /* keep the compiler quiet */ } @@ -2561,6 +2567,9 @@ WalSndLoop(WalSndSendDataCallback send_data) /* Sleep until something happens or we time out */ WalSndWait(wakeEvents, sleeptime, WAIT_EVENT_WAL_SENDER_MAIN); } + + /* Report pending statistics to the cumulative stats system */ + pgstat_report_wal(false); } } @@ -3118,6 +3127,9 @@ WalSndDone(WalSndSendDataCallback send_data) replicatedPtr = XLogRecPtrIsInvalid(MyWalSnd->flush) ? MyWalSnd->write : MyWalSnd->flush; + /* Report pending statistics to the cumulative stats system */ + pgstat_report_wal(true); + if (WalSndCaughtUp && sentPtr == replicatedPtr && !pq_is_send_pending()) { diff --git a/src/backend/utils/activity/pgstat_wal.c b/src/backend/utils/activity/pgstat_wal.c index e8598b2f4e..a939e614cd 100644 --- a/src/backend/utils/activity/pgstat_wal.c +++ b/src/backend/utils/activity/pgstat_wal.c @@ -107,6 +107,9 @@ pgstat_flush_wal(bool nowait) WALSTAT_ACC(wal_sync); WALSTAT_ACC(wal_write_time); WALSTAT_ACC(wal_sync_time); + WALSTAT_ACC(wal_read_bytes); + WALSTAT_ACC(wal_read); + WALSTAT_ACC(wal_read_time); #undef WALSTAT_ACC LWLockRelease(&stats_shmem->lock); @@ -147,7 +150,8 @@ pgstat_have_pending_wal(void) { return pgWalUsage.wal_records != prevWalUsage.wal_records || PendingWalStats.wal_write != 0 || - PendingWalStats.wal_sync != 0; + PendingWalStats.wal_sync != 0 || + PendingWalStats.wal_read != 0; } void diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c index 9d707c3521..99eebd5c88 100644 --- a/src/backend/utils/adt/pgstatfuncs.c +++ b/src/backend/utils/adt/pgstatfuncs.c @@ -1392,7 +1392,7 @@ pg_stat_get_io(PG_FUNCTION_ARGS) Datum pg_stat_get_wal(PG_FUNCTION_ARGS) { -#define PG_STAT_GET_WAL_COLS 9 +#define PG_STAT_GET_WAL_COLS 12 TupleDesc tupdesc; Datum values[PG_STAT_GET_WAL_COLS] = {0}; bool nulls[PG_STAT_GET_WAL_COLS] = {0}; @@ -1417,7 +1417,13 @@ pg_stat_get_wal(PG_FUNCTION_ARGS) FLOAT8OID, -1, 0); TupleDescInitEntry(tupdesc, (AttrNumber) 8, "wal_sync_time", FLOAT8OID, -1, 0); - TupleDescInitEntry(tupdesc, (AttrNumber) 9, "stats_reset", + TupleDescInitEntry(tupdesc, (AttrNumber) 9, "wal_read_bytes", + NUMERICOID, -1, 0); + TupleDescInitEntry(tupdesc, (AttrNumber) 10, "wal_read", + INT8OID, -1, 0); + TupleDescInitEntry(tupdesc, (AttrNumber) 11, "wal_read_time", + FLOAT8OID, -1, 0); + TupleDescInitEntry(tupdesc, (AttrNumber) 12, "stats_reset", TIMESTAMPTZOID, -1, 0); BlessTupleDesc(tupdesc); @@ -1444,7 +1450,19 @@ pg_stat_get_wal(PG_FUNCTION_ARGS) values[6] = Float8GetDatum(((double) wal_stats->wal_write_time) / 1000.0); values[7] = Float8GetDatum(((double) wal_stats->wal_sync_time) / 1000.0); - values[8] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp); + /* Convert to numeric. */ + snprintf(buf, sizeof buf, UINT64_FORMAT, wal_stats->wal_read_bytes); + values[8] = DirectFunctionCall3(numeric_in, + CStringGetDatum(buf), + ObjectIdGetDatum(0), + Int32GetDatum(-1)); + + values[9] = Int64GetDatum(wal_stats->wal_read); + + /* Convert counter from microsec to millisec for display */ + values[10] = Float8GetDatum(((double) wal_stats->wal_read_time) / 1000.0); + + values[11] = TimestampTzGetDatum(wal_stats->stat_reset_timestamp); /* Returns the record as Datum */ PG_RETURN_DATUM(HeapTupleGetDatum(heap_form_tuple(tupdesc, values, nulls))); diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat index 66b73c3900..ece28e6c6c 100644 --- a/src/include/catalog/pg_proc.dat +++ b/src/include/catalog/pg_proc.dat @@ -5729,9 +5729,9 @@ { oid => '1136', descr => 'statistics: information about WAL activity', proname => 'pg_stat_get_wal', proisstrict => 'f', provolatile => 's', proparallel => 'r', prorettype => 'record', proargtypes => '', - proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,timestamptz}', - proargmodes => '{o,o,o,o,o,o,o,o,o}', - proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,stats_reset}', + proallargtypes => '{int8,int8,numeric,int8,int8,int8,float8,float8,numeric,int8,float8,timestamptz}', + proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}', + proargnames => '{wal_records,wal_fpi,wal_bytes,wal_buffers_full,wal_write,wal_sync,wal_write_time,wal_sync_time,wal_read_bytes,wal_read,wal_read_time,stats_reset}', prosrc => 'pg_stat_get_wal' }, { oid => '6248', descr => 'statistics: information about WAL prefetching', proname => 'pg_stat_get_recovery_prefetch', prorows => '1', proretset => 't', diff --git a/src/include/pgstat.h b/src/include/pgstat.h index db9675884f..d062fb6e52 100644 --- a/src/include/pgstat.h +++ b/src/include/pgstat.h @@ -442,10 +442,19 @@ typedef struct PgStat_WalStats PgStat_Counter wal_sync; PgStat_Counter wal_write_time; PgStat_Counter wal_sync_time; + uint64 wal_read_bytes; + PgStat_Counter wal_read; + PgStat_Counter wal_read_time; TimestampTz stat_reset_timestamp; } PgStat_WalStats; +/* + * When stats captured in resource-intensive/tight loops, report pending stats + * once every PGSTAT_REPORT_FREQUENCY times. + */ +#define PGSTAT_REPORT_FREQUENCY 1024 + /* * Functions in pgstat.c */ diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out index 174b725fff..8656adb870 100644 --- a/src/test/regress/expected/rules.out +++ b/src/test/regress/expected/rules.out @@ -2215,8 +2215,11 @@ pg_stat_wal| SELECT wal_records, wal_sync, wal_write_time, wal_sync_time, + wal_read_bytes, + wal_read, + wal_read_time, stats_reset - FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, stats_reset); + FROM pg_stat_get_wal() w(wal_records, wal_fpi, wal_bytes, wal_buffers_full, wal_write, wal_sync, wal_write_time, wal_sync_time, wal_read_bytes, wal_read, wal_read_time, stats_reset); pg_stat_wal_receiver| SELECT pid, status, receive_start_lsn, -- 2.34.1