On 2025-02-10 05:06, Jelte Fennema-Nio wrote:
Thanks for reviewing the patch and comments!
Fixed issues you pointed out and attached v2 patch.
On Sun, 9 Feb 2025 at 19:05, Tom Lane <t...@sss.pgh.pa.us> wrote:
Andres Freund <and...@anarazel.de> writes:
> I'm somewhat against this patch, as it's fairly fundamentally incompatible
> with AIO. There's no real way to get information in this manner if the IO
> isn't executed synchronously in process context...
Hmm, I had not considered how this would interact with your AIO work.
I agree that getting this info would be hard/impossible to do
efficiently, when IOs are done by background IO processes that
interleave IOs from different queries. But I'd expect that AIOs that
are done using iouring would be tracked correctly without having to
change this code at all (because I assume those are done from the
query backend process).
One other thought: I think the primary benefit of this feature is
being able to see how many read IOs actually hit the disk, as opposed
to hitting OS page cache. That benefit disappears when using Direct
IO, because then there's no OS page cache.
How many years away do you think that widespread general use of
AIO+Direct IO is, though? I think that for the N years from now until
then, it would be very nice to have this feature to help debug query
performance problems. Then once the numbers become too
inaccurate/useless at some point, we could simply remove them again.
AIO efforts are something I haven't fully grasped yet, but Jelte's
comments seem reasonable to me.
Of course, as someone proposing this, I'm naturally biased toward
thinking it’s beneficial.
What do you think?
Even without looking ahead to AIO, there's bgwriter, walwriter, and
checkpointer processes that all take I/O load away from foreground
processes. I don't really believe that this will produce useful
numbers.
The bgwriter, walwriter, and checkpointer should only take away
*write* IOs. For read IOs the numbers should be very accurate and as
explained above read IOs is where I think the primary benefit of this
feature is.
But even for write IOs I think the numbers would be useful when
looking at them with the goal of finding out why a particular query is
slow: If the bgwriter or checkpointer do the writes, then the query
should be roughly as fast as if no writes to the disk had taken place
at all, but if the query process does the writes then those writes are
probably blocking further execution of the query and thus slowing it
down.
I agree with this as well.
For example, in a SELECT query executed immediately after a large number
of INSERTs, we can observe that writes to storage occur due to WAL
writes for hint bits.
This makes the query take longer compared to a scenario where these
writes do not occur.
I think we can guess what is happening from the output:
postgres=# insert into t1 (select i, repeat('a', 1000) from
generate_series(1, 1000000) i);
INSERT 0 1000000
postgres=# explain analyze table t1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..382665.25 rows=21409025 width=36) (actual
time=1.926..11035.531 rows=1000100 loops=1)
Buffers: shared read=168575 dirtied=142858 written=142479
Planning:
Buffers: shared hit=3 read=3 written=1
Storage I/O: read=48 times write=16 times
Planning Time: 4.472 ms
Execution:
Storage I/O: read=2697272 times write=4480096 times // many writes
Execution Time: 11099.424 ms // slow
(9 rows)
postgres=# explain analyze table t1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on t1 (cost=0.00..382665.25 rows=21409025 width=36) (actual
time=2.066..2926.394 rows=1000100 loops=1)
Buffers: shared read=168575 written=14
Planning Time: 0.295 ms
Execution:
Storage I/O: read=2697200 times write=224 times // few writes
Execution Time: 3016.257 ms // fast
(6 rows)
--
Regards,
--
Atsushi Torikoshi
Seconded from NTT DATA GROUP CORPORATION to SRA OSS K.K.
From 5459326fcfca8e119ce5d694f2180839958b3d5c Mon Sep 17 00:00:00 2001
From: Atsushi Torikoshi <torikos...@oss.nttdata.com>
Date: Mon, 10 Feb 2025 21:53:56 +0900
Subject: [PATCH v2] Add storage I/O tracking to 'BUFFERS' option
The 'BUFFERS' option currently indicates whether a block hit the shared
buffer, but does not distinguish between a cache hit in the OS cache or
a storage I/O operation.
While shared buffers and OS cache offer similar performance, storage
I/O is significantly slower in comparison. By measuring the numbers of
storage I/O read and write, we can better identify if storage I/O is a
bottleneck in performance.
Added tracking of storage I/O usage by calling getrusage(2) at both the
planning and execution phase start and end points.
A more granular approach as well as current BUFFERS option(tracking at
each plan node) was considered but found to be impractical due to the
high performance cost of frequent getrusage() calls.
TODO:
I believe this information is mainly useful when used in auto_explain.
I'll implement it later.
Reviewed-by: Jelte Fennema-Nio <postg...@jeltef.nl>
---
doc/src/sgml/ref/explain.sgml | 25 ++++--
src/backend/access/brin/brin.c | 8 +-
src/backend/access/nbtree/nbtsort.c | 8 +-
src/backend/commands/explain.c | 110 +++++++++++++++++++++++++-
src/backend/commands/prepare.c | 8 ++
src/backend/commands/vacuumparallel.c | 8 +-
src/backend/executor/execParallel.c | 35 ++++++--
src/backend/executor/instrument.c | 62 ++++++++++++++-
src/include/commands/explain.h | 1 +
src/include/executor/execParallel.h | 2 +
src/include/executor/instrument.h | 19 ++++-
src/include/port/win32/sys/resource.h | 2 +
src/port/win32getrusage.c | 4 +
src/test/regress/expected/explain.out | 37 ++++++++-
src/tools/pgindent/typedefs.list | 1 +
15 files changed, 292 insertions(+), 38 deletions(-)
diff --git a/doc/src/sgml/ref/explain.sgml b/doc/src/sgml/ref/explain.sgml
index 6361a14e65..7b45a34fc2 100644
--- a/doc/src/sgml/ref/explain.sgml
+++ b/doc/src/sgml/ref/explain.sgml
@@ -198,13 +198,24 @@ ROLLBACK;
previously unmodified blocks that were changed by this query; while the
number of blocks <emphasis>written</emphasis> indicates the number of
previously-dirtied blocks evicted from cache by this backend during
- query processing.
- The number of blocks shown for an
- upper-level node includes those used by all its child nodes. In text
- format, only non-zero values are printed. Buffers information is
- included by default when <literal>ANALYZE</literal> is used but
- otherwise is not included by default, but can be enabled using this
- option.
+ query processing. In text format, only non-zero values are printed.
+ If possible, this option also displays the number of read and write
+ operations performed on storage during the planning and execution phases,
+ shown at the end of the plan. These values are obtained from the
+ <function>getrusage()</function> system call. Note that on platforms that
+ do not support <function>getrusage()</function>, such as Windows, no output
+ will be shown, even if reads or writes actually occur. Additionally, even
+ on platforms where <function>getrusage()</function> is supported, if the
+ kernel is built without the necessary options to track storage read and
+ write operations, no output will be shown.
+ The timing and unit of measurement for read and write operations may vary
+ depending on the platform. For example, on Linux, a read is counted only
+ if this process caused data to be fetched from the storage layer, and a
+ write is counted at the page-dirtying time. On Linux, the unit of
+ measurement for read and write operations is 512 KB.
+ Buffers information is included by default when <literal>ANALYZE</literal>
+ is used but otherwise is not included by default, but can be enabled using
+ this option.
</para>
</listitem>
</varlistentry>
diff --git a/src/backend/access/brin/brin.c b/src/backend/access/brin/brin.c
index ccf824bbdb..fc711be2e1 100644
--- a/src/backend/access/brin/brin.c
+++ b/src/backend/access/brin/brin.c
@@ -2551,7 +2551,7 @@ _brin_end_parallel(BrinLeader *brinleader, BrinBuildState *state)
* or we might get incomplete data.)
*/
for (i = 0; i < brinleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&brinleader->bufferusage[i], &brinleader->walusage[i]);
+ InstrAccumParallelQuery(&brinleader->bufferusage[i], NULL, &brinleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(brinleader->snapshot))
@@ -2913,7 +2913,7 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
tuplesort_attach_shared(sharedsort, seg);
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/*
* Might as well use reliable figure when doling out maintenance_work_mem
@@ -2928,8 +2928,8 @@ _brin_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
index_close(indexRel, indexLockmode);
table_close(heapRel, heapLockmode);
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 7aba852db9..98cfde8875 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1619,7 +1619,7 @@ _bt_end_parallel(BTLeader *btleader)
* or we might get incomplete data.)
*/
for (i = 0; i < btleader->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&btleader->bufferusage[i], &btleader->walusage[i]);
+ InstrAccumParallelQuery(&btleader->bufferusage[i], NULL, &btleader->walusage[i]);
/* Free last reference to MVCC snapshot, if one was used */
if (IsMVCCSnapshot(btleader->snapshot))
@@ -1827,7 +1827,7 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
}
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Perform sorting of spool, and possibly a spool2 */
sortmem = maintenance_work_mem / btshared->scantuplesortstates;
@@ -1837,8 +1837,8 @@ _bt_parallel_build_main(dsm_segment *seg, shm_toc *toc)
/* Report WAL/buffer usage during parallel execution */
bufferusage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
walusage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber],
- &walusage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&bufferusage[ParallelWorkerNumber], NULL,
+ &walusage[ParallelWorkerNumber], NULL);
#ifdef BTREE_BUILD_STATS
if (log_btree_build_stats)
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index c24e66f82e..d23bbb3699 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -145,6 +145,8 @@ static void show_foreignscan_info(ForeignScanState *fsstate, ExplainState *es);
static const char *explain_get_index_name(Oid indexId);
static bool peek_buffer_usage(ExplainState *es, const BufferUsage *usage);
static void show_buffer_usage(ExplainState *es, const BufferUsage *usage);
+static bool peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
+static void show_storageio_usage(ExplainState *es, const StorageIOUsage *usage);
static void show_wal_usage(ExplainState *es, const WalUsage *usage);
static void show_memory_counters(ExplainState *es,
const MemoryContextCounters *mem_counters);
@@ -475,6 +477,8 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -496,7 +500,10 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
INSTR_TIME_SET_CURRENT(planstart);
/* plan the query */
@@ -516,11 +523,16 @@ standard_ExplainOneQuery(Query *query, int cursorOptions,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+
+ GetStorageIOUsage(&storageio);
+ storageio.inblock -= storageio_start.inblock;
+ storageio.outblock -= storageio_start.outblock;
}
/* run it (if needed) and produce output */
ExplainOnePlan(plan, into, es, queryString, params, queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->buffers ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
}
@@ -644,7 +656,7 @@ void
ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
const char *queryString, ParamListInfo params,
QueryEnvironment *queryEnv, const instr_time *planduration,
- const BufferUsage *bufusage,
+ const BufferUsage *bufusage, const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters)
{
DestReceiver *dest;
@@ -654,6 +666,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
int eflags;
int instrument_option = 0;
SerializeMetrics serializeMetrics = {0};
+ StorageIOUsage storageio_start;
Assert(plannedstmt->commandType != CMD_UTILITY);
@@ -663,7 +676,19 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
instrument_option |= INSTRUMENT_ROWS;
if (es->buffers)
+ {
+ GetStorageIOUsage(&storageio_start);
+
+ /*
+ * Initialize global variable counters for parallel query workers.
+ * Even if the query is cancelled on the way, the EXPLAIN execution
+ * always passes here, so it can be initialized here.
+ */
+ pgStorageIOUsageParallel.inblock = 0;
+ pgStorageIOUsageParallel.outblock = 0;
+
instrument_option |= INSTRUMENT_BUFFERS;
+ }
if (es->wal)
instrument_option |= INSTRUMENT_WAL;
@@ -747,8 +772,9 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
/* Create textual dump of plan tree */
ExplainPrintPlan(es, queryDesc);
- /* Show buffer and/or memory usage in planning */
- if (peek_buffer_usage(es, bufusage) || mem_counters)
+ /* Show buffer, storage I/O, and/or memory usage in planning */
+ if (peek_buffer_usage(es, bufusage) || peek_storageio_usage(es, planstorageio) ||
+ mem_counters)
{
ExplainOpenGroup("Planning", "Planning", true, es);
@@ -760,8 +786,10 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
}
if (bufusage)
+ {
show_buffer_usage(es, bufusage);
-
+ show_storageio_usage(es, planstorageio);
+ }
if (mem_counters)
show_memory_counters(es, mem_counters);
@@ -813,6 +841,35 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
totaltime += elapsed_time(&starttime);
+ /* Show storage I/O usage in execution */
+ if (es->buffers)
+ {
+ StorageIOUsage storageio = {0};
+ StorageIOUsage storageio_end;
+
+ GetStorageIOUsage(&storageio_end);
+ StorageIOUsageAccumDiff(&storageio, &storageio_end, &storageio_start);
+ StorageIOUsageAdd(&storageio, &pgStorageIOUsageParallel);
+
+ if (peek_storageio_usage(es, &storageio))
+ {
+ ExplainOpenGroup("Execution", "Execution", true, es);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Execution:\n");
+ es->indent++;
+ }
+ show_storageio_usage(es, &storageio);
+
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ es->indent--;
+
+ ExplainCloseGroup("Execution", "Execution", true, es);
+ }
+ }
+
/*
* We only report execution time if we actually ran the query (that is,
* the user specified ANALYZE), and if summary reporting is enabled (the
@@ -4232,6 +4289,51 @@ show_buffer_usage(ExplainState *es, const BufferUsage *usage)
}
}
+/*
+ * Return whether show_storageio_usage would have anything to print, if given
+ * the same 'usage' data. Note that when the format is anything other than
+ * text, we print even if the counters are all zeroes.
+ */
+static bool
+peek_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (usage == NULL)
+ return false;
+
+ if (es->format != EXPLAIN_FORMAT_TEXT)
+ return true;
+
+ return usage->inblock > 0 || usage->outblock > 0;
+}
+
+/*
+ * Show storage I/O usage.
+ */
+static void
+show_storageio_usage(ExplainState *es, const StorageIOUsage *usage)
+{
+ if (es->format == EXPLAIN_FORMAT_TEXT)
+ {
+ /* Show only positive counter values. */
+ if (usage->inblock <= 0 && usage->outblock <= 0)
+ return;
+
+ ExplainIndentText(es);
+ appendStringInfoString(es->str, "Storage I/O:");
+ appendStringInfo(es->str, " read=%ld times", (long) usage->inblock);
+ appendStringInfo(es->str, " write=%ld times", (long) usage->outblock);
+
+ appendStringInfoChar(es->str, '\n');
+ }
+ else
+ {
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->inblock, es);
+ ExplainPropertyInteger("Storage I/O Read", NULL,
+ usage->outblock, es);
+ }
+}
+
/*
* Show WAL usage details.
*/
diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
index 8989c0c882..e11601b21d 100644
--- a/src/backend/commands/prepare.c
+++ b/src/backend/commands/prepare.c
@@ -579,6 +579,8 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
instr_time planduration;
BufferUsage bufusage_start,
bufusage;
+ StorageIOUsage storageio,
+ storageio_start;
MemoryContextCounters mem_counters;
MemoryContext planner_ctx = NULL;
MemoryContext saved_ctx = NULL;
@@ -594,7 +596,11 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
}
if (es->buffers)
+ {
bufusage_start = pgBufferUsage;
+ GetStorageIOUsage(&storageio_start);
+ }
+
INSTR_TIME_SET_CURRENT(planstart);
/* Look it up in the hash table */
@@ -644,6 +650,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
{
memset(&bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(&bufusage, &pgBufferUsage, &bufusage_start);
+ GetStorageIOUsage(&storageio);
}
plan_list = cplan->stmt_list;
@@ -656,6 +663,7 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause *into, ExplainState *es,
if (pstmt->commandType != CMD_UTILITY)
ExplainOnePlan(pstmt, into, es, query_string, paramLI, pstate->p_queryEnv,
&planduration, (es->buffers ? &bufusage : NULL),
+ (es->buffers ? &storageio : NULL),
es->memory ? &mem_counters : NULL);
else
ExplainOneUtility(pstmt->utilityStmt, into, es, pstate, paramLI);
diff --git a/src/backend/commands/vacuumparallel.c b/src/backend/commands/vacuumparallel.c
index dc3322c256..cd4bf22082 100644
--- a/src/backend/commands/vacuumparallel.c
+++ b/src/backend/commands/vacuumparallel.c
@@ -737,7 +737,7 @@ parallel_vacuum_process_all_indexes(ParallelVacuumState *pvs, int num_index_scan
WaitForParallelWorkersToFinish(pvs->pcxt);
for (int i = 0; i < pvs->pcxt->nworkers_launched; i++)
- InstrAccumParallelQuery(&pvs->buffer_usage[i], &pvs->wal_usage[i]);
+ InstrAccumParallelQuery(&pvs->buffer_usage[i], NULL, &pvs->wal_usage[i]);
}
/*
@@ -1083,7 +1083,7 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
error_context_stack = &errcallback;
/* Prepare to track buffer usage during parallel execution */
- InstrStartParallelQuery();
+ InstrStartParallelQuery(NULL);
/* Process indexes to perform vacuum/cleanup */
parallel_vacuum_process_safe_indexes(&pvs);
@@ -1091,8 +1091,8 @@ parallel_vacuum_main(dsm_segment *seg, shm_toc *toc)
/* Report buffer/WAL usage during parallel execution */
buffer_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_BUFFER_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_VACUUM_KEY_WAL_USAGE, false);
- InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber], NULL,
+ &wal_usage[ParallelWorkerNumber], NULL);
TidStoreDetach(dead_items);
diff --git a/src/backend/executor/execParallel.c b/src/backend/executor/execParallel.c
index 134ff62f5c..aced0bbe4b 100644
--- a/src/backend/executor/execParallel.c
+++ b/src/backend/executor/execParallel.c
@@ -64,6 +64,7 @@
#define PARALLEL_KEY_QUERY_TEXT UINT64CONST(0xE000000000000008)
#define PARALLEL_KEY_JIT_INSTRUMENTATION UINT64CONST(0xE000000000000009)
#define PARALLEL_KEY_WAL_USAGE UINT64CONST(0xE00000000000000A)
+#define PARALLEL_KEY_STORAGEIO_USAGE UINT64CONST(0xE00000000000000B)
#define PARALLEL_TUPLE_QUEUE_SIZE 65536
@@ -599,6 +600,7 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
char *pstmt_space;
char *paramlistinfo_space;
BufferUsage *bufusage_space;
+ StorageIOUsage *storageiousage_space;
WalUsage *walusage_space;
SharedExecutorInstrumentation *instrumentation = NULL;
SharedJitInstrumentation *jit_instrumentation = NULL;
@@ -680,6 +682,13 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
mul_size(sizeof(WalUsage), pcxt->nworkers));
shm_toc_estimate_keys(&pcxt->estimator, 1);
+ /*
+ * Same thing for StorageIOUsage.
+ */
+ shm_toc_estimate_chunk(&pcxt->estimator,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_estimate_keys(&pcxt->estimator, 1);
+
/* Estimate space for tuple queues. */
shm_toc_estimate_chunk(&pcxt->estimator,
mul_size(PARALLEL_TUPLE_QUEUE_SIZE, pcxt->nworkers));
@@ -775,6 +784,12 @@ ExecInitParallelPlan(PlanState *planstate, EState *estate,
shm_toc_insert(pcxt->toc, PARALLEL_KEY_WAL_USAGE, walusage_space);
pei->wal_usage = walusage_space;
+ /* Same for StorageIOUsage. */
+ storageiousage_space = shm_toc_allocate(pcxt->toc,
+ mul_size(sizeof(StorageIOUsage), pcxt->nworkers));
+ shm_toc_insert(pcxt->toc, PARALLEL_KEY_STORAGEIO_USAGE, storageiousage_space);
+ pei->storageio_usage = storageiousage_space;
+
/* Set up the tuple queues that the workers will write into. */
pei->tqueue = ExecParallelSetupTupleQueues(pcxt, false);
@@ -1170,11 +1185,11 @@ ExecParallelFinish(ParallelExecutorInfo *pei)
WaitForParallelWorkersToFinish(pei->pcxt);
/*
- * Next, accumulate buffer/WAL usage. (This must wait for the workers to
- * finish, or we might get incomplete data.)
+ * Next, accumulate buffer, WAL, and Storage I/O usage. (This must wait
+ * for the workers to finish, or we might get incomplete data.)
*/
for (i = 0; i < nworkers; i++)
- InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->wal_usage[i]);
+ InstrAccumParallelQuery(&pei->buffer_usage[i], &pei->storageio_usage[i], &pei->wal_usage[i]);
pei->finished = true;
}
@@ -1406,6 +1421,8 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
{
FixedParallelExecutorState *fpes;
BufferUsage *buffer_usage;
+ StorageIOUsage *storageio_usage;
+ StorageIOUsage storageio_usage_start = {0};
WalUsage *wal_usage;
DestReceiver *receiver;
QueryDesc *queryDesc;
@@ -1459,13 +1476,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
ExecSetTupleBound(fpes->tuples_needed, queryDesc->planstate);
/*
- * Prepare to track buffer/WAL usage during query execution.
+ * Prepare to track buffer, WAL, and StorageI/O usage during query
+ * execution.
*
* We do this after starting up the executor to match what happens in the
* leader, which also doesn't count buffer accesses and WAL activity that
* occur during executor startup.
*/
- InstrStartParallelQuery();
+ InstrStartParallelQuery(&storageio_usage_start);
/*
* Run the plan. If we specified a tuple bound, be careful not to demand
@@ -1478,11 +1496,14 @@ ParallelQueryMain(dsm_segment *seg, shm_toc *toc)
/* Shut down the executor */
ExecutorFinish(queryDesc);
- /* Report buffer/WAL usage during parallel execution. */
+ /* Report buffer, WAL, and storageIO usage during parallel execution. */
buffer_usage = shm_toc_lookup(toc, PARALLEL_KEY_BUFFER_USAGE, false);
+ storageio_usage = shm_toc_lookup(toc, PARALLEL_KEY_STORAGEIO_USAGE, false);
wal_usage = shm_toc_lookup(toc, PARALLEL_KEY_WAL_USAGE, false);
InstrEndParallelQuery(&buffer_usage[ParallelWorkerNumber],
- &wal_usage[ParallelWorkerNumber]);
+ &storageio_usage[ParallelWorkerNumber],
+ &wal_usage[ParallelWorkerNumber],
+ &storageio_usage_start);
/* Report instrumentation data if any instrumentation options are set. */
if (instrumentation != NULL)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 2d3569b374..94664d5fab 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -13,16 +13,21 @@
*/
#include "postgres.h"
+#include <sys/resource.h>
#include <unistd.h>
#include "executor/instrument.h"
BufferUsage pgBufferUsage;
static BufferUsage save_pgBufferUsage;
+
+/* Only count parallel workers' usage */
+StorageIOUsage pgStorageIOUsageParallel;
WalUsage pgWalUsage;
static WalUsage save_pgWalUsage;
static void BufferUsageAdd(BufferUsage *dst, const BufferUsage *add);
+void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
static void WalUsageAdd(WalUsage *dst, WalUsage *add);
@@ -197,27 +202,46 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
/* note current values during parallel executor startup */
void
-InstrStartParallelQuery(void)
+InstrStartParallelQuery(StorageIOUsage *storageiousage)
{
save_pgBufferUsage = pgBufferUsage;
save_pgWalUsage = pgWalUsage;
+
+ if (storageiousage != NULL)
+ GetStorageIOUsage(storageiousage);
}
/* report usage after parallel executor shutdown */
void
-InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start)
{
memset(bufusage, 0, sizeof(BufferUsage));
BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
+
+ if (storageiousage != NULL && storageiousage_start != NULL)
+ {
+ struct StorageIOUsage storageiousage_end;
+
+ GetStorageIOUsage(&storageiousage_end);
+
+ memset(storageiousage, 0, sizeof(StorageIOUsage));
+ StorageIOUsageAccumDiff(storageiousage, &storageiousage_end, storageiousage_start);
+
+ ereport(DEBUG1,
+ (errmsg("Parallel worker's storage I/O times: inblock:%ld outblock:%ld",
+ storageiousage->inblock, storageiousage->outblock)));
+ }
memset(walusage, 0, sizeof(WalUsage));
WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
}
/* accumulate work done by workers in leader's stats */
void
-InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
+InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage)
{
BufferUsageAdd(&pgBufferUsage, bufusage);
+ if (storageiousage != NULL)
+ StorageIOUsageAdd(&pgStorageIOUsageParallel, storageiousage);
WalUsageAdd(&pgWalUsage, walusage);
}
@@ -273,6 +297,38 @@ BufferUsageAccumDiff(BufferUsage *dst,
add->temp_blk_write_time, sub->temp_blk_write_time);
}
+/* helper functions for StorageIOUsage usage accumulation */
+void
+StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add)
+{
+ dst->inblock += add->inblock;
+ dst->outblock += add->outblock;
+}
+
+/* dst += add - sub */
+void
+StorageIOUsageAccumDiff(StorageIOUsage *dst, const StorageIOUsage *add, const StorageIOUsage *sub)
+{
+ dst->inblock += add->inblock - sub->inblock;
+ dst->outblock += add->outblock - sub->outblock;
+}
+
+/* Captures the current storage I/O usage statistics */
+void
+GetStorageIOUsage(StorageIOUsage *usage)
+{
+ struct rusage rusage;
+
+ if (getrusage(RUSAGE_SELF, &rusage))
+ {
+ ereport(ERROR,
+ (errcode(ERRCODE_SYSTEM_ERROR),
+ errmsg("getrusage() failed: %m")));
+ }
+ usage->inblock = rusage.ru_inblock;
+ usage->outblock = rusage.ru_oublock;
+}
+
/* helper functions for WAL usage accumulation */
static void
WalUsageAdd(WalUsage *dst, WalUsage *add)
diff --git a/src/include/commands/explain.h b/src/include/commands/explain.h
index ea7419951f..8e67dd57b3 100644
--- a/src/include/commands/explain.h
+++ b/src/include/commands/explain.h
@@ -108,6 +108,7 @@ extern void ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into,
ParamListInfo params, QueryEnvironment *queryEnv,
const instr_time *planduration,
const BufferUsage *bufusage,
+ const StorageIOUsage *planstorageio,
const MemoryContextCounters *mem_counters);
extern void ExplainPrintPlan(ExplainState *es, QueryDesc *queryDesc);
diff --git a/src/include/executor/execParallel.h b/src/include/executor/execParallel.h
index 5e7106c397..5c8bc76c53 100644
--- a/src/include/executor/execParallel.h
+++ b/src/include/executor/execParallel.h
@@ -26,6 +26,8 @@ typedef struct ParallelExecutorInfo
PlanState *planstate; /* plan subtree we're running in parallel */
ParallelContext *pcxt; /* parallel context we're using */
BufferUsage *buffer_usage; /* points to bufusage area in DSM */
+ StorageIOUsage *storageio_usage; /* points to storageio usage area in
+ * DSM */
WalUsage *wal_usage; /* walusage area in DSM */
SharedExecutorInstrumentation *instrumentation; /* optional */
struct SharedJitInstrumentation *jit_instrumentation; /* optional */
diff --git a/src/include/executor/instrument.h b/src/include/executor/instrument.h
index 5a6eff75c6..7fc7281b1a 100644
--- a/src/include/executor/instrument.h
+++ b/src/include/executor/instrument.h
@@ -41,6 +41,14 @@ typedef struct BufferUsage
instr_time temp_blk_write_time; /* time spent writing temp blocks */
} BufferUsage;
+typedef struct StorageIOUsage
+{
+ long inblock; /* # of times the file system had to perform
+ * input */
+ long outblock; /* # of times the file system had to perform
+ * output */
+} StorageIOUsage;
+
/*
* WalUsage tracks only WAL activity like WAL records generation that
* can be measured per query and is displayed by EXPLAIN command,
@@ -99,6 +107,7 @@ typedef struct WorkerInstrumentation
} WorkerInstrumentation;
extern PGDLLIMPORT BufferUsage pgBufferUsage;
+extern PGDLLIMPORT StorageIOUsage pgStorageIOUsageParallel;
extern PGDLLIMPORT WalUsage pgWalUsage;
extern Instrumentation *InstrAlloc(int n, int instrument_options,
@@ -109,11 +118,15 @@ extern void InstrStopNode(Instrumentation *instr, double nTuples);
extern void InstrUpdateTupleCount(Instrumentation *instr, double nTuples);
extern void InstrEndLoop(Instrumentation *instr);
extern void InstrAggNode(Instrumentation *dst, Instrumentation *add);
-extern void InstrStartParallelQuery(void);
-extern void InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
-extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage);
+extern void InstrStartParallelQuery(StorageIOUsage *storageiousage);
+extern void InstrEndParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage, StorageIOUsage *storageiousage_start);
+extern void InstrAccumParallelQuery(BufferUsage *bufusage, StorageIOUsage *storageiousage, WalUsage *walusage);
extern void BufferUsageAccumDiff(BufferUsage *dst,
const BufferUsage *add, const BufferUsage *sub);
+extern void StorageIOUsageAccumDiff(StorageIOUsage *dst,
+ const StorageIOUsage *add, const StorageIOUsage *sub);
+extern void StorageIOUsageAdd(StorageIOUsage *dst, const StorageIOUsage *add);
+extern void GetStorageIOUsage(StorageIOUsage *usage);
extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
const WalUsage *sub);
diff --git a/src/include/port/win32/sys/resource.h b/src/include/port/win32/sys/resource.h
index a14feeb584..270dc37c84 100644
--- a/src/include/port/win32/sys/resource.h
+++ b/src/include/port/win32/sys/resource.h
@@ -13,6 +13,8 @@ struct rusage
{
struct timeval ru_utime; /* user time used */
struct timeval ru_stime; /* system time used */
+ long ru_inblock; /* Currently always 0 for Windows */
+ long ru_oublock; /* Currently always 0 for Windows */
};
extern int getrusage(int who, struct rusage *rusage);
diff --git a/src/port/win32getrusage.c b/src/port/win32getrusage.c
index 6a197c9437..27f0ea052a 100644
--- a/src/port/win32getrusage.c
+++ b/src/port/win32getrusage.c
@@ -57,5 +57,9 @@ getrusage(int who, struct rusage *rusage)
rusage->ru_utime.tv_sec = li.QuadPart / 1000000L;
rusage->ru_utime.tv_usec = li.QuadPart % 1000000L;
+ /* Currently always 0 for Windows */
+ rusage->ru_inblock = 0;
+ rusage->ru_oublock = 0;
+
return 0;
}
diff --git a/src/test/regress/expected/explain.out b/src/test/regress/expected/explain.out
index ee31e41d50..f569d8f6fd 100644
--- a/src/test/regress/expected/explain.out
+++ b/src/test/regress/expected/explain.out
@@ -127,10 +127,16 @@ select explain_filter('explain (analyze, buffers, format xml) select * from int8
<Local-Written-Blocks>N</Local-Written-Blocks> +
<Temp-Read-Blocks>N</Temp-Read-Blocks> +
<Temp-Written-Blocks>N</Temp-Written-Blocks> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
</Planning> +
<Planning-Time>N.N</Planning-Time> +
<Triggers> +
</Triggers> +
+ <Execution> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ <Storage-I-O-Read>N</Storage-I-O-Read> +
+ </Execution> +
<Execution-Time>N.N</Execution-Time> +
</Query> +
</explain>
@@ -175,6 +181,8 @@ select explain_filter('explain (analyze, serialize, buffers, format yaml) select
Local Written Blocks: N +
Temp Read Blocks: N +
Temp Written Blocks: N +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
Planning Time: N.N +
Triggers: +
Serialization: +
@@ -191,6 +199,9 @@ select explain_filter('explain (analyze, serialize, buffers, format yaml) select
Local Written Blocks: N +
Temp Read Blocks: N +
Temp Written Blocks: N +
+ Execution: +
+ Storage I/O Read: N +
+ Storage I/O Read: N +
Execution Time: N.N
(1 row)
@@ -237,7 +248,13 @@ select explain_filter('explain (buffers, format json) select * from int8_tbl i8'
"Local Dirtied Blocks": N, +
"Local Written Blocks": N, +
"Temp Read Blocks": N, +
- "Temp Written Blocks": N +
+ "Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
} +
} +
]
@@ -299,11 +316,17 @@ select explain_filter('explain (analyze, buffers, format json) select * from int
"Local I/O Read Time": N.N, +
"Local I/O Write Time": N.N, +
"Temp I/O Read Time": N.N, +
- "Temp I/O Write Time": N.N +
+ "Temp I/O Write Time": N.N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
}, +
"Planning Time": N.N, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
"Execution Time": N.N +
} +
]
@@ -423,12 +446,18 @@ select explain_filter('explain (memory, analyze, format json) select * from int8
"Local Written Blocks": N, +
"Temp Read Blocks": N, +
"Temp Written Blocks": N, +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N, +
"Memory Used": N, +
"Memory Allocated": N +
}, +
"Planning Time": N.N, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": N, +
+ "Storage I/O Read": N +
+ }, +
"Execution Time": N.N +
} +
]
@@ -641,6 +670,7 @@ select jsonb_pretty(
}, +
"Planning": { +
"Local Hit Blocks": 0, +
+ "Storage I/O Read": 0, +
"Temp Read Blocks": 0, +
"Local Read Blocks": 0, +
"Shared Hit Blocks": 0, +
@@ -653,6 +683,9 @@ select jsonb_pretty(
}, +
"Triggers": [ +
], +
+ "Execution": { +
+ "Storage I/O Read": 0 +
+ }, +
"Planning Time": 0.0, +
"Execution Time": 0.0 +
} +
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index 9a3bee93de..a5eeaca55c 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2558,6 +2558,7 @@ SSL
SSLExtensionInfoContext
SSL_CTX
STARTUPINFO
+StorageIOUsage
STRLEN
SV
SYNCHRONIZATION_BARRIER
base-commit: 3d17d7d7fb7a4603b48acb275b5a416f110db464
--
2.43.0