Hi,
On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote: > Attached v4 patchset that addresses feedback (unless otherwise noted below) > and is rebased on master. Other changes: > > [...] > - Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that > illustrates how the stack mechanism can be used to separate index and table > buffer accesses in the EXPLAIN for Index Scans Nice! I pushed 0001. The only changes I made were to break a few long lines. > From 324666b35d4513676783f0c352ad3a27371c08d8 Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <[email protected]> > Date: Sat, 1 Mar 2025 19:31:30 -0800 > Subject: [PATCH v4 2/7] Separate node and trigger instrumentation from other > use of Instrumentation struct > > Previously different places (e.g. query "total time") were repurposing > the Instrumentation struct initially introduced for capturing per-node > statistics during execution. This overuse of the same struct is confusing, > e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated > code paths, and prevents future refactorings. > > Instead, simplify the Instrumentation struct to only track time and > WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of > per-node instrumentation - these calls were added without any apparent > benefit since the relevant fields were never read. > > Introduce the NodeInstrumentation struct to carry forward the per-node > instrumentation information, and introduce TriggerInstrumentation to > capture trigger timing and firings (previously counted in "ntuples"). Hm. It looks to me that after moving trigger instrumentation to its own thing, there are no users of InstrAlloc() with n > 1. I think it may make sense to drop that? > diff --git a/src/backend/executor/execParallel.c > b/src/backend/executor/execParallel.c > index f098a5557cf..e87810d292e 100644 > --- a/src/backend/executor/execParallel.c > +++ b/src/backend/executor/execParallel.c > @@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState > * instrument_options: Same meaning here as in instrument.c. > * > * instrument_offset: Offset, relative to the start of this structure, > - * of the first Instrumentation object. This will depend on the length of > + * of the first NodeInstrumentation object. This will depend on the length > of > * the plan_node_id array. > * > * num_workers: Number of workers. > @@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation > int num_workers; > int num_plan_nodes; > int plan_node_id[FLEXIBLE_ARRAY_MEMBER]; > - /* array of num_plan_nodes * num_workers Instrumentation objects > follows */ > + > + /* > + * array of num_plan_nodes * num_workers NodeInstrumentation objects > + * follows > + */ > }; Spurious change, I think? > + > +/* Trigger instrumentation handling */ > +TriggerInstrumentation * > +InstrAllocTrigger(int n, int instrument_options) > +{ > + TriggerInstrumentation *tginstr = palloc0(n * > sizeof(TriggerInstrumentation)); > + bool need_buffers = (instrument_options & > INSTRUMENT_BUFFERS) != 0; > + bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0; > + bool need_timer = (instrument_options & INSTRUMENT_TIMER) != > 0; > + int i; > + > + for (i = 0; i < n; i++) > + { > + tginstr[i].instr.need_bufusage = need_buffers; > + tginstr[i].instr.need_walusage = need_wal; > + tginstr[i].instr.need_timer = need_timer; > + } > + > + return tginstr; Hm. Not that it's a huge difference, but I wonder if we ought to use InstrInit() here and in InstrAlloc(), InstrAllocNode(), to avoid repetition? > +/* Trigger instrumentation */ > +typedef struct TriggerInstrumentation > +{ > + Instrumentation instr; > + int firings; /* number of times the > instrumented trigger > + * was fired */ > +} TriggerInstrumentation; > + > +/* > + * Specialized instrumentation for per-node execution statistics > + */ > +typedef struct NodeInstrumentation > { > /* Parameters set at node creation: */ > bool need_timer; /* true if we need timer data */ > @@ -92,25 +125,34 @@ typedef struct Instrumentation > double nfiltered2; /* # of tuples removed by > "other" quals */ > BufferUsage bufusage; /* total buffer usage */ > WalUsage walusage; /* total WAL usage */ > -} Instrumentation; > +} NodeInstrumentation; The indentation here will look better if you add TriggerInstrumentation, NodeInstrumentation to typedefs.list > typedef struct WorkerInstrumentation > { > int num_workers; /* # of structures that follow > */ > - Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER]; > + NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER]; > } WorkerInstrumentation; Hm. Shouldn't this be WorkerNodeInstrumentation now? > From 74f44adc505a436a65d6069b286c8a878d4fe4af Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <[email protected]> > Date: Sun, 31 Aug 2025 16:34:42 -0700 > Subject: [PATCH v4 3/7] Replace direct changes of pgBufferUsage/pgWalUsage > with INSTR_* macros Looks sane to me. I'd move that to the head of the queue in the next revision. > From e03c96cbd3079c03ae63b6427937b79edaa9562b Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <[email protected]> > Date: Tue, 9 Sep 2025 02:16:59 -0700 > Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through > stack-based instrumentation > > Previously, in order to determine the buffer/WAL usage of a given code > section, > we utilized continuously incrementing global counters that get updated when > the > actual activity (e.g. shared block read) occurred, and then calculated a diff > when > the code section ended. This resulted in a bottleneck for executor node > instrumentation > specifically, with the function BufferUsageAccumDiff showing up in profiles > and > in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run. > > Instead, introduce a stack-based mechanism, where the actual activity writes > into the current stack entry. In the case of executor nodes, this means that > each node gets its own stack entry that is pushed at InstrStartNode, and > popped > at InstrEndNode. Stack entries are zero initialized (avoiding the diff > mechanism) > and get added to their parent entry when they are finalized, i.e. no more > modifications can occur. > > To correctly handle abort situations, any use of instrumentation stacks must > involve either a top-level Instrumentation struct, and its associated > InstrStart/ > InstrStop helpers (which use resource owners to handle aborts), or dedicated > PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after > an abort. I think it may be good to have some tests for edge cases. E.g. testing that a query that an explain analyze of a query that executes a plpgsql function which in turn does another explain analyze or just another query without explain analyze, does something reasonable. > @@ -1089,8 +1074,13 @@ pgss_ExecutorEnd(QueryDesc *queryDesc) > PGSS_EXEC, > > INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total), > queryDesc->estate->es_total_processed, > - &queryDesc->totaltime->bufusage, > - &queryDesc->totaltime->walusage, > + > + /* > + * Check if stack is initialized - it is not when ExecutorRun > wasn't > + * called > + */ > + queryDesc->totaltime->stack ? > &queryDesc->totaltime->stack->bufusage : NULL, > + queryDesc->totaltime->stack ? > &queryDesc->totaltime->stack->walusage : NULL, > queryDesc->estate->es_jit ? > &queryDesc->estate->es_jit->instr : NULL, > NULL, > > queryDesc->estate->es_parallel_workers_to_launch, Maybe it's just the diff, but this looks a bit odd, indentation wise. > @@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo, > resultRelInfo->ri_TrigWhenExprs = (ExprState **) > palloc0(n * sizeof(ExprState *)); > if (instrument_options) > - resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, > instrument_options); > + { > + /* > + * Triggers do not individually track buffer/WAL usage, > even if > + * otherwise tracked > + */ Why is that? > @@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr) > !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime)) > elog(ERROR, "InstrStart called twice in a row"); > > - if (instr->need_bufusage) > - instr->bufusage_start = pgBufferUsage; > + if (instr->need_bufusage || instr->need_walusage) > + { > + Assert(CurrentResourceOwner != NULL); > + instr->owner = CurrentResourceOwner; > + > + /* > + * Allocate the stack resource in a memory context that survives > + * during an abort. This will be freed by InstrStop (regular > + * execution) or ResOwnerReleaseInstrumentation (abort). > + * > + * We don't do this in InstrAlloc to avoid leaking when > InstrStart + > + * InstrStop isn't called. > + */ > + if (instr->stack == NULL) > + instr->stack = > MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack)); > > - if (instr->need_walusage) > - instr->walusage_start = pgWalUsage; > + ResourceOwnerEnlarge(instr->owner); > + ResourceOwnerRememberInstrStack(instr->owner, instr->stack); > + > + InstrPushStack(instr->stack); > + } > } I'm still worried that allocating something in CurTransactionContext will bite us eventually. It seems like it'd be pretty sane to want to have instrumentation around procedure execution, for example. > void > -InstrStop(Instrumentation *instr) > +InstrStop(Instrumentation *instr, bool finalize) > { > instr_time endtime; > > @@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr) > INSTR_TIME_SET_ZERO(instr->starttime); > } > > - /* Add delta of buffer usage since entry to node's totals */ > - if (instr->need_bufusage) > - BufferUsageAccumDiff(&instr->bufusage, > - &pgBufferUsage, > &instr->bufusage_start); > + if (instr->need_bufusage || instr->need_walusage) > + { > + InstrPopStack(instr->stack, finalize); > > - if (instr->need_walusage) > - WalUsageAccumDiff(&instr->walusage, > - &pgWalUsage, > &instr->walusage_start); > + Assert(instr->owner != NULL); > + ResourceOwnerForgetInstrStack(instr->owner, instr->stack); > + instr->owner = NULL; > + > + if (finalize) > + { > + /* > + * To avoid keeping memory allocated beyond when its > needed, copy > + * the result to the current memory context, and free > it in the > + * transaction context. > + */ > + InstrStack *stack = palloc(sizeof(InstrStack)); > + > + memcpy(stack, instr->stack, sizeof(InstrStack)); > + pfree(instr->stack); > + instr->stack = stack; > + } > + } > } Hm. I'm not entirely sure I understand the gain due to the palloc & copy & pfree. Won't that typically increase memory usage due to temporarily having two versions of the InstrStack? > /* Trigger instrumentation handling */ > @@ -98,15 +194,20 @@ TriggerInstrumentation * > InstrAllocTrigger(int n, int instrument_options) > { > TriggerInstrumentation *tginstr = palloc0(n * > sizeof(TriggerInstrumentation)); > - bool need_buffers = (instrument_options & > INSTRUMENT_BUFFERS) != 0; > - bool need_wal = (instrument_options & INSTRUMENT_WAL) != 0; > bool need_timer = (instrument_options & INSTRUMENT_TIMER) != > 0; > int i; > > + /* > + * To avoid having to determine when the last trigger fired, we never > + * track WAL/buffer usage for now > + */ > + Assert((instrument_options & INSTRUMENT_BUFFERS) == 0); > + Assert((instrument_options & INSTRUMENT_WAL) == 0); > + > for (i = 0; i < n; i++) > { > - tginstr[i].instr.need_bufusage = need_buffers; > - tginstr[i].instr.need_walusage = need_wal; > + tginstr[i].instr.need_bufusage = false; > + tginstr[i].instr.need_walusage = false; > tginstr[i].instr.need_timer = need_timer; > } > That does seem problematic to me. > @@ -400,21 +398,22 @@ InstrAggNode(NodeInstrumentation * dst, > NodeInstrumentation * add) > } > > /* start instrumentation during parallel executor startup */ > -void > +Instrumentation * > InstrStartParallelQuery(void) > { > - save_pgBufferUsage = pgBufferUsage; > - save_pgWalUsage = pgWalUsage; > + Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | > INSTRUMENT_WAL); > + > + InstrStart(instr); > + return instr; > } > /* report usage after parallel executor shutdown */ > void > -InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage) > +InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, > WalUsage *walusage) > { > - memset(bufusage, 0, sizeof(BufferUsage)); > - BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage); > - memset(walusage, 0, sizeof(WalUsage)); > - WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage); > + InstrStop(instr, true); > + memcpy(bufusage, &instr->stack->bufusage, sizeof(BufferUsage)); > + memcpy(walusage, &instr->stack->walusage, sizeof(WalUsage)); > } Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will make that pretty clear... > if (verbose || params.log_vacuum_min_duration == 0 || > TimestampDifferenceExceeds(starttime, endtime, > > params.log_vacuum_min_duration)) > { > long secs_dur; > int usecs_dur; > - WalUsage walusage; > - BufferUsage bufferusage; > StringInfoData buf; > char *msgfmt; > int32 diff; > @@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params, > int64 total_blks_hit; > int64 total_blks_read; > int64 total_blks_dirtied; > + BufferUsage bufusage = instr->stack->bufusage; > + WalUsage walusage = instr->stack->walusage; > TimestampDifference(starttime, endtime, &secs_dur, > &usecs_dur); > - memset(&walusage, 0, sizeof(WalUsage)); > - WalUsageAccumDiff(&walusage, &pgWalUsage, > &startwalusage); > - memset(&bufferusage, 0, sizeof(BufferUsage)); > - BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, > &startbufferusage); > - > - total_blks_hit = bufferusage.shared_blks_hit + > - bufferusage.local_blks_hit; > - total_blks_read = bufferusage.shared_blks_read + > - bufferusage.local_blks_read; > - total_blks_dirtied = bufferusage.shared_blks_dirtied + > - bufferusage.local_blks_dirtied; > + > + total_blks_hit = bufusage.shared_blks_hit + > + bufusage.local_blks_hit; > + total_blks_read = bufusage.shared_blks_read + > + bufusage.local_blks_read; > + total_blks_dirtied = bufusage.shared_blks_dirtied + > + bufusage.local_blks_dirtied; Why rename and move the declaration of bufferusage here? That seems to make the diff unnecessarily noisy? Kinda going the other way: Why copy this on the stack, rather than use a pointer? > diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c > index 34b6410d6a2..77b4c59e71c 100644 > --- a/src/backend/commands/prepare.c > +++ b/src/backend/commands/prepare.c > @@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause > *into, ExplainState *es, > ListCell *p; > ParamListInfo paramLI = NULL; > EState *estate = NULL; > - instr_time planstart; > - instr_time planduration; > - BufferUsage bufusage_start, > - bufusage; > + Instrumentation *instr = NULL; > MemoryContextCounters mem_counters; > MemoryContext planner_ctx = NULL; > MemoryContext saved_ctx = NULL; > > + if (es->buffers) > + instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS); > + else > + instr = InstrAlloc(1, INSTRUMENT_TIMER); I'd not duplicate the call to InstrAlloc(). > From ccea6e453872a0ae63351b3ba4360845035ec621 Mon Sep 17 00:00:00 2001 > From: Lukas Fittl <[email protected]> > Date: Thu, 30 Oct 2025 22:27:30 -0700 > Subject: [PATCH v4 7/7] Index scans: Split heap and index buffer access > reporting in EXPLAIN > > This makes it clear whether activity was on the index directly, or > on the table based on heap fetches. Yay^2. > --- > src/backend/commands/explain.c | 56 ++++++++++++++++------------ > src/backend/executor/execProcnode.c | 13 +++++++ > src/backend/executor/instrument.c | 25 +++++++++++++ > src/backend/executor/nodeIndexscan.c | 15 +++++++- > src/include/access/genam.h | 3 ++ > src/include/executor/instrument.h | 3 ++ > 6 files changed, 91 insertions(+), 24 deletions(-) > > diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c > index dd3bf615581..fb96dd5248c 100644 > diff --git a/src/backend/executor/execProcnode.c > b/src/backend/executor/execProcnode.c > index d00cf820a27..f19af428d97 100644 > --- a/src/backend/executor/execProcnode.c > +++ b/src/backend/executor/execProcnode.c > @@ -854,7 +854,20 @@ ExecAccumNodeInstrumentation_walker(PlanState *node, > void *context) > planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, > context); > > if (node->instrument && node->instrument->stack.previous) > + { Perhaps we should instead (probably in the earlier patch) just make the function return early if we don't have instrumentation / a previous node? We might gain a few more special cases, and it's nice if they're less indented. > + /* > + * Index Scan nodes account for heap buffer usage separately, > so we > + * need to explitly add here > + */ s/heap/table/ Greetings, Andres Freund
