Re: Stack-based tracking of per-node WAL/buffer usage

Andres Freund Fri, 09 Jan 2026 11:38:52 -0800

Hi,


On 2025-10-31 00:18:04 -0700, Lukas Fittl wrote:
> Attached v4 patchset that addresses feedback (unless otherwise noted below)
> and is rebased on master. Other changes:
>
> [...]
> - Per our off-list conversation at PGConf.EU, added a patch (v4/0007) that
> illustrates how the stack mechanism can be used to separate index and table
> buffer accesses in the EXPLAIN for Index Scans

Nice!


I pushed 0001. The only changes I made were to break a few long lines.


> From 324666b35d4513676783f0c352ad3a27371c08d8 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v4 2/7] Separate node and trigger instrumentation from other
>  use of Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This overuse of the same struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time and
> WAL/buffer usage. Similarly, drop the use of InstrEndLoop outside of
> per-node instrumentation - these calls were added without any apparent
> benefit since the relevant fields were never read.
>
> Introduce the NodeInstrumentation struct to carry forward the per-node
> instrumentation information, and introduce TriggerInstrumentation to
> capture trigger timing and firings (previously counted in "ntuples").

Hm. It looks to me that after moving trigger instrumentation to its own thing,
there are no users of InstrAlloc() with n > 1. I think it may make sense to
drop that?


> diff --git a/src/backend/executor/execParallel.c 
> b/src/backend/executor/execParallel.c
> index f098a5557cf..e87810d292e 100644
> --- a/src/backend/executor/execParallel.c
> +++ b/src/backend/executor/execParallel.c
> @@ -85,7 +85,7 @@ typedef struct FixedParallelExecutorState
>   * instrument_options: Same meaning here as in instrument.c.
>   *
>   * instrument_offset: Offset, relative to the start of this structure,
> - * of the first Instrumentation object.  This will depend on the length of
> + * of the first NodeInstrumentation object.  This will depend on the length 
> of
>   * the plan_node_id array.
>   *
>   * num_workers: Number of workers.
> @@ -102,11 +102,15 @@ struct SharedExecutorInstrumentation
>       int                     num_workers;
>       int                     num_plan_nodes;
>       int                     plan_node_id[FLEXIBLE_ARRAY_MEMBER];
> -     /* array of num_plan_nodes * num_workers Instrumentation objects 
> follows */
> +
> +     /*
> +      * array of num_plan_nodes * num_workers NodeInstrumentation objects
> +      * follows
> +      */
>  };

Spurious change, I think?



> +
> +/* Trigger instrumentation handling */
> +TriggerInstrumentation *
> +InstrAllocTrigger(int n, int instrument_options)
> +{
> +     TriggerInstrumentation *tginstr = palloc0(n * 
> sizeof(TriggerInstrumentation));
> +     bool            need_buffers = (instrument_options & 
> INSTRUMENT_BUFFERS) != 0;
> +     bool            need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> +     bool            need_timer = (instrument_options & INSTRUMENT_TIMER) != 
> 0;
> +     int                     i;
> +
> +     for (i = 0; i < n; i++)
> +     {
> +             tginstr[i].instr.need_bufusage = need_buffers;
> +             tginstr[i].instr.need_walusage = need_wal;
> +             tginstr[i].instr.need_timer = need_timer;
> +     }
> +
> +     return tginstr;

Hm. Not that it's a huge difference, but I wonder if we ought to use
InstrInit() here and in InstrAlloc(), InstrAllocNode(), to avoid repetition?

> +/* Trigger instrumentation */
> +typedef struct TriggerInstrumentation
> +{
> +     Instrumentation instr;
> +     int                     firings;                /* number of times the 
> instrumented trigger
> +                                                              * was fired */
> +}                    TriggerInstrumentation;
> +
> +/*
> + * Specialized instrumentation for per-node execution statistics
> + */
> +typedef struct NodeInstrumentation
>  {
>       /* Parameters set at node creation: */
>       bool            need_timer;             /* true if we need timer data */
> @@ -92,25 +125,34 @@ typedef struct Instrumentation
>       double          nfiltered2;             /* # of tuples removed by 
> "other" quals */
>       BufferUsage bufusage;           /* total buffer usage */
>       WalUsage        walusage;               /* total WAL usage */
> -} Instrumentation;
> +}                    NodeInstrumentation;

The indentation here will look better if you add TriggerInstrumentation,
NodeInstrumentation to typedefs.list


>  typedef struct WorkerInstrumentation
>  {
>       int                     num_workers;    /* # of structures that follow 
> */
> -     Instrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
> +     NodeInstrumentation instrument[FLEXIBLE_ARRAY_MEMBER];
>  } WorkerInstrumentation;

Hm. Shouldn't this be WorkerNodeInstrumentation now?


> From 74f44adc505a436a65d6069b286c8a878d4fe4af Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sun, 31 Aug 2025 16:34:42 -0700
> Subject: [PATCH v4 3/7] Replace direct changes of pgBufferUsage/pgWalUsage
>  with INSTR_* macros

Looks sane to me.

I'd move that to the head of the queue in the next revision.


> From e03c96cbd3079c03ae63b6427937b79edaa9562b Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v4 4/7] Optimize measuring WAL/buffer usage through
>  stack-based instrumentation
>
> Previously, in order to determine the buffer/WAL usage of a given code 
> section,
> we utilized continuously incrementing global counters that get updated when 
> the
> actual activity (e.g. shared block read) occurred, and then calculated a diff 
> when
> the code section ended. This resulted in a bottleneck for executor node 
> instrumentation
> specifically, with the function BufferUsageAccumDiff showing up in profiles 
> and
> in some cases adding up to 10% overhead to an EXPLAIN (ANALYZE, BUFFERS) run.
>
> Instead, introduce a stack-based mechanism, where the actual activity writes
> into the current stack entry. In the case of executor nodes, this means that
> each node gets its own stack entry that is pushed at InstrStartNode, and 
> popped
> at InstrEndNode. Stack entries are zero initialized (avoiding the diff 
> mechanism)
> and get added to their parent entry when they are finalized, i.e. no more
> modifications can occur.
>
> To correctly handle abort situations, any use of instrumentation stacks must
> involve either a top-level Instrumentation struct, and its associated 
> InstrStart/
> InstrStop helpers (which use resource owners to handle aborts), or dedicated
> PG_TRY/PG_FINALLY calls that ensure the stack is in a consistent state after
> an abort.

I think it may be good to have some tests for edge cases. E.g. testing that a
query that an explain analyze of a query that executes a plpgsql function
which in turn does another explain analyze or just another query without
explain analyze, does something reasonable.


> @@ -1089,8 +1074,13 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
>                                  PGSS_EXEC,
>                                  
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
>                                  queryDesc->estate->es_total_processed,
> -                                &queryDesc->totaltime->bufusage,
> -                                &queryDesc->totaltime->walusage,
> +
> +             /*
> +              * Check if stack is initialized - it is not when ExecutorRun 
> wasn't
> +              * called
> +              */
> +                                queryDesc->totaltime->stack ? 
> &queryDesc->totaltime->stack->bufusage : NULL,
> +                                queryDesc->totaltime->stack ? 
> &queryDesc->totaltime->stack->walusage : NULL,
>                                  queryDesc->estate->es_jit ? 
> &queryDesc->estate->es_jit->instr : NULL,
>                                  NULL,
>                                  
> queryDesc->estate->es_parallel_workers_to_launch,

Maybe it's just the diff, but this looks a bit odd, indentation wise.



> @@ -1266,7 +1280,15 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
>               resultRelInfo->ri_TrigWhenExprs = (ExprState **)
>                       palloc0(n * sizeof(ExprState *));
>               if (instrument_options)
> -                     resultRelInfo->ri_TrigInstrument = InstrAllocTrigger(n, 
> instrument_options);
> +             {
> +                     /*
> +                      * Triggers do not individually track buffer/WAL usage, 
> even if
> +                      * otherwise tracked
> +                      */

Why is that?


> @@ -59,15 +125,31 @@ InstrStart(Instrumentation *instr)
>               !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
>               elog(ERROR, "InstrStart called twice in a row");
>
> -     if (instr->need_bufusage)
> -             instr->bufusage_start = pgBufferUsage;
> +     if (instr->need_bufusage || instr->need_walusage)
> +     {
> +             Assert(CurrentResourceOwner != NULL);
> +             instr->owner = CurrentResourceOwner;
> +
> +             /*
> +              * Allocate the stack resource in a memory context that survives
> +              * during an abort. This will be freed by InstrStop (regular
> +              * execution) or ResOwnerReleaseInstrumentation (abort).
> +              *
> +              * We don't do this in InstrAlloc to avoid leaking when 
> InstrStart +
> +              * InstrStop isn't called.
> +              */
> +             if (instr->stack == NULL)
> +                     instr->stack = 
> MemoryContextAllocZero(CurTransactionContext, sizeof(InstrStack));
>
> -     if (instr->need_walusage)
> -             instr->walusage_start = pgWalUsage;
> +             ResourceOwnerEnlarge(instr->owner);
> +             ResourceOwnerRememberInstrStack(instr->owner, instr->stack);
> +
> +             InstrPushStack(instr->stack);
> +     }
>  }

I'm still worried that allocating something in CurTransactionContext will bite
us eventually. It seems like it'd be pretty sane to want to have
instrumentation around procedure execution, for example.


>  void
> -InstrStop(Instrumentation *instr)
> +InstrStop(Instrumentation *instr, bool finalize)
>  {
>       instr_time      endtime;
>
> @@ -83,14 +165,28 @@ InstrStop(Instrumentation *instr)
>               INSTR_TIME_SET_ZERO(instr->starttime);
>       }
>
> -     /* Add delta of buffer usage since entry to node's totals */
> -     if (instr->need_bufusage)
> -             BufferUsageAccumDiff(&instr->bufusage,
> -                                                      &pgBufferUsage, 
> &instr->bufusage_start);
> +     if (instr->need_bufusage || instr->need_walusage)
> +     {
> +             InstrPopStack(instr->stack, finalize);
>
> -     if (instr->need_walusage)
> -             WalUsageAccumDiff(&instr->walusage,
> -                                               &pgWalUsage, 
> &instr->walusage_start);
> +             Assert(instr->owner != NULL);
> +             ResourceOwnerForgetInstrStack(instr->owner, instr->stack);
> +             instr->owner = NULL;
> +
> +             if (finalize)
> +             {
> +                     /*
> +                      * To avoid keeping memory allocated beyond when its 
> needed, copy
> +                      * the result to the current memory context, and free 
> it in the
> +                      * transaction context.
> +                      */
> +                     InstrStack *stack = palloc(sizeof(InstrStack));
> +
> +                     memcpy(stack, instr->stack, sizeof(InstrStack));
> +                     pfree(instr->stack);
> +                     instr->stack = stack;
> +             }
> +     }
>  }

Hm. I'm not entirely sure I understand the gain due to the palloc & copy &
pfree. Won't that typically increase memory usage due to temporarily having
two versions of the InstrStack?


>  /* Trigger instrumentation handling */
> @@ -98,15 +194,20 @@ TriggerInstrumentation *
>  InstrAllocTrigger(int n, int instrument_options)
>  {
>       TriggerInstrumentation *tginstr = palloc0(n * 
> sizeof(TriggerInstrumentation));
> -     bool            need_buffers = (instrument_options & 
> INSTRUMENT_BUFFERS) != 0;
> -     bool            need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
>       bool            need_timer = (instrument_options & INSTRUMENT_TIMER) != 
> 0;
>       int                     i;
>
> +     /*
> +      * To avoid having to determine when the last trigger fired, we never
> +      * track WAL/buffer usage for now
> +      */
> +     Assert((instrument_options & INSTRUMENT_BUFFERS) == 0);
> +     Assert((instrument_options & INSTRUMENT_WAL) == 0);
> +
>       for (i = 0; i < n; i++)
>       {
> -             tginstr[i].instr.need_bufusage = need_buffers;
> -             tginstr[i].instr.need_walusage = need_wal;
> +             tginstr[i].instr.need_bufusage = false;
> +             tginstr[i].instr.need_walusage = false;
>               tginstr[i].instr.need_timer = need_timer;
>       }
>

That does seem problematic to me.




> @@ -400,21 +398,22 @@ InstrAggNode(NodeInstrumentation * dst, 
> NodeInstrumentation * add)
>  }
>
>  /* start instrumentation during parallel executor startup */
> -void
> +Instrumentation *
>  InstrStartParallelQuery(void)
>  {
> -     save_pgBufferUsage = pgBufferUsage;
> -     save_pgWalUsage = pgWalUsage;
> +     Instrumentation *instr = InstrAlloc(1, INSTRUMENT_BUFFERS | 
> INSTRUMENT_WAL);
> +
> +     InstrStart(instr);
> +     return instr;
>  }

>  /* report usage after parallel executor shutdown */
>  void
> -InstrEndParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
> +InstrEndParallelQuery(Instrumentation *instr, BufferUsage *bufusage, 
> WalUsage *walusage)
>  {
> -     memset(bufusage, 0, sizeof(BufferUsage));
> -     BufferUsageAccumDiff(bufusage, &pgBufferUsage, &save_pgBufferUsage);
> -     memset(walusage, 0, sizeof(WalUsage));
> -     WalUsageAccumDiff(walusage, &pgWalUsage, &save_pgWalUsage);
> +     InstrStop(instr, true);
> +     memcpy(bufusage, &instr->stack->bufusage, sizeof(BufferUsage));
> +     memcpy(walusage, &instr->stack->walusage, sizeof(WalUsage));
>  }


Wonder if we should slap a pg_nodiscard on InstrStartParallelQuery() to make
it easier for extensions to notice this? OTOH, InstrEndParallelQuery() will
make that pretty clear...



>               if (verbose || params.log_vacuum_min_duration == 0 ||
>                       TimestampDifferenceExceeds(starttime, endtime,
>                                                                          
> params.log_vacuum_min_duration))
>               {
>                       long            secs_dur;
>                       int                     usecs_dur;
> -                     WalUsage        walusage;
> -                     BufferUsage bufferusage;
>                       StringInfoData buf;
>                       char       *msgfmt;
>                       int32           diff;
> @@ -975,19 +976,17 @@ heap_vacuum_rel(Relation rel, const VacuumParams params,
>                       int64           total_blks_hit;
>                       int64           total_blks_read;
>                       int64           total_blks_dirtied;
> +                     BufferUsage bufusage = instr->stack->bufusage;
> +                     WalUsage        walusage = instr->stack->walusage;

>                       TimestampDifference(starttime, endtime, &secs_dur, 
> &usecs_dur);
> -                     memset(&walusage, 0, sizeof(WalUsage));
> -                     WalUsageAccumDiff(&walusage, &pgWalUsage, 
> &startwalusage);
> -                     memset(&bufferusage, 0, sizeof(BufferUsage));
> -                     BufferUsageAccumDiff(&bufferusage, &pgBufferUsage, 
> &startbufferusage);
> -
> -                     total_blks_hit = bufferusage.shared_blks_hit +
> -                             bufferusage.local_blks_hit;
> -                     total_blks_read = bufferusage.shared_blks_read +
> -                             bufferusage.local_blks_read;
> -                     total_blks_dirtied = bufferusage.shared_blks_dirtied +
> -                             bufferusage.local_blks_dirtied;
> +
> +                     total_blks_hit = bufusage.shared_blks_hit +
> +                             bufusage.local_blks_hit;
> +                     total_blks_read = bufusage.shared_blks_read +
> +                             bufusage.local_blks_read;
> +                     total_blks_dirtied = bufusage.shared_blks_dirtied +
> +                             bufusage.local_blks_dirtied;

Why rename and move the declaration of bufferusage here? That seems to make
the diff unnecessarily noisy?

Kinda going the other way: Why copy this on the stack, rather than use a
pointer?

> diff --git a/src/backend/commands/prepare.c b/src/backend/commands/prepare.c
> index 34b6410d6a2..77b4c59e71c 100644
> --- a/src/backend/commands/prepare.c
> +++ b/src/backend/commands/prepare.c
> @@ -578,14 +578,16 @@ ExplainExecuteQuery(ExecuteStmt *execstmt, IntoClause 
> *into, ExplainState *es,
>       ListCell   *p;
>       ParamListInfo paramLI = NULL;
>       EState     *estate = NULL;
> -     instr_time      planstart;
> -     instr_time      planduration;
> -     BufferUsage bufusage_start,
> -                             bufusage;
> +     Instrumentation *instr = NULL;
>       MemoryContextCounters mem_counters;
>       MemoryContext planner_ctx = NULL;
>       MemoryContext saved_ctx = NULL;
>
> +     if (es->buffers)
> +             instr = InstrAlloc(1, INSTRUMENT_TIMER | INSTRUMENT_BUFFERS);
> +     else
> +             instr = InstrAlloc(1, INSTRUMENT_TIMER);

I'd not duplicate the call to InstrAlloc().



> From ccea6e453872a0ae63351b3ba4360845035ec621 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Thu, 30 Oct 2025 22:27:30 -0700
> Subject: [PATCH v4 7/7] Index scans: Split heap and index buffer access
>  reporting in EXPLAIN
>
> This makes it clear whether activity was on the index directly, or
> on the table based on heap fetches.

Yay^2.


> ---
>  src/backend/commands/explain.c       | 56 ++++++++++++++++------------
>  src/backend/executor/execProcnode.c  | 13 +++++++
>  src/backend/executor/instrument.c    | 25 +++++++++++++
>  src/backend/executor/nodeIndexscan.c | 15 +++++++-
>  src/include/access/genam.h           |  3 ++
>  src/include/executor/instrument.h    |  3 ++
>  6 files changed, 91 insertions(+), 24 deletions(-)
>
> diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
> index dd3bf615581..fb96dd5248c 100644

> diff --git a/src/backend/executor/execProcnode.c 
> b/src/backend/executor/execProcnode.c
> index d00cf820a27..f19af428d97 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -854,7 +854,20 @@ ExecAccumNodeInstrumentation_walker(PlanState *node, 
> void *context)
>       planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, 
> context);
>
>       if (node->instrument && node->instrument->stack.previous)
> +     {

Perhaps we should instead (probably in the earlier patch) just make the
function return early if we don't have instrumentation / a previous node? We
might gain a few more special cases, and it's nice if they're less indented.


> +             /*
> +              * Index Scan nodes account for heap buffer usage separately, 
> so we
> +              * need to explitly add here
> +              */

s/heap/table/


Greetings,

Andres Freund

Re: Stack-based tracking of per-node WAL/buffer usage

Reply via email to