On 2025-10-22 14:28:24 +0300, Lukas Fittl wrote:
> On Tue, Sep 9, 2025 at 10:35 PM Lukas Fittl <[email protected]> wrote:
>
> > Attached an updated patch set that addresses the feedback, and also adds
> > the complete removal of the global pgBufferUsage variable in later patches
> > (0005-0007), to avoid counting both the stack and the variable.
> >
>
> See attached the same patch set rebased on latest master.

> From d40f69cce15dfa10479c8be31917b33a49d01477 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v3 1/7] Instrumentation: Keep time fields as instrtime,
>  require caller to convert
>
> Previously the Instrumentation logic always converted to seconds, only for 
> many
> of the callers to do unnecessary division to get to milliseconds. Since an 
> upcoming
> refactoring will split the Instrumentation struct, utilize instrtime always to
> keep things simpler.

LGTM, think we should apply this regardless of the rest of the patches.


> @@ -173,14 +169,14 @@ InstrAggNode(Instrumentation *dst, Instrumentation *add)
>               dst->running = true;
>               dst->firsttuple = add->firsttuple;
>       }
> -     else if (dst->running && add->running && dst->firsttuple > 
> add->firsttuple)
> +     else if (dst->running && add->running && 
> INSTR_TIME_CMP_LT(dst->firsttuple, add->firsttuple))
>               dst->firsttuple = add->firsttuple;

This isn't due to this patch, but it seems a bit odd that we use the minimum
time for the first tuple, but the average time for the node's completion...


> diff --git a/src/include/portability/instr_time.h 
> b/src/include/portability/instr_time.h
> index f71a851b18d..646934020d1 100644
> --- a/src/include/portability/instr_time.h
> +++ b/src/include/portability/instr_time.h
> @@ -184,6 +184,8 @@ GetTimerFrequency(void)
>  #define INSTR_TIME_ACCUM_DIFF(x,y,z) \
>       ((x).ticks += (y).ticks - (z).ticks)
>
> +#define INSTR_TIME_CMP_LT(x,y) \
> +     ((x).ticks > (y).ticks)
>
>  #define INSTR_TIME_GET_DOUBLE(t) \
>       ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
> --
> 2.47.1

Any reason to actually have _CMP_ in the name? Other operations like _ADD
don't have such an additional verb in the name.


> From 7546f855d138d0dac0d8c22ea5915314810f13e5 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Sat, 1 Mar 2025 19:31:30 -0800
> Subject: [PATCH v3 2/7] Separate node instrumentation from other use of
>  Instrumentation struct
>
> Previously different places (e.g. query "total time") were repurposing
> the Instrumentation struct initially introduced for capturing per-node
> statistics during execution. This dual use of the struct is confusing,
> e.g. by cluttering calls of InstrStartNode/InstrStopNode in unrelated
> code paths, and prevents future refactorings.
>
> Instead, simplify the Instrumentation struct to only track time,
> WAL/buffer usage, and tuple counts. Similarly, drop the use of InstrEndLoop
> outside of per-node instrumentation. Introduce the NodeInstrumentation
> struct to carry forward the per-node instrumentation information.


> @@ -381,12 +381,6 @@ explain_ExecutorEnd(QueryDesc *queryDesc)
>                */
>               oldcxt = MemoryContextSwitchTo(queryDesc->estate->es_query_cxt);
>
> -             /*
> -              * Make sure stats accumulation is done.  (Note: it's okay if 
> several
> -              * levels of hook all do this.)
> -              */
> -             InstrEndLoop(queryDesc->totaltime);
> -
>               /* Log plan if duration is exceeded. */
>               msec = INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total);
>               if (msec >= auto_explain_log_min_duration)

Maybe add a comment about the removal of these InstrEndLoop() calls to the
commit message? If I understand correctly they were superfluous before, but
that's not entirely obvious when just looking at the patch.



> +/*
> + * General purpose instrumentation that can capture time, WAL/buffer usage 
> and tuples
> + *
> + * Initialized through InstrAlloc, followed by one or more calls to a pair of
> + * InstrStart/InstrStop (activity is measured inbetween).
> + */
>  typedef struct Instrumentation
> +{
> +     /* Parameters set at creation: */
> +     bool            need_timer;             /* true if we need timer data */
> +     bool            need_bufusage;  /* true if we need buffer usage data */
> +     bool            need_walusage;  /* true if we need WAL usage data */
> +     /* Internal state keeping: */
> +     instr_time      starttime;              /* start time of last 
> InstrStart */
> +     BufferUsage bufusage_start; /* buffer usage at start */
> +     WalUsage        walusage_start; /* WAL usage at start */
> +     /* Accumulated statistics: */
> +     instr_time      total;                  /* total runtime */
> +     double          ntuples;                /* total tuples counted in 
> InstrStop */
> +     BufferUsage bufusage;           /* total buffer usage */
> +     WalUsage        walusage;               /* total WAL usage */
> +} Instrumentation;

Maybe add a comment explaining why ntuples is in here?



> From aa1acccb3dfa6a5d81a9a049d8cb63762a3d7cf7 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <[email protected]>
> Date: Tue, 9 Sep 2025 02:16:59 -0700
> Subject: [PATCH v3 4/7] Introduce stack for tracking per-node WAL/buffer usage

Could use a commit message :)


> ---
>  .../pg_stat_statements/pg_stat_statements.c   |   4 +-
>  src/backend/commands/explain.c                |   8 +-
>  src/backend/commands/trigger.c                |   4 +-
>  src/backend/executor/execMain.c               |  25 ++-
>  src/backend/executor/execProcnode.c           |  29 +++
>  src/backend/executor/instrument.c             | 199 ++++++++++++++----
>  src/include/executor/executor.h               |   1 +
>  src/include/executor/instrument.h             |  53 ++++-
>  8 files changed, 260 insertions(+), 63 deletions(-)
>
> diff --git a/contrib/pg_stat_statements/pg_stat_statements.c 
> b/contrib/pg_stat_statements/pg_stat_statements.c
> index f43a33b3787..eeabd820d8e 100644
> --- a/contrib/pg_stat_statements/pg_stat_statements.c
> +++ b/contrib/pg_stat_statements/pg_stat_statements.c
> @@ -1089,8 +1089,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
>                                  PGSS_EXEC,
>                                  
> INSTR_TIME_GET_MILLISEC(queryDesc->totaltime->total),
>                                  queryDesc->estate->es_total_processed,
> -                                &queryDesc->totaltime->bufusage,
> -                                &queryDesc->totaltime->walusage,
> +                                &INSTR_GET_BUFUSAGE(queryDesc->totaltime),
> +                                &INSTR_GET_WALUSAGE(queryDesc->totaltime),

Getting a pointer to something returned by a macro is a bit ugly... Perhaps
it'd be better to just pass the &queryDesc->totaltime? But ugh, that's not
easily possible given how pgss_planner() currently tracks things :(

Maybe it's worth refactoring this a bit in a precursor patch?


> @@ -1266,7 +1280,12 @@ InitResultRelInfo(ResultRelInfo *resultRelInfo,
>               resultRelInfo->ri_TrigWhenExprs = (ExprState **)
>                       palloc0(n * sizeof(ExprState *));
>               if (instrument_options)
> -                     resultRelInfo->ri_TrigInstrument = InstrAlloc(n, 
> instrument_options);
> +             {
> +                     if ((instrument_options & INSTRUMENT_TIMER) != 0)
> +                             resultRelInfo->ri_TrigInstrument = 
> InstrAlloc(n, INSTRUMENT_TIMER);
> +                     else
> +                             resultRelInfo->ri_TrigInstrument = 
> InstrAlloc(n, INSTRUMENT_ROWS);
> +             }
>       }
>       else
>       {

I'd not duplicate the InstrAlloc(), but compute the flags separately.


>  /* ------------------------------------------------------------------------
> @@ -828,6 +829,34 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>       return false;
>  }
>
> +/*
> + * ExecAccumNodeInstrumentation
> + *
> + * Accumulate instrumentation stats from all execution nodes to their 
> respective
> + * parents (or the original parent instrumentation stack).
> + */
> +void
> +ExecAccumNodeInstrumentation(PlanState *node)
> +{
> +     (void) ExecAccumNodeInstrumentation_walker(node, NULL);
> +}

I wonder if this is too narrow a name. There might be other uses of a pass
across the node tree at that point. OTOH, it's probably better to just rename
it at that later point.


> +static bool
> +ExecAccumNodeInstrumentation_walker(PlanState *node, void *context)
> +{
> +     if (node == NULL)
> +             return false;
> +
> +     check_stack_depth();
> +
> +     planstate_tree_walker(node, ExecAccumNodeInstrumentation_walker, 
> context);

There already is a check_stack_depth() in planstate_tree_walker().


> +     if (node->instrument && node->instrument->stack.previous)
> +             InstrStackAdd(node->instrument->stack.previous, 
> &node->instrument->stack);
> +
> +     return false;
> +}

E.g. in ExecShutdownNode_walker we use planstate_tree_walker(), but then also
have special handling for a few node types. Do we need something like that
here too?  It probably is ok, but it's worth explicitly checking and adding a
comment.



> +/*
> + * Use ResourceOwner mechanism to correctly reset pgInstrStack on abort.
> + */
> +static void ResOwnerReleaseInstrumentation(Datum res);
> +static const ResourceOwnerDesc instrumentation_resowner_desc =
> +{
> +     .name = "instrumentation",
> +     .release_phase = RESOURCE_RELEASE_BEFORE_LOCKS,
> +     .release_priority = RELEASE_PRIO_FIRST,
> +     .ReleaseResource = ResOwnerReleaseInstrumentation,
> +     .DebugPrint = NULL,                     /* default message is fine */
> +};

Is there a reason to do the release here before the lock release? And why
_FIRST?


> +static void
> +ResOwnerReleaseInstrumentation(Datum res)
> +{
> +     Instrumentation *instr = (Instrumentation *) DatumGetPointer(res);
> +
> +     /*
> +      * Because registered resources are *not* called in reverse order, we'll
> +      * get what was first registered first at shutdown. Thus, on any later
> +      * resources we need to not change the stack, which was already set to 
> the
> +      * correct previous entry.
> +      */

FWIW, the release order is not guaranteed to be in that order either,
e.g. once resowner switches to hashing, it'll essentially be random.


> +     if (pgInstrStack && !StackIsParent(pgInstrStack, &instr->stack))
> +             pgInstrStack = instr->stack.previous;

Hm - this is effectively O(stack-depth^2), right?  It's probably fine, given
that we have fairly limited nesting (explain + pg_stat_statements +
auto_explain is probably the current max), but seems worth noting in a
comment?


> +     /*
> +      * Always accumulate all collected stats before the abort, even if we
> +      * already walked up the stack with an earlier resource.
> +      */
> +     if (pgInstrStack)
> +             InstrStackAdd(pgInstrStack, &instr->stack);

Why are we accumulating stats in case of errors? It's probably fine, but doing
less as part of cleanup is pre ferrable...


>  /* General purpose instrumentation handling */
>  Instrumentation *
>  InstrAlloc(int n, int instrument_options)
>  {
> -     Instrumentation *instr;
> +     Instrumentation *instr = NULL;
> +     bool            need_buffers = (instrument_options & 
> INSTRUMENT_BUFFERS) != 0;
> +     bool            need_wal = (instrument_options & INSTRUMENT_WAL) != 0;
> +     bool            need_timer = (instrument_options & INSTRUMENT_TIMER) != 
> 0;
> +     int                     i;
> +
> +     /*
> +      * If resource owner will be used, we must allocate in the transaction
> +      * context (not the calling context, usually a lower context), because 
> the
> +      * memory might otherwise be freed too early in an abort situation.
> +      */
> +     if (need_buffers || need_wal)
> +             instr = MemoryContextAllocZero(CurTransactionContext, n * 
> sizeof(Instrumentation));
> +     else
> +             instr = palloc0(n * sizeof(Instrumentation));

Is this long-lived enough? I'm e.g. wondering about utility statements that
internally starting transactions, wouldn't that cause problems with a user
like pgss tracking something like CIC?


> -     /* initialize all fields to zeroes, then modify as needed */
> -     instr = palloc0(n * sizeof(Instrumentation));
> -     if (instrument_options & (INSTRUMENT_BUFFERS | INSTRUMENT_TIMER | 
> INSTRUMENT_WAL))
> +     for (i = 0; i < n; i++)
>       {
> -             bool            need_buffers = (instrument_options & 
> INSTRUMENT_BUFFERS) != 0;
> -             bool            need_wal = (instrument_options & 
> INSTRUMENT_WAL) != 0;
> -             bool            need_timer = (instrument_options & 
> INSTRUMENT_TIMER) != 0;
> -             int                     i;
> -
> -             for (i = 0; i < n; i++)
> -             {
> -                     instr[i].need_bufusage = need_buffers;
> -                     instr[i].need_walusage = need_wal;
> -                     instr[i].need_timer = need_timer;
> -             }
> +             instr[i].need_bufusage = need_buffers;
> +             instr[i].need_walusage = need_wal;
> +             instr[i].need_timer = need_timer;
>       }
>
>       return instr;
>  }
> +
>  void
>  InstrStart(Instrumentation *instr)
>  {
> +     Assert(!instr->finalized);
> +
>       if (instr->need_timer &&
>               !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
>               elog(ERROR, "InstrStart called twice in a row");
>
> -     if (instr->need_bufusage)
> -             instr->bufusage_start = pgBufferUsage;
> -
> -     if (instr->need_walusage)
> -             instr->walusage_start = pgWalUsage;
> +     if (instr->need_bufusage || instr->need_walusage)
> +             InstrPushStackResource(instr);
>  }
> +
>  void
> -InstrStop(Instrumentation *instr, double nTuples)
> +InstrStop(Instrumentation *instr, double nTuples, bool finalize)
>  {
>       instr_time      endtime;
>
> @@ -84,14 +178,15 @@ InstrStop(Instrumentation *instr, double nTuples)
>               INSTR_TIME_SET_ZERO(instr->starttime);
>       }
>
> -     /* Add delta of buffer usage since entry to node's totals */
> -     if (instr->need_bufusage)
> -             BufferUsageAccumDiff(&instr->bufusage,
> -                                                      &pgBufferUsage, 
> &instr->bufusage_start);
> +     if (instr->need_bufusage || instr->need_walusage)
> +             InstrPopStackResource(instr);
>
> -     if (instr->need_walusage)
> -             WalUsageAccumDiff(&instr->walusage,
> -                                               &pgWalUsage, 
> &instr->walusage_start);
> +     if (finalize)
> +     {
> +             instr->finalized = true;
> +             if (pgInstrStack)
> +                     InstrStackAdd(pgInstrStack, &instr->stack);
> +     }
>  }
>
>  /* Allocate new node instrumentation structure(s) */
> @@ -139,12 +234,14 @@ InstrStartNode(NodeInstrumentation * instr)
>               !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
>               elog(ERROR, "InstrStartNode called twice in a row");
>
> -     /* save buffer usage totals at node entry, if needed */
> -     if (instr->need_bufusage)
> -             instr->bufusage_start = pgBufferUsage;
> +     if (instr->need_bufusage || instr->need_walusage)
> +     {
> +             /* Ensure that we always have a parent, even at the top most 
> node */
> +             Assert(pgInstrStack != NULL);
>
> -     if (instr->need_walusage)
> -             instr->walusage_start = pgWalUsage;
> +             instr->stack.previous = pgInstrStack;
> +             pgInstrStack = &instr->stack;
> +     }
>  }
>
>  /* Exit from a plan node */
> @@ -169,14 +266,12 @@ InstrStopNode(NodeInstrumentation * instr, double 
> nTuples)
>               INSTR_TIME_SET_ZERO(instr->starttime);
>       }
>
> -     /* Add delta of buffer usage since entry to node's totals */
> -     if (instr->need_bufusage)
> -             BufferUsageAccumDiff(&instr->bufusage,
> -                                                      &pgBufferUsage, 
> &instr->bufusage_start);
> -
> -     if (instr->need_walusage)
> -             WalUsageAccumDiff(&instr->walusage,
> -                                               &pgWalUsage, 
> &instr->walusage_start);
> +     if (instr->need_bufusage || instr->need_walusage)
> +     {
> +             /* Ensure that we always have a parent, even at the top most 
> node */
> +             Assert(instr->stack.previous != NULL);
> +             pgInstrStack = instr->stack.previous;
> +     }
>
>       /* Is this the first tuple of this cycle? */
>       if (!instr->running)
> @@ -253,10 +348,20 @@ InstrAggNode(NodeInstrumentation * dst, 
> NodeInstrumentation * add)
>
>       /* Add delta of buffer usage since entry to node's totals */
>       if (dst->need_bufusage)
> -             BufferUsageAdd(&dst->bufusage, &add->bufusage);
> +             BufferUsageAdd(&dst->stack.bufusage, &add->stack.bufusage);
>
>       if (dst->need_walusage)
> -             WalUsageAdd(&dst->walusage, &add->walusage);
> +             WalUsageAdd(&dst->stack.walusage, &add->stack.walusage);
> +}
> +
> +void
> +InstrStackAdd(InstrStack * dst, InstrStack * add)
> +{
> +     Assert(dst != NULL);
> +     Assert(add != NULL);
> +
> +     BufferUsageAdd(&dst->bufusage, &add->bufusage);
> +     WalUsageAdd(&dst->walusage, &add->walusage);
>  }
>

Do we want to do BufferUsageAdd() etc even if we are not tracking buffer
usage? Those operations aren't cheap...


>  /* note current values during parallel executor startup */
> @@ -281,6 +386,14 @@ InstrEndParallelQuery(BufferUsage *bufusage, WalUsage 
> *walusage)
>  void
>  InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage *walusage)
>  {
> +     if (pgInstrStack != NULL)
> +     {
> +             InstrStack *dst = pgInstrStack;
> +
> +             BufferUsageAdd(&dst->bufusage, bufusage);
> +             WalUsageAdd(&dst->walusage, walusage);
> +     }
> +
>       BufferUsageAdd(&pgBufferUsage, bufusage);
>       WalUsageAdd(&pgWalUsage, walusage);
>  }

Is the pgInstrStack == NULL case actually reachable?


> diff --git a/src/include/executor/instrument.h 
> b/src/include/executor/instrument.h
> index 78d3653997b..d04607ce40c 100644
> --- a/src/include/executor/instrument.h
> +++ b/src/include/executor/instrument.h
> @@ -14,6 +14,7 @@
>  #define INSTRUMENT_H
>
>  #include "portability/instr_time.h"
> +#include "utils/resowner.h"

I'd probably not include resowner here but just forward declare the typedef.

> @@ -146,26 +161,46 @@ extern void InstrEndParallelQuery(BufferUsage 
> *bufusage, WalUsage *walusage);
>  extern void InstrAccumParallelQuery(BufferUsage *bufusage, WalUsage 
> *walusage);
>  extern void BufferUsageAccumDiff(BufferUsage *dst,
>                                                                const 
> BufferUsage *add, const BufferUsage *sub);
> +extern void InstrStackAdd(InstrStack * dst, InstrStack * add);
>  extern void WalUsageAccumDiff(WalUsage *dst, const WalUsage *add,
>                                                         const WalUsage *sub);
>
> +#define INSTR_GET_BUFUSAGE(instr) \
> +     instr->stack.bufusage
> +
> +#define INSTR_GET_WALUSAGE(instr) \
> +     instr->stack.walusage

Not convinced that having these macros is worthwhile.


At this point I reached return -ENEEDCOFFEE :)

Greetings,

Andres Freund


Reply via email to