Hi,

On 2025-08-31 16:57:01 -0700, Lukas Fittl wrote:
> Please find attached a patch series that introduces a new paradigm for how
> per-node WAL/buffer usage is tracked, with two primary goals: (1) reduce
> overhead of EXPLAIN ANALYZE, (2) enable future work like tracking estimated
> distinct buffer hits [0].

I like this for a third reason: To separate out buffer access statistics for
the index and the table in index scans. Right now it's very hard to figure out
if a query is slow because of the index lookups or finding the tuples in the
table.


> 0001: Separate node instrumentation from other use of Instrumentation struct
> 
>     Previously different places (e.g. query "total time") were repurposing
> the per-node Instrumentation struct. Instead, simplify the Instrumentation
> struct to only track time, WAL/buffer usage, and tuple counts. Similarly,
> drop the use of InstrEndLoop outside of per-node instrumentation. Introduce
> the NodeInstrumentation struct to carry forward the per-node
> instrumentation information.

It's mildly odd that the two types of instrumentation have a different
definition of 'total' (one double, one instr_time).


> 0003: Introduce stack for tracking per-node WAL/buffer usage


> From 4375fcb4141f18d6cd927659970518553aa3fe94 Mon Sep 17 00:00:00 2001
> From: Lukas Fittl <lu...@fittl.com>
> Date: Sun, 31 Aug 2025 16:37:05 -0700
> Subject: [PATCH v1 3/3] Introduce stack for tracking per-node WAL/buffer usage


> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index b83ced9a57a..1c2268bc608 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -312,6 +312,7 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>       DestReceiver *dest;
>       bool            sendTuples;
>       MemoryContext oldcontext;
> +     InstrStackResource *res;
>  
>       /* sanity checks */
>       Assert(queryDesc != NULL);
> @@ -333,6 +334,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>       if (queryDesc->totaltime)
>               InstrStart(queryDesc->totaltime);
>  
> +     /* Start up per-query node level instrumentation */
> +     res = InstrStartQuery();
> +
>       /*
>        * extract information from the query descriptor and the query feature.
>        */
> @@ -382,6 +386,9 @@ standard_ExecutorRun(QueryDesc *queryDesc,
>       if (sendTuples)
>               dest->rShutdown(dest);
>  
> +     /* Shut down per-query node level instrumentation */
> +     InstrShutdownQuery(res);
> +
>       if (queryDesc->totaltime)
>               InstrStop(queryDesc->totaltime, estate->es_processed);


Why are we doing Instr{Start,Stop}Query when not using instrumentation?
Resowner stuff ain't free, so I'd skip them when not necessary.


> diff --git a/src/backend/executor/execProcnode.c 
> b/src/backend/executor/execProcnode.c
> index d286471254b..7436f307994 100644
> --- a/src/backend/executor/execProcnode.c
> +++ b/src/backend/executor/execProcnode.c
> @@ -823,8 +823,17 @@ ExecShutdownNode_walker(PlanState *node, void *context)
>  
>       /* Stop the node if we started it above, reporting 0 tuples. */
>       if (node->instrument && node->instrument->running)
> +     {
>               InstrStopNode(node->instrument, 0);
>  
> +             /*
> +              * Propagate WAL/buffer stats to the parent node on the
> +              * instrumentation stack (which is where InstrStopNode returned 
> us
> +              * to).
> +              */
> +             InstrNodeAddToCurrent(&node->instrument->stack);
> +     }
> +
>       return false;
>  }

Can we rely on this being reached? Note that ExecutePlan() calls
ExecShutdownNode() conditionally:

        /*
         * If we know we won't need to back up, we can release resources at this
         * point.
         */
        if (!(estate->es_top_eflags & EXEC_FLAG_BACKWARD))
                ExecShutdownNode(planstate);


> +static void
> +ResOwnerReleaseInstrStack(Datum res)
> +{
> +     /*
> +      * XXX: Registered resources are *not* called in reverse order, i.e. 
> we'll
> +      * get what was first registered first at shutdown. To avoid handling
> +      * that, we are resetting the stack here on abort (instead of recovering
> +      * to previous).
> +      */
> +     pgInstrStack = NULL;
> +}

Hm, doesn't that mean we loose track of instrumentation if you e.g. do an
EXPLAIN ANALYZE of a query that executes a function, which internally triggers
an error and catches it?

I wonder if the solution could be to walk the stack and search for the
to-be-released element.

Greetings,

Andres Freund


Reply via email to