Re: Separate leader buffer info and worker wait info in EXPLAIN output?

Robert Haas Wed, 15 Nov 2017 07:57:45 -0800

On Tue, Nov 14, 2017 at 6:09 PM, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:
> EXPLAIN can show per-worker information and aggregated information,
> but doesn't show the leader's information individually.  Should it?
>
> Example:
>
>   Partial Aggregate [...]
>     Output: PARTIAL count(*)
>     Buffers: shared hit=476
>     Worker 0: actual time=5.110..5.110 rows=1 loops=1
>       Buffers: shared hit=150
>     Worker 1: actual time=5.088..5.088 rows=1 loops=1
>       Buffers: shared hit=149
>
> Presumably in that case the leader hit 476 - 150 - 149 = 175 shared
> buffers, but why don't we just say so?


This mess is my fault, and I'm happy to have somebody fix it.  The
problem I ran into is what to do about the fields where you can't work
backwards.  When it says actual_time=${FIRST}..${TOTAL}, you can work
out the leader's value for ${TOTAL} by taking the overall total and
subtracting the totals for each worker, but that doesn't work for
${FIRST}.

The problem here is really because of the way
ExecParallelRetrieveInstrumentation works.  That function normally
gets called once per parallel query but could get called multiple
times if workers are relaunched.  Each time it's called, it folds the
data from the most recent set of workers into the leader's information
by calling InstrAggNode().  But because InstrAggNode() isn't fully
reversible, that loses data.  I think what we probably need to do is
modify the definition of PlanState to have an additional
Instrumentation * pointer -- so that know the real leader information
without having to try to work backward.  When I was initially working
on this stuff I was mostly aiming to get done before I ran out of
release cycle and minimize the chances of people asking for parallel
query to be reverted entirely, so I tried to perturb things as little
as possible in this area.  It's probably time to shift the priority
from non-reversion to quality.

> The above is about about increasing visibility of the leader's
> contribution, but it would also be good to be able to see how the
> leader is hurting you by running the plan itself.  Would it make sense
> to instrument the tuple queue interactions so we can show per-worker
> shm_mq_send_bytes wait count and wait time?

Wait count could probably be gathered pretty cheaply, but wait time
would have some significant overhead.  Since we'd presumably only be
doing it in the EXPLAIN ANALYZE case, that wouldn't really be a
problem, but I guess the question is how accurate the results would
be.  If it makes a 50 second query take 100 seconds, then any
percentages you derive from it are probably mostly meaningless.

I think we should have an EXPLAIN option that makes the leader and
every worker run getrusage() before and after the query and include
the results in a separate section of the output.  This would only be
usable with ANALYZE.  Then you could see user/system/elapsed for every
backend involved in the query.  That way, you can see that, for
example, the leader has user approximately equal to elapsed, but for
workers user/elapsed = 20% or whatever.  Granted, it doesn't tell you
WHY processes were off-CPU so it might not be the tuple queue, but I
think it would still be pretty useful -- and it'd be very cheap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Separate leader buffer info and worker wait info in EXPLAIN output?

Reply via email to