On Tue, Nov 14, 2017 at 6:09 PM, Thomas Munro <thomas.mu...@enterprisedb.com> wrote: > EXPLAIN can show per-worker information and aggregated information, > but doesn't show the leader's information individually. Should it? > > Example: > > Partial Aggregate [...] > Output: PARTIAL count(*) > Buffers: shared hit=476 > Worker 0: actual time=5.110..5.110 rows=1 loops=1 > Buffers: shared hit=150 > Worker 1: actual time=5.088..5.088 rows=1 loops=1 > Buffers: shared hit=149 > > Presumably in that case the leader hit 476 - 150 - 149 = 175 shared > buffers, but why don't we just say so?
This mess is my fault, and I'm happy to have somebody fix it. The problem I ran into is what to do about the fields where you can't work backwards. When it says actual_time=${FIRST}..${TOTAL}, you can work out the leader's value for ${TOTAL} by taking the overall total and subtracting the totals for each worker, but that doesn't work for ${FIRST}. The problem here is really because of the way ExecParallelRetrieveInstrumentation works. That function normally gets called once per parallel query but could get called multiple times if workers are relaunched. Each time it's called, it folds the data from the most recent set of workers into the leader's information by calling InstrAggNode(). But because InstrAggNode() isn't fully reversible, that loses data. I think what we probably need to do is modify the definition of PlanState to have an additional Instrumentation * pointer -- so that know the real leader information without having to try to work backward. When I was initially working on this stuff I was mostly aiming to get done before I ran out of release cycle and minimize the chances of people asking for parallel query to be reverted entirely, so I tried to perturb things as little as possible in this area. It's probably time to shift the priority from non-reversion to quality. > The above is about about increasing visibility of the leader's > contribution, but it would also be good to be able to see how the > leader is hurting you by running the plan itself. Would it make sense > to instrument the tuple queue interactions so we can show per-worker > shm_mq_send_bytes wait count and wait time? Wait count could probably be gathered pretty cheaply, but wait time would have some significant overhead. Since we'd presumably only be doing it in the EXPLAIN ANALYZE case, that wouldn't really be a problem, but I guess the question is how accurate the results would be. If it makes a 50 second query take 100 seconds, then any percentages you derive from it are probably mostly meaningless. I think we should have an EXPLAIN option that makes the leader and every worker run getrusage() before and after the query and include the results in a separate section of the output. This would only be usable with ANALYZE. Then you could see user/system/elapsed for every backend involved in the query. That way, you can see that, for example, the leader has user approximately equal to elapsed, but for workers user/elapsed = 20% or whatever. Granted, it doesn't tell you WHY processes were off-CPU so it might not be the tuple queue, but I think it would still be pretty useful -- and it'd be very cheap. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company