Re: Parallel grouping sets

Tomas Vondra Thu, 13 Jun 2019 17:44:58 -0700

On Fri, Jun 14, 2019 at 12:02:52PM +1200, David Rowley wrote:

On Fri, 14 Jun 2019 at 11:45, Tomas Vondra <tomas.von...@2ndquadrant.com> wrote:


On Wed, Jun 12, 2019 at 10:58:44AM +0800, Richard Guo wrote:

># explain (costs off, verbose) select c1, c2, avg(c3) from t2 group by
>grouping sets((c1,c2), (c1));
>                          QUERY PLAN
>--------------------------------------------------------------
> Finalize GroupAggregate
>   Output: c1, c2, avg(c3), (gset_id)
>   Group Key: t2.c1, t2.c2, (gset_id)
>   ->  Gather Merge
>         Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>         Workers Planned: 2
>         ->  Sort
>               Output: c1, c2, (gset_id), (PARTIAL avg(c3))
>               Sort Key: t2.c1, t2.c2, (gset_id)
>               ->  Partial HashAggregate
>                     Output: c1, c2, gset_id, PARTIAL avg(c3)
>                     Hash Key: t2.c1, t2.c2
>                     Hash Key: t2.c1
>                     ->  Parallel Seq Scan on public.t2
>                           Output: c1, c2, c3
>(15 rows)
>

OK, I'm not sure I understand the point of this - can you give an
example which is supposed to benefit from this? Where does the speedup
came from?


I think this is a bad example since the first grouping set is a
superset of the 2nd. If those were independent and each grouping set
produced a reasonable number of groups then it may be better to do it
this way instead of grouping by all exprs in all grouping sets in the
first phase, as is done by #1.   To do #2 would require that we tag
the aggregate state with the grouping set that belong to, which seem
to be what gset_id is in Richard's output.


Aha! So if we have grouping sets (a,b) and (c,d), then with the first
approach we'd do partial aggregate on (a,b,c,d) - which may produce
quite a few distinct groups, making it inefficient. But with the second
approach, we'd do just (a,b) and (c,d) and mark the rows with gset_id.

Neat!

In my example upthread the first phase of aggregation produced a group
per input row. Method #2 would work better for that case since it
would only produce 2000 groups instead of 1 million.

Likely both methods would be good to consider, but since #1 seems much
easier than #2, then to me it seems to make sense to start there.


Yep. Thanks for the explanation.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com

PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Parallel grouping sets

Reply via email to