Hi Hackers, Richard pointed out that he get incorrect results with the patch I attached, there are bugs somewhere, I fixed them now and attached the newest version, please refer to [1] for the fix.
Thanks, Pengzhou References: [1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets <https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3 On Mon, Sep 30, 2019 at 5:41 PM Pengzhou Tang <pt...@pivotal.io> wrote: > Hi Richard & Tomas: > > I followed the idea of the second approach to add a gset_id in the > targetlist of the first stage of > grouping sets and uses it to combine the aggregate in final stage. gset_id > stuff is still kept > because of GROUPING() cannot uniquely identify a grouping set, grouping > sets may contain > duplicated set, eg: group by grouping sets((c1, c2), (c1,c2)). > > There are some differences to implement the second approach comparing to > the original idea from > Richard, gset_id is not used as additional group key in the final stage, > instead, we use it to > dispatch the input tuple to the specified grouping set directly and then > do the aggregate. > One advantage of this is that we can handle multiple rollups with better > performance without APPEND node. > > the plan now looks like: > > gpadmin=# explain select c1, c2 from gstest group by grouping > sets(rollup(c1, c2), rollup(c3)); > QUERY PLAN > > -------------------------------------------------------------------------------------------- > Finalize MixedAggregate (cost=1000.00..73108.57 rows=8842 width=12) > Dispatched by: (GROUPINGSETID()) > Hash Key: c1, c2 > Hash Key: c1 > Hash Key: c3 > Group Key: () > Group Key: () > -> Gather (cost=1000.00..71551.48 rows=17684 width=16) > Workers Planned: 2 > -> Partial MixedAggregate (cost=0.00..68783.08 rows=8842 > width=16) > Hash Key: c1, c2 > Hash Key: c1 > Hash Key: c3 > Group Key: () > Group Key: () > -> Parallel Seq Scan on gstest (cost=0.00..47861.33 > rows=2083333 width=12) > (16 rows) > > gpadmin=# set enable_hashagg to off; > gpadmin=# explain select c1, c2 from gstest group by grouping > sets(rollup(c1, c2), rollup(c3)); > QUERY PLAN > > -------------------------------------------------------------------------------------------------------- > Finalize GroupAggregate (cost=657730.66..663207.45 rows=8842 width=12) > Dispatched by: (GROUPINGSETID()) > Group Key: c1, c2 > Sort Key: c1 > Group Key: c1 > Group Key: () > Group Key: () > Sort Key: c3 > Group Key: c3 > -> Sort (cost=657730.66..657774.87 rows=17684 width=16) > Sort Key: c1, c2 > -> Gather (cost=338722.94..656483.04 rows=17684 width=16) > Workers Planned: 2 > -> Partial GroupAggregate (cost=337722.94..653714.64 > rows=8842 width=16) > Group Key: c1, c2 > Group Key: c1 > Group Key: () > Group Key: () > Sort Key: c3 > Group Key: c3 > -> Sort (cost=337722.94..342931.28 rows=2083333 > width=12) > Sort Key: c1, c2 > -> Parallel Seq Scan on gstest > (cost=0.00..47861.33 rows=2083333 width=12) > > References: > [1] https://github.com/greenplum-db/postgres/tree/parallel_groupingsets > <https://github.com/greenplum-db/postgres/tree/parallel_groupingsets_3>_3 > > On Wed, Jul 31, 2019 at 4:07 PM Richard Guo <ri...@pivotal.io> wrote: > >> On Tue, Jul 30, 2019 at 11:05 PM Tomas Vondra < >> tomas.von...@2ndquadrant.com> wrote: >> >>> On Tue, Jul 30, 2019 at 03:50:32PM +0800, Richard Guo wrote: >>> >On Wed, Jun 12, 2019 at 10:58 AM Richard Guo <ri...@pivotal.io> wrote: >>> > >>> >> Hi all, >>> >> >>> >> Paul and I have been hacking recently to implement parallel grouping >>> >> sets, and here we have two implementations. >>> >> >>> >> Implementation 1 >>> >> ================ >>> >> >>> >> Attached is the patch and also there is a github branch [1] for this >>> >> work. >>> >> >>> > >>> >Rebased with the latest master. >>> > >>> >>> Hi Richard, >>> >>> thanks for the rebased patch. I think the patch is mostly fine (at least >>> I >>> don't see any serious issues). A couple minor comments: >>> >> >> Hi Tomas, >> >> Thank you for reviewing this patch. >> >> >>> >>> 1) I think get_number_of_groups() would deserve a short explanation why >>> it's OK to handle (non-partial) grouping sets and regular GROUP BY in the >>> same branch. Before these cases were clearly separated, now it seems a >>> bit >>> mixed up and it may not be immediately obvious why it's OK. >>> >> >> Added a short comment in get_number_of_groups() explaining the behavior >> when doing partial aggregation for grouping sets. >> >> >>> >>> 2) There are new regression tests, but they are not added to any schedule >>> (parallel or serial), and so are not executed as part of "make check". I >>> suppose this is a mistake. >>> >> >> Yes, thanks. Added the new regression test in parallel_schedule and >> serial_schedule. >> >> >>> >>> 3) The regression tests do check plan and results like this: >>> >>> EXPLAIN (COSTS OFF, VERBOSE) SELECT ...; >>> SELECT ... ORDER BY 1, 2, 3; >>> >>> which however means that the query might easily use a different plan than >>> what's verified in the eplain (thanks to the additional ORDER BY clause). >>> So I think this should explain and execute the same query. >>> >>> (In this case the plans seems to be the same, but that may easily change >>> in the future, and we could miss it here, failing to verify the results.) >>> >> >> Thank you for pointing this out. Fixed it in V4 patch. >> >> >>> >>> 4) It might be a good idea to check the negative case too, i.e. a query >>> on >>> data set that we should not parallelize (because the number of partial >>> groups would be too high). >>> >> >> Yes, agree. Added a negative case. >> >> >>> >>> >>> Do you have any plans to hack on the second approach too? AFAICS those >>> two >>> approaches are complementary (address different data sets / queries), and >>> it would be nice to have both. One of the things I've been wondering is >>> if >>> we need to invent gset_id as a new concept, or if we could simply use the >>> existing GROUPING() function - that uniquely identifies the grouping set. >>> >>> >> Yes, I'm planning to hack on the second approach in short future. I'm >> also reconsidering the gset_id stuff since it brings a lot of complexity >> for the second approach. I agree with you that we can try GROUPING() >> function to see if it can replace gset_id. >> >> Thanks >> Richard >> >
v2-0001-Support-for-parallel-grouping-sets.patch
Description: Binary data