On Wed, Jan 20, 2016 at 8:32 PM, David Rowley <david.row...@2ndquadrant.com> wrote: > At the moment I think everything which will use this is queued up behind the > pathification of the grouping planner which Tom is working on. I think > naturally Parallel Aggregate makes sense to work on first, given all the > other parallel stuff in this release. I plan on working on that that by > either assisting Haribabu, or... whatever else it takes.
Great! > The other two usages which I have thought of are; > > 1) Aggregating before UNION ALL, which might be fairly simple after the > grouping planner changes, as it may just be a matter of considering another > "grouping path" which partially aggregates before the UNION ALL, and > performs the final grouping stage after UNION ALL. At this stage it's hard > to say how that will work as I'm not sure how far changes to the grouping > planner will go. Perhaps Tom can comment? I hope he will, but in the meantime let me ask how this does us any good. UNION ALL turns into an Append node. Pushing aggregation through an Append doesn't make anything faster AFAICS *unless* you can further optimize beginning at that point. For example, if one or both sides of the Append node have a Gather under them, and you can push the partial aggregation down into the Gather; or if you hit a Foreign Scan and can have it do partial aggregation on the remote side, you win. But if you just have: Finalize Aggregate -> Append -> Partial Aggregate -> Thing One -> Partial Aggregate -> Thing Two Instead of: Aggregate -> Append -> Thing One -> Thing Two ...then I don't see the point, at least not for a single-group Aggregate or HashAggregate. For a GroupAggregate, Thing One and Thing Two might need to be sorted, and sorting two or more smaller data sets might be faster than sorting one larger data set, but I can't see us winning anything very big here. To be clear, I'm not saying we shouldn't do this. I just don't think it does much *by itself*. If you combine it with other optimizations that let the aggregation be pushed further down the plan tree, it could win big. > 2) Group before join. e.g select p.description,sum(s.qty) from sale s inner > join s.product_id = p.product_id group by s.product_id group by > p.description; I have a partial patch which implements this, although I was > a bit stuck on if I should invent the concept of "GroupingPaths", or just > inject alternative subquery relations which are already grouped by the > correct clause. The problem with "GroupingPaths" was down to the row > estimates currently come from the RelOptInfo and is set in > set_baserel_size_estimates() which always assumes the ungrouped number of > rows, which is not what's needed if the grouping is already performed. I was > holding off to see how Tom does this in the grouping planner changes. Your sample query has two GROUP BY clauses; I think you meant to include only the second one. This seems like it can win big, because it's possible that joining first means joining against a much larger relation, whereas aggregating first might reduce "sale" to a volume of data that fits in work_mem, after which you can hash join. On the other hand, if you add WHERE p.description LIKE '%snuffleupagus%' then the aggregate-first strategy figures to lose, assuming things are indexed properly. On the substantive question you raise, I hope Tom will weigh in here. However, I'll give you my take. It seems to me that you are likely to run into problems not only with the row counts but also with target lists and width estimates. All of those things change once you aggregate. It seems clear that you are going to need a GroupingPath here, but it also seems really clear that you can't take a path where some aggregation has been done and use add_path to toss it on the pile for the same RelOptInfo as unaggregated paths to which it is not comparable. Right now, there's a RelOptInfo corresponding to each subset of the joinrels for which we build paths; but in this new world, I think you'll end up with multiple joinrels for each RelOptInfo, one per In your example above, you'd end up with RelOptInfos for (1) {s} with no aggregation, (2) {p} with no aggregation, (3) {s p} with no aggregation, (4) {s} aggregated by s.product_id and (5) {s p} aggregated by p.description. You can generate paths for (5) either by joining (2) to (4) or by aggregating (3), and the cheapest path for (5) is the overall winner. It seems a bit tricky to determine which aggregations are worth considering, and how to represent that in the RelOptInfo, but overall that feels like the right idea. I'm not sure how much this is going to overlap with the work Tom is already doing. I think his principal goal is to let the planning of a subquery return multiple candidate paths. That's almost certain to involve creating something like GroupingPath, though I can't predict the details, and I suspect it's also likely that he'll create some new RelOptInfo that represents the output of the grouping stage. However, I'm not sure that'll have the aggregation that has been performed represented in any principled way that would lend itself to having more than one RelOptInfo of that kind. That's probably another problem altogether. However, my Tom-predictive capabilities are far from perfect. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers