Antonin Houska <a...@cybertec.at> writes: > [ agg_pushdown_v8.tgz ]
I spent a few hours looking at this today. It seems to me that at no point has there been a clear explanation of what the patch is trying to accomplish, so let me see whether I've got it straight or not. (I suggest that something like this ought to be included in optimizer/README; the patch's lack of internal documentation is a serious deficiency.) -- Conceptually, we'd like to be able to push aggregation down below joins when that yields a win, which it could do by reducing the number of rows that have to be processed at the join stage. Thus consider SELECT agg(a.x) FROM a, b WHERE a.y = b.z; We can't simply aggregate during the scan of "a", because some values of "x" might not appear at all in the input of the naively-computed aggregate (if their rows have no join partner in "b"), or might appear multiple times (if their rows have multiple join partners). So at first glance, aggregating before the join is impossible. The key insight of the patch is that we can make some progress by considering only grouped aggregation: if we can group "a" into sets of rows that must all have the same join partners, then we can do preliminary aggregation within each such group, and take care of the number-of-repetitions problem when we join. If the groups are large then this reduces the number of rows processed by the join, at the cost that we might spend time computing the aggregate for some row sets that will just be discarded by the join. So now we consider SELECT agg(a.x) FROM a, b WHERE a.y = b.z GROUP BY ?; and ask what properties the grouping column(s) "?" must have to make this work. I believe we can say that it's OK to push down through a join if its join clauses are all of the form "a.y = b.z", where either a.y or b.z is listed as a GROUP BY column *and* the join operator is equality for the btree opfamily specified by the SortGroupClause. (Note: actually, SortGroupClause per se mentions an equality operator, although I think the planner quickly reduces that to an opfamily specification.) The concerns Robert had about equality semantics are certainly vital, but I believe that the logic goes through correctly as long as the grouping equality operator and the join equality operator belong to the same opfamily. Case 1: the GROUP BY column is a.y. Two rows of "a" whose y values are equal per the grouping operator must join to exactly the same set of "b" rows, else transitivity is failing. Case 2: the GROUP BY column is b.z. It still works, because the set of "a" rows that are equal to a given z value is well-defined, and those rows must join to exactly the "b" rows whose z entries are equal to the given value, else transitivity is failing. Robert postulated cases like "citext = text", but that doesn't apply here because no cross-type citext = text operator could be part of a well-behaved opfamily. What we'd actually be looking at is either "citextvar::text texteq textvar" or "citextvar citexteq textvar::citext", and the casts prevent these expressions from matching GROUP BY entries that have the wrong semantics. However: what we have proven here is only that we can aggregate across a set of rows that must share the same join partners. We still have to be able to handle the case that the rowset has more than one join partner, which AFAICS means that we must use partial aggregation and then apply an "aggmultifn" (or else apply the aggcombinefn N times). We can avoid that and use plain aggregation when we can prove the "b" side of the join is unique, so that no sets of rows will have to be merged post-join; but ISTM that that reduces the set of use cases to be too small to be worth such a complex patch. So I'm really doubtful that we should proceed forward with only that case available. Also, Tomas complained in the earlier thread that he didn't think grouping on the join column was a very common use-case in the first place. I share that concern, but I think we could extend the logic to the case that Tomas posited as being useful: SELECT agg(a.x) FROM a, b WHERE a.y = b.id GROUP BY b.z; where the join column b.id is unique. If we group on a.y (using semantics compatible with the join operator and the uniqueness constraint), then all "a" rows in a given group will join to exactly one "b" row that necessarily has exactly one grouping value, so this group can safely be aggregated together. We might need to combine it post-join with other "b" rows that have equal "z" values, but we can do that as long as we're okay with partial aggregation. I think this example shows why the idea is far more powerful with partial aggregation than without. -- In short, then, I don't have much use for the patch as presented in this thread, without "aggmultifn". That might be OK as the first phase in a multi-step patch, but not as a production feature. I also have no use for the stuff depending on bitwise equality rather than the user-visible operators; that's just a hack, and an ugly one. As far as the patch details go, I've not looked through it in great detail, but it appears to me that you are still trying to cram the same stuff into base relations as before; shoving it into a subsidiary struct doesn't improve that. Multiple people have said that's the wrong thing, and I agree. Conceptually it's a disaster: a single RelOptInfo should represent one well-defined set of result rows, not more than one. This approach also cannot be extended to handle doing aggregations partway up the join tree, which is at least theoretically interesting (though I'm fine with not tackling it right away). I think the right representation is to create UPPERREL_GROUP_AGG RelOptInfos whose relids show the set of baserels joined before aggregating. Currently there's only one of those and its relids is equal to all_baserels (or should be, anyway). This patch would add instances of UPPERREL_GROUP_AGG RelOptInfos with singleton relids, and later we might put in the ability to consider aggregation across other relid subsets, and in any case we'd run join planning on those level-one rels and create new UPPERREL_GROUP_AGG RelOptInfos that represent the intermediate join steps leading up to the final join. The correct indexing structure for this collection of RelOptInfos is certainly not simple_rel_array; it should look more like the way we index the various regular join rels that we consider. (Maybe an appropriate preliminary patch is to refactor the support code associated with join_rel_list + join_rel_hash so that we can treat those fields as one instance of a structure we'll replicate later.) Note that I'm envisioning that we'd basically run join planning a second time on this tree of join rels, rather than try to merge it with the primary join planning. Since the numbers of rows to be processed will be completely different in this join tree, merging it with the standard one seems hopeless. BTW, I noticed some noise in the v8 patch: what's it doing touching src/interfaces/libpq/fe-auth.c or src/interfaces/libpq/fe-secure-common.c? regards, tom lane