On Fri, Jan 17, 2025 at 6:40 AM Tom Lane <t...@sss.pgh.pa.us> wrote: > * The README addition, and the basically identical text in the > commit message, don't even provide a reason to believe that the > transformation is correct let alone that it will result in faster > execution. I don't understand why it's so hard to provide a solid > correctness argument. This work was supposedly based on an academic > paper; surely that paper must have included a correctness proof? > PG might need a few refinements, like being specific about what we > expect from the equality operators. But an EXPLAIN plan is not an > argument.
Thank you for taking a look at this patch! In README, I provided the justification for the correctness of this transformation as follows: For the partial aggregation that is pushed down to a non-aggregated relation, we need to consider all expressions from this relation that are involved in upper join clauses and include them in the grouping keys, using compatible operators. This is essential to ensure that an aggregated row from the partial aggregation matches the other side of the join if and only if each row in the partial group does. This ensures that all rows within the same partial group share the same 'destiny', which is crucial for maintaining correctness. I believed that this explanation would make it clear why this transformation is correct. Yeah, this work implements one of the transformations introduced in paper "Eager Aggregation and Lazy Aggregation". In the paper, Section 4 presents the formalism, Section 5 proves the main theorem, and Section 6 introduces corollaries related to this specific transformation. I'm just not sure how to translate these theorems and corollaries into natural language that would be suitable to be included in the README. I can give it another try if you find the above justification not clear enough, but it would be really helpful if I could get some assistance with this. And I'd like to clarify that the EXPLAIN plan included in the README is only meant to illustrate how this transformation looks like, and is not intended to serve as an argument for its correctness. > * As for the performance aspect, we're given > > Finalize HashAggregate > Group Key: a.i > -> Nested Loop > -> Partial HashAggregate > Group Key: b.j > -> Seq Scan on b > -> Index Only Scan using a_pkey on a > Index Cond: (i = b.j) > > As far as I can see, this will require aggregation to be performed > across every row of "b", whereas the naive way would have aggregated > across only rows having join partners in "a". Yes, that's correct. > If most "b" rows lack > a join partner then this will be far slower than the naive way. No, this is not correct. The partial aggregation may reduce the number of input rows to the join, and the resulting data reduction could justify the cost of performing the partial aggregation. As an example, please consider: create table t1 (a int, b int, c int); create table t2 (a int, b int, c int); insert into t1 select i%3, i%3, i from generate_series(1,1000000)i; insert into t2 select i%3+3, i%3+3, i from generate_series(1,1000000)i; analyze t1, t2; explain analyze select sum(t2.c) from t1 join t2 on t1.b = t2.b group by t1.a; So for this query, most (actually all) t2 rows lack a join partner. Running it with and without eager aggregation, I got (best of 3): -- with eager aggregation Execution Time: 496.856 ms -- without eager aggregation Execution Time: 1723.844 ms > I do see that it can be better if most "b" rows have multiple join > partners, because we'll re-use partial aggregation results instead > of (effectively) recalculating them. Not only because we'll re-use partial aggregation results, but also (and perhaps more importantly) because the number of input rows to the join could be significantly reduced. > But the README text makes it > sound like this is an unconditional win, which is not the right > mindset. I'm sorry if the README text gives that impression. The README says: If the partial aggregation on table B significantly reduces the number of input rows, the join above will be much cheaper, leading to a more efficient final plan. Perhaps I should use "could" or "might" instead of "will" to make it less misleading. But as you can see from the implementation, the decision is entirely based on cost, not on rules. There is no part of the code that ever assumes this transformation is an unconditional win. > (In fact, in this specific example where a.i is presumed > unique, how's it a win at all?) It seems to me that whether it's a win depends on whether b.j is a column with low cardinality (i.e., relatively few unique values). I don't really see how a.i being unique would change that. Please see the example below: create table a (i int primary key, x int); create table b (j int, y int); insert into a select i, i%3 from generate_series(1,10000)i; insert into b select i%3, i from generate_series(1,10000)i; analyze a, b; set enable_eager_aggregate to off; EXPLAIN (ANALYZE, COSTS OFF) SELECT a.i, avg(b.y) FROM a JOIN b ON a.i > b.j GROUP BY a.i; QUERY PLAN -------------------------------------------------------------------------------------------------- HashAggregate (actual time=100257.254..100268.841 rows=10000 loops=1) Group Key: a.i Batches: 1 Memory Usage: 2193kB Buffers: shared hit=133 -> Nested Loop (actual time=2.629..40849.630 rows=99990000 loops=1) Buffers: shared hit=133 -> Seq Scan on b (actual time=0.450..10.066 rows=10000 loops=1) Buffers: shared hit=45 -> Memoize (actual time=0.002..0.752 rows=9999 loops=10000) Cache Key: b.j Cache Mode: binary Hits: 9997 Misses: 3 Evictions: 0 Overflows: 0 Memory Usage: 1055kB Buffers: shared hit=88 -> Index Only Scan using a_pkey on a (actual time=0.752..8.100 rows=9999 loops=3) Index Cond: (i > b.j) Heap Fetches: 0 Buffers: shared hit=88 Planning Time: 1.681 ms Execution Time: 100273.011 ms (19 rows) set enable_eager_aggregate to on; EXPLAIN (ANALYZE, COSTS OFF) SELECT a.i, avg(b.y) FROM a JOIN b ON a.i > b.j GROUP BY a.i; QUERY PLAN -------------------------------------------------------------------------------------------- Finalize HashAggregate (actual time=77.701..90.680 rows=10000 loops=1) Group Key: a.i Batches: 1 Memory Usage: 2193kB Buffers: shared hit=133 -> Nested Loop (actual time=27.586..52.352 rows=29997 loops=1) Buffers: shared hit=133 -> Partial HashAggregate (actual time=27.408..27.419 rows=3 loops=1) Group Key: b.j Batches: 1 Memory Usage: 24kB Buffers: shared hit=45 -> Seq Scan on b (actual time=0.173..3.767 rows=10000 loops=1) Buffers: shared hit=45 -> Index Only Scan using a_pkey on a (actual time=0.108..5.277 rows=9999 loops=3) Index Cond: (i > b.j) Heap Fetches: 0 Buffers: shared hit=88 Planning Time: 1.739 ms Execution Time: 93.003 ms (18 rows) There is a performance improvement of ~1000 times, even though a.i is unique. # select 100273.011/93.003; ?column? ----------------------- 1078.1696396890422890 (1 row) (I used 'a.i > b.j' instead of 'a.i = b.j' to make the performance difference more noticeable. I believe this is fine, as it doesn't undermine the fact that a.i is unique.) > * I'm also concerned about what happens with aggregates that can have > large partial-aggregation values, such as string_agg(). With the > existing usage of partial aggregation for parallel queries, it's > possible to be confident that there are not many partial-aggregation > values in existence at the same time. I don't think that holds for > pushed-down aggregates: for example, I wouldn't be surprised if the > planner chooses a join plan that requires stuffing all those values > into a hash table, or "materializes" the output of the partial > aggregation step. Do we have logic that will avoid blowing out > memory during such queries? Good point! Thank you for bringing this up. I hadn't considered it before, and it seems no one else has raised this issue. I'll look into it. > * I am just as worried as Robert is about the notion of different > paths for the same RelOptInfo having different rowcount estimates. > That is an extremely fundamental violation of basic planner > assumptions. We did bend it for parameterized paths by restating > those assumptions as (from optimizer/README): > > To keep cost estimation rules relatively simple, we make an implementation > restriction that all paths for a given relation of the same parameterization > (i.e., the same set of outer relations supplying parameters) must have the > same rowcount estimate. This is justified by insisting that each such path > apply *all* join clauses that are available with the named outer relations. > > I don't see any corresponding statement here, and it's not clear > to me that the point has been thought through adequately. > > Another aspect that bothers me is that a RelOptInfo is understood > to contain a bunch of paths that all yield the same data (the same > set of columns), and it seems like that might not be the case here. > Certainly partially-aggregated paths will output something different > than unaggregated ones, but mightn't different join orders mutate the > column set even further? > > I think that we might be better off building a separate RelOptInfo for > each way of pushing down the aggregates, in order to preserve the > principle that all the paths in any one RelOptInfo have the same > output. This'll mean more RelOptInfos, but not more paths, so > I doubt it adds that much performance overhead. Hmm, IIUC, this means that we would separate the grouped paths of the same grouped relation into different RelOptInfos based on the location of the partial aggregation within the path tree. Let's define the "location" as the relids of the relation on top of which we place the partial aggregation. For grouped relation {A B C D}, if we perform some aggregation on C, we would end up with 8 diffent grouped paths: {A B D PartialAgg(C)} {B D PartialAgg(A C)} {A D PartialAgg(B C)} {A B PartialAgg(D C)} {D PartialAgg(A B C)} {B PartialAgg(A D C)} {A PartialAgg(B D C)} {PartialAgg(A B D C)} That means we would need to create 8 RelOptInfos for this grouped relation. If my math doesn't fail me, for a relation containing n base rels, we would need to create 2^(n-1) different RelOptInfos. When building grouped relation {A B C D E} by joining {A B C D} with {E}, we would need to call make_grouped_join_rel() 8 times, each time joining {E} with one of the 8 RelOptInfos mentioned above. And at last, considering other join orders such as joining {A B C E} with {D}, this new grouped relation would end up with 16 new RelOptInfos. And then we proceed with building grouped relation {A B C D E F}, and end up with 32 new RelOptInfos, and this process continues... It seems to me that this doesn't only result in more RelOptInfos, it can also lead to more paths. Consider two grouped paths, say P1 and P2, for the same grouped relation, but with different locations of the partial aggregation. Suppose P1 is cheaper, at least as well ordered, generates no more rows, requires no outer rels not required by P2, and is no less parallel-safe. If these two paths are kept in the same RelOptInfo, P2 will be discarded and not considered in further planning. However, if P1 and P2 are separated into different RelOptInfos, and P2 happens to have survived the add_path() tournament for the RelOptInfo it is in, then it will be considered in subsequent planning steps. So in any case, this doesn't seem like a feasible approach to me. I also have some thoughts on grouped paths and parameterized paths, but I've run out of time for today. I'll send a separate email. I'm really glad you're taking a look at this patch. Thank you! Thanks Richard