skyzh commented on PR #14595: URL: https://github.com/apache/datafusion/pull/14595#issuecomment-2655639889
Okay, I just realized that the plan is not correct. I'll need to do more transformation to make it work :( So back to the semantics of the lateral join, ``` SELECT * FROM t0, LATERAL (SELECT sum(v1) FROM t1 WHERE t0.v0 = t1.v0); ``` is basically saying that "for every row in t0, run the lateral query", which is equivalent to the following program: ``` for row in t0: lateral_result = run_query(SELECT sum(v1) FROM t1 WHERE t0.v0 = t1.v0) if lateral_result is not empty output.append(row ++ lateral_result) ``` Note that sum will produce 1 row of null even when the right table doesn't contain any matching row. The produced plan with this patch is: ``` Inner Join t0.v0 = t1.v0 Scan t0 Aggregate: groupBy=[[t1.v0]], aggr=[[sum(t1.v1)]] Scan t1 ``` If t0 = (0, 0), (1, 1) and t1 = (0, 0), the output should have 2 rows, while this plan will only produce 1 row. The correct plan should be: ``` Left Outer Join t0.v0 = t1.v0 Scan t0 Aggregate: groupBy=[[t1.v0]], aggr=[[sum(t1.v1)]] Scan t1 ``` And aggregations like `count(*)` needs to be handled separately, which I already see some code on that in the codebase. Back to the DuckDB plan. They are using the Hyper subquery unnesting. The idea of Hyper is that it will first eliminate duplicates on the join column. Therefore, in the case where the left table has a lot of duplicated rows, it only needs to compute the subquery once for all such rows. So what they are doing is that: eliminate the duplicates from the left table, then compute the subquery result, and finally join it back to the original one, which will be planned like: ``` ^1: Left Outer Join t0.v0 = t1.v0 DistinctScan/DelimScan t0 Aggregate: groupBy=[[t1.v0]], aggr=[[sum(t1.v1)]] Scan t1 ``` which computes the right side row for every value in the left table. We will get (distinct t0.*, sum(t1.v1)) at this step. And then, deduplicate the result of the subquery (this step seems unnecessary?) ``` ^2: Dedup (HashJoin) ^1 ``` And finally, join the result back to the original table, ``` HashJoin Scan t0 ^2 ``` The result should be the same as the corrected plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org