skyzh commented on PR #14595:
URL: https://github.com/apache/datafusion/pull/14595#issuecomment-2655639889

   Okay, I just realized that the plan is not correct. I'll need to do more 
transformation to make it work :(
   
   So back to the semantics of the lateral join,
   
   ```
   SELECT *
     FROM t0,
     LATERAL (SELECT sum(v1) FROM t1 WHERE t0.v0 = t1.v0);
   ```
   
   is basically saying that "for every row in t0, run the lateral query", which 
is equivalent to the following program:
   
   ```
   for row in t0:
     lateral_result = run_query(SELECT sum(v1) FROM t1 WHERE t0.v0 = t1.v0)
     if lateral_result is not empty
       output.append(row ++ lateral_result) 
   ```
   
   Note that sum will produce 1 row of null even when the right table doesn't 
contain any matching row.
   
   The produced plan with this patch is:
   
   ```
   Inner Join t0.v0 = t1.v0
     Scan t0
     Aggregate: groupBy=[[t1.v0]], aggr=[[sum(t1.v1)]]
       Scan t1
   ```
   
   If t0 = (0, 0), (1, 1) and t1 = (0, 0), the output should have 2 rows, while 
this plan will only produce 1 row.
   
   The correct plan should be:
   
   ```
   Left Outer Join t0.v0 = t1.v0
     Scan t0
     Aggregate: groupBy=[[t1.v0]], aggr=[[sum(t1.v1)]]
       Scan t1
   ```
   
   And aggregations like `count(*)` needs to be handled separately, which I 
already see some code on that in the codebase.
   
   Back to the DuckDB plan. They are using the Hyper subquery unnesting. The 
idea of Hyper is that it will first eliminate duplicates on the join column. 
Therefore, in the case where the left table has a lot of duplicated rows, it 
only needs to compute the subquery once for all such rows. So what they are 
doing is that: eliminate the duplicates from the left table, then compute the 
subquery result, and finally join it back to the original one, which will be 
planned like:
   
   ```
   ^1:
   Left Outer Join t0.v0 = t1.v0
     DistinctScan/DelimScan t0
     Aggregate: groupBy=[[t1.v0]], aggr=[[sum(t1.v1)]]
       Scan t1
   ```
   
   which computes the right side row for every value in the left table. We will 
get (distinct t0.*, sum(t1.v1)) at this step. And then, deduplicate the result 
of the subquery (this step seems unnecessary?)
   
   ```
   ^2:
   Dedup (HashJoin)
     ^1
   ```
   
   And finally, join the result back to the original table,
   
   ```
   HashJoin
     Scan t0
     ^2
   ```
   
   The result should be the same as the corrected plan.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to