aditanase opened a new issue, #16106:
URL: https://github.com/apache/datafusion/issues/16106

   ### Describe the bug
   
   I have a simple use case for a star-schema join between a facts table and 2 
metadata tables (one of them small, another one larger but would still make 
sense to collect in memory - 200K rows, ~90 MB). Here is a simplified version 
of this query:
   ```sql
   SELECT
     fm.Timestamp,
     SUM(dm.BasePrice * fm.ConsumedUnits) AS BilledUnits
   FROM dr
     JOIN fm ON fm.ResourceId = dr.ResourceId
     JOIN dm ON fm.MeterId = dm.MeterId
   WHERE
     fm.Timestamp > (now() - interval '10' day) AND
     dm.MeterName = 'Usage' AND
     dr.Region != ''
   GROUP BY fm.Timestamp
   ```
   
   The join is executed on a string column. All 3 tables are deltalake tables 
loaded with `delta-rs`.
   
   My goal is to convince DF to execute the query as described here: ["right 
deep 
tree"](https://github.com/apache/datafusion/blob/f77579108d1dc0285636fbfb24507d2bfca66446/datafusion/physical-plan/src/joins/hash_join.rs#L276-L312).
   
   After tweaking the config a bit - mainly disabling repartition for joins - I 
managed to obtain 2 hash joins with mode `CollectLeft`, but the query will OOM 
trying to collect the facts table after the first join, with the 2nd metadata 
table remaining on the right.
   
   Drilling in the code, this seems to happen because on the 2nd join the 
optimizer will bail since there are no statistics pushed through the Join, and 
left and right are not swapped. The facts table remains in the middle of the 
tree.
   
   The reason appears to be that delta-rs does not generate columns stats for 
strings, [only for "primitive 
types"](https://github.com/delta-io/delta-rs/blob/main/crates/core/src/kernel/snapshot/log_data.rs#L620-L639).
   
   I have test a patch that will swap left and right as a last resort if we 
already have a join with a CollectLeft in the left node - basically trying to 
keep the pshysical shape of a "right deep tree":
   
https://github.com/hstack/arrow-datafusion/commit/eea1ff4344f2d36e134af8e29472137c10a0a5d4
   
   If you think this is a sensible approach, we may want to add a config flag 
to fence this, but otherwise I can raise a PR with this.
   I am wondering how you generally feel about other ways to guide the 
optimizer:
   - enforce min/max for string columns in delta-rs (does not seem like good 
value in general, you can't really compute cardinalities and such)
   - adding some sorf ot join hints in the SQL planner [like we have in 
spark](https://downloads.apache.org/spark/docs/3.0.0/sql-ref-syntax-qry-select-hints.html#join-hints)
   - other ideas?
   
   cc @alamb 
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   CollectLeft optimizations to work for more than 2 tables, even if the join 
key is string and column stats are missing - I thinks this is a common case for 
star schema joins.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to