andygrove opened a new pull request, #3554:
URL: https://github.com/apache/datafusion-comet/pull/3554

   ## Summary
   
   - Add per-partition size check and size ratio check to `RewriteJoin`, 
mirroring Spark's own `JoinSelection` logic (`canBuildLocalHashMapBySize()` and 
`muchSmaller()`)
   - Previously, enabling `spark.comet.exec.replaceSortMergeJoin` would 
unconditionally rewrite **all** SortMergeJoins to ShuffledHashJoins, risking 
OOM when the build side is too large
   - Now the rewrite only happens when the build side is estimated to fit in 
memory and is significantly smaller than the probe side
   - Add new config `spark.comet.exec.replaceSortMergeJoin.sizeRatio` (default: 
3) matching Spark's `SHUFFLE_HASH_JOIN_FACTOR`
   - Log reasons when rewrite is skipped (visible with 
`explainFallback.enabled=true`)
   
   ## Details
   
   **Per-partition size check:** `buildSize < autoBroadcastJoinThreshold * 
numShufflePartitions`
   - Reuses Spark's existing configs — no new threshold to configure
   - When `autoBroadcastJoinThreshold = -1` (broadcast disabled), this check is 
skipped
   
   **Size ratio check:** `buildSize * sizeRatio <= probeSize`
   - Default ratio of 3 matches Spark's `spark.sql.shuffledHashJoinFactor`
   - Ensures hash join is only used when it has a clear advantage over 
sort-merge
   
   **Safe fallback:** When no logical plan statistics are available, the 
rewrite is skipped conservatively.
   
   ## Test plan
   
   - [ ] Verify existing `CometJoinSuite` tests still pass
   - [ ] Test with TPC-H/TPC-DS to confirm joins are correctly selected
   - [ ] Test with `explainFallback.enabled=true` to verify skip reasons are 
logged
   - [ ] Test edge case: `autoBroadcastJoinThreshold=-1` still allows rewrite 
when size ratio is met
   
   🤖 Generated with [Claude Code](https://claude.com/claude-code)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to