andygrove opened a new pull request, #3564: URL: https://github.com/apache/datafusion-comet/pull/3564
## Summary - Adds a new Grace Hash Join operator that partitions both build and probe sides into hash buckets, spilling to disk when memory is tight, then joining each partition independently - Supports all join types (Inner, Left, Right, Full, LeftSemi, LeftAnti) with build-side selection - Uses incremental Arrow IPC spill I/O via `SpillWriter` for efficient disk access - Includes recursive repartitioning (up to 3 levels) for oversized partitions - Adds `CometGraceHashJoinExec` Spark operator with production metrics (build/probe/join time, spill count/bytes) - Controlled by `spark.comet.exec.graceHashJoin.enabled` (default false) and `spark.comet.exec.graceHashJoin.numPartitions` (default 16) - Includes comprehensive deterministic tests, fuzz tests with ParquetGenerator, and join microbenchmarks comparing Spark SMJ, Comet SMJ, Comet Hash Join, and Comet Grace Hash Join ## Test plan - [x] All existing CometJoinSuite tests pass - [x] 8 new deterministic tests covering all join types, filters, data types, empty tables, self-join, build side, plan verification, multi-key - [x] 2 fuzz tests using ParquetGenerator with and without spilling - [x] Rust unit tests for basic join and no-overlap scenarios - [x] `cargo clippy` passes with no warnings - [x] Join microbenchmark added (`CometJoinBenchmark`) 🤖 Generated with [Claude Code](https://claude.com/claude-code) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
