jonathanc-n opened a new issue, #17635: URL: https://github.com/apache/datafusion/issues/17635
### Is your feature request related to a problem or challenge? A perfect hash join is a highly specialized form of a hash join that eliminates a lot of runtime overhead. It can be thought of as a hybrid between a hash join and a direct-lookup array join. Here are the conditions for this join and the reasons for why they are needed Single Join Key - Perfect hash join only works when there is exactly one join key per row. - Multiple keys would require building composite keys - cannot index into Integer Key - Keys must be integers (or convertible to contiguous integer ranges) - Allows for direct addressing: output[key] = value. Non-integer keys (strings, UUIDs) would require hashing, which introduces collisions and destroys the "perfect" mapping. Distinct Values on Build Side - The build side must have unique keys (no duplicates). - This ensures a 1:1 mapping from key → value, allowing direct indexing without needing to chain multiple rows in a bucket. Inner Join - Perfect hash join works best with inner joins because you only need to store matches. - Outer joins would require materializing nulls for missing keys, which reintroduces complexity With perfect hash join: - determine the minimum and maximum key on the build side. - Allocate a contiguous array of size max_key - min_key + 1. - Store each build row at index key - min_key in the array. during probe phase: - For each probe key, check if it falls in [min_key, max_key]. - If yes, directly index into the array to find the corresponding row. - If no, skip (no match). ### Describe the solution you'd like - Add hash join benchmarks for testing - Use existing accumulator to calculate min/max values - Pass flag to `HashJoinExec` to specify if perfect hash join execution is needed - Can check distinct values from the hash table or use existing statistics. ## Implement perfect hash join execution. - Add stream implementation ### Describe alternatives you've considered _No response_ ### Additional context DuckDB's implementation can be seen here: https://github.com/duckdb/duckdb/pull/1959 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org