comphead commented on issue #17267: URL: https://github.com/apache/datafusion/issues/17267#issuecomment-3290029337
HJ Spilling is complicated as mentioned above, however we can consider a new type of join which is Grace hash Join https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join which designed specifically for processing tables if build hash table is too large for available memory pages. the approach is recursive using Divide and Conquer algorithm The algorithm ### Partition Phase (Hash Partitioning) Choose a hash function h1 on the join key. For each tuple r ∈ R, compute bucket h1(r.key) mod p and write tuple into that bucket on disk. Do the same for S. Now both R and S are split into p partitions: R1…Rp, S1…Sp. Important: corresponding partitions Ri and Si contain all possible join partners (since hashing guarantees collocation). If partitions are still too large to fit in memory, recursively re-partition them with a different hash function h2, h3, … until partitions are small enough. This recursive spilling is sometimes called hybrid grace hash join. ### Build Phase (In-Memory Hash Table) For each partition pair (Ri, Si): Load the smaller partition (say Ri) into memory and build a hash table on its join key. ### Probe Phase Stream tuples from the other partition (Si), probing the in-memory hash table to find matches. Output joined results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
