comphead commented on issue #17267:
URL: https://github.com/apache/datafusion/issues/17267#issuecomment-3290029337

   HJ Spilling is complicated as mentioned above, however we can consider a new 
type of join which is Grace hash Join 
https://en.wikipedia.org/wiki/Hash_join#Grace_hash_join which designed 
specifically for processing tables if build hash table is too large for 
available memory pages. the approach is recursive using Divide and Conquer 
algorithm
   
   The algorithm 
   
   ### Partition Phase (Hash Partitioning)
   Choose a hash function h1 on the join key.
   For each tuple r ∈ R, compute bucket h1(r.key) mod p and write tuple into 
that bucket on disk.
   Do the same for S.
   Now both R and S are split into p partitions: R1…Rp, S1…Sp.
   Important: corresponding partitions Ri and Si contain all possible join 
partners (since hashing guarantees collocation).
   If partitions are still too large to fit in memory, recursively re-partition 
them with a different hash function h2, h3, … until partitions are small enough.
   This recursive spilling is sometimes called hybrid grace hash join.
   
   ### Build Phase (In-Memory Hash Table)
   For each partition pair (Ri, Si):
   Load the smaller partition (say Ri) into memory and build a hash table on 
its join key.
   
   ### Probe Phase
   Stream tuples from the other partition (Si), probing the in-memory hash 
table to find matches.
   Output joined results.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to