alamb commented on PR #103:
URL: https://github.com/apache/datafusion-site/pull/103#issuecomment-3266424754

   > Morning bike ride thought: the goal of a hash join is to split up the work 
into multiple partitions so that we can do work in parallel. The hashing of the 
join keys is just one way of doing this. Hashing does not preserve order at 
all. Would it be possible to do this with a space filling curve instead? I'm 
getting into above my pay grade territory but in my mind what z-order and 
Hilbert curves do is map from a higher dimensional space of arbitrary types 
(the arbitrary types part being part of an implementation not the mathematical 
instrument) into a single u64. Could we map the join key values into a u64 then 
take `partition = u64 % num_partitions` on the build side, thus putting values 
that are close in the join keys into ~ the same partition? I imagine we could 
even compute partition cutoffs in the u64 space to "rebalance" lopsided build 
side partitions (impossible to do with hashing) -> when we run the probe side 
it's something like `partition = case when u64 < 12 then 1 else
  when u64 >= 12 and u64 <= 182 then 2 else when u64 >= 182 then 3 end`.
   > 
   > I've never heard of this before so it probably would not work / I'm 
missing something...
   
   This is an interesting idea, but I am also not sure (I have also never heard 
of it)
   
   It sounds like the idea is to "range partition" the join key space by range 
(though I get it is not strictly by range, it is by the transformed curve space)
   
   > Could we map the join key values into a u64 then take `partition = u64 % 
num_partitions` on the build side, thus putting values that are close in the 
join keys into ~ the same partition?
   
   It isn't clear to me that `%` would put values close in the join keys into 
the same partition. I think you would have to do something like
   
   | `Partition` | Header |
   |--------|--------|
   | 0 | `0 < u64 < u64::MAX/num_partitions` |
   | 1 | `u64::MAX/num_partitions < u64 < 2 * u64::MAX/num_partitions` |
   | ... | ... |
   | `num_partitions-1` | `(num_partitions - 2) * u64::MAX/num_partitions < u64 
< (num_partitions-1) * u64::MAX/num_partitions` | 
   
   And the downside of that would be you could end up with imbalanced 
partitions due to skew (e.g. if all your values ended up in a few of the 
partitions)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to