Hi,

Le 21/07/2023 à 15:58, Yaron Gvili a écrit :
A first approach I found is using `Hashing32` and `Hashing64`. This approach 
seems to be useful for hashing the fields composing a key of multiple rows when 
joining. However, it has a couple of drawbacks. One drawback is that if the 
number of distinct keys is large (like in the scale of a million or so) then 
the probability of hash collision may no longer be acceptable for some 
applications, more so when using `Hashing32`. Another drawback that I noticed 
in my experiments is that the common `N/A` and `0` integer values both hash to 
0 and thus collide.

Ouch... so if N/A does have the same hash value as a common non-null value (0), this should be fixed.

Also, I don't understand why there are two versions of the hash table ("hashing32" and "hashing64" apparently). What's the rationale? How is the user meant to choose between them? Say a Substrait plan is being executed: which hashing variant is chosen and why?

I don't think 32-bit hashing is a good idea when operating on large data. Unless the hash function is exceptionally good, you may get lots of hash collisions. It's nice to have a SIMD-accelerated hash table, but less so if access times degenerate to O(n)...

So IMHO we should only have one hashing variant with a 64-bit output. And make sure it doesn't have trivial collisions on common data patterns (such as nulls and zeros, or clustered integer ranges).

A second approach I found is by serializing the Arrow structures (possibly by 
streaming) and hashing using functions in `util/hashing.h`. I didn't yet look 
into what properties these hash functions have except for the documented high 
performance. In particular, I don't know whether they have unfortunate hash 
collisions and, more generally, what is the probability of hash collision. I 
also don't know whether they are designed for efficient use in the context of 
joining.

Those hash functions shouldn't have unfortunate hash, but they were not exercised on real-world data at the time. I have no idea whether they are efficient in the context of joining, as they have been written much earlier than our joining implementation.

Regards

Antoine.

Reply via email to