Hi,

What are the recommended ways to hash Arrow structures? What are the pros and 
cons of each approach?

Looking a bit through the code, I've so far found two different hashing 
approaches, which I describe below. Are there any others?

A first approach I found is using `Hashing32` and `Hashing64`. This approach 
seems to be useful for hashing the fields composing a key of multiple rows when 
joining. However, it has a couple of drawbacks. One drawback is that if the 
number of distinct keys is large (like in the scale of a million or so) then 
the probability of hash collision may no longer be acceptable for some 
applications, more so when using `Hashing32`. Another drawback that I noticed 
in my experiments is that the common `N/A` and `0` integer values both hash to 
0 and thus collide.

A second approach I found is by serializing the Arrow structures (possibly by 
streaming) and hashing using functions in `util/hashing.h`. I didn't yet look 
into what properties these hash functions have except for the documented high 
performance. In particular, I don't know whether they have unfortunate hash 
collisions and, more generally, what is the probability of hash collision. I 
also don't know whether they are designed for efficient use in the context of 
joining.


Cheers,
Yaron.

Reply via email to