Re: hashing Arrow structures

Antoine Pitrou Mon, 24 Jul 2023 01:19:32 -0700


Hi,

Le 21/07/2023 à 15:58, Yaron Gvili a écrit :

A first approach I found is using `Hashing32` and `Hashing64`. This approach 
seems to be useful for hashing the fields composing a key of multiple rows when 
joining. However, it has a couple of drawbacks. One drawback is that if the 
number of distinct keys is large (like in the scale of a million or so) then 
the probability of hash collision may no longer be acceptable for some 
applications, more so when using `Hashing32`. Another drawback that I noticed 
in my experiments is that the common `N/A` and `0` integer values both hash to 
0 and thus collide.

Ouch... so if N/A does have the same hash value as a common non-nullvalue (0), this should be fixed.

Also, I don't understand why there are two versions of the hash table("hashing32" and "hashing64" apparently). What's the rationale? How isthe user meant to choose between them? Say a Substrait plan is beingexecuted: which hashing variant is chosen and why?

I don't think 32-bit hashing is a good idea when operating on largedata. Unless the hash function is exceptionally good, you may get lotsof hash collisions. It's nice to have a SIMD-accelerated hash table, butless so if access times degenerate to O(n)...

So IMHO we should only have one hashing variant with a 64-bit output.And make sure it doesn't have trivial collisions on common data patterns(such as nulls and zeros, or clustered integer ranges).

A second approach I found is by serializing the Arrow structures (possibly by 
streaming) and hashing using functions in `util/hashing.h`. I didn't yet look 
into what properties these hash functions have except for the documented high 
performance. In particular, I don't know whether they have unfortunate hash 
collisions and, more generally, what is the probability of hash collision. I 
also don't know whether they are designed for efficient use in the context of 
joining.

Those hash functions shouldn't have unfortunate hash, but they were notexercised on real-world data at the time. I have no idea whether theyare efficient in the context of joining, as they have been written muchearlier than our joining implementation.


Regards

Antoine.

Re: hashing Arrow structures

Reply via email to