Hi, What are the recommended ways to hash Arrow structures? What are the pros and cons of each approach?
Looking a bit through the code, I've so far found two different hashing approaches, which I describe below. Are there any others? A first approach I found is using `Hashing32` and `Hashing64`. This approach seems to be useful for hashing the fields composing a key of multiple rows when joining. However, it has a couple of drawbacks. One drawback is that if the number of distinct keys is large (like in the scale of a million or so) then the probability of hash collision may no longer be acceptable for some applications, more so when using `Hashing32`. Another drawback that I noticed in my experiments is that the common `N/A` and `0` integer values both hash to 0 and thus collide. A second approach I found is by serializing the Arrow structures (possibly by streaming) and hashing using functions in `util/hashing.h`. I didn't yet look into what properties these hash functions have except for the documented high performance. In particular, I don't know whether they have unfortunate hash collisions and, more generally, what is the probability of hash collision. I also don't know whether they are designed for efficient use in the context of joining. Cheers, Yaron.