Yes, those are the two main approaches to hashing in the code base that I am aware of as well. I haven't seen any real concrete comparison and benchmarks between the two. If collisions between NA and 0 are a problem it would probably be ok to tweak the hash value of NA to something unique. I suspect these collisions aren't an inevitable fact of the design but more just something that has not been considered yet.
There is a third way currently in use as well by arrow::compute::GrouperImpl. In this class the key values from each row are converted into a single "row format" which is stored in a std::string. A std::unordered_map is then used for the hashing. The GrouperFastImpl class was created to (presumably) be faster than this. It uses the Hashing32 routines and stores the groups in an arrow::compute::SwissTable. However, I think there was some benchmarking done that showed the GrouperFastIimpl to be faster than the GrouperImpl. On Fri, Jul 21, 2023 at 6:59 AM Yaron Gvili <rt...@hotmail.com> wrote: > Hi, > > What are the recommended ways to hash Arrow structures? What are the pros > and cons of each approach? > > Looking a bit through the code, I've so far found two different hashing > approaches, which I describe below. Are there any others? > > A first approach I found is using `Hashing32` and `Hashing64`. This > approach seems to be useful for hashing the fields composing a key of > multiple rows when joining. However, it has a couple of drawbacks. One > drawback is that if the number of distinct keys is large (like in the scale > of a million or so) then the probability of hash collision may no longer be > acceptable for some applications, more so when using `Hashing32`. Another > drawback that I noticed in my experiments is that the common `N/A` and `0` > integer values both hash to 0 and thus collide. > > A second approach I found is by serializing the Arrow structures (possibly > by streaming) and hashing using functions in `util/hashing.h`. I didn't yet > look into what properties these hash functions have except for the > documented high performance. In particular, I don't know whether they have > unfortunate hash collisions and, more generally, what is the probability of > hash collision. I also don't know whether they are designed for efficient > use in the context of joining. > > > Cheers, > Yaron. >