Yes, those are the two main approaches to hashing in the code base that I
am aware of as well.  I haven't seen any real concrete comparison and
benchmarks between the two.  If collisions between NA and 0 are a problem
it would probably be ok to tweak the hash value of NA to something unique.
I suspect these collisions aren't an inevitable fact of the design but more
just something that has not been considered yet.

There is a third way currently in use as well by
arrow::compute::GrouperImpl.  In this class the key values from each row
are converted into a single "row format" which is stored in a std::string.
A std::unordered_map is then used for the hashing.  The GrouperFastImpl
class was created to (presumably) be faster than this.  It uses the
Hashing32 routines and stores the groups in an arrow::compute::SwissTable.
However, I think there was some benchmarking done that showed the
GrouperFastIimpl to be faster than the GrouperImpl.

On Fri, Jul 21, 2023 at 6:59 AM Yaron Gvili <rt...@hotmail.com> wrote:

> Hi,
>
> What are the recommended ways to hash Arrow structures? What are the pros
> and cons of each approach?
>
> Looking a bit through the code, I've so far found two different hashing
> approaches, which I describe below. Are there any others?
>
> A first approach I found is using `Hashing32` and `Hashing64`. This
> approach seems to be useful for hashing the fields composing a key of
> multiple rows when joining. However, it has a couple of drawbacks. One
> drawback is that if the number of distinct keys is large (like in the scale
> of a million or so) then the probability of hash collision may no longer be
> acceptable for some applications, more so when using `Hashing32`. Another
> drawback that I noticed in my experiments is that the common `N/A` and `0`
> integer values both hash to 0 and thus collide.
>
> A second approach I found is by serializing the Arrow structures (possibly
> by streaming) and hashing using functions in `util/hashing.h`. I didn't yet
> look into what properties these hash functions have except for the
> documented high performance. In particular, I don't know whether they have
> unfortunate hash collisions and, more generally, what is the probability of
> hash collision. I also don't know whether they are designed for efficient
> use in the context of joining.
>
>
> Cheers,
> Yaron.
>

Reply via email to