Re: hashing Arrow structures

2023-07-24 Thread Weston Pace
> Also, I don't understand why there are two versions of the hash table > ("hashing32" and "hashing64" apparently). What's the rationale? How is > the user meant to choose between them? Say a Substrait plan is being > executed: which hashing variant is chosen and why? It's not user-configurable.

Re: hashing Arrow structures

2023-07-24 Thread Antoine Pitrou
Hi, Le 21/07/2023 à 15:58, Yaron Gvili a écrit : A first approach I found is using `Hashing32` and `Hashing64`. This approach seems to be useful for hashing the fields composing a key of multiple rows when joining. However, it has a couple of drawbacks. One drawback is that if the number of

Re: hashing Arrow structures

2023-07-21 Thread Weston Pace
Yes, those are the two main approaches to hashing in the code base that I am aware of as well. I haven't seen any real concrete comparison and benchmarks between the two. If collisions between NA and 0 are a problem it would probably be ok to tweak the hash value of NA to something unique. I susp

hashing Arrow structures

2023-07-21 Thread Yaron Gvili
Hi, What are the recommended ways to hash Arrow structures? What are the pros and cons of each approach? Looking a bit through the code, I've so far found two different hashing approaches, which I describe below. Are there any others? A first approach I found is using `Hashing32` and `Hashing6