Hi Micah, We have run into some of these issues on Impala in various guises, including hash tables and min/max stats in parquet. Treating +0/-0 as indistinguishable for purposes of equality and grouping makes the most sense and avoids most pitfalls.
NaN is messier. I don't think there's necessarily one right answer. One observation is that it's inconvenient or borderline useless to treat NaN as distinct values when doing grouping operations, because you get a potentially huge number of distinct groups. But for the purposes of joins or, really, most cases where you're comparing equality, you want to treat all NaN values as not equal. So it can make sense to actually have two relations. There's an analogy with SQL NULL, which behaves differently in grouping and other contexts. In Impala's hash table implementation, which supports hashing and comparing multiple columns, we determine per-column whether to use the "null-safe" or non-"null-safe" behaviour. One corollary is that if you're using systems like this it's best to avoid depending on floating point equality because it's unpredictable. - Tim On Mon, Feb 25, 2019 at 9:09 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > Implementing compute kernels that depend on hashing has raised a couple of > edge cases that are worth discussing. In particular > the following points need to be resolved (I opened a JIRA [1] to track the > fixes). In particular: > > 1. How to handle -0.0 and 0.0? > - Option 1: Collapse to a single value (this is more inline with ieee-754 > spec I believe) > - Option 2: Keep them as separate values (I believe this is how java > handles them) > 2. How handle NaN? > - Option 1: Do nothing with them (multiple values of NaN might occur in > hashtables) > - Option 2: Canonicalize to a single NaN (this is what java does) > > I haven't investigated how DB systems handle these (if anyone knows and can > chime in I would appreciate it). As a default, I think it might be nice to > align the C++ implementation with the way Java handles them, but I don't > have any strong opinions. > > Thanks, > Micah > > [1] https://issues.apache.org/jira/browse/ARROW-4497 >