If a floating point column does not have NaN as a lower_bound or upper_bound, must it contain no NaNs?
This question came up in parquet in https://issues.apache.org/jira/browse/PARQUET-1222. One reasonable choice would be to specify the use of the IEEE-754 totalOrder predicate: https://en.wikipedia.org/wiki/IEEE_754#Total-ordering_predicate In this implementation, if neither bound is NaN, then the column does not contain NaN. After that it gets more complex: if upper_bound contains a NaN with a negative sign (NaNs are signed), then the column contains ONLY NaNs. To add complexity, for the purpose skipping files, I suppose the compute engines would have to be using totalOrder, not one of the usual comparators like <=. Another possibility is to count NaNs like NULLs are counted, with a nan_value_counts and insist that lower and upper bounds must be numbers or an infinity. I'm not sure then how a lower_bound would be set in a column with no non-NaN values. Maybe it would just be left out of the map. One other thing I'll note about floating-point weirdness: 0 can be signed. However, -0 is equal to +0. Double however, they can be distinguished with some operations. So 1.0/0.0 = inf, but 1.0/-0.0 = -inf. Also, -0 is less than +0 in totalOrder, so compute engines pruning files with totalOrder would need lower_bound to respect the distinction between -0 and +0. Additionally, -0 and +0 have different bit patterns, which means that the hash of -0 and +0 are likely different, given the hash function defined in the spec of hashLong(doubleToRawLongBits(v)), even though the floating point values are "equal" for some definition of equal. I'm not sure how important this last one is, since the spec says "floating point types are not valid source values for partitioning", and I'm still working on parsing the spec to understand why hashing needs to be defined at all for hash values, if they aren't valid source values for partitioning. Thanks! Jim