orlp commented on PR #221: URL: https://github.com/apache/parquet-format/pull/221#issuecomment-2937795348
> Are we willing to special case float to make filtering in the presence of NaNs more efficient, or do we go with a more streamlined implementation without special fields that only make sense for a single data type, at the cost of worse filtering performance when NaNs are present. I consider IEEE 754 total order as arbitrary and special-cased as the NaN-ignoring order. Either way we have to do something we're not doing right now for floats. > On the upside, what we gain by considering the sign bit for NaNs is: > - a very efficient, branch-free and straightforward software implementation for the comparison predicate Computing the min/max while ignoring NaNs is also efficient and branch-free if done correctly. And I hope we agree the computation of the statistic is far more important than the singular comparison to decide whether or not to skip an entire page. > - this gives a total ordering of every possible NaN bit pattern and thereby leaves no room for interpretation that could lead to different results between engines. My proposal is also fully defined without room for interpretation (apart from the exact NaN pattern used for the full-NaN page, we can figure that out). > The "NaN poisoning" you mention, which I also mentioned in my initial proposal, is somewhat arbitrary: Yes, NaN is considered an extreme value in this proposal and therefore overwrites other extreme values, making filtering less efficient, but is it really worth to make a case distinction here? Yes, absolutely. NaNs aren't as rare as you might think depending on the use-case, and need special handling *regardless*, because they aren't ordered naturally. The 'large value' conundrum you mention is a red herring, those are naturally handled the same by every data system. But not every data system makes the same choices when it comes to handling NaNs. By taking them out of the ordering and presented separately through a `nan_count` field every data system can process the information in a way which is compatible with their system. > Your own proposal suggests an ordering that would be different for min and max. While this might help with squeezing the most filtering possibilities out of NaNs, it adds even more special handling logic, and I guess a sloppily written writer implementation could get this wrong as well. Let's not forget that my proposal very closely matches *what Parquet already requires today*. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
