orlp commented on PR #221:
URL: https://github.com/apache/parquet-format/pull/221#issuecomment-2937795348

   > Are we willing to special case float to make filtering in the presence of 
NaNs more efficient, or do we go with a more streamlined implementation without 
special fields that only make sense for a single data type, at the cost of 
worse filtering performance when NaNs are present.
   
   I consider IEEE 754 total order as arbitrary and special-cased as the 
NaN-ignoring order. Either way we have to do something we're not doing right 
now for floats.
   
   > On the upside, what we gain by considering the sign bit for NaNs is:
   > - a very efficient, branch-free and straightforward software 
implementation for the comparison predicate
   
   Computing the min/max while ignoring NaNs is also efficient and branch-free 
if done correctly. And I hope we agree the computation of the statistic is far 
more important than the singular comparison to decide whether or not to skip an 
entire page.
   
   > - this gives a total ordering of every possible NaN bit pattern and 
thereby leaves no room for interpretation that could lead to different results 
between engines.
   
   My proposal is also fully defined without room for interpretation (apart 
from the exact NaN pattern used for the full-NaN page, we can figure that out).
   
   > The "NaN poisoning" you mention, which I also mentioned in my initial 
proposal, is somewhat arbitrary: Yes, NaN is considered an extreme value in 
this proposal and therefore overwrites other extreme values, making filtering 
less efficient, but is it really worth to make a case distinction here?
   
   Yes, absolutely. NaNs aren't as rare as you might think depending on the 
use-case, and need special handling *regardless*, because they aren't ordered 
naturally. The 'large value' conundrum you mention is a red herring, those are 
naturally handled the same by every data system.
   
   But not every data system makes the same choices when it comes to handling 
NaNs. By taking them out of the ordering and presented separately through a 
`nan_count` field every data system can process the information in a way which 
is compatible with their system. 
   
   > Your own proposal suggests an ordering that would be different for min and 
max. While this might help with squeezing the most filtering possibilities out 
of NaNs, it adds even more special handling logic, and I guess a sloppily 
written writer implementation could get this wrong as well.
   
   Let's not forget that my proposal very closely matches *what Parquet already 
requires today*.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to