etseidl commented on PR #221:
URL: https://github.com/apache/parquet-format/pull/221#issuecomment-2945050964

   > I don't think this is a fair voting procedure if the options are stated as 
"approve this or veto".
   
   I wasn't suggesting a formal vote on total order. Rather, I was thinking an 
informal poll as @JFinis had suggested to fail fast on this PR if it lacked PMC 
support. But I see your point. What I want to avoid is a protracted discussion 
that after several months ends with the majority opinion of those involved to 
proceed with total order, only to then be shot down in an actual vote. The most 
vocal opinions recently have been from non-PMC.
   
   
   >>There is actually a problem with the singular NaN count for data systems 
which use IEEE 754 total ordering (such as datafusion), they would need two 
counts for efficient page filtering in the face of NaNs: one for positive NaNs 
and one for negative NaNs.
   >
   >I don't think that's a big problem. It just means that if the system needs 
to include either -NaN or +NaN in a query, any page that has a non-zero 
nan_count has to be scanned. Yes, that might mean that you scan a page in vain, 
if you're only looking for, say, +NaN, but the page happens to only include 
-NaN, but this seems to be a rather small problem.
   
   I believe it's worse than that. Consider a page `[-NaN, -2.0, 0.0]`. With 
`nan_counts` the stats are `(-2.0, 0.0, nan_count=1)`. Not knowing what type of 
NaN was seen, an engine like Datafusion will have to treat the stats as `(-NaN, 
NaN)` rather than `(-NaN, 0.0)`. With a query predicate like `x > 0.0`, 
Datafusion could prune the page with total order stats, but would have to scan 
the page with `nan_count`. It's the opposite of the problem @orlp raised. An 
engine that treats all NaNs as equal and greater than all real values would 
turn the total order stats `(-NaN, 0.0)` into `(-NaN, NaN)`, and thus would be 
unable to prune with a predicate of `x < -2.0`. Adding a separate count for 
-NaN, I think, _would_ satisfy both types of engines, but adds even more 
complexity (but if we're counting anyway it's not much of an added burden).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to