etseidl commented on PR #221: URL: https://github.com/apache/parquet-format/pull/221#issuecomment-2945050964
> I don't think this is a fair voting procedure if the options are stated as "approve this or veto". I wasn't suggesting a formal vote on total order. Rather, I was thinking an informal poll as @JFinis had suggested to fail fast on this PR if it lacked PMC support. But I see your point. What I want to avoid is a protracted discussion that after several months ends with the majority opinion of those involved to proceed with total order, only to then be shot down in an actual vote. The most vocal opinions recently have been from non-PMC. >>There is actually a problem with the singular NaN count for data systems which use IEEE 754 total ordering (such as datafusion), they would need two counts for efficient page filtering in the face of NaNs: one for positive NaNs and one for negative NaNs. > >I don't think that's a big problem. It just means that if the system needs to include either -NaN or +NaN in a query, any page that has a non-zero nan_count has to be scanned. Yes, that might mean that you scan a page in vain, if you're only looking for, say, +NaN, but the page happens to only include -NaN, but this seems to be a rather small problem. I believe it's worse than that. Consider a page `[-NaN, -2.0, 0.0]`. With `nan_counts` the stats are `(-2.0, 0.0, nan_count=1)`. Not knowing what type of NaN was seen, an engine like Datafusion will have to treat the stats as `(-NaN, NaN)` rather than `(-NaN, 0.0)`. With a query predicate like `x > 0.0`, Datafusion could prune the page with total order stats, but would have to scan the page with `nan_count`. It's the opposite of the problem @orlp raised. An engine that treats all NaNs as equal and greater than all real values would turn the total order stats `(-NaN, 0.0)` into `(-NaN, NaN)`, and thus would be unable to prune with a predicate of `x < -2.0`. Adding a separate count for -NaN, I think, _would_ satisfy both types of engines, but adds even more complexity (but if we're counting anyway it's not much of an added burden). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
