orlp commented on PR #221: URL: https://github.com/apache/parquet-format/pull/221#issuecomment-3140841598
@alamb > but I don't see any reason we can't introduce it later on It would be incompatible with all existing Parquet readers which implemented this PR. Suppose I write a query engine with a Parquet reader, and want to do pruning. My query is "find all rows where column `x` is NaN". Naturally, I read this proposal and implemented it by cleverly pruning anything where the min and max statistics aren't NaN. Now if we change the semantics of the min and max to no longer include NaNs and those instead being reported in the `nan_count`, this query engine would incorrectly miss rows if it doesn't understand `nan_count`. So the only option would be to *completely replace* the ordering introduced in this PR with *another* ordering which includes this `nan_count` behavior. But then we go back to my original complaint, as the older engines don't support this new ordering: > I do see it as a one-way door because no vendor is going to adopt a new ordering later if it means all readers which do not (yet) support the new ordering will downgrade from "suboptimal pruning" (the new status quo if this is adopted) to "no pruning" (the current status quo). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
