orlp commented on PR #221:
URL: https://github.com/apache/parquet-format/pull/221#issuecomment-3140841598

   @alamb
   
   > but I don't see any reason we can't introduce it later on
   
   It would be incompatible with all existing Parquet readers which implemented 
this PR.
   
   Suppose I write a query engine with a Parquet reader, and want to do 
pruning. My query is "find all rows where column `x` is NaN".  Naturally, I 
read this proposal and implemented it by cleverly pruning anything where the 
min and max statistics aren't NaN.
   
   Now if we change the semantics of the min and max to no longer include NaNs 
and those instead being reported in the `nan_count`, this query engine would 
incorrectly miss rows if it doesn't understand `nan_count`. So the only option 
would be to *completely replace* the ordering introduced in this PR with 
*another* ordering which includes this `nan_count` behavior. But then we go 
back to my original complaint, as the older engines don't support this new 
ordering:
   
   > I do see it as a one-way door because no vendor is going to adopt a new 
ordering later if it means all readers which do not (yet) support the new 
ordering will downgrade from "suboptimal pruning" (the new status quo if this 
is adopted) to "no pruning" (the current status quo).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to