alamb commented on PR #221:
URL: https://github.com/apache/parquet-format/pull/221#issuecomment-3140071375

   > > I don't see adopting total ordering as a one way door, we can always add 
a nan count mechanism later.
   > 
   > I do see it as a one-way door because no vendor is going to adopt a new 
ordering later if it means all readers which do not (yet) support the new 
ordering will downgrade from "suboptimal pruning" (the new status quo if this 
is adopted) to "no pruning" (the current status quo). We realistically speaking 
only have one chance to go from the status quo to a good solution, adopting a 
suboptimal solution now will likely leave us in a local maximum for the 
remainder of Parquet's lifetime.
   
   I don't think the two proposal are mutually exclusive.  Here is a existence 
proof of a way they could both co-exist
   1. Introduce IEEE 754 total order (this PR/proposal)
   2. In a follow on change, adopt the `nan_count` statistic proposed by @orlp, 
with semantics: if `nan_count` is specified, then `Nan`s should **not** be 
included in the statistics; If `nan_count` is not specified then Nans **are** 
included in the statistics, per this proposal
   
   Of course, the spec would be less complicated if we included `nan_count` to 
begin with, but I don't see any reason we can't introduce it later on
   
   So my suggestion is let's get this proposal up for a vote, and then write up 
a follow on proposal for being more efficient in the presence of Nans


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to