orlp commented on PR #221:
URL: https://github.com/apache/parquet-format/pull/221#issuecomment-3140927834

   >  @orlp are you willing to to make such a proposal if it doesn't exist?
   
   I don't have the time or energy to make a fully formal proposal, no.
   
   That said, if I may informally propose something, I think the current order 
we have in `TypeDefinedOrder` is honestly fine:
   
   ```
      * (*) Because the sorting order is not specified properly for floating
      *     point values (relations vs. total ordering) the following
      *     compatibility rules should be applied when reading statistics:
      *     - If the min is a NaN, it should be ignored.
      *     - If the max is a NaN, it should be ignored.
      *     - If the min is +0, the row group may contain -0 values as well.
      *     - If the max is -0, the row group may contain +0 values as well.
      *     - When looking for NaN values, min and max should be ignored.
      * 
      *     When writing statistics the following rules should be followed:
      *     - NaNs should not be written to min or max statistics fields.
      *     - If the computed max value is zero (whether negative or positive),
      *       `+0.0` should be written into the max statistics field.
      *     - If the computed min value is zero (whether negative or positive),
      *       `-0.0` should be written into the min statistics field.
   ```
   
   I would honestly only clarify what should be done in the event a full-NaN 
page is encountered. I think in that case it makes the most sense to write 
`-inf` to the min and `+inf` to the max. That's honestly the only defect in the 
current spec, that we specify "NaNs should not be written to min or max 
statistics fields", but not what *should* be done in that case.
   
   ---
   
   Then if this ordering (**which we already have!** - can't get easier than 
that) is simply combined with the addition of two new statistics 
(`pos_nan_count` and `neg_nan_count`), **all** execution engines should be able 
to prune as they please, regardless of what semantics they have for comparison 
(both total ordering like in DataFusion and partial ordering like in Polars).
   
   I assume the addition of a completely new statistic is backwards compatible, 
if we don't change any semantics of the existing ordering (only clarifying what 
*should* be done for the full-NaN page)?
   
   Then the statistics can be implemented in an incredibly simple branchless 
(with `cmov`s) way:
   
   ```rust
   min = inf
   max = -inf
   pos_nan_count = 0
   neg_nan_count = 0
   for x in data {
       min = if x < min { x } else { min }
       max = if x > max { x } else { max }
       pos_nan_count += x.is_nan() & x.is_sign_positive()
       neg_nan_count += x.is_nan() & x.is_sign_negative()
   }
   
   // Post-processing normalization.
   if min == 0 { min = -0.0 }
   if max == 0 { max = 0.0 }
   if pos_nan_count + neg_nan_count == data.len() {
       min = -inf
       max = inf
   }
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to