This is an automated email from the ASF dual-hosted git repository.

wgtmac pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/parquet-format.git


The following commit(s) were added to refs/heads/master by this push:
     new 662cdac  PARQUET-2249: Introduce IEEE 754 total order & NaN-counts 
(#514)
662cdac is described below

commit 662cdac7ffb7e71e10e0e0c519b215791ff2d1aa
Author: Jan Finis <[email protected]>
AuthorDate: Tue May 26 04:09:31 2026 +0200

    PARQUET-2249: Introduce IEEE 754 total order & NaN-counts (#514)
    
    This commit is a combination of the following PRs:
    
    * Introduce IEEE 754 total order
      https://github.com/apache/parquet-format/pull/221
    * Add nan_count to handle NaNs in statistics
      https://github.com/apache/parquet-format/pull/196
    
    Both these PRs try to solve the same problems; read
    the description of the respective PRs for explanation.
    
    This PR is the result of an extended discussion in
    which it was repeatedly brought up that another possible
    solution to the problem could be the combination of
    the two approaches. Please refer to this discussion
    on the mailing list and in the two PRs for details.
    the mailing list discussion can be found here:
    https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s
    
    The contents of this PR are basically a straightforward
    combination of the two approaches:
    * IEEE total order is introduced as a new order for floating
      point types
    * nan_count and nan_counts fields are added
    
    Legacy writers may not write nan_count(s) fields,
    so readers have to handle them being absent. Also, legacy
    writers may have included NaNs into min/max bounds, so readers
    also have to handle that.
    
    As there are no legacy writers writing IEEE total order,
    nan_count(s) are defined to be mandatory if this order is used,
    so readers can assume their presence when this order is used.
    
    This commit removes `nan_pages` from the ColumnIndex,
    which the nan_counts PR mandated. We don't need them anymore,
    as IEEE total order solves this: As this is a new order and
    there are thus no legacy writers and readers, we have the
    freedom to define that for only-NaN pages using this order,
    we can actually write NaNs into min & max bounds in this case,
    and readers can assume that isNaN(min) signals an only-NaN
    page.
---
 LogicalTypes.md                |  5 ++-
 README.md                      |  4 +-
 src/main/thrift/parquet.thrift | 95 ++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 98 insertions(+), 6 deletions(-)

diff --git a/LogicalTypes.md b/LogicalTypes.md
index f3375e5..795c223 100644
--- a/LogicalTypes.md
+++ b/LogicalTypes.md
@@ -254,7 +254,10 @@ Used in contexts where precision is traded off for smaller 
footprint and potenti
 
 The primitive type is a 2-byte `FIXED_LEN_BYTE_ARRAY`.
 
-The sort order for `FLOAT16` is signed (with special handling of NANs and 
signed zeros); it uses the same 
[logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and 
`DOUBLE`.
+Like `FLOAT` and `DOUBLE`, the sort order for `FLOAT16` is signed with special
+handling for NaNs and signed zeros. Writers should use IEEE754TotalOrder for
+consistent handling of these edge cases. See the `ColumnOrder` union in the
+[Thrift definition](src/main/thrift/parquet.thrift) for details.
 
 ## Temporal Types
 
diff --git a/README.md b/README.md
index d348209..d398ac4 100644
--- a/README.md
+++ b/README.md
@@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types].
 Parquet stores min/max statistics at several levels (such as Column Chunk,
 Column Index, and Data Page). These statistics are according to a sort order,
 which is defined for each column in the file footer. Parquet supports common
-sort orders for logical and primitive types. The details are documented in the
+sort orders for logical and primitive types and also special orders for types
+with potentially ambiguous semantics (e.g., NaN ordering for floating point
+types). The details are documented in the
 [Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.
 
 ## Nested Encoding
diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift
index 883264c..225f85f 100644
--- a/src/main/thrift/parquet.thrift
+++ b/src/main/thrift/parquet.thrift
@@ -309,6 +309,13 @@ struct Statistics {
    7: optional bool is_max_value_exact;
    /** If true, min_value is the actual minimum value for a column */
    8: optional bool is_min_value_exact;
+   /**
+    * Count of NaN values in the column; only present if physical type is FLOAT
+    * or DOUBLE, or logical type is FLOAT16.
+    * If this field is not present, readers MUST assume NaNs may be present
+    * (i.e. MUST assume nan_count > 0 and MAY NOT assume nan_count == 0).
+    */
+   9: optional i64 nan_count;
 }
 
 /** Empty structs to use as logical type annotations */
@@ -1050,6 +1057,9 @@ struct RowGroup {
 /** Empty struct to signal the order defined by the physical or logical type */
 struct TypeDefinedOrder {}
 
+/** Empty struct to signal IEEE 754 total order for floating point types */
+struct IEEE754TotalOrder {}
+
 /**
  * Union to specify the order used for the min_value and max_value fields for a
  * column. This union takes the role of an enhanced enum that allows rich
@@ -1058,6 +1068,7 @@ struct TypeDefinedOrder {}
  * Possible values are:
  * * TypeDefinedOrder - the column uses the order defined by its logical or
  *                      physical type (if there is no logical type).
+ * * IEEE754TotalOrder - the floating point column uses IEEE 754 total order.
  *
  * If the reader does not support the value of this union, min and max stats
  * for this column should be ignored.
@@ -1111,23 +1122,78 @@ union ColumnOrder {
    *      64-bit signed integer (nanos)
    *    See https://github.com/apache/parquet-format/issues/502 for more 
details
    *
-   * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
+   * (*) Because TYPE_ORDER is ambiguous for floating point types due to
+   *     underspecified handling of NaN and -0/+0, it is recommended that 
writers
+   *     use IEEE_754_TOTAL_ORDER for these types.
+   *
+   *     If TYPE_ORDER is used for floating point types, then the following
    *     compatibility rules should be applied when reading statistics:
    *     - If the min is a NaN, it should be ignored.
    *     - If the max is a NaN, it should be ignored.
+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-null
+   *       values are NaN.
    *     - If the min is +0, the row group may contain -0 values as well.
    *     - If the max is -0, the row group may contain +0 values as well.
    *     - When looking for NaN values, min and max should be ignored.
+   *       If the nan_count field is set, it can be used to check whether
+   *       NaNs are present.
    *
-   *     When writing statistics the following rules should be followed:
-   *     - NaNs should not be written to min or max statistics fields.
+   *     When writing page or column chunk statistics for columns with
+   *     TYPE_ORDER order, the following rules must be followed:
+   *     - The nan_count field must be set for floating point types, even if
+   *       it is zero.
+   *     - If the nan_count field is set, min and max statistics fields, when
+   *       present, must not contain NaN values and must be computed from
+   *       non-NaN values only. This signals to readers that the min and max
+   *       statistics are reliable for non-NaN values.
+   *     - If all non-null values are NaN, min and max statistics must not be
+   *       written.
    *     - If the computed max value is zero (whether negative or positive),
    *       `+0.0` should be written into the max statistics field.
    *     - If the computed min value is zero (whether negative or positive),
    *       `-0.0` should be written into the min statistics field.
+   *
+   *     When writing column indexes for columns with TYPE_ORDER order, the
+   *     following rules must be followed:
+   *     - NaNs must not be written to min_values or max_values.
+   *     - If all non-null values of a page are NaN, a column index must not
+   *       be written for this column chunk because min_values and max_values
+   *       are required.
+   *     - If the computed max value is zero (whether negative or positive),
+   *       `+0.0` should be written into the corresponding max_values entry.
+   *     - If the computed min value is zero (whether negative or positive),
+   *       `-0.0` should be written into the corresponding min_values entry.
    */
   1: TypeDefinedOrder TYPE_ORDER;
+
+  /*
+   * The floating point type is ordered according to the totalOrder predicate,
+   * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
+   * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this 
ordering.
+   *
+   * Intuitively, this orders floats mathematically, but defines -0 to be less
+   * than +0, -NaN to be less than anything else, and +NaN to be greater than
+   * anything else. It also defines an order between different bit 
representations
+   * of the same value.
+   *
+   * When writing statistics for columns with IEEE_754_TOTAL_ORDER order, then
+   * following rules must be followed:
+   * - Writing the nan_count field is mandatory when using this ordering.
+   * - Min and max statistics must contain the smallest and largest non-NaN
+   *   values respectively, or if all non-null values are NaN, the smallest and
+   *   largest NaN values as defined by IEEE 754 total order.
+   *
+   * When reading statistics for columns with this order, the following rules
+   * should be followed:
+   * - Readers should consult the nan_count field to determine whether NaNs
+   *   are present.
+   * - A reader can compute nan_count + null_count == num_values to deduce
+   *   whether all non-null values are NaN. In the page index, which does not
+   *   have a num_values field, the presence of a NaN value in min_values
+   *   or max_values indicates that all non-null values are NaN.
+   */
+  2: IEEE754TotalOrder IEEE_754_TOTAL_ORDER;
 }
 
 struct PageLocation {
@@ -1199,6 +1265,18 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated 
before
    * using them by inspecting null_pages.
+   *
+   * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+   * NaN values are not to be included in these bounds. If all non-null values
+   * of a page are NaN, then a writer must do the following:
+   * - If the order of this column is TYPE_ORDER, then a column index must
+   *   not be written for this column chunk. While this is unfortunate for
+   *   performance, it is necessary to avoid conflict with legacy files that
+   *   still included NaN in min_values and max_values even if the page had
+   *   non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
+   * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+   *   and max_values[i] of that page must be set to the smallest and largest
+   *   NaN values as defined by IEEE 754 total order.
    */
   2: required list<binary> min_values
   3: required list<binary> max_values
@@ -1240,6 +1318,15 @@ struct ColumnIndex {
     * Same as repetition_level_histograms except for definitions levels.
     **/
    7: optional list<i64> definition_level_histograms;
+
+   /**
+    * A list containing the number of NaN values for each page. Only present
+    * for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16.
+    * If this field is not present, readers MUST assume that there might be
+    * NaN values in any page.
+    */
+   8: optional list<i64> nan_counts
+
 }
 
 struct AesGcmV1 {

Reply via email to