wgtmac commented on code in PR #514:
URL: https://github.com/apache/parquet-format/pull/514#discussion_r2272517329


##########
src/main/thrift/parquet.thrift:
##########
@@ -1082,23 +1093,105 @@ union ColumnOrder {
    *   BYTE_ARRAY - unsigned byte-wise comparison
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
-   * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating
+   *     point types due to underspecified handling of NaN and -0/+0,
+   *     it is recommended that writers use IEEE_754_TOTAL_ORDER
+   *     for these types.
+   *
+   *     If TYPE_ORDER is used for floating point types, then the following
    *     compatibility rules should be applied when reading statistics:
    *     - If the min is a NaN, it should be ignored.
    *     - If the max is a NaN, it should be ignored.
+   *     - If the nan_count field is set, a reader can compute
+   *       nan_count + null_count == num_values to deduce whether all non-null
+   *       values are NaN.
+   *     - When looking for NaN values, min and max should be ignored.

Review Comment:
   Moving these to line 1113 as it has duplicate content.



##########
src/main/thrift/parquet.thrift:
##########
@@ -1170,6 +1263,19 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated 
before
    * using them by inspecting null_pages.
+   * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+   * NaN values are not to be included in these bounds. If all non-null values
+   * of a page are NaN, then a writer must do the following:
+   * - If the order of this column is TypeDefinedOrder, then no column index
+   *   must be written for this column chunk. While this is unfortunate for
+   *   performance, it is necessary to avoid conflict with legacy files that
+   *   still included NaN in min_values and max_values even if the page had
+   *   non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
+   * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+   * * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
+   *   of a page are NaN, then min_values[i] and max_values[i] must be set to
+   *   the smallest and largest NaN value contained in the page, as defined
+   *   by the IEEE 754 total order.

Review Comment:
   ```suggestion
      * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
      *   and max_values[i] must be set to the smallest and largest NaN value
      *   contained in the page, as defined by the IEEE 754 total order.
   ```



##########
src/main/thrift/parquet.thrift:
##########
@@ -1170,6 +1263,19 @@ struct ColumnIndex {
    * Such more compact values must still be valid values within the column's
    * logical type. Readers must make sure that list entries are populated 
before
    * using them by inspecting null_pages.
+   * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+   * NaN values are not to be included in these bounds. If all non-null values
+   * of a page are NaN, then a writer must do the following:
+   * - If the order of this column is TypeDefinedOrder, then no column index

Review Comment:
   ```suggestion
      * - If the order of this column is TYPE_ORDER, then no column index
   ```
   
   Just to be consistent



##########
src/main/thrift/parquet.thrift:
##########
@@ -1211,6 +1316,16 @@ struct ColumnIndex {
     * Same as repetition_level_histograms except for definitions levels.
     **/
    7: optional list<i64> definition_level_histograms;
+
+   /** 
+    * A list containing the number of NaN values for each page. Only present
+    * for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16.
+    * If this field is not present, readers MUST assume that there might or
+    * might not be NaN values in any page, as NaNs should not be included
+    * in min_values or max_values.

Review Comment:
   This looks imprecise to me. NaNs are included in min_values/max_values when 
IEEE754TotalOrder is used. So perhaps we should simply say that `If this field 
is not present, readers MUST assume that there might be NaN values in any page`?



##########
src/main/thrift/parquet.thrift:
##########
@@ -1082,23 +1093,105 @@ union ColumnOrder {
    *   BYTE_ARRAY - unsigned byte-wise comparison
    *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
    *
-   * (*) Because the sorting order is not specified properly for floating
-   *     point values (relations vs. total ordering) the following
+   * (*) Because the precise sorting order is ambiguous for floating

Review Comment:
   ```suggestion
      * (*) Because the TypeDefinedOrder is ambiguous for floating
   ```
   
   Perhaps explicitly use `TypeDefinedOrder` or `TYPE_ORDER` here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to