alamb commented on code in PR #514:
URL: https://github.com/apache/parquet-format/pull/514#discussion_r2266615423
##########
src/main/thrift/parquet.thrift:
##########
@@ -309,6 +309,13 @@ struct Statistics {
7: optional bool is_max_value_exact;
/** If true, min_value is the actual minimum value for a column */
8: optional bool is_min_value_exact;
+ /**
+ * count of NaN values in the column; only present if physical type is FLOAT
+ * or DOUBLE, or logical type is FLOAT16.
+ * Readers MUST distinguish between nan_count not being present and
nan_count == 0.
+ * If nan_count is not present, readers MUST NOT assume nan_count == 0.
+ */
+ 9: optional i64 nan_count;
Review Comment:
I would find this easier to understand if it were more concise and didn't
have a double negative. Here is a suggestion:
```suggestion
/**
* count of NaN values in the column; only present if physical type is
FLOAT
* or DOUBLE, or logical type is FLOAT16.
* If this field is not present, readers must assume NaNs may be present
* (MUST NOT assume nan_count == 0).
*/
9: optional i64 nan_count;
```
##########
README.md:
##########
@@ -158,7 +158,9 @@ documented in [LogicalTypes.md][logical-types].
Parquet stores min/max statistics at several levels (such as Column Chunk,
Column Index, and Data Page). These statistics are according to a sort order,
which is defined for each column in the file footer. Parquet supports common
-sort orders for logical and primitve types. The details are documented in the
+sort orders for logical and primitve types and also special orders for types
+where the common sort order is not unambiguously defined (e.g., NaN ordering
+for floating point types). The details are documented in the
Review Comment:
There is a spelling mistake (`primitve`). Also here is a proposal to make
this more concise
```suggestion
sort orders for logical and primitive types and also special orders for
types with potentially ambiguous semantics (e.g., NaN ordering
for floating point types). The details are documented in the
```
##########
src/main/thrift/parquet.thrift:
##########
@@ -1170,6 +1263,19 @@ struct ColumnIndex {
* Such more compact values must still be valid values within the column's
* logical type. Readers must make sure that list entries are populated
before
* using them by inspecting null_pages.
+ * For columns of physical type FLOAT or DOUBLE, or logical type FLOAT16,
+ * NaN values are not to be included in these bounds. If all non-null values
+ * of a page are NaN, then a writer must do the following:
+ * - If the order of this column is TypeDefinedOrder, then no column index
+ * must be written for this column chunk. While this is unfortunate for
+ * performance, it is necessary to avoid conflict with legacy files that
+ * still included NaN in min_values and max_values even if the page had
+ * non-NaN values. To mitigate this, IEEE754_TOTAL_ORDER is recommended.
+ * - If the order of this column is IEEE754_TOTAL_ORDER, then min_values[i]
+ * * If IEEE754_TOTAL_ORDER is used for the column and all non-null values
Review Comment:
I think the heading `-` bullet is different:
```suggestion
* - If IEEE754_TOTAL_ORDER is used for the column and all non-null values
```
##########
src/main/thrift/parquet.thrift:
##########
@@ -1082,23 +1093,105 @@ union ColumnOrder {
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
- * (*) Because the sorting order is not specified properly for floating
- * point values (relations vs. total ordering) the following
+ * (*) Because the precise sorting order is ambiguous for floating
+ * point types due to underspecified handling of NaN and -0/+0,
+ * it is recommended that writers use IEEE_754_TOTAL_ORDER
+ * for these types.
+ *
+ * If TYPE_ORDER is used for floating point types, then the following
* compatibility rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+ * - If the nan_count field is set, a reader can compute
+ * nan_count + null_count == num_values to deduce whether all non-null
+ * values are NaN.
+ * - When looking for NaN values, min and max should be ignored.
+ * If the nan_count field is set, it can be used to check whether
+ * NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
*
* When writing statistics the following rules should be followed:
- * - NaNs should not be written to min or max statistics fields.
+ * - It is suggested to always set the nan_count field for floating
+ * point types, especially also if it is zero.
+ * - NaNs should not be written to min or max statistics fields except
+ * in the column index, where min_values and max_values are not
optional
+ * so a NaN value must be written if all non-null values in a page
+ * are NaN.
* - If the computed max value is zero (whether negative or positive),
* `+0.0` should be written into the max statistics field.
* - If the computed min value is zero (whether negative or positive),
* `-0.0` should be written into the min statistics field.
*/
1: TypeDefinedOrder TYPE_ORDER;
+
+ /*
+ * The floating point type is ordered according to the totalOrder predicate,
+ * as defined in section 5.10 of IEEE-754 (2008 revision). Only columns of
+ * physical type FLOAT or DOUBLE, or logical type FLOAT16 may use this
ordering.
+ *
+ * Intuitively, this orders floats mathematically, but defines -0 to be less
+ * than +0, -NaN to be less than anything else, and +NaN to be greater than
+ * anything else. It also defines an order between different bit
representations
+ * of the same value.
+ *
+ * The formal definition is as follows:
+ * a) If x<y, totalOrder(x, y) is true.
+ * b) If x>y, totalOrder(x, y) is false.
+ * c) If x=y:
+ * 1) totalOrder(−0, +0) is true.
+ * 2) totalOrder(+0, −0) is false.
+ * 3) If x and y represent the same floating-point datum:
+ * i) If x and y have negative sign, totalOrder(x, y) is true if and
+ * only if the exponent of x ≥ the exponent of y
+ * ii) otherwise totalOrder(x, y) is true if and only if the exponent
+ * of x ≤ the exponent of y.
+ * d) If x and y are unordered numerically because x or y is NaN:
+ * 1) totalOrder(−NaN, y) is true where −NaN represents a NaN with
+ * negative sign bit and y is a non-NaN floating-point number.
+ * 2) totalOrder(x, +NaN) is true where +NaN represents a NaN with
+ * positive sign bit and x is a non-NaN floating-point number.
+ * 3) If x and y are both NaNs, then totalOrder reflects a total ordering
+ * based on:
+ * i) negative sign orders below positive sign
+ * ii) signaling orders below quiet for +NaN, reverse for −NaN
+ * iii) lesser payload, when regarded as an integer, orders below
+ * greater payload for +NaN, reverse for −NaN.
+ *
+ * Note that this ordering can be implemented efficiently in software by
bit-wise
+ * operations on the integer representation of the floating point values.
+ * E.g., this is a possible implementation for DOUBLE in Rust:
+ *
+ * pub fn totalOrder(x: f64, y: f64) -> bool {
+ * let mut x_int = x.to_bits() as i64;
+ * let mut y_int = y.to_bits() as i64;
+ * x_int ^= (((x_int >> 63) as u64) >> 1) as i64;
+ * y_int ^= (((y_int >> 63) as u64) >> 1) as i64;
+ * return x_int <= y_int;
+ * }
+ *
+ * When writing statistics for columns with this order, the following rules
+ * must be followed:
+ * - Writing the nan_count field is mandatory when using this ordering,
+ * especialy also if it is zero.
Review Comment:
I if the field is mandatory, the "especially" part makes no sense
```suggestion
* - Writing the nan_count field is mandatory when using this ordering.
```
##########
src/main/thrift/parquet.thrift:
##########
@@ -1082,23 +1093,105 @@ union ColumnOrder {
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
- * (*) Because the sorting order is not specified properly for floating
- * point values (relations vs. total ordering) the following
+ * (*) Because the precise sorting order is ambiguous for floating
+ * point types due to underspecified handling of NaN and -0/+0,
+ * it is recommended that writers use IEEE_754_TOTAL_ORDER
+ * for these types.
+ *
+ * If TYPE_ORDER is used for floating point types, then the following
* compatibility rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+ * - If the nan_count field is set, a reader can compute
+ * nan_count + null_count == num_values to deduce whether all non-null
+ * values are NaN.
+ * - When looking for NaN values, min and max should be ignored.
+ * If the nan_count field is set, it can be used to check whether
+ * NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
*
* When writing statistics the following rules should be followed:
- * - NaNs should not be written to min or max statistics fields.
+ * - It is suggested to always set the nan_count field for floating
+ * point types, especially also if it is zero.
+ * - NaNs should not be written to min or max statistics fields except
+ * in the column index, where min_values and max_values are not
optional
+ * so a NaN value must be written if all non-null values in a page
+ * are NaN.
Review Comment:
```suggestion
* - NaNs should not be written to min or max statistics fields except
* in the column index when a page contains only NaN values. In this
* case, since min_values and max_values are required, a NaN value
* must be written.
```
##########
src/main/thrift/parquet.thrift:
##########
@@ -1082,23 +1093,105 @@ union ColumnOrder {
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*
- * (*) Because the sorting order is not specified properly for floating
- * point values (relations vs. total ordering) the following
+ * (*) Because the precise sorting order is ambiguous for floating
+ * point types due to underspecified handling of NaN and -0/+0,
+ * it is recommended that writers use IEEE_754_TOTAL_ORDER
+ * for these types.
+ *
+ * If TYPE_ORDER is used for floating point types, then the following
* compatibility rules should be applied when reading statistics:
* - If the min is a NaN, it should be ignored.
* - If the max is a NaN, it should be ignored.
+ * - If the nan_count field is set, a reader can compute
+ * nan_count + null_count == num_values to deduce whether all non-null
+ * values are NaN.
+ * - When looking for NaN values, min and max should be ignored.
+ * If the nan_count field is set, it can be used to check whether
+ * NaNs are present.
* - If the min is +0, the row group may contain -0 values as well.
* - If the max is -0, the row group may contain +0 values as well.
* - When looking for NaN values, min and max should be ignored.
*
* When writing statistics the following rules should be followed:
- * - NaNs should not be written to min or max statistics fields.
+ * - It is suggested to always set the nan_count field for floating
Review Comment:
The heading says "the following rules should be followed" but the first one
says "it is suggested" which seems to imply it is optional. I recommend
removing the "suggested"
##########
src/main/thrift/parquet.thrift:
##########
@@ -1211,6 +1316,16 @@ struct ColumnIndex {
* Same as repetition_level_histograms except for definitions levels.
**/
7: optional list<i64> definition_level_histograms;
+
+ /**
+ * A list containing the number of NaN values for each page. Only present
+ * for columns of physical type FLOAT or DOUBLE, or logical type FLOAT16.
+ * If this field is not present, readers MUST assume that there might or
+ * might not be NaN values in any page, as NaNs should not be included
+ * in min_values or max_values.
Review Comment:
```suggestion
* might not be NaN values in any page, as NaNs are not included
* in min_values or max_values.
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]