etseidl commented on code in PR #183: URL: https://github.com/apache/parquet-site/pull/183#discussion_r3374036386
########## content/en/blog/features/ieee754-order.md: ########## @@ -0,0 +1,88 @@ +--- +title: "Taming Floating-Point Statistics in Apache Parquet: IEEE 754 Total Order and NaN Counts" +date: 2026-05-29 +description: "How the Apache Parquet Community resolved potentially ambiguous floating-point statistics using IEEE 754 total order and explicit NaN counts" +author: "[Jan Finis](https://github.com/JFinis), [Gang Wu](https://github.com/wgtmac)" +categories: ["features"] +--- + +Column statistics are the secret to Apache Parquet's blazing fast performance. By storing compact summaries—like `min`, `max`, and null counts—for row groups, column chunks, and pages, readers can easily skip irrelevant data that doesn't match a query. + +However, floating-point values throw a wrench into this simple model. The IEEE 754 standard defines special values like `NaN` (Not a Number), signed zeros (`-0.0` and `+0.0`), and infinities. Their comparison rules don't play well with the simple "total order" (a strict smaller-to-larger ranking) expected by most data-pruning algorithms. To fix this, the Parquet community recently clarified the standard by combining IEEE 754 total order semantics with an explicit `nan_count` field in the statistics. + +The result is a much clearer contract between data writers and readers. Floating-point bounds can now be interpreted consistently, and readers can confidently determine if `NaN` values are present, without having to guess based solely on `min` and `max` bounds. + +## Why Floating-Point Statistics Need Special Handling + +For integers, strings, and many other straightforward types, Parquet statistics are simple: the writer records the absolute smallest and largest values, and the reader uses those bounds to decide if a query might find a match. + +Floating-point columns are trickier for two major reasons. First, `-0.0` and `+0.0` are considered equal in normal math operations, yet they possess distinct underlying bit patterns. A data format needs strict rules on how to order these values; otherwise, different libraries might generate conflicting statistics for the exact same underlying data. Parquet's approach to dealing with this ambiguity has been to mandate that a `min` of `0.0` always be written as `-0.0`, and a `max` of `0.0` always be written as `+0.0`, regardless of any sign bits that may be present in the actual data. Readers are advised that `-0.0` may be present even if the `min` is `+0.0`, and `+0.0` may be present even if the max is `-0.0`. + +Second, `NaN` is completely unordered under standard IEEE 754 comparisons. Expressions like `x < NaN`, `x > NaN`, and `x == NaN` always evaluate to false. If a writer blindly includes `NaN` in ordinary `min` or `max` calculations, the resulting bounds might be useless for skipping data. To date Parquet has followed the latter approach, forbidding the inclusion of `NaN` in the statistics. [PR #196](https://github.com/apache/parquet-format/pull/196) provides a detailed overview of the problems inherent in this approach. For instance, consider a page with a max statistic of `0.0` that also contains a `NaN`. A query engine that considers `NaN` to be greater than all values attempts a query with a predicate like `x > 1.0`. If the engine examines the statistics, it will see that the `max` is `0.0`, so it might improperly skip that page, even though it contains at least one row that satisfies the predicate. Without knowledge of the presence or absence of `NaN`, the engine cannot safely per form this type of page pruning for floating point columns. + +These aren't just theoretical edge cases. Query engines rely heavily on these statistics to safely skip large chunks of data. Ambiguous floating-point bounds degrade query performance and can lead to severe inconsistencies. [PARQUET-2249](https://issues.apache.org/jira/browse/PARQUET-2249) exposed another critical flaw in how `NaN` values interacted with Parquet's `ColumnIndex` metadata. + +The core problem in PARQUET-2249 arose when a data page contained *only* `NaN` values. Such a page isn't considered "null"—the data is physically there, and `NaN` is distinct from a missing `null` value. However, older guidelines stated that `NaN` values shouldn't be included when computing `min` and `max`. This put the `ColumnIndex` in an impossible situation: it strictly required valid `min` and `max` bounds for any non-null page, yet there were no non-`NaN` values available to use! + +## How the Community Reached Consensus + +The final solution didn't emerge overnight. The discussion began with [PR #196](https://github.com/apache/parquet-format/pull/196) in March 2023, which proposed adding a `nan_count` to floating-point statistics. This neatly solved one problem: making it explicitly clear whether a chunk of data contains `NaN`s. It also provided a safe migration path, as the new field could be added without breaking older readers. + +However, adding `nan_count` alone didn't completely solve the `ColumnIndex` dilemma. For page and column chunk statistics, a reader can combine `num_values`, `null_count`, and `nan_count` to deduce if all non-null values are `NaN`. But `ColumnIndex` doesn't store `num_values`, making that math impossible. The community brainstormed various workarounds, like writing `NaN` directly into the bounds or adding special markers (like `nan_pages`). Each idea solved part of the puzzle but introduced new complexities or metadata bloat. + +[PR #221](https://github.com/apache/parquet-format/pull/221) explored an alternative route: introducing a brand new `IEEE_754_TOTAL_ORDER` for floating-point columns. This gave physical `FLOAT`, `DOUBLE`, and logical `FLOAT16` values a strict, deterministic position, sorting out the signed zeros and different `NaN` bit patterns. However, this approach had a critical flaw on its own: because `NaN` values are placed at the absolute extremes of the total order, the presence of even a single `NaN` would pollute the `min` or `max` bounds, rendering normal numeric predicate pushdown completely ineffective. Furthermore, while it elegantly removed ordering ambiguity, it required an opt-in from readers and writers, and didn't fully answer how much readers should know about the presence of `NaN`s in older files. + +Ultimately, the community reached a consensus: why not combine the best of both worlds? [PR #514](https://github.com/apache/parquet-format/pull/514), opened in August 2025 and merged in May 2026, successfully merged `IEEE_754_TOTAL_ORDER` with `NaN` counts. The new order strictly defines how bounds are compared, while `nan_count` clearly flags the presence of `NaN`s. Because no legacy writer uses the new order, the spec now safely *requires* `nan_count` whenever `IEEE_754_TOTAL_ORDER` is used, while gracefully handling older files. + +## The Solution + +The resulting specification elegantly marries two concepts: an optional `nan_count` field for both `Statistics` and `ColumnIndex`, and the `IEEE_754_TOTAL_ORDER` column order. + +`nan_count` records the exact number of `NaN` values within a given scope. Because the field is optional (or completely missing from older files), readers must treat a *missing* `nan_count` differently than a `0`. If missing, readers must cautiously assume `NaN` values *might* be present. If a column is written using `IEEE_754_TOTAL_ORDER`, the writer is forced to provide the `nan_count`. + +This new total order applies exclusively to physical `FLOAT`, `DOUBLE`, or logical `FLOAT16` columns. It defines a strict, deterministic ordering for bit patterns with these key properties: + +1. `-0.0` is ordered strictly before `+0.0`. Review Comment: > So the implication to implementations is that special care should be taken when a literal in a predicate is 0. I think how engines treat -0 will dictate how they interpret the statistics. But at least now the statistics are well defined. In the past, a min of `0.0` was represented as `-0.0`, which would give incorrect information to engines that distinguish between positive and negative zero. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
