xudong963 commented on code in PR #15330:
URL: https://github.com/apache/datafusion/pull/15330#discussion_r2005013876


##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -797,10 +797,34 @@ pub async fn fetch_statistics(
     statistics_from_parquet_meta_calc(&metadata, table_schema)
 }
 
-/// Convert statistics in  [`ParquetMetaData`] into [`Statistics`] using 
['StatisticsConverter`]
+/// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using 
[`StatisticsConverter`]
 ///
 /// The statistics are calculated for each column in the table schema
 /// using the row group statistics in the parquet metadata.
+///
+/// # Key behaviors:
+///
+/// 1. Extracts row counts and byte sizes from all row groups
+/// 2. Applies schema type coercions to align file schema with table schema
+/// 3. Collects and aggregates statistics across row groups when available
+///
+/// # When there are no statistics:
+///
+/// If the Parquet file doesn't contain any statistics (has_statistics is 
false), the function returns a Statistics object with:
+/// - Exact row count
+/// - Exact byte size
+/// - All column statistics marked as unknown via 
Statistics::unknown_column(&table_schema)
+/// # When only some columns have statistics:
+///
+/// For columns with statistics:
+/// - Min/max values are properly extracted and represented as Precision::Exact
+/// - Null counts are calculated by summing across row groups
+///
+/// For columns without statistics,
+/// - For min/max, there are two situations:
+///     1. The column isn't in arrow schema, then min/max values are set to 
Precision::Absent
+///     2. The column is in arrow schema, but not in parquet schema due to 
schema revolution, min/max values are set to Precision::Exact(null)

Review Comment:
   In fact, I have questions about this behavior, shouldn't it be 
`Precision::Absent`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to