alamb commented on code in PR #15330:
URL: https://github.com/apache/datafusion/pull/15330#discussion_r2006654113


##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -797,10 +797,34 @@ pub async fn fetch_statistics(
     statistics_from_parquet_meta_calc(&metadata, table_schema)
 }
 
-/// Convert statistics in  [`ParquetMetaData`] into [`Statistics`] using 
['StatisticsConverter`]
+/// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using 
[`StatisticsConverter`]
 ///
 /// The statistics are calculated for each column in the table schema
 /// using the row group statistics in the parquet metadata.
+///
+/// # Key behaviors:
+///
+/// 1. Extracts row counts and byte sizes from all row groups
+/// 2. Applies schema type coercions to align file schema with table schema
+/// 3. Collects and aggregates statistics across row groups when available
+///
+/// # When there are no statistics:
+///
+/// If the Parquet file doesn't contain any statistics (has_statistics is 
false), the function returns a Statistics object with:
+/// - Exact row count
+/// - Exact byte size
+/// - All column statistics marked as unknown via 
Statistics::unknown_column(&table_schema)
+/// # When only some columns have statistics:
+///
+/// For columns with statistics:
+/// - Min/max values are properly extracted and represented as Precision::Exact
+/// - Null counts are calculated by summing across row groups
+///
+/// For columns without statistics,
+/// - For min/max, there are two situations:
+///     1. The column isn't in arrow schema, then min/max values are set to 
Precision::Absent
+///     2. The column is in arrow schema, but not in parquet schema due to 
schema revolution, min/max values are set to Precision::Exact(null)

Review Comment:
   I think in this case, the default schema adapter will fill in the constant 
value null for all columns like this so Precision::Exact(null) is correct
   
   However, as @adriangb found in 
https://github.com/apache/datafusion/pull/15263 and elsewhere when users use 
custom Schema adapters a value other than NULL is filled in
   
   Maybe this is another place where the schema adapter could/should be used 🤔 



##########
datafusion/datasource-parquet/src/file_format.rs:
##########
@@ -797,10 +797,34 @@ pub async fn fetch_statistics(
     statistics_from_parquet_meta_calc(&metadata, table_schema)
 }
 
-/// Convert statistics in  [`ParquetMetaData`] into [`Statistics`] using 
['StatisticsConverter`]
+/// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using 
[`StatisticsConverter`]
 ///
 /// The statistics are calculated for each column in the table schema
 /// using the row group statistics in the parquet metadata.
+///
+/// # Key behaviors:

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to