xudong963 commented on code in PR #15330: URL: https://github.com/apache/datafusion/pull/15330#discussion_r2005013876
########## datafusion/datasource-parquet/src/file_format.rs: ########## @@ -797,10 +797,34 @@ pub async fn fetch_statistics( statistics_from_parquet_meta_calc(&metadata, table_schema) } -/// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using ['StatisticsConverter`] +/// Convert statistics in [`ParquetMetaData`] into [`Statistics`] using [`StatisticsConverter`] /// /// The statistics are calculated for each column in the table schema /// using the row group statistics in the parquet metadata. +/// +/// # Key behaviors: +/// +/// 1. Extracts row counts and byte sizes from all row groups +/// 2. Applies schema type coercions to align file schema with table schema +/// 3. Collects and aggregates statistics across row groups when available +/// +/// # When there are no statistics: +/// +/// If the Parquet file doesn't contain any statistics (has_statistics is false), the function returns a Statistics object with: +/// - Exact row count +/// - Exact byte size +/// - All column statistics marked as unknown via Statistics::unknown_column(&table_schema) +/// # When only some columns have statistics: +/// +/// For columns with statistics: +/// - Min/max values are properly extracted and represented as Precision::Exact +/// - Null counts are calculated by summing across row groups +/// +/// For columns without statistics, +/// - For min/max, there are two situations: +/// 1. The column isn't in arrow schema, then min/max values are set to Precision::Absent +/// 2. The column is in arrow schema, but not in parquet schema due to schema revolution, min/max values are set to Precision::Exact(null) Review Comment: In fact, I have questions about this behavior, shouldn't it be `Precision::Absent`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org