alamb commented on issue #10453: URL: https://github.com/apache/datafusion/issues/10453#issuecomment-2117733259
After working through an actual example in https://github.com/apache/datafusion/pull/10549 I have a new API proposal: https://github.com/NGA-TRAN/arrow-datafusion/pull/118 Here is what the API looks like ```rust /// What type of statistics should be extracted? #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)] pub enum RequestedStatistics { /// Minimum Value Min, /// Maximum Value Max, /// Null Count, returned as a [`UInt64Array`]) NullCount, } /// Extracts Parquet statistics as Arrow arrays /// /// This is used to convert Parquet statistics to Arrow arrays, with proper type /// conversions. This information can be used for pruning parquet files or row /// groups based on the statistics embedded in parquet files /// /// # Schemas /// /// The schema of the parquet file and the arrow schema are used to convert the /// underlying statistics value (stored as a parquet value) into the /// corresponding Arrow value. For example, Decimals are stored as binary in /// parquet files. /// /// The parquet_schema and arrow _schema do not have to be identical (for /// example, the columns may be in different orders and one or the other schemas /// may have additional columns). The function [`parquet_column`] is used to /// match the column in the parquet file to the column in the arrow schema. /// /// # Multiple parquet files /// /// This API is designed to support efficiently extracting statistics from /// multiple parquet files (hence why the parquet schema is passed in as an /// argument). This is useful when building an index for a directory of parquet /// files. /// #[derive(Debug)] pub struct StatisticsConverter<'a> { /// The name of the column to extract statistics for column_name: &'a str, /// The type of statistics to extract statistics_type: RequestedStatistics, /// The arrow schema of the query arrow_schema: &'a Schema, /// The field (with data type) of the column in the arrow schema arrow_field: &'a Field, } impl<'a> StatisticsConverter<'a> { /// Returns a [`UInt64Array`] with counts for each row group /// /// The returned array has no nulls, and has one value for each row group. /// Each value is the number of rows in the row group. pub fn row_counts(metadata: &ParquetMetaData) -> Result<UInt64Array> { ... } /// create an new statistics converter pub fn try_new( column_name: &'a str, statistics_type: RequestedStatistics, arrow_schema: &'a Schema, ) -> Result<Self> { ... } /// extract the statistics from a parquet file, given the parquet file's metadata /// /// The returned array contains 1 value for each row group in the parquet /// file in order /// /// Each value is either /// * the requested statistics type for the column /// * a null value, if the statistics can not be extracted /// /// Note that a null value does NOT mean the min or max value was actually /// `null` it means it the requested statistic is unknown /// /// Reasons for not being able to extract the statistics include: /// * the column is not present in the parquet file /// * statistics for the column are not present in the row group /// * the stored statistic value can not be converted to the requested type pub fn extract(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> { ... } } ``` I am envisioning this API could also easily support Extract from multiple files in one go ```rust impl<'a> StatisticsConverter<'a> { .. /// Extract metadata from multiple parquet files into an single arrow array /// one element per row group per file fn extract_multi(&self, metadata: impl IntoIterator<Item = &ParquetMetadata>))-> Result<ArrayRef> { ... } ``` Extract information from the page index as well ```rust impl<'a> StatisticsConverter<'a> { .. /// Extract metadata from page indexes across all row groups. The returned array has one element /// per page across all row groups fn extract_page(&self, metadata: impl IntoIterator<Item = &ParquetMetadata>))-> Result<ArrayRef> { ... } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
