Re: [I] Efficiently and correctly extract parquet statistics into ArrayRefs [datafusion]

via GitHub Fri, 17 May 2024 07:27:08 -0700


alamb commented on issue #10453:
URL: https://github.com/apache/datafusion/issues/10453#issuecomment-2117733259


   After working through an actual example in 
https://github.com/apache/datafusion/pull/10549 I have a new API proposal: 
https://github.com/NGA-TRAN/arrow-datafusion/pull/118
   
   Here is what the API looks like
   
   ```rust
   /// What type of statistics should be extracted?
   #[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Hash)]
   pub enum RequestedStatistics {
       /// Minimum Value
       Min,
       /// Maximum Value
       Max,
       /// Null Count, returned as a [`UInt64Array`])
       NullCount,
   }
   
   /// Extracts Parquet statistics as Arrow arrays
   ///
   /// This is used to convert Parquet statistics to Arrow arrays, with proper 
type
   /// conversions. This information can be used for pruning parquet files or 
row
   /// groups based on the statistics embedded in parquet files
   ///
   /// # Schemas
   ///
   /// The schema of the parquet file and the arrow schema are used to convert 
the
   /// underlying statistics value (stored as a parquet value) into the
   /// corresponding Arrow  value. For example, Decimals are stored as binary in
   /// parquet files.
   ///
   /// The parquet_schema and arrow _schema do not have to be identical (for
   /// example, the columns may be in different orders and one or the other 
schemas
   /// may have additional columns). The function [`parquet_column`] is used to
   /// match the column in the parquet file to the column in the arrow schema.
   ///
   /// # Multiple parquet files
   ///
   /// This API is designed to support efficiently extracting statistics from
   /// multiple parquet files (hence why the parquet schema is passed in as an
   /// argument). This is useful when building an index for a directory of 
parquet
   /// files.
   ///
   #[derive(Debug)]
   pub struct StatisticsConverter<'a> {
       /// The name of the column to extract statistics for
       column_name: &'a str,
       /// The type of statistics to extract
       statistics_type: RequestedStatistics,
       /// The arrow schema of the query
       arrow_schema: &'a Schema,
       /// The field (with data type) of the column in the arrow schema
       arrow_field: &'a Field,
   }
   
   impl<'a> StatisticsConverter<'a> {
       /// Returns a [`UInt64Array`] with counts for each row group
       ///
       /// The returned array has no nulls, and has one value for each row 
group.
       /// Each value is the number of rows in the row group.
       pub fn row_counts(metadata: &ParquetMetaData) -> Result<UInt64Array> {
   ...
       }
   
       /// create an new statistics converter
       pub fn try_new(
           column_name: &'a str,
           statistics_type: RequestedStatistics,
           arrow_schema: &'a Schema,
       ) -> Result<Self> {
   ...
       }
   
       /// extract the statistics from a parquet file, given the parquet file's 
metadata
       ///
       /// The returned array contains 1 value for each row group in the parquet
       /// file in order
       ///
       /// Each value is either
       /// * the requested statistics type for the column
       /// * a null value, if the statistics can not be extracted
       ///
       /// Note that a null value does NOT mean the min or max value was 
actually
       /// `null` it means it the requested statistic is unknown
       ///
       /// Reasons for not being able to extract the statistics include:
       /// * the column is not present in the parquet file
       /// * statistics for the column are not present in the row group
       /// * the stored statistic value can not be converted to the requested 
type
       pub fn extract(&self, metadata: &ParquetMetaData) -> Result<ArrayRef> {
   ...
       }
   }
   ```
   
   I am envisioning this API could also easily support
   
   Extract from multiple files in one go
   ```rust
   impl<'a> StatisticsConverter<'a> {
   ..
   /// Extract metadata from multiple parquet files into an single arrow array
   /// one element per row group per file
   fn extract_multi(&self, metadata: impl IntoIterator<Item = 
&ParquetMetadata>))-> Result<ArrayRef> {
   ...
   }
   ```
   
   Extract information from the page index as well
   ```rust
   impl<'a> StatisticsConverter<'a> {
   ..
   /// Extract metadata from page indexes across all row groups. The returned 
array has one element
   /// per page across all row groups
   fn extract_page(&self, metadata: impl IntoIterator<Item = 
&ParquetMetadata>))-> Result<ArrayRef> {
   ...
   }
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Efficiently and correctly extract parquet statistics into ArrayRefs [datafusion]

Reply via email to