[I] Efficiently and correctly extract parquet statistics into ArrayRefs [datafusion]

via GitHub Fri, 10 May 2024 13:38:26 -0700


alamb opened a new issue, #10453:
URL: https://github.com/apache/datafusion/issues/10453


   ### Is your feature request related to a problem or challenge?
   
   There are at least three places that parquet statistics are extracted into 
ArrayRefs today
   
   1. ParquetExec (pruning Row Groups): 
https://github.com/apache/datafusion/blob/465c89f7f16d48b030d4a384733567b91dab88fa/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L18-L17
   
   * ParquetExec (Pruning pages): 
https://github.com/apache/datafusion/blob/671cef85c550969ab2c86d644968a048cb181c0c/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L393-L392
   
   * ListingTable (pruning files): 
https://github.com/apache/datafusion/blob/97148bd105fc2102b0444f2d67ef535937da5dfe/datafusion/core/src/datasource/file_format/parquet.rs#L295-L294
   
   Not only are there three copies of the code, they are all subtly different 
(e.g. https://github.com/apache/datafusion/issues/8295) and have varying 
degrees of testing
   
   
   ### Describe the solution you'd like
   
   I would like one API with the following properties:
   1. Extracts statistics from one or more parquet files as `ArrayRef`s 
suitable to pass to 
[PruningPredicate](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html)
   2. Does so correctly (applies the appropriate schema coercion / conversion 
rules)
   3. Does so quickly and efficiently (e.g. does not do this once per row 
group), is suitable for 1000s of parquet files
   
   ### Describe alternatives you've considered
   
   Some 
[ideas](https://github.com/apache/arrow-rs/issues/4328#issuecomment-2099247223) 
from https://github.com/apache/arrow-rs/issues/4328
   
   
   Here is a proposed API:
   
   ```rust
   /// statistics extracted from `Statistics` as Arrow `ArrayRef`s
   ///
   /// # Note:
   /// If the corresponding `Statistics` is not present, or has no information 
for 
   /// a column, a NULL is present in the  corresponding array entry
   pub struct ArrowStatistics {
     /// min values
     min: ArrayRef,
     /// max values
     max: ArrayRef,
     /// Row counts (UInt64Array)
     row_count: ArrayRef,
     /// Null Counts (UInt64Array)
     null_count: ArrayRef,
   }
   
   // (TODO accessors for min/max/row_count/null_count)
   
   /// Extract `ArrowStatistics` from the  parquet [`Statistics`]
   pub fn parquet_stats_to_arrow(
       arrow_datatype: &DataType,
       statistics: impl IntoIterator<Item = Option<&Statistics>>
   ) -> Result<ArrowStatisics> {
     todo!()
   }
   ```
   
   Maybe it would make sense to have something more builder style:
   
   ```rust
   struct ParquetStatisticsExtractor {
   ...
   }
   
   // create an extractor that can extract data from parquet files 
   let extractor = ParquetStatisticsExtractor::new(arrow_schema, parquet_schema)
   
   // get parquet statistics (one for each row group) somehow:
   let parquet_stats: Vec<&Statistics> = ...;
   
   // extract min/max values for column "a" and "b";
   let col_a stats = extractor.extract("a", parquet_stats.iter());
   let col_b stats = extractor.extract("b", parquet_stats.iter());
   ```
   
   
   (This is similar to the existing API 
[parquet](https://docs.rs/parquet/latest/parquet/index.html)::[arrow](https://docs.rs/parquet/latest/parquet/arrow/index.html)::[parquet_to_arrow_schema](https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_to_arrow_schema.html#))
   
   Note `Statistics` above is 
[`Statistics`](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html)
 
   
   There is a version of this code here in DataFusion that could perhaps be 
adapted: 
https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L186
   
   ## Testing
   I suggest we add a new module to the existing parquet test  in 
https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_exec.rs
   
   The tests should look like:
   ```
   let record_batch = make_batch_with_relevant_datatype();
   // write batch/batches to file
   // open file / extract stats from metadata
   // compare stats
   ```
   
   I can help writing these tests 
   
   I personally suggest:
   1. Make a PR with the basic API and a single basic types (like Int/UInt or 
String) and figure out the test pattern (I can definitely help here)
   2. Then we can fill out support for the rest of the types in a follow on PR
   
   cc @tustvold  in case you have other ideas
   
   ### Additional context
   
   This code likely eventually would be good to have in the parquet crate -- 
see https://github.com/apache/arrow-rs/issues/4328. However, I think initially 
we should do it in DataFusion to iterate faster and figure out the API before 
moving it up there
   
   
   There are a bunch of related improvements that I think become much simpler 
with this feature:
   1. https://github.com/apache/datafusion/issues/8229 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Efficiently and correctly extract parquet statistics into ArrayRefs [datafusion]

Reply via email to