alamb opened a new issue, #10453: URL: https://github.com/apache/datafusion/issues/10453
### Is your feature request related to a problem or challenge? There are at least three places that parquet statistics are extracted into ArrayRefs today 1. ParquetExec (pruning Row Groups): https://github.com/apache/datafusion/blob/465c89f7f16d48b030d4a384733567b91dab88fa/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L18-L17 * ParquetExec (Pruning pages): https://github.com/apache/datafusion/blob/671cef85c550969ab2c86d644968a048cb181c0c/datafusion/core/src/datasource/physical_plan/parquet/page_filter.rs#L393-L392 * ListingTable (pruning files): https://github.com/apache/datafusion/blob/97148bd105fc2102b0444f2d67ef535937da5dfe/datafusion/core/src/datasource/file_format/parquet.rs#L295-L294 Not only are there three copies of the code, they are all subtly different (e.g. https://github.com/apache/datafusion/issues/8295) and have varying degrees of testing ### Describe the solution you'd like I would like one API with the following properties: 1. Extracts statistics from one or more parquet files as `ArrayRef`s suitable to pass to [PruningPredicate](https://docs.rs/datafusion/latest/datafusion/physical_optimizer/pruning/struct.PruningPredicate.html) 2. Does so correctly (applies the appropriate schema coercion / conversion rules) 3. Does so quickly and efficiently (e.g. does not do this once per row group), is suitable for 1000s of parquet files ### Describe alternatives you've considered Some [ideas](https://github.com/apache/arrow-rs/issues/4328#issuecomment-2099247223) from https://github.com/apache/arrow-rs/issues/4328 Here is a proposed API: ```rust /// statistics extracted from `Statistics` as Arrow `ArrayRef`s /// /// # Note: /// If the corresponding `Statistics` is not present, or has no information for /// a column, a NULL is present in the corresponding array entry pub struct ArrowStatistics { /// min values min: ArrayRef, /// max values max: ArrayRef, /// Row counts (UInt64Array) row_count: ArrayRef, /// Null Counts (UInt64Array) null_count: ArrayRef, } // (TODO accessors for min/max/row_count/null_count) /// Extract `ArrowStatistics` from the parquet [`Statistics`] pub fn parquet_stats_to_arrow( arrow_datatype: &DataType, statistics: impl IntoIterator<Item = Option<&Statistics>> ) -> Result<ArrowStatisics> { todo!() } ``` Maybe it would make sense to have something more builder style: ```rust struct ParquetStatisticsExtractor { ... } // create an extractor that can extract data from parquet files let extractor = ParquetStatisticsExtractor::new(arrow_schema, parquet_schema) // get parquet statistics (one for each row group) somehow: let parquet_stats: Vec<&Statistics> = ...; // extract min/max values for column "a" and "b"; let col_a stats = extractor.extract("a", parquet_stats.iter()); let col_b stats = extractor.extract("b", parquet_stats.iter()); ``` (This is similar to the existing API [parquet](https://docs.rs/parquet/latest/parquet/index.html)::[arrow](https://docs.rs/parquet/latest/parquet/arrow/index.html)::[parquet_to_arrow_schema](https://docs.rs/parquet/latest/parquet/arrow/fn.parquet_to_arrow_schema.html#)) Note `Statistics` above is [`Statistics`](https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html) There is a version of this code here in DataFusion that could perhaps be adapted: https://github.com/apache/datafusion/blob/accce9732e26723cab2ffc521edbf5a3fe7460b3/datafusion/core/src/datasource/physical_plan/parquet/statistics.rs#L179-L186 ## Testing I suggest we add a new module to the existing parquet test in https://github.com/apache/datafusion/blob/main/datafusion/core/tests/parquet_exec.rs The tests should look like: ``` let record_batch = make_batch_with_relevant_datatype(); // write batch/batches to file // open file / extract stats from metadata // compare stats ``` I can help writing these tests I personally suggest: 1. Make a PR with the basic API and a single basic types (like Int/UInt or String) and figure out the test pattern (I can definitely help here) 2. Then we can fill out support for the rest of the types in a follow on PR cc @tustvold in case you have other ideas ### Additional context This code likely eventually would be good to have in the parquet crate -- see https://github.com/apache/arrow-rs/issues/4328. However, I think initially we should do it in DataFusion to iterate faster and figure out the API before moving it up there There are a bunch of related improvements that I think become much simpler with this feature: 1. https://github.com/apache/datafusion/issues/8229 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
