marvinlanhenke commented on code in PR #10852:
URL: https://github.com/apache/datafusion/pull/10852#discussion_r1634245246
##########
datafusion/core/src/datasource/physical_plan/parquet/statistics.rs:
##########
@@ -517,6 +518,72 @@ macro_rules! get_statistics {
}}}
}
+macro_rules! make_data_page_stats_iterator {
+ ($iterator_type: ident, $func: ident, $index_type: path, $stat_value_type:
ty) => {
+ struct $iterator_type<'a, I>
+ where
+ I: Iterator<Item = &'a Index>,
+ {
+ iter: I,
+ }
+
+ impl<'a, I> $iterator_type<'a, I>
+ where
+ I: Iterator<Item = &'a Index>,
+ {
+ fn new(iter: I) -> Self {
+ Self { iter }
+ }
+ }
+
+ impl<'a, I> Iterator for $iterator_type<'a, I>
+ where
+ I: Iterator<Item = &'a Index>,
+ {
+ type Item = Vec<Option<$stat_value_type>>;
+
+ fn next(&mut self) -> Option<Self::Item> {
+ let next = self.iter.next();
+ match next {
+ Some(index) => match index {
+ $index_type(native_index) => Some(
+ native_index
+ .indexes
+ .iter()
+ .map(|x| x.$func)
+ .collect::<Vec<_>>(),
+ ),
+ // No matching `Index` found.
+ // Thus no statistics that can be extracted.
+ // We return vec![None] to effectively
+ // create an arrow null-array.
+ _ => Some(vec![None]),
Review Comment:
Yes, you're right - and we definitely need test coverage here (I'm still
confused sometimes (actually all the time) about this logic 🤯).
However, I think we need a different approach since we wouldn't have access
to `native_index.indexes`, if we cannot match the index, or we encounter the
`Index::NONE` variant.
Based on the implementation in `page_filter.rs`
[here](https://github.com/apache/datafusion/blob/main/datafusion/core/src/physical_optimizer/pruning.rs#L885-L893)
we should probably construct a `vec![None; len]` where `len =
page_offset_index.len()` and `page_offset_index: Vec<PageLocation>`.
I'll try make some changes here (also the API needs to change slightly) and
we can discuss further?
I'll also try to make a test-case with multiple data_pages per row_group
(haven't found the setting, yet...)
WDYT @alamb
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]