suremarc commented on issue #10316:
URL: https://github.com/apache/datafusion/issues/10316#issuecomment-2462945990
I went ahead and made an attempt at implementing this over in this draft PR:
#13296
I was able to reuse the `MinMaxStatistics` code with some small changes
which was pretty cool.
I have no nontrivial way to test this code until we get statistics per
partition. In the PR I proposed the following API:
```rust
pub trait ExecutionPlan: [...] {
// [...]
fn statistics_by_partition(&self) -> Result<Vec<Statistics>> {
// Return global statistics by default
Ok(vec![
self.statistics()?;
self.properties().partitioning.partition_count()
])
}
}
```
In order for the statistics to be useful we'll actually need non-default
implementations of course. So I'm wondering if I should just implement this
method just for `ParquetExec` in my PR, so I can get some tests in. Of course,
nothing will be finalized yet.
As discussed in [Epic: Statistics
Improvements](https://github.com/apache/datafusion/issues/8227#issuecomment-2457565135)
we will need #8078 in order for this code to actually work properly in all
situations, but I believe @alamb is working on that.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]