paleolimbot commented on issue #8227: URL: https://github.com/apache/datafusion/issues/8227#issuecomment-4000263664
In today's community sync I volunteered to review the state of statistics issues. Given that this is already a summary issue and contains a lot of context I'll do it here 🙂 . The first thing I see are ongoing efforts to expand the types of inexactness that can be represented in statistics. The `Statistics` object exposed to the `ExecutionPlan` doesn't make use of the `Distrubtion`, which is the basis for propagating statistics through expressions. This means there are two independent means by which ranges of values are propagated, with the `Distribution`-based mechanism able to more accurately represent things like half-open ranges. A relevant issue is https://github.com/apache/datafusion/issues/14896 although https://github.com/apache/datafusion/issues/8078 is similar and has some more concrete suggestions about how we can get `Statistics` to actually use the distribution/interval API. Another "this is how statistics are represented" issue is that their current representation they is expensive to copy. The PR to use an Arc should help ( https://github.com/apache/datafusion/pull/20570 ). Something we discussed at the community call today is that statistics can't represent some data types. This applies to built-in types (structs, maps, and lists) and extension types (variant and geometry/geography). I see @adriangb has already made a stab at structs support (https://github.com/apache/datafusion/pull/20589)...I didn't see an issue created for that yet so I made one: https://github.com/apache/datafusion/issues/20707 . A related item that has popped up at least once is other types of statistics (e.g., average byte size https://github.com/apache/datafusion/pull/19963 ). I also see the ongoing effort to implement `partition_statistics()` for the built-in operators. This seems like a huge milestone that is almost done! (One remaining PR: https://github.com/apache/datafusion/pull/16956 ). Finally, a few statistics related PRs that seem like they might be close to merging (some duplicates from above): - https://github.com/apache/datafusion/pull/20292 - https://github.com/apache/datafusion/pull/16956 - https://github.com/apache/datafusion/pull/20570 - https://github.com/apache/datafusion/pull/20391 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
