paleolimbot commented on issue #8227:
URL: https://github.com/apache/datafusion/issues/8227#issuecomment-4000263664

   In today's community sync I volunteered to review the state of statistics 
issues. Given that this is already a summary issue and contains a lot of 
context I'll do it here 🙂 .
   
   The first thing I see are ongoing efforts to expand the types of inexactness 
that can be represented in statistics. The `Statistics` object exposed to the 
`ExecutionPlan` doesn't make use of the `Distrubtion`, which is the basis for 
propagating statistics through expressions. This means there are two 
independent means by which ranges of values are propagated, with the 
`Distribution`-based mechanism able to more accurately represent things like 
half-open ranges. A relevant issue is 
https://github.com/apache/datafusion/issues/14896 although 
https://github.com/apache/datafusion/issues/8078 is similar and has some more 
concrete suggestions about how we can get `Statistics` to actually use the 
distribution/interval API.
   
   Another "this is how statistics are represented" issue is that their current 
representation they is expensive to copy. The PR to use an Arc should help ( 
https://github.com/apache/datafusion/pull/20570 ).
   
   Something we discussed at the community call today is that statistics can't 
represent some data types. This applies to built-in types (structs, maps, and 
lists) and extension types (variant and geometry/geography). I see @adriangb 
has already made a stab at structs support 
(https://github.com/apache/datafusion/pull/20589)...I didn't see an issue 
created for that yet so I made one: 
https://github.com/apache/datafusion/issues/20707 . A related item that has 
popped up at least once is other types of statistics (e.g., average byte size 
https://github.com/apache/datafusion/pull/19963 ).
   
   I also see the ongoing effort to implement `partition_statistics()` for the 
built-in operators. This seems like a huge milestone that is almost done! (One 
remaining PR: https://github.com/apache/datafusion/pull/16956 ).
   
   Finally, a few statistics related PRs that seem like they might be close to 
merging (some duplicates from above):
   
   - https://github.com/apache/datafusion/pull/20292
   - https://github.com/apache/datafusion/pull/16956
   - https://github.com/apache/datafusion/pull/20570
   - https://github.com/apache/datafusion/pull/20391


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to