alamb commented on issue #8078: URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3387109890
> I gave it a shot in https://github.com/apache/datafusion/pull/17980. I think you mean - https://github.com/apache/datafusion/pull/17978 > Have we considered Precision<Distribution>? It seems to nicely encapsulate what I think we want. > Seems like we might need to do a bit of juggling if we want to access Distribution from the current location. I can't help but feel the current `Distribution` doesn't have many practical benefits -- specifically the idea of having mathemetical descriptions of value distributions is intellectually appealing, but I have never see actual query engines use it (because real data is never completely described by those theoretical distributions). Maybe I am missing something > I don't love the current reperesentation Vec<ColumnStatistics> for several reasons: I broadly agree with this concern > But that doesn't address (2). Maybe the statistics could include a `SchemaRef` pointer so projections could be converted back into names when needed > I tried to address (1) and (3) by changing the representation to Vec<Option<Arc<ColumnDistributionStatistics>> which makes it cheaper to initailize as empty and cheaper to project. I think that is basically the right internal representation, but in terms of software engineering, I personally recommend: Updating statistics initially to be a wrapper: ```rust /// Not sure of this name. I am open to new ones /// StatisticsV2 was proposed at some point /// but that seems to offer no additional context. pub struct RelationStatistics { inner: Vec<Option<Arc<ColumnStatistics>> } ``` And then there should be an easy way to convert between `Vec<ColumnStatistics>` to a RelationStatistics to ease migration Then we could change the signatgure of ExecutionPlan to retun an arc'd version ```rust trait ExecutionPlan { /// retun the statistics for this plan pub fn relation_statistics(&self) -> &Arc<RelationStatistics>; ... } ``` Then copying entire nodes would be fast (copy an Arc) as would projecting statistics 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
