xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743665328
> We can only merge two statistical objects in certain special circumstances. For example, if we have a statistical object that tracks sample averages along with counts, we can merge two instances of them. Our distributions are not merge-able quantities in this sense. They are _mixable_ (with a given weight), but not _merge-able_. I confused the `merge` and `mix`, after reviewing the information, "Merge" suggests combining datasets that maintain their original properties, but what's implemented is actually close to a weighted mixture of probability distributions. Do I understand correctly? > One of the follow-ups we previously discussed was adding a `HistogramDistribution` object that tracks bins and ranges. These objects will be merge-able. Therefore, we should start off by adding a `HistogramDistribution` object first. Then, we can add a `merge` API to that object. Yes, I agree. `HistogramDistribution` is merge-able. Does it look like this? ```rust pub struct HistogramDistribution { bins: Vec<Interval>, // The bin boundaries counts: Vec<u64>, // Frequency in each bin total_count: u64, // Sum of all bin counts range: Interval, // Overall range covered by the histogram } ``` > If you think we should have a `mix` API for the general `Distribution` object, we can add it too. Such an API will need to include a mixing weight in its signature. This is my use case: https://github.com/apache/datafusion/pull/13296/files#diff-8d786f45bc2d5bf629754a119ed6fa7998dcff7faacd954c45945b7047b87fa1R498, merge the file statistics in the whole file group. I'm still thinking if `mix` API can satisfy my requirement. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org