xudong963 commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743665328

   > We can only merge two statistical objects in certain special 
circumstances. For example, if we have a statistical object that tracks sample 
averages along with counts, we can merge two instances of them. Our 
distributions are not merge-able quantities in this sense. They are _mixable_ 
(with a given weight), but not _merge-able_.
   
   I confused the `merge` and `mix`,  after reviewing the information, "Merge" 
suggests combining datasets that maintain their original properties, but what's 
implemented is actually close to a weighted mixture of probability 
distributions. Do I understand correctly?
   
   
   
   > One of the follow-ups we previously discussed was adding a 
`HistogramDistribution` object that tracks bins and ranges. These objects will 
be merge-able. Therefore, we should start off by adding a 
`HistogramDistribution` object first. Then, we can add a `merge` API to that 
object.
   
   Yes, I agree. `HistogramDistribution` is merge-able. Does it look like this?
   ```rust
   pub struct HistogramDistribution {
       bins: Vec<Interval>,     // The bin boundaries
       counts: Vec<u64>,        // Frequency in each bin
       total_count: u64,        // Sum of all bin counts
       range: Interval,         // Overall range covered by the histogram
   }
   ```
   
   
   
   > If you think we should have a `mix` API for the general `Distribution` 
object, we can add it too. Such an API will need to include a mixing weight in 
its signature.
   
   This is my use case: 
https://github.com/apache/datafusion/pull/13296/files#diff-8d786f45bc2d5bf629754a119ed6fa7998dcff7faacd954c45945b7047b87fa1R498,
 merge the file statistics in the whole file group. I'm still thinking if `mix` 
API can satisfy my requirement.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to