ozankabak commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2744132824
The most likely way we will end up with `HistogramDistribution`s will be via sampling. We can also leverage statistics in file metadata if a file format stores this information. AFAICT Parquet doesn't store histogram information. If your use case is specific to Parquet files and you can't do sampling, what we can do is to add an optional `num_samples` field to `GenericDistribution`. This way, you can `merge` two `GenericDistrbution` objects **if** both have a value for `num_samples`. In such a scenario, you can update `mean`, `variance` and `range` fields with [appropriate formulas](https://math.stackexchange.com/questions/2971315/how-do-i-combine-standard-deviations-of-two-groups) and add the `num_samples` fields. The `median` value will always be set to `None`, that is not merge-able. In an expression tree, any `num_samples` information of children `GenericDistribution` will combine multiplicatively (due to the independence assumption). When a `GenericDistribution` combines with another `Distribution`, the information will be lost and set to `None` for the resulting `GenericDistribution`. For example, the resulting `GenericDistribution` for `2 * x` will preserve `num_samples` (w.r.t. that of `x`), but the same for `x + y` will be `num_samples_x * num_samples_y`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org