Re: [PR] feat: support merge for `Distribution` [datafusion]

via GitHub Fri, 21 Mar 2025 11:24:15 -0700


ozankabak commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2744132824

The most likely way we will end up with `HistogramDistribution`s will be via
sampling. We can also leverage statistics in file metadata if a file format
stores this information. AFAICT Parquet doesn't store histogram information.

If your use case is specific to Parquet files and you can't do sampling,
what we can do is to add an optional `num_samples` field to
`GenericDistribution`. This way, you can `merge` two `GenericDistrbution`
objects **if** both have a value for `num_samples`. In such a scenario, you can
update `mean`, `variance` and `range` fields with [appropriate
formulas](https://math.stackexchange.com/questions/2971315/how-do-i-combine-standard-deviations-of-two-groups)
and add the `num_samples` fields. The `median` value will always be set to
`None`, that is not merge-able.

In an expression tree, any `num_samples` information of children
`GenericDistribution` will combine multiplicatively (due to the independence
assumption). When a `GenericDistribution` combines with another `Distribution`,
the information will be lost and set to `None` for the resulting
`GenericDistribution`.

For example, the resulting `GenericDistribution` for `2 * x` will preserve
`num_samples` (w.r.t. that of `x`), but the same for `x + y` will be
`num_samples_x * num_samples_y`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: support merge for `Distribution` [datafusion]

Reply via email to