ozankabak commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2744132824

   The most likely way we will end up with `HistogramDistribution`s will be via 
sampling. We can also leverage statistics in file metadata if a file format 
stores this information. AFAICT Parquet doesn't store histogram information.
   
   If your use case is specific to Parquet files and you can't do sampling, 
what we can do is to add an optional `num_samples` field to 
`GenericDistribution`. This way, you can `merge` two `GenericDistrbution` 
objects **if** both have a value for `num_samples`. In such a scenario, you can 
update `mean`, `variance` and `range` fields with [appropriate 
formulas](https://math.stackexchange.com/questions/2971315/how-do-i-combine-standard-deviations-of-two-groups)
 and add the `num_samples` fields. The `median` value will always be set to 
`None`, that is not merge-able.
   
   In an expression tree, any `num_samples` information of children 
`GenericDistribution` will combine multiplicatively (due to the independence 
assumption). When a `GenericDistribution` combines with another `Distribution`, 
the information will be lost and set to `None` for the resulting 
`GenericDistribution`.
   
   For example, the resulting `GenericDistribution` for `2 * x` will preserve 
`num_samples` (w.r.t. that of `x`), but the same for `x + y` will be 
`num_samples_x * num_samples_y`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to