cj-zhukov commented on PR #16614: URL: https://github.com/apache/datafusion/pull/16614#issuecomment-3027164610
> Thanks @cj-zhukov ! > > I think it would be super helpful to make an example / test showing how to use this new distribution to estimate cardinality > > For example perhaps you could set up a SampledDistribition like > > ``` > [0, 10]: 100 samples > [20,30]: 200 samples > ``` > > And then estimate the cardinality of a predicate like `x > 25` > > I would expect the estimate to be 1/3 (half of the 20-30 bucket and none of the 1-10 bucket) hi @alamb , I wanted to clarify one thing . I implemented general-purpose methods like mean(), median(), and variance() for SampledDistribution, similar to other Distribution variants. These are designed to summarize the entire distribution. To answer your question about estimating cardinality for predicates like x > 25, I implemented a separate method estimate_selectivity_gt() that works specifically for that use case — it calculates how many values match the condition based on the bin layout and counts. Let me know if you think those general-purpose methods should be reused here, or if you’d prefer to keep predicate-based estimation separate. Happy to adjust based on your guidance. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org