Re: [PR] Statistics: Implement SampledDistribution variant to Distribution to … [datafusion]

via GitHub Wed, 02 Jul 2025 02:37:26 -0700


cj-zhukov commented on PR #16614:
URL: https://github.com/apache/datafusion/pull/16614#issuecomment-3027164610


   > Thanks @cj-zhukov !
   > 
   > I think it would be super helpful to make an example / test showing how to 
use this new distribution to estimate cardinality
   > 
   > For example perhaps you could set up a SampledDistribition like
   > 
   > ```
   > [0, 10]: 100 samples
   > [20,30]: 200 samples
   > ```
   > 
   > And then estimate the cardinality of a predicate like `x > 25`
   > 
   > I would expect the estimate to be 1/3 (half of the 20-30 bucket and none 
of the 1-10 bucket)
   
   hi @alamb , I wanted to clarify one thing . I implemented general-purpose 
methods like mean(), median(), and variance() for SampledDistribution, similar 
to other Distribution variants. These are designed to summarize the entire 
distribution. To answer your question about estimating cardinality for 
predicates like x > 25, I implemented a separate method 
estimate_selectivity_gt() that works specifically for that use case — it 
calculates how many values match the condition based on the bin layout and 
counts. Let me know if you think those general-purpose methods should be reused 
here, or if you’d prefer to keep predicate-based estimation separate. Happy to 
adjust based on your guidance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] Statistics: Implement SampledDistribution variant to Distribution to … [datafusion]

Reply via email to