Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

via GitHub Fri, 10 Oct 2025 05:33:52 -0700


adriangb commented on issue #8078:
URL: https://github.com/apache/datafusion/issues/8078#issuecomment-3389845185


   One use case for `Distribution` I wanted to explore that is compatible with 
Parquet is what I'll call a "footer table sample". I don't remember where I 
heard of this first or what I should call it, but I did discuss it with Hannes 
of DuckDB and it sounds like a really cool idea. TLDR is it's expensive to 
randomly sample compressed columnar storage like Parquet, but if you store a 
pre-sampled portion of the file e.g. as the last row group you can get very 
good estimates for all kinds of things (filter selectivity, cardinality of any 
column, etc.) and it's very efficient IO-wise to get that data (it's all nicely 
packed into 1 read unit). My thought is that something like this could be used 
to easily get *estimated* distributions and cardinality from the data.
   
   > I also feel that there's a slight conflict of interest or at least two 
camps here:
   > 
   > * statistics always-correct optimizers: Some people use statistics for 
optimizers like join ordering. There a wrong statistics often only results in 
slower execution, but never wrong results. That is kinda reflected in a lot of 
statistics calculation in the DF code base.
   > * correctness: Some plan transformers (InfluxData for example has one) 
rely on the statistics that actually can make hard promises, i.e. "all values 
are FOR SURE in this range". In that case, you really wanna be picky about what 
the stats do.
   
   I agree with this. My biggest issue with the current statistics is that we 
only have `Exact` and `Inexact` but `Inexact` isn't really what you want for 
the second case you list, you want something like `Bounded`.
   
   I also think the current statistics is lacking info like the size of each 
column which is much better than the total file size in almost every use case 
(most queries are not `select *`).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Introduce a way to represent constrained statistics / bounds on values in Statistics [datafusion]

Reply via email to