xudong963 commented on PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743831182

   > Attribute `total_count` is derivable from `counts`, so we may not want to 
store it for normalization/consistency reasons. Same goes for `range`, it can 
constructed from `bins` in O(1) time.
   
   Yes, we don't need to store them.
   
   ```rust
   pub struct HistogramDistribution {
       bins: Vec<HistogramBin>, 
   }
   
   pub struct HistogramBin {
       upper: ScalarValue,
       count: u64
       // Maybe other fileds, such as ndv
   }
   ```
   
   How do we plan to generate the `HistogramDistribution`?
   
   Let's assume we can get the exact min/max from the parquet file: 
https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/file_format.rs#L827,
 should we generate the `HistogramDistribution` based on the min/max? Or do we 
have alternative ways?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to