xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743831182
> Attribute `total_count` is derivable from `counts`, so we may not want to store it for normalization/consistency reasons. Same goes for `range`, it can constructed from `bins` in O(1) time. Yes, we don't need to store them. ```rust pub struct HistogramDistribution { bins: Vec<HistogramBin>, } pub struct HistogramBin { upper: ScalarValue, count: u64 // Maybe other fileds, such as ndv } ``` How do we plan to generate the `HistogramDistribution`? Let's assume we can get the exact min/max from the parquet file: https://github.com/apache/datafusion/blob/main/datafusion/datasource-parquet/src/file_format.rs#L827, should we generate the `HistogramDistribution` based on the min/max? Or do we have alternative ways? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org