xudong963 opened a new issue, #15265: URL: https://github.com/apache/datafusion/issues/15265
### Background I've been exploring the statistics collection in DataFusion, particularly for parquet, in the `datafusion/datasource-parquet/src/file_format.rs` file's `infer_stats` method. I noticed that while DataFusion collects statistics like: - Row counts - Null counts - Min/max values - Total byte size There doesn't appear to be any logic for computing **NDV (Number of Distinct Values)**. The `distinct_count` field is explicitly set to `Precision::Absent`. ### Is there existing NDV computation? 1. Is there another mechanism in DataFusion for computing NDV that I've missed? 2. Are there plans to implement NDV computation in the future? ### Impact on Query Optimization Without NDV statistics, the query optimizer might struggle to choose the optimal join orders, especially for queries with multiple joins. For example, in traditional optimizers, NDV is crucial for estimating join cardinalities and selecting the best join ordering. If NDV computation isn't currently available, how to ensure accurate join ordering in TPC-H queries? Are there alternative statistics or hints we're using? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org