xudong963 opened a new issue, #15265:
URL: https://github.com/apache/datafusion/issues/15265
   ### Background
   I've been exploring the statistics collection in DataFusion, particularly 
for parquet, in the `datafusion/datasource-parquet/src/file_format.rs` file's 
`infer_stats` method. I noticed that while DataFusion collects statistics like:
   
   - Row counts
   - Null counts
   - Min/max values
   - Total byte size
   
   There doesn't appear to be any logic for computing **NDV (Number of Distinct 
Values)**. The `distinct_count` field is explicitly set to `Precision::Absent`.
   
   ### Is there existing NDV computation?
   1. Is there another mechanism in DataFusion for computing NDV that I've 
missed?
   2. Are there plans to implement NDV computation in the future?
   
   ### Impact on Query Optimization
   Without NDV statistics, the query optimizer might struggle to choose the 
optimal join orders, especially for queries with multiple joins.  For example, 
in traditional optimizers, NDV is crucial for estimating join cardinalities and 
selecting the best join ordering.  If NDV computation isn't currently 
available, how to ensure accurate join ordering in TPC-H queries? Are there 
alternative statistics or hints we're using? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to