asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2910772163


##########
datafusion/datasource-parquet/src/metadata.rs:
##########
@@ -541,6 +541,36 @@ fn summarize_min_max_null_counts(
     )
     .map(|(idx, _)| idx);
 
+    // Extract distinct counts from row group column statistics
+    accumulators.distinct_counts_array[logical_schema_index] =
+        if let Some(parquet_idx) = parquet_index {
+            let distinct_counts: Vec<u64> = row_groups_metadata
+                .iter()
+                .filter_map(|rg| {
+                    rg.columns()
+                        .get(parquet_idx)
+                        .and_then(|col| col.statistics())
+                        .and_then(|stats| stats.distinct_count_opt())
+                })
+                .collect();
+
+            if distinct_counts.is_empty() {
+                Precision::Absent
+            } else if distinct_counts.len() == 1 {

Review Comment:
   Good point. As a first step I agree on a threshold + configuration for this.
   
   That said, since DataFusion is a framework, how to propagate statistics is 
really dependent on the system, data, etc. I'm working towards a fully 
customizable solution for statistics propagation, not just at the expression 
level (as shown with ExpressionAnalyzer in this PR), but also at the operator 
level.
   
   The WIP branch is here:
     
https://github.com/asolimando/datafusion/tree/asolimando/statistics-planner-prototype
 - it introduces a `StatisticsProvider`/`StatisticsRegistry` 
chain-of-responsibility pattern for operator-level stats, similar in spirit to 
`ExpressionAnalyzer` but for `ExecutionPlan` nodes. I will open PRs for it, but 
any feedback on the general approach is welcome. For instance, the shortcoming 
you highlighted for `ExpressionAnalyzer` (hard-coded delegation instead of 
going through the chain) does not affect the operator-level design, but there 
might be other limitations I'm not seeing.
     
   A few things are still pending as I'd like to make use of 
https://github.com/apache/datafusion/issues/20184 (unmerged) and 
https://github.com/apache/datafusion/pull/20570 (that just got merged).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to