asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2910772163
##########
datafusion/datasource-parquet/src/metadata.rs:
##########
@@ -541,6 +541,36 @@ fn summarize_min_max_null_counts(
)
.map(|(idx, _)| idx);
+ // Extract distinct counts from row group column statistics
+ accumulators.distinct_counts_array[logical_schema_index] =
+ if let Some(parquet_idx) = parquet_index {
+ let distinct_counts: Vec<u64> = row_groups_metadata
+ .iter()
+ .filter_map(|rg| {
+ rg.columns()
+ .get(parquet_idx)
+ .and_then(|col| col.statistics())
+ .and_then(|stats| stats.distinct_count_opt())
+ })
+ .collect();
+
+ if distinct_counts.is_empty() {
+ Precision::Absent
+ } else if distinct_counts.len() == 1 {
Review Comment:
Good point. As a first step I agree on a threshold + configuration for this.
That said, since DataFusion is a framework, how to propagate statistics is
really dependent on the system, data, etc. I'm working towards a fully
customizable solution for statistics propagation, not just at the expression
level (as shown with ExpressionAnalyzer in this PR), but also at the operator
level.
The WIP branch is here:
https://github.com/asolimando/datafusion/tree/asolimando/statistics-planner-prototype
- it introduces a `StatisticsProvider`/`StatisticsRegistry`
chain-of-responsibility pattern for operator-level stats, similar in spirit to
`ExpressionAnalyzer` but for `ExecutionPlan` nodes. I will open PRs for it, but
any feedback on the general approach is welcome. For instance, the shortcoming
you highlighted for `ExpressionAnalyzer` (hard-coded delegation instead of
going through the chain) does not affect the operator-level design, but there
might be other limitations I'm not seeing.
A few things are still pending as I'd like to make use of
https://github.com/apache/datafusion/issues/20184 (unmerged) and
https://github.com/apache/datafusion/pull/20570 (that just got merged).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]