asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2924005097


##########
datafusion/common/src/stats.rs:
##########
@@ -660,7 +637,14 @@ impl Statistics {
             col_stats.max_value = 
col_stats.max_value.max(&item_col_stats.max_value);
             col_stats.min_value = 
col_stats.min_value.min(&item_col_stats.min_value);
             col_stats.sum_value = 
col_stats.sum_value.add(&item_col_stats.sum_value);
-            col_stats.distinct_count = Precision::Absent;
+            // Use max as a conservative lower bound for distinct count
+            // (can't accurately merge NDV since duplicates may exist across 
partitions)

Review Comment:
   Addressed in 72d3be325. Thanks for the heads up on #20846!
   
   I have finally replaced `max(left, right)` with the overlap-based formula 
from that PR, now extracted into a shared `estimate_ndv_with_overlap` in 
`stats.rs` used by both `try_merge` and Union, as agreed with @xudong963.
   
   I have added tests covering disjoint, identical, partial overlap, missing 
min/max, non-numeric, and constant column cases. No Union tests changed, as 
expected from a pure extraction/refactoring.
   
   Is this matching your expectations, @gene-bordegaray?
   
   Apologies again for the force push but building on #20846 seemed really 
worth it, apart from SHAs, the only changes are the two topmost, addressing 
pending points.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to