asolimando commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2924005097
##########
datafusion/common/src/stats.rs:
##########
@@ -660,7 +637,14 @@ impl Statistics {
col_stats.max_value =
col_stats.max_value.max(&item_col_stats.max_value);
col_stats.min_value =
col_stats.min_value.min(&item_col_stats.min_value);
col_stats.sum_value =
col_stats.sum_value.add(&item_col_stats.sum_value);
- col_stats.distinct_count = Precision::Absent;
+ // Use max as a conservative lower bound for distinct count
+ // (can't accurately merge NDV since duplicates may exist across
partitions)
Review Comment:
Addressed in 72d3be325. Thanks for the heads up on #20846!
I have finally replaced `max(left, right)` with the overlap-based formula
from that PR, now extracted into a shared `estimate_ndv_with_overlap` in
`stats.rs` used by both `try_merge` and Union, as agreed with @xudong963.
I have added tests covering disjoint, identical, partial overlap, missing
min/max, non-numeric, and constant column cases. No Union tests changed, as
expected from a pure extraction/refactoring.
Is this matching your expectations, @gene-bordegaray?
Apologies again for the force push but building on #20846 seemed really
worth it, apart from SHAs, the only changes are the two topmost, addressing
pending points.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]