gene-bordegaray commented on code in PR #19957:
URL: https://github.com/apache/datafusion/pull/19957#discussion_r2936704265


##########
datafusion/common/src/stats.rs:
##########
@@ -672,6 +684,96 @@ impl Statistics {
     }
 }
 
+/// Estimates the combined number of distinct values (NDV) when merging two
+/// column statistics, using range overlap to avoid double-counting shared 
values.
+///
+/// Assumes values are distributed uniformly within each input's
+/// `[min, max]` range (the standard assumption when only summary
+/// statistics are available). Under uniformity the fraction of an input's
+/// distinct values that land in a sub-range equals the fraction of
+/// the range that sub-range covers.
+///
+/// The combined value space is split into three disjoint regions:
+///
+/// ```text
+///   |-- only A --|-- overlap --|-- only B --|
+/// ```
+///
+/// * **Only in A/B** - values outside the other input's range
+///   contribute `(1 - overlap_a) * NDV_a` and `(1 - overlap_b) * NDV_b`.
+/// * **Overlap** - both inputs may produce values here. We take
+///   `max(overlap_a * NDV_a, overlap_b * NDV_b)` rather than the
+///   sum because values in the same sub-range are likely shared
+///   (the smaller set is assumed to be a subset of the larger).
+///
+/// The formula ranges between `[max(NDV_a, NDV_b), NDV_a + NDV_b]`,
+/// from full overlap to no overlap.
+///
+/// ```text
+/// NDV = max(overlap_a * NDV_a, overlap_b * NDV_b)   [intersection]
+///     + (1 - overlap_a) * NDV_a                      [only in A]
+///     + (1 - overlap_b) * NDV_b                      [only in B]
+/// ```
+///
+/// Returns `None` when min/max are absent or distance is unsupported
+/// (e.g. strings), in which case the caller should fall back to a simpler
+/// estimate.
+pub fn estimate_ndv_with_overlap(

Review Comment:
   This being moved from union.rs makes me wonder if this should be generalized 
to all operators.
   
   Say rather than just merging two columns we were to merge three: A, B, C. 
This can result in different outcomes with the same columns being merged since 
the columns stats are smeared after the first merge.
   
   Example:
   ```
   - A = [0,100], NDV=80
   - B = [50,150], NDV=60
   - C = [100,200], NDV=50
   
   Scenarios:
   - (A+B)+C = 135
   - A+(B+C) = 137
   ```
   
   I know that while in union.rs this problem still existed but it was more 
tightly scoped.
   
   Is there a way we could make this more explicit in the documentation / plan 
follow up work to handle merging?
   
   I would think that preserving the shape of the distinct values in some way 
after merging is what we want. This might require more plumbing and could come 
back to this.
   
   Let me know your thoughts.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to