alamb commented on code in PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002037023


##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -857,6 +857,143 @@ pub fn compute_variance(
     ScalarValue::try_from(target_type)
 }
 
+/// Merges two distributions into a single distribution that represents their 
combined statistics.
+/// This creates a more general distribution that approximates the mixture of 
the input distributions.
+pub fn merge_distributions(a: &Distribution, b: &Distribution) -> 
Result<Distribution> {
+    let range_a = a.range()?;
+    let range_b = b.range()?;
+
+    // Determine data type and create combined range
+    let combined_range = if range_a.is_unbounded() || range_b.is_unbounded() {

Review Comment:
   I think we could use `Interval::union` here: 
https://docs.rs/datafusion/latest/datafusion/logical_expr/interval_arithmetic/struct.Interval.html#method.union



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -857,6 +857,143 @@ pub fn compute_variance(
     ScalarValue::try_from(target_type)
 }
 
+/// Merges two distributions into a single distribution that represents their 
combined statistics.
+/// This creates a more general distribution that approximates the mixture of 
the input distributions.
+pub fn merge_distributions(a: &Distribution, b: &Distribution) -> 
Result<Distribution> {

Review Comment:
   I wonder if this should be a method on `Distribution` rather than a free 
function 🤔 



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -857,6 +857,143 @@ pub fn compute_variance(
     ScalarValue::try_from(target_type)
 }
 
+/// Merges two distributions into a single distribution that represents their 
combined statistics.
+/// This creates a more general distribution that approximates the mixture of 
the input distributions.

Review Comment:
   I think it would help to explain in comments what assumptions can be made 
from the combined distribution
   
   For example, is it guaranteed that the `range` is conservative (as in it is 
known that there are no values that lay outside the range?)
   
   Though now that I ask it seems like maybe we need to clarify if the range of 
`GenericDistribution` is conservative 🤔 
   
   
https://github.com/apache/datafusion/blob/8a2e83eb74f89a4e3387817943749f3894e7141a/datafusion/expr-common/src/statistics.rs#L273-L272



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -857,6 +857,143 @@ pub fn compute_variance(
     ScalarValue::try_from(target_type)
 }
 
+/// Merges two distributions into a single distribution that represents their 
combined statistics.
+/// This creates a more general distribution that approximates the mixture of 
the input distributions.
+pub fn merge_distributions(a: &Distribution, b: &Distribution) -> 
Result<Distribution> {
+    let range_a = a.range()?;
+    let range_b = b.range()?;
+
+    // Determine data type and create combined range
+    let combined_range = if range_a.is_unbounded() || range_b.is_unbounded() {
+        Interval::make_unbounded(&range_a.data_type())?
+    } else {
+        // Take the widest possible range conservatively
+        let lower_a = range_a.lower();
+        let lower_b = range_b.lower();
+        let upper_a = range_a.upper();
+        let upper_b = range_b.upper();
+
+        let combined_lower = if lower_a.lt(lower_b) {
+            lower_a.clone()
+        } else {
+            lower_b.clone()
+        };
+
+        let combined_upper = if upper_a.gt(upper_b) {
+            upper_a.clone()
+        } else {
+            upper_b.clone()
+        };
+
+        Interval::try_new(combined_lower, combined_upper)?
+    };
+
+    // Calculate weights for the mixture distribution

Review Comment:
   what does "mixture distribution" mean in this context? 
   
   It seems like this code weighs the input distributions on number of distinct 
values (cardinality) which seems not right. For example if we have two inputs:
   1. 1M rows, 3 distinct values
   2. 10 rows, 10 distinct values
   
   I think this code is going to assume the man is close to the second input 
even though there are only 10 values 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to