xudong963 commented on code in PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002335941


##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -857,6 +857,143 @@ pub fn compute_variance(
     ScalarValue::try_from(target_type)
 }
 
+/// Merges two distributions into a single distribution that represents their 
combined statistics.
+/// This creates a more general distribution that approximates the mixture of 
the input distributions.
+pub fn merge_distributions(a: &Distribution, b: &Distribution) -> 
Result<Distribution> {
+    let range_a = a.range()?;
+    let range_b = b.range()?;
+
+    // Determine data type and create combined range
+    let combined_range = if range_a.is_unbounded() || range_b.is_unbounded() {
+        Interval::make_unbounded(&range_a.data_type())?
+    } else {
+        // Take the widest possible range conservatively
+        let lower_a = range_a.lower();
+        let lower_b = range_b.lower();
+        let upper_a = range_a.upper();
+        let upper_b = range_b.upper();
+
+        let combined_lower = if lower_a.lt(lower_b) {
+            lower_a.clone()
+        } else {
+            lower_b.clone()
+        };
+
+        let combined_upper = if upper_a.gt(upper_b) {
+            upper_a.clone()
+        } else {
+            upper_b.clone()
+        };
+
+        Interval::try_new(combined_lower, combined_upper)?
+    };
+
+    // Calculate weights for the mixture distribution

Review Comment:
   Your point is correct.
   
   IMO, the best way to compute the weight is based on the count of each 
interval, but the count of each interval is unknown.
   
   After thinking, I have a new idea, maybe we can use the variance to 
approximate the weight. That means, **lower variance generally indicates more 
samples**:
   
   ```rust
   let (weight_a, weight_b) = {
       // Lower variance generally indicates more samples
       let var_a = self.variance()?.cast_to(&DataType::Float64)?;
       let var_b = other.variance()?.cast_to(&DataType::Float64)?;
       
       match (var_a, var_b) {
           (ScalarValue::Float64(Some(va)), ScalarValue::Float64(Some(vb))) => {
               // Weighting inversely by variance (with safeguards against 
division by zero)
               let va_safe = va.max(f64::EPSILON);
               let vb_safe = vb.max(f64::EPSILON);
               let wa = 1.0 / va_safe;
               let wb = 1.0 / vb_safe;
               let total = wa + wb;
               (wa / total, wb / total)
           }
           _ => (0.5, 0.5)  // Fall back to equal weights
       }
   };
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to