Re: [PR] feat: support merge for `Distribution` [datafusion]

via GitHub Wed, 19 Mar 2025 02:10:44 -0700


kosiew commented on code in PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002828639



##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -203,6 +203,121 @@ impl Distribution {
         };
         Ok(dt)
     }
+
+    /// Merges two distributions into a single distribution that represents 
their combined statistics.
+    /// This creates a more general distribution that approximates the mixture 
of the input distributions.
+    pub fn merge(&self, other: &Self) -> Result<Self> {
+        let range_a = self.range()?;
+        let range_b = other.range()?;
+
+        // Determine data type and create combined range
+        let combined_range = range_a.union(&range_b)?;
+
+        // Calculate weights for the mixture distribution
+        let (weight_a, weight_b) = match (range_a.cardinality(), 
range_b.cardinality()) {
+            (Some(ca), Some(cb)) => {
+                let total = (ca + cb) as f64;
+                ((ca as f64) / total, (cb as f64) / total)
+            }
+            _ => (0.5, 0.5), // Equal weights if cardinalities not available
+        };
+
+        // Get the original statistics
+        let mean_a = self.mean()?;
+        let mean_b = other.mean()?;
+        let median_a = self.median()?;
+        let median_b = other.median()?;
+        let var_a = self.variance()?;
+        let var_b = other.variance()?;
+
+        // Always use Float64 for intermediate calculations to avoid truncation
+        // I assume that the target type is always numeric
+        // Todo: maybe we can keep all `ScalarValue` as `Float64` in 
`Distribution`?
+        let calc_type = DataType::Float64;
+
+        // Create weight scalars using Float64 to avoid truncation
+        let weight_a_scalar = ScalarValue::from(weight_a);
+        let weight_b_scalar = ScalarValue::from(weight_b);
+
+        // Calculate combined mean
+        let combined_mean = if mean_a.is_null() || mean_b.is_null() {
+            if mean_a.is_null() {
+                mean_b.clone()
+            } else {
+                mean_a.clone()
+            }
+        } else {
+            // Cast to Float64 for calculation
+            let mean_a_f64 = mean_a.cast_to(&calc_type)?;
+            let mean_b_f64 = mean_b.cast_to(&calc_type)?;
+
+            // Calculate weighted mean
+            mean_a_f64
+                .mul_checked(&weight_a_scalar)?
+                .add_checked(&mean_b_f64.mul_checked(&weight_b_scalar)?)?
+        };
+
+        // Calculate combined median
+        let combined_median = if median_a.is_null() || median_b.is_null() {
+            if median_a.is_null() {
+                median_b
+            } else {
+                median_a
+            }
+        } else {
+            // Cast to Float64 for calculation
+            let median_a_f64 = median_a.cast_to(&calc_type)?;
+            let median_b_f64 = median_b.cast_to(&calc_type)?;
+
+            // Calculate weighted median
+            median_a_f64
+                .mul_checked(&weight_a_scalar)?
+                .add_checked(&median_b_f64.mul_checked(&weight_b_scalar)?)?

Review Comment:
   Without access to the full data, there isn’t a universally “better” method 
than the weighted average approach which you adopted here.
   The key, as you mentioned, is to document these assumptions clearly so that 
downstream users of the code understand that the computed median is an 
approximation that may not capture the true central tendency if the underlying 
distributions differ significantly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] feat: support merge for `Distribution` [datafusion]

Reply via email to