berkaysynnada commented on code in PR #15296:
URL: https://github.com/apache/datafusion/pull/15296#discussion_r2007079688


##########
datafusion/expr-common/src/statistics.rs:
##########
@@ -203,6 +203,138 @@ impl Distribution {
         };
         Ok(dt)
     }
+
+    /// Merges two distributions into a single distribution that represents 
their combined statistics.
+    /// This creates a more general distribution that approximates the mixture 
of the input distributions.
+    ///
+    /// # Important Notes
+    ///
+    /// - The resulting mean, median, and variance are approximations of the 
mixture
+    ///   distribution parameters. They are calculated using weighted averages 
based on
+    ///   the input distributions. Users should not make definitive 
assumptions based on these values.
+    ///
+    /// - The range of the merged distribution is computed as the union of the 
input ranges
+    ///   and its accuracy directly depends on the accuracy of the input 
ranges.
+    ///
+    /// - The result is always a [`Generic`] distribution, which may lose some 
specific
+    ///   properties of the original distribution types.
+    ///
+    /// # Returns
+    ///
+    /// Returns a new [`Distribution`] that approximates the combined 
statistics of the
+    /// input distributions.
+    pub fn merge(&self, other: &Self) -> Result<Self> {
+        let range_a = self.range()?;
+        let range_b = other.range()?;
+
+        // Determine data type and create combined range
+        let combined_range = range_a.union(&range_b)?;
+
+        // Calculate weights for the mixture distribution
+        let (weight_a, weight_b) = match (range_a.cardinality(), 
range_b.cardinality()) {
+            (Some(ca), Some(cb)) => {
+                let total = (ca + cb) as f64;
+                ((ca as f64) / total, (cb as f64) / total)
+            }
+            _ => (0.5, 0.5), // Equal weights if cardinalities not available
+        };
+
+        // Get the original statistics
+        let mean_a = self.mean()?;
+        let mean_b = other.mean()?;
+        let median_a = self.median()?;
+        let median_b = other.median()?;
+        let var_a = self.variance()?;
+        let var_b = other.variance()?;
+
+        // Always use Float64 for intermediate calculations to avoid truncation
+        // I assume that the target type is always numeric
+        // Todo: maybe we can keep all `ScalarValue` as `Float64` in 
`Distribution`?
+        let calc_type = DataType::Float64;

Review Comment:
   Why float? decimals have higher precisions? We've thought on that a lot, and 
relaxing the datatype is not a good way during computations and representing 
intermediate or final results. Rather than presuming a target type, we need to 
rely on the data type of the original quantity and standard coercions of it. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to