berkaysynnada commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2007079688
########## datafusion/expr-common/src/statistics.rs: ########## @@ -203,6 +203,138 @@ impl Distribution { }; Ok(dt) } + + /// Merges two distributions into a single distribution that represents their combined statistics. + /// This creates a more general distribution that approximates the mixture of the input distributions. + /// + /// # Important Notes + /// + /// - The resulting mean, median, and variance are approximations of the mixture + /// distribution parameters. They are calculated using weighted averages based on + /// the input distributions. Users should not make definitive assumptions based on these values. + /// + /// - The range of the merged distribution is computed as the union of the input ranges + /// and its accuracy directly depends on the accuracy of the input ranges. + /// + /// - The result is always a [`Generic`] distribution, which may lose some specific + /// properties of the original distribution types. + /// + /// # Returns + /// + /// Returns a new [`Distribution`] that approximates the combined statistics of the + /// input distributions. + pub fn merge(&self, other: &Self) -> Result<Self> { + let range_a = self.range()?; + let range_b = other.range()?; + + // Determine data type and create combined range + let combined_range = range_a.union(&range_b)?; + + // Calculate weights for the mixture distribution + let (weight_a, weight_b) = match (range_a.cardinality(), range_b.cardinality()) { + (Some(ca), Some(cb)) => { + let total = (ca + cb) as f64; + ((ca as f64) / total, (cb as f64) / total) + } + _ => (0.5, 0.5), // Equal weights if cardinalities not available + }; + + // Get the original statistics + let mean_a = self.mean()?; + let mean_b = other.mean()?; + let median_a = self.median()?; + let median_b = other.median()?; + let var_a = self.variance()?; + let var_b = other.variance()?; + + // Always use Float64 for intermediate calculations to avoid truncation + // I assume that the target type is always numeric + // Todo: maybe we can keep all `ScalarValue` as `Float64` in `Distribution`? + let calc_type = DataType::Float64; Review Comment: Why float? decimals have higher precisions? We've thought on that a lot, and relaxing the datatype is not a good way during computations and representing intermediate or final results. Rather than presuming a target type, we need to rely on the data type of the original quantity and standard coercions of it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org