Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-04-05 Thread via GitHub
berkaysynnada commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2007079688 ## datafusion/expr-common/src/statistics.rs: ## @@ -203,6 +203,138 @@ impl Distribution { }; Ok(dt) } + +/// Merges two distributi

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-04-04 Thread via GitHub
xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743831182 > Attribute `total_count` is derivable from `counts`, so we may not want to store it for normalization/consistency reasons. Same goes for `range`, it can constructed from `bins` in

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-04-04 Thread via GitHub
ozankabak commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743175661 This API, as it currently stands, does not seem to make sense. It seems to make the assumption that outcomes (i.e. individual items in the range) of the `Distribution`s are equally

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-24 Thread via GitHub
xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2742833876 FYI @berkaysynnada @ozankabak -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-22 Thread via GitHub
xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743593315 > Do you know any use cases where this method would be especially useful? If so, maybe we can study one of those cases in more detail. That could help us understand the real need a

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-22 Thread via GitHub
xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2745345742 Thanks for your suggestions!! @alamb @ozankabak @berkaysynnada and @kosiew I'll continue to do such work after the `Migrate to Distribution from Precision` work is done. I t

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-22 Thread via GitHub
xudong963 closed pull request #15296: feat: support merge for `Distribution` URL: https://github.com/apache/datafusion/pull/15296 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-21 Thread via GitHub
ozankabak commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2744132824 The most likely way we will end up with `HistogramDistribution`s will be via sampling. We can also leverage statistics in file metadata if a file format stores this information. AF

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-21 Thread via GitHub
ozankabak commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743724228 > I confused the merge and mix, after reviewing the information, "Merge" suggests combining datasets that maintain their original properties, but what's implemented is actually clo

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-21 Thread via GitHub
xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2743665328 > We can only merge two statistical objects in certain special circumstances. For example, if we have a statistical object that tracks sample averages along with counts, we can mer

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-21 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2007748334 ## datafusion/expr-common/src/statistics.rs: ## @@ -203,6 +203,138 @@ impl Distribution { }; Ok(dt) } + +/// Merges two distributions

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-21 Thread via GitHub
berkaysynnada commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2007079688 ## datafusion/expr-common/src/statistics.rs: ## @@ -203,6 +203,138 @@ impl Distribution { }; Ok(dt) } + +/// Merges two distributi

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-19 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2004911809 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distr

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-19 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002959262 ## datafusion/expr-common/src/statistics.rs: ## @@ -203,6 +203,121 @@ impl Distribution { }; Ok(dt) } + +/// Merges two distributions

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-19 Thread via GitHub
kosiew commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002828639 ## datafusion/expr-common/src/statistics.rs: ## @@ -203,6 +203,121 @@ impl Distribution { }; Ok(dt) } + +/// Merges two distributions int

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002526236 ## datafusion/expr-common/src/statistics.rs: ## @@ -203,6 +203,121 @@ impl Distribution { }; Ok(dt) } + +/// Merges two distributions

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002503571 ## datafusion/expr-common/src/statistics.rs: ## @@ -203,6 +203,121 @@ impl Distribution { }; Ok(dt) } + +/// Merges two distributions

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
kosiew commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002421418 ## datafusion/expr-common/src/statistics.rs: ## @@ -203,6 +203,121 @@ impl Distribution { }; Ok(dt) } + +/// Merges two distributions int

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 commented on PR #15296: URL: https://github.com/apache/datafusion/pull/15296#issuecomment-2735210417 > I think eventually it would be nice to add some tests for this code Yes, as the ticket description said: I'll do it after we are consistent. -- This is an automated messa

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002299377 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distr

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002335941 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distr

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002309255 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distr

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002299377 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distr

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002296888 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distr

Re: [PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
alamb commented on code in PR #15296: URL: https://github.com/apache/datafusion/pull/15296#discussion_r2002037023 ## datafusion/expr-common/src/statistics.rs: ## @@ -857,6 +857,143 @@ pub fn compute_variance( ScalarValue::try_from(target_type) } +/// Merges two distribut

[PR] feat: support merge for `Distribution` [datafusion]

2025-03-18 Thread via GitHub
xudong963 opened a new pull request, #15296: URL: https://github.com/apache/datafusion/pull/15296 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/15290 ## Rationale for this change See issue #15290 ## What change