[PR] Fix intermittent SQL logic test failure in limit.slt by adding ORDER BY clause [datafusion]

2025-06-05 Thread via GitHub
kosiew opened a new pull request, #16257: URL: https://github.com/apache/datafusion/pull/16257 ## Which issue does this PR close? - Closes #16180. ## Rationale for this change Intermittent test failures were observed in the `limit.slt` SQL logic test file on the `main` b

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-05 Thread via GitHub
xudong963 commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-2943445579 Test with datafusion-materialized-views โœ… -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [I] June 2025 ASF Board Report [datafusion]

2025-06-05 Thread via GitHub
alamb commented on issue #15182: URL: https://github.com/apache/datafusion/issues/15182#issuecomment-2943751488 FYI @andygrove / @mbutrovich in case you would like to add anything for comet -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [I] June 2025 ASF Board Report [datafusion]

2025-06-05 Thread via GitHub
alamb commented on issue #15182: URL: https://github.com/apache/datafusion/issues/15182#issuecomment-2943751090 Fyi @iffyio in case you would like to add anything for sqlparser -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [I] June 2025 ASF Board Report [datafusion]

2025-06-05 Thread via GitHub
alamb commented on issue #15182: URL: https://github.com/apache/datafusion/issues/15182#issuecomment-2943751927 FYI @milenkovicm in case you would like to add anything for Ballista -- This is an automated message from the Apache Git Service. To respond to the message, please log o

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2943759107 > @ozankabak I still don't think you need all this API work since there's a zero API change way to deal with cancellation already. Tests all pass with no API changes in the 'all St

Re: [I] [EPIC] Complete Span (source location) information / feature [datafusion-sqlparser-rs]

2025-06-05 Thread via GitHub
agis commented on issue #1548: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1548#issuecomment-2943739678 https://github.com/apache/datafusion-sqlparser-rs/issues/1858 is also related to this -- This is an automated message from the Apache Git Service. To respond to the m

Re: [PR] feat: Support defining custom MetricValues in PhysicalPlans [datafusion]

2025-06-05 Thread via GitHub
sfluor commented on PR #16195: URL: https://github.com/apache/datafusion/pull/16195#issuecomment-2943143826 There was one remaining test failing, I fixed it and updated the PR to the latest branch. Should it target `main` or the `47` branch ? -- This is an automated message from the Apach

Re: [I] [EPIC] Complete Span (source location) information / feature [datafusion-sqlparser-rs]

2025-06-05 Thread via GitHub
eliaperantoni commented on issue #1548: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1548#issuecomment-2943240795 I noticed that `CASE .. END` expressions don't have correct spans because the locations of the heading and trailing keywords are not considered. ie. the follow

Re: [PR] feat(small): Add `BaselineMetrics` to `generate_series()` table function [datafusion]

2025-06-05 Thread via GitHub
2010YOUY01 commented on PR #16255: URL: https://github.com/apache/datafusion/pull/16255#issuecomment-2943542788 > LGTM Thank you @2010YOUY01 , i have minor question: > > Do we test before, does those BaselineMetrics will affect performance when large data set? Thanks for the re

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2943553646 @zhuqi-lucas FYI, I've reworked the 'infinite stream' to 'range stream' in my tests. It simply emits a `Range` now. I've added an additional evil test case for sort-merge join where

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2128523292 ## datafusion/sqllogictest/test_files/joins.slt: ## @@ -4702,7 +4702,8 @@ physical_plan 01)CrossJoinExec 02)--DataSourceExec: partitions=1, partition_sizes=[

[PR] Fix distinct count for DictionaryArray to correctly account for nulls in values array [datafusion]

2025-06-05 Thread via GitHub
kosiew opened a new pull request, #16258: URL: https://github.com/apache/datafusion/pull/16258 ## Which issue does this PR close? Closes #16228 ## Rationale for this change `Array::is_null` does not correctly identify nulls for `DictionaryArray` when the indices poi

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2943654190 > @zhuqi-lucas regarding corner cases, I've added two additional tests on my branch for filter and join. > > Filter can refuse to cancel if the filter rejects many full bat

Re: [I] Support columns having the same alias [datafusion]

2025-06-05 Thread via GitHub
osipovartem commented on issue #6543: URL: https://github.com/apache/datafusion/issues/6543#issuecomment-2943653445 Here is the same issue ```sql CREATE TABLE test (col1 DATE, col2 DATE); INSERT INTO test SELECT TO_DATE('2013-05-08T23:39:20.123'), TO_DATE('2013-05-08T23:39

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2128526165 ## datafusion/common/src/config.rs: ## @@ -722,6 +722,19 @@ config_namespace! { /// then the output will be coerced to a non-view. /// Coerce

[I] September 2025 ASF Board Report [datafusion]

2025-06-05 Thread via GitHub
alamb opened a new issue, #16259: URL: https://github.com/apache/datafusion/issues/16259 Related Items: - Coordination Google Doc: TBD - Mailing List Thread: TBD ### Is your feature request related to a problem or challenge? Per https://whimsy.apache.org/roster/committee/d

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2128549342 ## datafusion/physical-plan/src/execution_plan.rs: ## @@ -75,7 +75,19 @@ use futures::stream::{StreamExt, TryStreamExt}; /// [`execute`]: ExecutionPlan::execu

Re: [D] DISCUSSION: May 27, 2025 DataFusion Meetup in Amsterdam [datafusion]

2025-06-05 Thread via GitHub
GitHub user alamb added a comment to the discussion: DISCUSSION: May 27, 2025 DataFusion Meetup in Amsterdam I think this event was postponed, FWIW GitHub link: https://github.com/apache/datafusion/discussions/16038#discussioncomment-13378346 This is an automatically sent email for gith

Re: [D] Expose intermediary states in aggregation functions [datafusion]

2025-06-05 Thread via GitHub
GitHub user alamb added a comment to the discussion: Expose intermediary states in aggregation functions ๐Ÿ‘‹ -- perhaps https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.Accumulator.html#tymethod.state is relevant GitHub link: https://github.com/apache/datafusion/discussions/16

Re: [PR] minor: Replace many instances of `checkSparkAnswer` with `checkSparkAnswerAndOperator` [datafusion-comet]

2025-06-05 Thread via GitHub
codecov-commenter commented on PR #1851: URL: https://github.com/apache/datafusion-comet/pull/1851#issuecomment-2944850645 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1851?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2945139509 Changing hats to DataFusion user mode where I need to make sure that the end users of our system can press 'cancel' at any time and that works as expected. From that perspecti

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2945161909 I will be traveling tomorrow, but myself and @berkaysynnada will help drive this to completion early next week. I made some progress on sketching out a good API and will circle bac

Re: [PR] feat: Support defining custom MetricValues in PhysicalPlans [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #16195: URL: https://github.com/apache/datafusion/pull/16195#issuecomment-2945581697 Let's wait to merge this PR until we ship DataFusion 48 to limit the breaking changes - #15771 I think we'll be able to merge this in the next few days -- This is an autom

Re: [D] DISCUSSION: DataFusion Meetup in New York, NY, USA [datafusion]

2025-06-05 Thread via GitHub
GitHub user timsaucer added a comment to the discussion: DISCUSSION: DataFusion Meetup in New York, NY, USA I am interested in attending and there are a few topics I could present on, depending on what time we have available. GitHub link: https://github.com/apache/datafusion/discussions/1626

Re: [PR] Add `--substrait-round-trip` option in sqllogictests [datafusion]

2025-06-05 Thread via GitHub
alamb commented on code in PR #16183: URL: https://github.com/apache/datafusion/pull/16183#discussion_r2129704959 ## .github/workflows/rust.yml: ## @@ -476,6 +476,28 @@ jobs: POSTGRES_HOST: postgres POSTGRES_PORT: ${{ job.services.postgres.ports[5432] }}

Re: [PR] Add `--substrait-round-trip` option in sqllogictests [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #16183: URL: https://github.com/apache/datafusion/pull/16183#issuecomment-2945612836 Thank you for the review @2010YOUY01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] Add `--substrait-round-trip` option in sqllogictests [datafusion]

2025-06-05 Thread via GitHub
alamb merged PR #16183: URL: https://github.com/apache/datafusion/pull/16183 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-05 Thread via GitHub
alamb commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-2945615819 > I would also like to include https://github.com/apache/datafusion/pull/16256 Merged! -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Feat: Support Spark 4.0.0 part1 [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove commented on PR #1830: URL: https://github.com/apache/datafusion-comet/pull/1830#issuecomment-2945620138 The Spark version will also need to be updated in `.github/workflows/spark_sql_test_ansi.yml` -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] Minor: fix upgrade papercut `pub use PruningStatistics` [datafusion]

2025-06-05 Thread via GitHub
adriangb commented on PR #16264: URL: https://github.com/apache/datafusion/pull/16264#issuecomment-2945616545 Ah I was waiting for the email but in reality I just missed it ๐Ÿ˜„ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [PR] Minor: fix upgrade papercut `pub use PruningStatistics` [datafusion]

2025-06-05 Thread via GitHub
adriangb merged PR #16264: URL: https://github.com/apache/datafusion/pull/16264 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Minor: fix upgrade papercut `pub use PruningStatistics` [datafusion]

2025-06-05 Thread via GitHub
adriangb commented on PR #16264: URL: https://github.com/apache/datafusion/pull/16264#issuecomment-2945617720 Worked!! Sweet. Thank you Andrew (for the fix and nudge). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-05 Thread via GitHub
xudong963 commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-2948015656 > It looks like these changes all went into the `main` branch. I've been testing off of the `48.0.0-rc1` tag, should I switch to test off of `main`? I'll update branch46

[I] Update Fuzz tests to include Dict with null values [datafusion]

2025-06-05 Thread via GitHub
kosiew opened a new issue, #16266: URL: https://github.com/apache/datafusion/issues/16266 #16258 closes #16228 We should [update the fuzz tests to include Dict with null values](https://github.com/apache/datafusion/pull/16258#issuecomment-2944973123). -- This is an automated messa

Re: [PR] Fix intermittent SQL logic test failure in limit.slt by adding ORDER BY clause [datafusion]

2025-06-05 Thread via GitHub
kosiew commented on code in PR #16257: URL: https://github.com/apache/datafusion/pull/16257#discussion_r2131457267 ## datafusion/sqllogictest/test_files/limit.slt: ## @@ -860,6 +860,7 @@ query I with selection as ( select * from test_limit_with_partitions +order b

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-05 Thread via GitHub
xudong963 commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-2948082010 https://github.com/apache/datafusion/pull/16267 After it's merged, I'll push the 48.0.0-rc2 and start vote -- This is an automated message from the Apache Git Service. T

[PR] [branch-48] update changelog [datafusion]

2025-06-05 Thread via GitHub
xudong963 opened a new pull request, #16267: URL: https://github.com/apache/datafusion/pull/16267 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes teste

Re: [PR] Add support for mysql's drop index (`ALTER TABLE table_a DROP INDEX idx_a`) [datafusion-sqlparser-rs]

2025-06-05 Thread via GitHub
iffyio commented on code in PR #1865: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1865#discussion_r2131627748 ## tests/sqlparser_common.rs: ## @@ -9132,7 +9132,9 @@ fn test_create_index_with_with_clause() { #[test] fn parse_drop_index() { let sql = "DROP

[I] Update or ignore tests in Spark SQL WholeStageCodegenSuite [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove opened a new issue, #1852: URL: https://github.com/apache/datafusion-comet/issues/1852 ### What is the problem the feature request solves? The following tests in WholeStageCodegenSuite currently pass because we are falling back to Spark, but they fail when they run natively

[PR] Extend benchmark comparison script with more detailed statistics [datafusion]

2025-06-05 Thread via GitHub
pepijnve opened a new pull request, #16262: URL: https://github.com/apache/datafusion/pull/16262 ## Which issue does this PR close? - No issue created yet, related to PR #16196. ## Rationale for this change The current benchmark comparison script compares

Re: [PR] chore: Update documentation and ignore Spark SQL tests for known issue with count distinct on NaN in aggregate [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove commented on PR #1847: URL: https://github.com/apache/datafusion-comet/pull/1847#issuecomment-2944297917 I reported the bug in DataFusion yesterday and there is already a fix https://github.com/apache/datafusion/pull/16256 Perhaps this will be included in the 48.0.0 release

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
ozankabak commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2944338401 Great, let's add tests for all cases we currently fail to cover (interleave, join, filter). We will then use them as litmus tests as we iterate on the API and the rule. ๐Ÿš€

Re: [PR] fix: NaN semantics in GROUP BY [datafusion]

2025-06-05 Thread via GitHub
andygrove merged PR #16256: URL: https://github.com/apache/datafusion/pull/16256 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [I] Support columns having the same alias [datafusion]

2025-06-05 Thread via GitHub
alamb commented on issue #6543: URL: https://github.com/apache/datafusion/issues/6543#issuecomment-2944658229 > would it be correct to use statement visitors here to add unique aliases? I think that is actually a pretty neat idea -- specifically add the aliases in the SQL planner

Re: [PR] feat: Add Aggregate UDF to FFI crate [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #14775: URL: https://github.com/apache/datafusion/pull/14775#issuecomment-2945964478 gogogo -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

Re: [PR] feat: Add Aggregate UDF to FFI crate [datafusion]

2025-06-05 Thread via GitHub
alamb merged PR #14775: URL: https://github.com/apache/datafusion/pull/14775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] feat: Add Window UDFs to FFI Crate [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #16261: URL: https://github.com/apache/datafusion/pull/16261#issuecomment-2945974005 I rebased this PR on main so it was ready to go -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] feat: Add Window UDFs to FFI Crate [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #16261: URL: https://github.com/apache/datafusion/pull/16261#issuecomment-2945974874 If the tests pass I'll merge it in -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [D] DISCUSSION: DataFusion Meetup in New York, NY, USA [datafusion]

2025-06-05 Thread via GitHub
GitHub user leoDYL added a comment to the discussion: DISCUSSION: DataFusion Meetup in New York, NY, USA We're looking for a variety of topics with the theme of reflecting on the 50 releases of DataFusion! Seems [VegaFusion](https://vegafusion.io/) has been using DataFusion for a while so it

Re: [I] Make it easier to run TPCH queries with datafusion-cli [datafusion]

2025-06-05 Thread via GitHub
clflushopt commented on issue #14608: URL: https://github.com/apache/datafusion/issues/14608#issuecomment-2948048431 Hey @alamb following suggestions from @kevinjqliu I am happy to say that https://github.com/clflushopt/datafusion-tpch provides a ux on par with duckdb and what we discussed

Re: [PR] Extend benchmark comparison script with more detailed statistics [datafusion]

2025-06-05 Thread via GitHub
pepijnve commented on code in PR #16262: URL: https://github.com/apache/datafusion/pull/16262#discussion_r2131585926 ## benchmarks/bench.sh: ## @@ -66,10 +67,11 @@ DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch ** * Commands ** -data: Generates

Re: [PR] Extend benchmark comparison script with more detailed statistics [datafusion]

2025-06-05 Thread via GitHub
pepijnve commented on code in PR #16262: URL: https://github.com/apache/datafusion/pull/16262#discussion_r2131600952 ## benchmarks/compare.py: ## @@ -148,10 +174,23 @@ def compare( ) continue -total_baseline_time += baseline_result.execution_t

Re: [PR] Fix distinct count for DictionaryArray to correctly account for nulls in values array [datafusion]

2025-06-05 Thread via GitHub
kosiew commented on code in PR #16258: URL: https://github.com/apache/datafusion/pull/16258#discussion_r2131413138 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -5030,6 +5030,20 @@ select count(distinct column1), count(distinct column2) from dict_test group by sta

Re: [PR] [branch-48] update changelog [datafusion]

2025-06-05 Thread via GitHub
xudong963 merged PR #16267: URL: https://github.com/apache/datafusion/pull/16267 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2948268910 @zhuqi-lucas @alamb I wanted to work on measuring the performance impact of this PR today, but looking at https://github.com/apache/datafusion/pull/16262#pullrequestreview-290313953

Re: [I] Unaligned memory access in `SparkUnsafeRow` [datafusion-comet]

2025-06-05 Thread via GitHub
parthchandra commented on issue #1849: URL: https://github.com/apache/datafusion-comet/issues/1849#issuecomment-2947748623 I wonder how likely is it that a comet user would be on an architecture that does not support unaligned memory access. -- This is an automated message from the Apac

[PR] Add compression option to SpillManager [datafusion]

2025-06-05 Thread via GitHub
ding-young opened a new pull request, #16268: URL: https://github.com/apache/datafusion/pull/16268 ## Which issue does this PR close? - Closes #16130 . ## TODO - [ ] add test for compression in spill file - [ ] refine arg names - [ ] check config docs

Re: [PR] Adjust slttest to pass without RUST_BACKTRACE enabled [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #16251: URL: https://github.com/apache/datafusion/pull/16251#issuecomment-2944591691 > I am confused why CI will not fail for this case, i remember some cases i run locally to -- --complete, but the CI failed, so i add RUST_BACKTRACE to generate. I think sqllogi

[PR] minor: Replace many instances of `checkSparkAnswer` with `checkSparkAnswerAndOperator` [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove opened a new pull request, #1851: URL: https://github.com/apache/datafusion-comet/pull/1851 ## Which issue does this PR close? N/A ## Rationale for this change Improve testing and help prevent regressions ## What changes are included in th

Re: [PR] minor: Replace many instances of `checkSparkAnswer` with `checkSparkAnswerAndOperator` [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove commented on code in PR #1851: URL: https://github.com/apache/datafusion-comet/pull/1851#discussion_r2128955528 ## spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala: ## @@ -399,12 +399,7 @@ class CometArrayExpressionSuite extends CometTestBase with

Re: [PR] feat: Support defining custom MetricValues in PhysicalPlans [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #16195: URL: https://github.com/apache/datafusion/pull/16195#issuecomment-2944619110 > > Should it target main or the 47 branch ? > > The `main` branch is the good one (I don't think the branch-47 is the most recent release branch anyway) yeah, let's targe

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2128830214 ## datafusion/physical-plan/src/execution_plan.rs: ## @@ -546,6 +558,23 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync { child_pushdo

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-05 Thread via GitHub
andygrove commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-2944294392 I would also like to include https://github.com/apache/datafusion/pull/16256 -- This is an automated message from the Apache Git Service. To respond to the message, please lo

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2128845907 ## datafusion/physical-plan/src/execution_plan.rs: ## @@ -75,7 +75,19 @@ use futures::stream::{StreamExt, TryStreamExt}; /// [`execute`]: ExecutionPlan::execu

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2128846243 ## datafusion/sqllogictest/test_files/group_by.slt: ## @@ -4113,7 +4113,7 @@ EXPLAIN SELECT lhs.c, rhs.c, lhs.sum1, rhs.sum1 logical_plan 01)Projection

Re: [PR] fix: NaN semantics in GROUP BY [datafusion]

2025-06-05 Thread via GitHub
andygrove commented on code in PR #16256: URL: https://github.com/apache/datafusion/pull/16256#discussion_r2128850266 ## datafusion/physical-plan/src/aggregates/group_values/multi_group_by/primitive.rs: ## @@ -121,7 +122,7 @@ impl GroupColumn // Otherwise, we n

Re: [PR] Adjust slttest to pass without RUST_BACKTRACE enabled [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on PR #16251: URL: https://github.com/apache/datafusion/pull/16251#issuecomment-2944937313 Thank you @alamb , is it possible for --complete also generate substring which matches in CI? -- This is an automated message from the Apache Git Service. To respond to the me

Re: [I] Enable more DPP Spark SQL tests [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove commented on issue #1739: URL: https://github.com/apache/datafusion-comet/issues/1739#issuecomment-2945837570 Fixed in https://github.com/apache/datafusion-comet/issues/1831 and https://github.com/apache/datafusion-comet/pull/1838 -- This is an automated message from the Apache

Re: [I] Enable more DPP Spark SQL tests [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove closed issue #1739: Enable more DPP Spark SQL tests URL: https://github.com/apache/datafusion-comet/issues/1739 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [PR] Fix distinct count for DictionaryArray to correctly account for nulls in values array [datafusion]

2025-06-05 Thread via GitHub
alamb commented on code in PR #16258: URL: https://github.com/apache/datafusion/pull/16258#discussion_r2130020509 ## datafusion/functions-aggregate/src/count.rs: ## @@ -711,8 +711,8 @@ impl Accumulator for DistinctCountAccumulator { } (0..arr.len()).try_for_e

Re: [PR] feat: Support null aware + equijoins for `NestedLoopJoin` [datafusion]

2025-06-05 Thread via GitHub
jonathanc-n commented on PR #16210: URL: https://github.com/apache/datafusion/pull/16210#issuecomment-2945916745 There's an interesting implementation of index joins [here](https://github.com/duckdb/duckdb/pull/1008), but this would involve creating indexes. What are the thoughts on support

Re: [PR] Minor: fix upgrade papercut `pub use PruningStatistics` [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #16264: URL: https://github.com/apache/datafusion/pull/16264#issuecomment-2945947663 WOOHOO ๐Ÿš€ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsu

Re: [I] Question on: `visit_expressions_mut` for alias expr [datafusion-sqlparser-rs]

2025-06-05 Thread via GitHub
HuyNguyen7994 commented on issue #1475: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1475#issuecomment-2944295472 @cisaacson I'm having the same problem. Turn out you want to visit `SelectItem`: https://github.com/HuyNguyen7994/datafusion-sqlparser-rs/tree/add-select-item-

Re: [PR] minor: Replace many instances of `checkSparkAnswer` with `checkSparkAnswerAndOperator` [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove commented on code in PR #1851: URL: https://github.com/apache/datafusion-comet/pull/1851#discussion_r2129233751 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1394,10 +1394,10 @@ class CometExpressionSuite extends CometTestBase with Adapti

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-05 Thread via GitHub
adriangb commented on code in PR #16014: URL: https://github.com/apache/datafusion/pull/16014#discussion_r2129798435 ## datafusion/proto/src/physical_plan/to_proto.rs: ## @@ -506,7 +506,7 @@ pub fn serialize_file_scan_config( .iter() .cloned() .collect

Re: [PR] Add late pruning of file based on file level statistics [datafusion]

2025-06-05 Thread via GitHub
adriangb commented on PR #16014: URL: https://github.com/apache/datafusion/pull/16014#issuecomment-2945656786 I've rebased this and it's looking nice now. I think the main open question is the concern about performance / overhead: https://github.com/apache/datafusion/pull/16014/file

Re: [D] DISCUSSION: DataFusion Meetup in New York, NY, USA [datafusion]

2025-06-05 Thread via GitHub
GitHub user jonmmease added a comment to the discussion: DISCUSSION: DataFusion Meetup in New York, NY, USA What kinds of talks are you looking for? I may be available to give one on how [VegaFusion](https://vegafusion.io/) uses DataFusion. GitHub link: https://github.com/apache/datafusion/d

Re: [D] DISCUSSION: DataFusion Meetup in New York, NY, USA [datafusion]

2025-06-05 Thread via GitHub
GitHub user lwwmanning added a comment to the discussion: DISCUSSION: DataFusion Meetup in New York, NY, USA Iโ€™d love to give a talk on DataFusion/Vortex stuff! Specifically, how DataFusionโ€™s extensibility was hugely useful for bootstrapping, building/testing, & benchmarking a new file format

Re: [PR] Extend benchmark comparison script with more detailed statistics [datafusion]

2025-06-05 Thread via GitHub
2010YOUY01 commented on code in PR #16262: URL: https://github.com/apache/datafusion/pull/16262#discussion_r2131020155 ## benchmarks/bench.sh: ## @@ -66,10 +67,11 @@ DATAFUSION_DIR=/source/datafusion ./bench.sh run tpch ** * Commands ** -data: Generate

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-05 Thread via GitHub
shehabgamin commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-2947035398 > I think we have merged all the desired PRs now: > > * [fix: NaN semantics in GROUP BY #16256](https://github.com/apache/datafusion/pull/16256) > > * [

Re: [PR] Extend benchmark comparison script with more detailed statistics [datafusion]

2025-06-05 Thread via GitHub
Copilot commented on code in PR #16262: URL: https://github.com/apache/datafusion/pull/16262#discussion_r2131026859 ## benchmarks/compare.py: ## @@ -148,10 +174,23 @@ def compare( ) continue -total_baseline_time += baseline_result.execution_ti

Re: [I] Spark Test fails `vectorized reader: missing all struct fields` [datafusion-comet]

2025-06-05 Thread via GitHub
parthchandra commented on issue #1843: URL: https://github.com/apache/datafusion-comet/issues/1843#issuecomment-2947362362 Why would the expected result be `[null], [null], [null]` ? This means that all the structs are null but that is not the actual data. Only in the third row, is the str

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
zhuqi-lucas commented on code in PR #16196: URL: https://github.com/apache/datafusion/pull/16196#discussion_r2128845315 ## datafusion/common/src/config.rs: ## @@ -722,6 +722,19 @@ config_namespace! { /// then the output will be coerced to a non-view. /// Coerce

Re: [I] Inconsistency with count distinct on NaN values [datafusion]

2025-06-05 Thread via GitHub
andygrove closed issue #16254: Inconsistency with count distinct on NaN values URL: https://github.com/apache/datafusion/issues/16254 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

Re: [PR] feat: Allow cancelling of grouping operations which are CPU bound [datafusion]

2025-06-05 Thread via GitHub
pepijnve commented on PR #16196: URL: https://github.com/apache/datafusion/pull/16196#issuecomment-2944505070 > I feel like we are getting close to a point where we start having not-so-fruitful discussions. I think I have made a good effort to make my arguments and reasoning clear. @

Re: [PR] Handle dicts for distinct count [datafusion]

2025-06-05 Thread via GitHub
blaginin merged PR #15871: URL: https://github.com/apache/datafusion/pull/15871 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] Improve performance of COUNT (distinct x) for dictionary columns [datafusion]

2025-06-05 Thread via GitHub
blaginin closed issue #258: Improve performance of COUNT (distinct x) for dictionary columns URL: https://github.com/apache/datafusion/issues/258 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] chore(deps): bump substrait from 0.56.0 to 0.57.0 [datafusion]

2025-06-05 Thread via GitHub
dependabot[bot] commented on PR #16143: URL: https://github.com/apache/datafusion/pull/16143#issuecomment-2945761029 Dependabot tried to update this pull request, but something went wrong. We're looking into it, but in the meantime you can retry the update by commenting `@dependabot rebase`

[PR] feat: Add Aggregate UDF to FFI crate [datafusion]

2025-06-05 Thread via GitHub
timsaucer opened a new pull request, #14775: URL: https://github.com/apache/datafusion/pull/14775 ## Which issue does this PR close? This PR addresses part of #14562 ## Rationale for this change This change allows for using user defined **aggregate** functions across FFI

Re: [PR] feat: Add Aggregate UDF to FFI crate [datafusion]

2025-06-05 Thread via GitHub
alamb closed pull request #14775: feat: Add Aggregate UDF to FFI crate URL: https://github.com/apache/datafusion/pull/14775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] feat: Add Aggregate UDF to FFI crate [datafusion]

2025-06-05 Thread via GitHub
alamb commented on PR #14775: URL: https://github.com/apache/datafusion/pull/14775#issuecomment-2945762124 It seems github is experiencing issues. I will close/reopen this PR to restart the checks https://www.githubstatus.com/ https://github.com/user-attachments/assets/2bc627ef

Re: [PR] Fix distinct count for DictionaryArray to correctly account for nulls in values array [datafusion]

2025-06-05 Thread via GitHub
jonathanc-n commented on code in PR #16258: URL: https://github.com/apache/datafusion/pull/16258#discussion_r2130695219 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -5030,6 +5030,20 @@ select count(distinct column1), count(distinct column2) from dict_test group by

Re: [PR] Fix distinct count for DictionaryArray to correctly account for nulls in values array [datafusion]

2025-06-05 Thread via GitHub
jonathanc-n commented on code in PR #16258: URL: https://github.com/apache/datafusion/pull/16258#discussion_r2130695219 ## datafusion/sqllogictest/test_files/aggregate.slt: ## @@ -5030,6 +5030,20 @@ select count(distinct column1), count(distinct column2) from dict_test group by

Re: [I] Iceberg integration - parquet-column version conflicts [datafusion-comet]

2025-06-05 Thread via GitHub
huaxingao commented on issue #1833: URL: https://github.com/apache/datafusion-comet/issues/1833#issuecomment-2946713655 To work around the shading issues, I'm working on higher-level abstractions so that we don't need to pass any Parquet objects. -- This is an automated message from the

Re: [I] I would like to be able to use PyDataFrame from other projects [datafusion-python]

2025-06-05 Thread via GitHub
mara-schulke commented on issue #581: URL: https://github.com/apache/datafusion-python/issues/581#issuecomment-2944785439 Hi @andygrove, we are currently using `PyDataFrame` and would like to use it to convert back to a `datafusion::DataFrame` do you have any information / guidance on how

Re: [I] I would like to be able to use PyDataFrame from other projects [datafusion-python]

2025-06-05 Thread via GitHub
andygrove commented on issue #581: URL: https://github.com/apache/datafusion-python/issues/581#issuecomment-2944796949 Perhaps @timsaucer can provide some guidance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] minor: Replace many instances of `checkSparkAnswer` with `checkSparkAnswerAndOperator` [datafusion-comet]

2025-06-05 Thread via GitHub
comphead commented on code in PR #1851: URL: https://github.com/apache/datafusion-comet/pull/1851#discussion_r2129022559 ## spark/src/test/scala/org/apache/comet/CometExpressionSuite.scala: ## @@ -1394,10 +1394,10 @@ class CometExpressionSuite extends CometTestBase with Adaptiv

Re: [PR] Intermediate result blocked approach to aggregation memory management [datafusion]

2025-06-05 Thread via GitHub
Dandandan commented on PR #15591: URL: https://github.com/apache/datafusion/pull/15591#issuecomment-2945391975 thanks @Rachelint and congratulations! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] [WIP] Remove `COMET_SHUFFLE_FALLBACK_TO_COLUMNAR` config [datafusion-comet]

2025-06-05 Thread via GitHub
andygrove commented on PR #1736: URL: https://github.com/apache/datafusion-comet/pull/1736#issuecomment-2945396822 Current failures: core1: ``` 2025-06-05T16:40:13.0938574Z [info] - avoid reordering broadcast join keys to match input hash partitioning *** FAILED *** (2 seco

  1   2   >