Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2021921381 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -667,175 +740,322 @@ impl Debug for ShuffleRepartitioner { } } -/// The status of appen

Re: [PR] Add query to extended clickbench suite for "complex filter" [datafusion]

2025-03-31 Thread via GitHub
alamb commented on PR #15500: URL: https://github.com/apache/datafusion/pull/15500#issuecomment-2767682615 > I think we may need to augment the github extended tests to run all the benchmarks, not to validate the results (though we can do that later) but to just make sure the benchmarks sti

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15499: URL: https://github.com/apache/datafusion/pull/15499#discussion_r2021931938 ## datafusion/sql/tests/sql_integration.rs: ## @@ -157,219 +157,282 @@ fn parse_ident_normalization() { #[test] fn select_no_relation() { -quick_test( -

Re: [PR] Extract tokio runtime creation from hot loop in benchmarks [datafusion]

2025-03-31 Thread via GitHub
alamb merged PR #15508: URL: https://github.com/apache/datafusion/pull/15508 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-03-31 Thread via GitHub
alamb commented on PR #15499: URL: https://github.com/apache/datafusion/pull/15499#issuecomment-2767687638 FYI @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [I] Improve `String` to `&str` conversions [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #15498: URL: https://github.com/apache/datafusion/issues/15498#issuecomment-2767680544 > @alamb would you mind giving an example on Arc ? Here is one example https://github.com/apache/datafusion/blob/923bfb7fc7cf1718522f572b74f3756d02652933/datafusion/c

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-31 Thread via GitHub
parthchandra commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2767719365 > [@andygrove](https://github.com/andygrove) sorry for the confusion here, my cluster is on Spark 3.5.0, and Comet 0.7.0 prebuilt JAR did not work on it > > [@pa

Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-03-31 Thread via GitHub
jayzhan211 commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2767724669 > There has been a number of issues where benchmarks stopped working and no one noticed until someone happened to try and run them Instead of running the benchmark, how

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-03-31 Thread via GitHub
qstommyshu commented on code in PR #15499: URL: https://github.com/apache/datafusion/pull/15499#discussion_r2021991253 ## datafusion/sql/tests/sql_integration.rs: ## @@ -157,219 +157,282 @@ fn parse_ident_normalization() { #[test] fn select_no_relation() { -quick_test( -

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-03-31 Thread via GitHub
qstommyshu commented on code in PR #15499: URL: https://github.com/apache/datafusion/pull/15499#discussion_r2021991253 ## datafusion/sql/tests/sql_integration.rs: ## @@ -157,219 +157,282 @@ fn parse_ident_normalization() { #[test] fn select_no_relation() { -quick_test( -

Re: [PR] Update user guide to note decimal is not experimental anymore [datafusion]

2025-03-31 Thread via GitHub
Jiashu-Hu closed pull request #15514: Update user guide to note decimal is not experimental anymore URL: https://github.com/apache/datafusion/pull/15514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] Improve `String` to `&str` conversions [datafusion]

2025-03-31 Thread via GitHub
matthewmturner commented on issue #15498: URL: https://github.com/apache/datafusion/issues/15498#issuecomment-2767914670 I have vague recollection that `Arc` was only faster than cloning a `String` when there was very little contention due to locking used by `Arc` internally. If there was a

[PR] feat: Add config max_temp_directory_size to limit max disk usage for spilling queries [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 opened a new pull request, #15520: URL: https://github.com/apache/datafusion/pull/15520 ## Which issue does this PR close? - Closes #15358 ## Rationale for this change See the rationale part of the first attempt PR https://github.com/apache/datafusion

Re: [PR] feat: Add config `max_temp_directory_size` to limit max disk usage for spilling queries [datafusion]

2025-03-31 Thread via GitHub
Copilot commented on code in PR #15520: URL: https://github.com/apache/datafusion/pull/15520#discussion_r2022259546 ## datafusion/execution/src/disk_manager.rs: ## @@ -164,6 +216,50 @@ impl RefCountedTempFile { pub fn inner(&self) -> &NamedTempFile { &self.tempfile

Re: [PR] feat(sql): add diagnostic for wrong number of function arguments [datafusion]

2025-03-31 Thread via GitHub
eliaperantoni commented on code in PR #15490: URL: https://github.com/apache/datafusion/pull/15490#discussion_r2022232702 ## datafusion/common/src/utils/mod.rs: ## @@ -979,16 +981,19 @@ pub fn take_function_args( ) -> Result<[T; N]> { let args = args.into_iter().collect::>

Re: [PR] feat(sql): add diagnostic for wrong number of function arguments [datafusion]

2025-03-31 Thread via GitHub
eliaperantoni commented on PR #15490: URL: https://github.com/apache/datafusion/pull/15490#issuecomment-2768341633 Also you made the PR ready for review again, after @alamb, made it a draft, without first committing any new changes? -- This is an automated message from the Apache Git Serv

Re: [PR] chore: Run Comet tests for more Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
codecov-commenter commented on PR #1582: URL: https://github.com/apache/datafusion-comet/pull/1582#issuecomment-2767137360 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1582?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

[I] Add documentation for benchmarking Comet in AWS with S3 data source [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove opened a new issue, #1583: URL: https://github.com/apache/datafusion-comet/issues/1583 ### What is the problem the feature request solves? We have focused much of our benchmarking effort on single-node benchmarks with local data. This does not reflect how Spark is generally

Re: [I] Update ClickBench benchmarks with DataFusion `46.0.0` (When Published) [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #14587: URL: https://github.com/apache/datafusion/issues/14587#issuecomment-2767092217 We will probably also need to remove the call to `to_timestamp_seconds` - https://github.com/apache/datafusion/issues/15465 As well as - https://github.com/apache/dataf

[PR] WIP: Aggregate UDF FFI [datafusion]

2025-03-31 Thread via GitHub
CrystalZhou0529 opened a new pull request, #15510: URL: https://github.com/apache/datafusion/pull/15510 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021619914 ## datafusion/datasource-parquet/src/source.rs: ## @@ -349,11 +337,13 @@ impl ParquetSource { } /// Optional reference to this parquet scan's pruning pre

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2767850478 Thank you, @leoyvens! I plan to add tests for the PR in the next two days, and then we can continue to move it forward. Thanks for all your review! -- This is an automated

Re: [I] common_sub_expression_eliminate internal error [datafusion]

2025-03-31 Thread via GitHub
andygrove commented on issue #3418: URL: https://github.com/apache/datafusion/issues/3418#issuecomment-2767628869 > Is this still an issue [@andygrove](https://github.com/andygrove) ? No, I will go ahead and close this. Thanks. -- This is an automated message from the Apache Git Ser

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021674359 ## datafusion/datasource-parquet/src/source.rs: ## @@ -349,11 +337,13 @@ impl ParquetSource { } /// Optional reference to this parquet scan's pruning

Re: [PR] Add query to extended clickbench suite for "complex filter" [datafusion]

2025-03-31 Thread via GitHub
Omega359 commented on PR #15500: URL: https://github.com/apache/datafusion/pull/15500#issuecomment-2767216080 I am seeing the sql_planner benchmark now failing after this was merged. ``` Benchmarking physical_plan_clickbench_q50 Benchmarking physical_plan_clickbench_q50: Warming up f

Re: [PR] Add query to extended clickbench suite for "complex filter" [datafusion]

2025-03-31 Thread via GitHub
Omega359 commented on PR #15500: URL: https://github.com/apache/datafusion/pull/15500#issuecomment-2767230210 I think we may need to augment the github extended tests to run all the benchmarks, not to validate the results (though we can do that later) but to just make sure the benchmarks st

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021713556 ## datafusion/physical-plan/src/filter.rs: ## @@ -433,6 +433,22 @@ impl ExecutionPlan for FilterExec { } try_embed_projection(projection, self)

Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2767310181 I would persoanlly suggest if we are going to run the benchmarks we should also be actively tracking them / making sure they are measuring something useful. -- This is an auto

Re: [PR] docs: Update supported Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove commented on code in PR #1580: URL: https://github.com/apache/datafusion-comet/pull/1580#discussion_r2021819064 ## docs/source/user-guide/installation.md: ## @@ -30,22 +30,32 @@ Make sure the following requirements are met and software installed on your mach ### Su

Re: [I] datafusion-cli: document reading partitioned parquet [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on issue #15309: URL: https://github.com/apache/datafusion/issues/15309#issuecomment-2766488794 Hmm are you saying that they're using wildcards which we don't support, but they happen to work for local filesystem so they're getting away with it? If so yes I think it's wor

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2021913440 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -667,175 +740,322 @@ impl Debug for ShuffleRepartitioner { } } -/// The status of appen

Re: [PR] fix: improve CSV path handling and error handling in substrait example [datafusion-python]

2025-03-31 Thread via GitHub
timsaucer commented on PR #1073: URL: https://github.com/apache/datafusion-python/pull/1073#issuecomment-2766787598 I don't think these changes make a substantive difference to the example. One of the main goals of the examples to be as simple and clear as possible for new users to underst

[I] Follow up #15432 [datafusion]

2025-03-31 Thread via GitHub
xudong963 opened a new issue, #15519: URL: https://github.com/apache/datafusion/issues/15519 I suggest writing this logic with a state machine, which polls files at one point only. I believe FSM's would increase the code quality and understandability _Originally posted

[PR] update note [datafusion]

2025-03-31 Thread via GitHub
Jiashu-Hu opened a new pull request, #15515: URL: https://github.com/apache/datafusion/pull/15515 ## Which issue does this PR close? - [Closes #15464](https://github.com/apache/datafusion/issues/15464). ## Rationale for this change The Decimal support in DataF

[I] Run all benchmarks on merge to main branch [datafusion]

2025-03-31 Thread via GitHub
Omega359 opened a new issue, #15511: URL: https://github.com/apache/datafusion/issues/15511 ### Is your feature request related to a problem or challenge? There has been a number of issues where benchmarks stopped working and no one noticed until someone happened to try and run them.

Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-03-31 Thread via GitHub
berkaysynnada commented on PR #15504: URL: https://github.com/apache/datafusion/pull/15504#issuecomment-2767270135 Very nice example, thank you @Shreyaskr1409. Perhaps we can make this in datafusion-examples as well? -- This is an automated message from the Apache Git Service. To respond

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021737787 ## datafusion/physical-plan/src/filter.rs: ## @@ -433,6 +433,22 @@ impl ExecutionPlan for FilterExec { } try_embed_projection(projection, self)

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021746142 ## datafusion/core/src/datasource/physical_plan/parquet.rs: ## @@ -1769,13 +1775,13 @@ mod tests { let sql = "select * from base_table where name='test02'";

Re: [I] Push down data filter for native_iceberg_compat [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove closed issue #1562: Push down data filter for native_iceberg_compat URL: https://github.com/apache/datafusion-comet/issues/1562 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove merged PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] Add some DataFrame method(s) to combine two inputs where the schema can be different [datafusion]

2025-03-31 Thread via GitHub
alamb closed issue #12650: Add some DataFrame method(s) to combine two inputs where the schema can be different URL: https://github.com/apache/datafusion/issues/12650 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [PR] Support computing statistics for FileGroup [datafusion]

2025-03-31 Thread via GitHub
xudong963 merged PR #15432: URL: https://github.com/apache/datafusion/pull/15432 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] WIP: Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-03-31 Thread via GitHub
berkaysynnada commented on PR #15503: URL: https://github.com/apache/datafusion/pull/15503#issuecomment-2768177113 > > I suggest modifying the existing API, not a new one. > > Which one do you think is suitable to modify, `statistics()`? yes -- This is an automated message fr

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15501: URL: https://github.com/apache/datafusion/pull/15501#discussion_r2021936887 ## datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs: ## @@ -0,0 +1,635 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor l

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on code in PR #15501: URL: https://github.com/apache/datafusion/pull/15501#discussion_r2022130464 ## datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs: ## @@ -0,0 +1,635 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribu

Re: [PR] chore(deps): bump blake3 from 1.7.0 to 1.8.0 [datafusion]

2025-03-31 Thread via GitHub
xudong963 merged PR #15502: URL: https://github.com/apache/datafusion/pull/15502 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

[PR] Add utf8view benchmark for aggregate topk [datafusion]

2025-03-31 Thread via GitHub
zhuqi-lucas opened a new pull request, #15518: URL: https://github.com/apache/datafusion/pull/15518 ## Which issue does this PR close? Since we have merged the utf8view support for aggregate topk, so also add the corresponding benchmark code in this PR. https://github.com/apach

[I] Enable sort query fuzzing with limited memory [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 opened a new issue, #15517: URL: https://github.com/apache/datafusion/issues/15517 ### Is your feature request related to a problem or challenge? A new sort query fuzzer for out-of-core sorting has been added in https://github.com/apache/datafusion/pull/15501. However,

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on PR #15501: URL: https://github.com/apache/datafusion/pull/15501#issuecomment-2768105998 Thank you for the review! > I think we should consider the random seed along with the runtime of this test, but otherwise it looks really nice It's using a random sta

Re: [PR] fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura merged PR #1578: URL: https://github.com/apache/datafusion-comet/pull/1578 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura commented on code in PR #1578: URL: https://github.com/apache/datafusion-comet/pull/1578#discussion_r2021779171 ## spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala: ## @@ -53,29 +55,50 @@ case class CometNativeScanExec( disableBucke

Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-03-31 Thread via GitHub
Omega359 commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2767313655 Sure. https://github.com/apache/datafusion/issues/5504 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[PR] Update user guide to note decimal is not experimental anymore [datafusion]

2025-03-31 Thread via GitHub
Jiashu-Hu opened a new pull request, #15514: URL: https://github.com/apache/datafusion/pull/15514 ## Which issue does this PR close? - [Closes #15464](https://github.com/apache/datafusion/issues/15464). ## Rationale for this change The Decimal support in DataF

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2022103675 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -186,6 +235,90 @@ impl TopK { Ok(()) } +fn calculate_dynamic_filters( +thresholds:

[I] Improve performance of native scan [datafusion-comet]

2025-03-31 Thread via GitHub
wForget opened a new issue, #1586: URL: https://github.com/apache/datafusion-comet/issues/1586 ### What is the problem the feature request solves? I found several possible optimizations for native scan by profiling CometReadBenchmark. 1. Snappy decompression performance is poor

[PR] feat: respect `batchSize/workerThreads/blockingThreads` configurations for native_iceberg_compat scan [datafusion-comet]

2025-03-31 Thread via GitHub
wForget opened a new pull request, #1587: URL: https://github.com/apache/datafusion-comet/pull/1587 ## Which issue does this PR close? Closes #1571. ## Rationale for this change Respect `batchSize/workerThreads/blockingThreads` configurations for native_iceberg_c

Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura merged PR #1555: URL: https://github.com/apache/datafusion-comet/pull/1555 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021737787 ## datafusion/physical-plan/src/filter.rs: ## @@ -433,6 +433,22 @@ impl ExecutionPlan for FilterExec { } try_embed_projection(projection, self)

Re: [PR] docs: Update supported Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura commented on PR #1580: URL: https://github.com/apache/datafusion-comet/pull/1580#issuecomment-2767522455 Thanks @andygrove -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] docs: Update supported Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura merged PR #1580: URL: https://github.com/apache/datafusion-comet/pull/1580 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
suibianwanwank commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2767929587 >And follow the model of evaluation for [Literal](https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/expressions/literal.rs) > >Except you would

[PR] Add GreptimeDB to the "Users" in README [datafusion-sqlparser-rs]

2025-03-31 Thread via GitHub
MichaelScofield opened a new pull request, #1788: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1788 GreptimeDB uses sqlparser in its "RANGE" syntax, and various other places. -- This is an automated message from the Apache Git Service. To respond to the message, please log o

[PR] Minor: clone and debug for FileSinkConfig [datafusion]

2025-03-31 Thread via GitHub
jayzhan211 opened a new pull request, #15516: URL: https://github.com/apache/datafusion/pull/15516 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes test

Re: [I] Perf: Dataframe with_column and with_column_renamed are slow [datafusion]

2025-03-31 Thread via GitHub
Omega359 closed issue #14563: Perf: Dataframe with_column and with_column_renamed are slow URL: https://github.com/apache/datafusion/issues/14563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[I] [Epic] A collection of dynamic filtering related items [datafusion]

2025-03-31 Thread via GitHub
alamb opened a new issue, #15512: URL: https://github.com/apache/datafusion/issues/15512 ### Is your feature request related to a problem or challenge? This is a collection of various items related to "dynamic filtering" Roughly speaking dynamic filters are filters who values ar

Re: [PR] fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait [datafusion-comet]

2025-03-31 Thread via GitHub
mbutrovich commented on code in PR #1578: URL: https://github.com/apache/datafusion-comet/pull/1578#discussion_r2021782172 ## spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala: ## @@ -53,29 +55,50 @@ case class CometNativeScanExec( disableBucketedSca

[PR] chore: Run Comet tests for more Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove opened a new pull request, #1582: URL: https://github.com/apache/datafusion-comet/pull/1582 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes

Re: [PR] WIP: Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on PR #15503: URL: https://github.com/apache/datafusion/pull/15503#issuecomment-2767823416 > I suggest modifying the existing API, not a new one. Which one do you think is suitable to modify, `statistics()`? -- This is an automated message from the Apache Git Se

Re: [I] Allow UDFs to return custom `Diagnostic` [datafusion]

2025-03-31 Thread via GitHub
jsai28 commented on issue #15276: URL: https://github.com/apache/datafusion/issues/15276#issuecomment-2765086427 @eliaperantoni I had a follow up question to this regarding the `diagnose` trait function. We want to call the diagnose function during logical planning and pass in `ReturnTypeAr

Re: [PR] Improve spill performance: Disable re-validation of spilled files [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on PR #15454: URL: https://github.com/apache/datafusion/pull/15454#issuecomment-2765960567 Thank you! I think it's almost ready. The only remaining task is to upgrade the other Arrow dependencies to 54.3.0 similar to https://github.com/apache/datafusion/pull/14153 did.

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-31 Thread via GitHub
acking-you commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020615756 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -805,6 +811,47 @@ impl BinaryExpr { } } +/// Check if it meets the short-circuit condition

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
Dandandan commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020805433 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [PR] Fix typos [datafusion-sqlparser-rs]

2025-03-31 Thread via GitHub
iffyio merged PR #1785: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1785 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [I] Decorrelate scalar subqueries with more complex filter expressions [datafusion]

2025-03-31 Thread via GitHub
ctsk commented on issue #14554: URL: https://github.com/apache/datafusion/issues/14554#issuecomment-2765704387 Hey @duongcongtoai, I want to draw your attention on a follow-up paper on "Unnesting Arbitrary Queries": https://15799.courses.cs.cmu.edu/spring2025/papers/11-unnesting/neum

Re: [I] Improve `String` to `&str` conversions [datafusion]

2025-03-31 Thread via GitHub
Omega359 commented on issue #15498: URL: https://github.com/apache/datafusion/issues/15498#issuecomment-2766282665 Is there some particular hotspot or hotspots where conversion from String -> &str is happening? -- This is an automated message from the Apache Git Service. To respond to the

[PR] fix: adjust doCanonicalize() and hashCode() for AQE, CometNativeScan uses DataSourceScanExec trait [datafusion-comet]

2025-03-31 Thread via GitHub
mbutrovich opened a new pull request, #1578: URL: https://github.com/apache/datafusion-comet/pull/1578 ## Which issue does this PR close? Addresses another failure in #1441. ## Rationale for this change `CometExecSuite.explain native plan` fails with `nati

Re: [PR] Allow type coersion of zero input arrays to nullary [datafusion]

2025-03-31 Thread via GitHub
timsaucer merged PR #15487: URL: https://github.com/apache/datafusion/pull/15487 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

[PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-03-31 Thread via GitHub
Shreyaskr1409 opened a new pull request, #15504: URL: https://github.com/apache/datafusion/pull/15504 ## Which issue does this PR close? - Closes #15369. ## Rationale for this change Adding a documentation example for `AggregateExprBuilder` making it easie

Re: [PR] feat: Add Aggregate UDF to FFI crate [datafusion]

2025-03-31 Thread via GitHub
timsaucer commented on PR #14775: URL: https://github.com/apache/datafusion/pull/14775#issuecomment-2766341477 Now that https://github.com/apache/datafusion/pull/15487 is resolved, this is unblocked. -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] Change default `EXPLAIN` format in `datafusion-cli` to `tree` format [datafusion]

2025-03-31 Thread via GitHub
alamb merged PR #15427: URL: https://github.com/apache/datafusion/pull/15427 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-03-31 Thread via GitHub
Shreyaskr1409 commented on PR #15504: URL: https://github.com/apache/datafusion/pull/15504#issuecomment-2766368569 The problems I am facing are: 1. i can not use datafusion crate inside the examples. i could not test the portion where I am building from AggregateExprBuilder. ![image](

Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-03-31 Thread via GitHub
Shreyaskr1409 commented on code in PR #15504: URL: https://github.com/apache/datafusion/pull/15504#discussion_r2021132605 ## datafusion/physical-expr/src/aggregate.rs: ## @@ -97,6 +97,167 @@ impl AggregateExprBuilder { /// Constructs an `AggregateFunctionExpr` from the buil

Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-03-31 Thread via GitHub
Shreyaskr1409 commented on code in PR #15504: URL: https://github.com/apache/datafusion/pull/15504#discussion_r2021132605 ## datafusion/physical-expr/src/aggregate.rs: ## @@ -97,6 +97,167 @@ impl AggregateExprBuilder { /// Constructs an `AggregateFunctionExpr` from the buil

Re: [I] datafusion-cli: document reading partitioned parquet [datafusion]

2025-03-31 Thread via GitHub
marvelshan commented on issue #15309: URL: https://github.com/apache/datafusion/issues/15309#issuecomment-2766405116 I see that both `copy.slt` and `parquet.slt` contain the `*.parquet` wildcard. Do any changes need to be made here? -- This is an automated message from the Apache Git Serv

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
leoyvens commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766451717 I took some time to play with this, so I can provide an anecdotal report. I compared three setups: - This branch with `split_file_groups_by_statistics = true` - main with `spli

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020712161 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [I] Dynamic pruning filters from TopK state (optimize `ORDER BY LIMIT` queries) [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2766489676 I plan to spend a non trivial amount of time working on this with @adriangb this week -- This is an automated message from the Apache Git Service. To respond to the message, pl

[PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 opened a new pull request, #15501: URL: https://github.com/apache/datafusion/pull/15501 ## Which issue does this PR close? - Closes #. ## Rationale for this change Recently we have detected multiple bugs for out-of-core sorting, and there are m

Re: [PR] added fallback using reflection for backward-compatibility [datafusion-comet]

2025-03-31 Thread via GitHub
YanivKunda commented on code in PR #1573: URL: https://github.com/apache/datafusion-comet/pull/1573#discussion_r2020709855 ## spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala: ## @@ -55,15 +55,48 @@ trait ShimCometScanExec { protected def isNe

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-31 Thread via GitHub
Dandandan commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020695065 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -805,6 +811,47 @@ impl BinaryExpr { } } +/// Check if it meets the short-circuit condition +

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-31 Thread via GitHub
Dandandan commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020695065 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -805,6 +811,47 @@ impl BinaryExpr { } } +/// Check if it meets the short-circuit condition +

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-31 Thread via GitHub
acking-you commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2020726805 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -805,6 +811,47 @@ impl BinaryExpr { } } +/// Check if it meets the short-circuit condition

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2020712161 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-03-31 Thread via GitHub
goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766551662 Based on https://github.com/goldmedal/datafusion/pull/3, I did the some benchmarks(`clieckbench_1`, `h2o_medium`) for it. `feat_zero-copy-hash-agg-false` is the branch that

Re: [PR] added fallback using reflection for backward-compatibility [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove merged PR #1573: URL: https://github.com/apache/datafusion-comet/pull/1573 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] NoSuchMethodError: PartitionedFileUtil$.splitFiles when running with Spark 3.5.4 [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove closed issue #1572: NoSuchMethodError: PartitionedFileUtil$.splitFiles when running with Spark 3.5.4 URL: https://github.com/apache/datafusion-comet/issues/1572 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

Re: [PR] datafusion-cli: document reading partitioned parquet [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on code in PR #15505: URL: https://github.com/apache/datafusion/pull/15505#discussion_r2021191846 ## docs/source/user-guide/cli/datasources.md: ## @@ -95,8 +95,7 @@ additional configuration options. # `CREATE EXTERNAL TABLE` It is also possible to create a

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
leoyvens commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766568954 I took some time to play with this, so I can provide an anecdotal report. **Conclusion** In my setup, this PR is a clear win to execution times. **Configurations**

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-31 Thread via GitHub
mkgada commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2766575025 @andygrove sorry for the confusion here, my cluster is on Spark 3.5.0, and Comet 0.7.0 prebuilt JAR did not work on it -- This is an automated message from the Ap

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
leoyvens commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2766577490 In terms of default behaviour, I see there are planning time concerns relative to this PR. For cases like mine, where files are lexicographically sorted, just changing the way the d

  1   2   >