Re: [PR] feat: Add config `max_temp_directory_size` to limit max disk usage for spilling queries [datafusion]

2025-03-31 Thread via GitHub
Copilot commented on code in PR #15520: URL: https://github.com/apache/datafusion/pull/15520#discussion_r2022259546 ## datafusion/execution/src/disk_manager.rs: ## @@ -164,6 +216,50 @@ impl RefCountedTempFile { pub fn inner(&self) -> &NamedTempFile { &self.tempfile

[PR] feat: Add config max_temp_directory_size to limit max disk usage for spilling queries [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 opened a new pull request, #15520: URL: https://github.com/apache/datafusion/pull/15520 ## Which issue does this PR close? - Closes #15358 ## Rationale for this change See the rationale part of the first attempt PR https://github.com/apache/datafusion

Re: [PR] feat(sql): add diagnostic for wrong number of function arguments [datafusion]

2025-03-31 Thread via GitHub
eliaperantoni commented on PR #15490: URL: https://github.com/apache/datafusion/pull/15490#issuecomment-2768341633 Also you made the PR ready for review again, after @alamb, made it a draft, without first committing any new changes? -- This is an automated message from the Apache Git Serv

Re: [PR] feat(sql): add diagnostic for wrong number of function arguments [datafusion]

2025-03-31 Thread via GitHub
eliaperantoni commented on code in PR #15490: URL: https://github.com/apache/datafusion/pull/15490#discussion_r2022232702 ## datafusion/common/src/utils/mod.rs: ## @@ -979,16 +981,19 @@ pub fn take_function_args( ) -> Result<[T; N]> { let args = args.into_iter().collect::>

Re: [PR] WIP: Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-03-31 Thread via GitHub
berkaysynnada commented on PR #15503: URL: https://github.com/apache/datafusion/pull/15503#issuecomment-2768177113 > > I suggest modifying the existing API, not a new one. > > Which one do you think is suitable to modify, `statistics()`? yes -- This is an automated message fr

Re: [PR] Support computing statistics for FileGroup [datafusion]

2025-03-31 Thread via GitHub
xudong963 merged PR #15432: URL: https://github.com/apache/datafusion/pull/15432 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

[I] Follow up #15432 [datafusion]

2025-03-31 Thread via GitHub
xudong963 opened a new issue, #15519: URL: https://github.com/apache/datafusion/issues/15519 I suggest writing this logic with a state machine, which polls files at one point only. I believe FSM's would increase the code quality and understandability _Originally posted

Re: [PR] fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura merged PR #1578: URL: https://github.com/apache/datafusion-comet/pull/1578 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on PR #15501: URL: https://github.com/apache/datafusion/pull/15501#issuecomment-2768105998 Thank you for the review! > I think we should consider the random seed along with the runtime of this test, but otherwise it looks really nice It's using a random sta

[I] Enable sort query fuzzing with limited memory [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 opened a new issue, #15517: URL: https://github.com/apache/datafusion/issues/15517 ### Is your feature request related to a problem or challenge? A new sort query fuzzer for out-of-core sorting has been added in https://github.com/apache/datafusion/pull/15501. However,

[PR] Add utf8view benchmark for aggregate topk [datafusion]

2025-03-31 Thread via GitHub
zhuqi-lucas opened a new pull request, #15518: URL: https://github.com/apache/datafusion/pull/15518 ## Which issue does this PR close? Since we have merged the utf8view support for aggregate topk, so also add the corresponding benchmark code in this PR. https://github.com/apach

Re: [PR] chore(deps): bump blake3 from 1.7.0 to 1.8.0 [datafusion]

2025-03-31 Thread via GitHub
xudong963 merged PR #15502: URL: https://github.com/apache/datafusion/pull/15502 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-03-31 Thread via GitHub
2010YOUY01 commented on code in PR #15501: URL: https://github.com/apache/datafusion/pull/15501#discussion_r2022130464 ## datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs: ## @@ -0,0 +1,635 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribu

Re: [PR] Test: configuration fuzzer for (external) sort queries [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15501: URL: https://github.com/apache/datafusion/pull/15501#discussion_r2021936887 ## datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs: ## @@ -0,0 +1,635 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor l

[I] Improve performance of native scan [datafusion-comet]

2025-03-31 Thread via GitHub
wForget opened a new issue, #1586: URL: https://github.com/apache/datafusion-comet/issues/1586 ### What is the problem the feature request solves? I found several possible optimizations for native scan by profiling CometReadBenchmark. 1. Snappy decompression performance is poor

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2022103675 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -186,6 +235,90 @@ impl TopK { Ok(()) } +fn calculate_dynamic_filters( +thresholds:

[PR] feat: respect `batchSize/workerThreads/blockingThreads` configurations for native_iceberg_compat scan [datafusion-comet]

2025-03-31 Thread via GitHub
wForget opened a new pull request, #1587: URL: https://github.com/apache/datafusion-comet/pull/1587 ## Which issue does this PR close? Closes #1571. ## Rationale for this change Respect `batchSize/workerThreads/blockingThreads` configurations for native_iceberg_c

Re: [I] Perf: Dataframe with_column and with_column_renamed are slow [datafusion]

2025-03-31 Thread via GitHub
Omega359 closed issue #14563: Perf: Dataframe with_column and with_column_renamed are slow URL: https://github.com/apache/datafusion/issues/14563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[PR] Minor: clone and debug for FileSinkConfig [datafusion]

2025-03-31 Thread via GitHub
jayzhan211 opened a new pull request, #15516: URL: https://github.com/apache/datafusion/pull/15516 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes test

[PR] Add GreptimeDB to the "Users" in README [datafusion-sqlparser-rs]

2025-03-31 Thread via GitHub
MichaelScofield opened a new pull request, #1788: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1788 GreptimeDB uses sqlparser in its "RANGE" syntax, and various other places. -- This is an automated message from the Apache Git Service. To respond to the message, please log o

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
suibianwanwank commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2767929587 >And follow the model of evaluation for [Literal](https://github.com/apache/datafusion/blob/main/datafusion/physical-expr/src/expressions/literal.rs) > >Except you would

Re: [I] Improve `String` to `&str` conversions [datafusion]

2025-03-31 Thread via GitHub
matthewmturner commented on issue #15498: URL: https://github.com/apache/datafusion/issues/15498#issuecomment-2767914670 I have vague recollection that `Arc` was only faster than cloning a `String` when there was very little contention due to locking used by `Arc` internally. If there was a

Re: [PR] Update user guide to note decimal is not experimental anymore [datafusion]

2025-03-31 Thread via GitHub
Jiashu-Hu closed pull request #15514: Update user guide to note decimal is not experimental anymore URL: https://github.com/apache/datafusion/pull/15514 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-03-31 Thread via GitHub
qstommyshu commented on code in PR #15499: URL: https://github.com/apache/datafusion/pull/15499#discussion_r2021991253 ## datafusion/sql/tests/sql_integration.rs: ## @@ -157,219 +157,282 @@ fn parse_ident_normalization() { #[test] fn select_no_relation() { -quick_test( -

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-03-31 Thread via GitHub
qstommyshu commented on code in PR #15499: URL: https://github.com/apache/datafusion/pull/15499#discussion_r2021991253 ## datafusion/sql/tests/sql_integration.rs: ## @@ -157,219 +157,282 @@ fn parse_ident_normalization() { #[test] fn select_no_relation() { -quick_test( -

Re: [I] common_sub_expression_eliminate internal error [datafusion]

2025-03-31 Thread via GitHub
andygrove commented on issue #3418: URL: https://github.com/apache/datafusion/issues/3418#issuecomment-2767628869 > Is this still an issue [@andygrove](https://github.com/andygrove) ? No, I will go ahead and close this. Thanks. -- This is an automated message from the Apache Git Ser

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on PR #15473: URL: https://github.com/apache/datafusion/pull/15473#issuecomment-2767850478 Thank you, @leoyvens! I plan to add tests for the PR in the next two days, and then we can continue to move it forward. Thanks for all your review! -- This is an automated

Re: [PR] WIP: Add `statistics_by_partition API` to ExecutionPlan [datafusion]

2025-03-31 Thread via GitHub
xudong963 commented on PR #15503: URL: https://github.com/apache/datafusion/pull/15503#issuecomment-2767823416 > I suggest modifying the existing API, not a new one. Which one do you think is suitable to modify, `statistics()`? -- This is an automated message from the Apache Git Se

Re: [I] Add some DataFrame method(s) to combine two inputs where the schema can be different [datafusion]

2025-03-31 Thread via GitHub
alamb closed issue #12650: Add some DataFrame method(s) to combine two inputs where the schema can be different URL: https://github.com/apache/datafusion/issues/12650 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the UR

Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-03-31 Thread via GitHub
jayzhan211 commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2767724669 > There has been a number of issues where benchmarks stopped working and no one noticed until someone happened to try and run them Instead of running the benchmark, how

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-31 Thread via GitHub
parthchandra commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2767719365 > [@andygrove](https://github.com/andygrove) sorry for the confusion here, my cluster is on Spark 3.5.0, and Comet 0.7.0 prebuilt JAR did not work on it > > [@pa

Re: [I] Improve `String` to `&str` conversions [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #15498: URL: https://github.com/apache/datafusion/issues/15498#issuecomment-2767680544 > @alamb would you mind giving an example on Arc ? Here is one example https://github.com/apache/datafusion/blob/923bfb7fc7cf1718522f572b74f3756d02652933/datafusion/c

Re: [PR] Extract tokio runtime creation from hot loop in benchmarks [datafusion]

2025-03-31 Thread via GitHub
alamb merged PR #15508: URL: https://github.com/apache/datafusion/pull/15508 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-03-31 Thread via GitHub
alamb commented on PR #15499: URL: https://github.com/apache/datafusion/pull/15499#issuecomment-2767687638 FYI @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Migrate `datafusion/sql` tests to insta, part2 [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15499: URL: https://github.com/apache/datafusion/pull/15499#discussion_r2021931938 ## datafusion/sql/tests/sql_integration.rs: ## @@ -157,219 +157,282 @@ fn parse_ident_normalization() { #[test] fn select_no_relation() { -quick_test( -

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2021921381 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -667,175 +740,322 @@ impl Debug for ShuffleRepartitioner { } } -/// The status of appen

Re: [PR] Add query to extended clickbench suite for "complex filter" [datafusion]

2025-03-31 Thread via GitHub
alamb commented on PR #15500: URL: https://github.com/apache/datafusion/pull/15500#issuecomment-2767682615 > I think we may need to augment the github extended tests to run all the benchmarks, not to validate the results (though we can do that later) but to just make sure the benchmarks sti

Re: [PR] fix: improve CSV path handling and error handling in substrait example [datafusion-python]

2025-03-31 Thread via GitHub
timsaucer commented on PR #1073: URL: https://github.com/apache/datafusion-python/pull/1073#issuecomment-2766787598 I don't think these changes make a substantive difference to the example. One of the main goals of the examples to be as simple and clear as possible for new users to underst

Re: [I] datafusion-cli: document reading partitioned parquet [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on issue #15309: URL: https://github.com/apache/datafusion/issues/15309#issuecomment-2766488794 Hmm are you saying that they're using wildcards which we don't support, but they happen to work for local filesystem so they're getting away with it? If so yes I think it's wor

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove commented on code in PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#discussion_r2021913440 ## native/core/src/execution/shuffle/shuffle_writer.rs: ## @@ -667,175 +740,322 @@ impl Debug for ShuffleRepartitioner { } } -/// The status of appen

[PR] update note [datafusion]

2025-03-31 Thread via GitHub
Jiashu-Hu opened a new pull request, #15515: URL: https://github.com/apache/datafusion/pull/15515 ## Which issue does this PR close? - [Closes #15464](https://github.com/apache/datafusion/issues/15464). ## Rationale for this change The Decimal support in DataF

[PR] Update user guide to note decimal is not experimental anymore [datafusion]

2025-03-31 Thread via GitHub
Jiashu-Hu opened a new pull request, #15514: URL: https://github.com/apache/datafusion/pull/15514 ## Which issue does this PR close? - [Closes #15464](https://github.com/apache/datafusion/issues/15464). ## Rationale for this change The Decimal support in DataF

[PR] chore: Run Comet tests for more Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove opened a new pull request, #1582: URL: https://github.com/apache/datafusion-comet/pull/1582 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes

Re: [PR] docs: Update supported Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura merged PR #1580: URL: https://github.com/apache/datafusion-comet/pull/1580 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] docs: Update supported Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura commented on PR #1580: URL: https://github.com/apache/datafusion-comet/pull/1580#issuecomment-2767522455 Thanks @andygrove -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove merged PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] Push down data filter for native_iceberg_compat [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove closed issue #1562: Push down data filter for native_iceberg_compat URL: https://github.com/apache/datafusion-comet/issues/1562 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] docs: Update supported Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove commented on code in PR #1580: URL: https://github.com/apache/datafusion-comet/pull/1580#discussion_r2021819064 ## docs/source/user-guide/installation.md: ## @@ -30,22 +30,32 @@ Make sure the following requirements are met and software installed on your mach ### Su

Re: [PR] fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura commented on code in PR #1578: URL: https://github.com/apache/datafusion-comet/pull/1578#discussion_r2021779171 ## spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala: ## @@ -53,29 +55,50 @@ case class CometNativeScanExec( disableBucke

Re: [PR] fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait [datafusion-comet]

2025-03-31 Thread via GitHub
mbutrovich commented on code in PR #1578: URL: https://github.com/apache/datafusion-comet/pull/1578#discussion_r2021782172 ## spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala: ## @@ -53,29 +55,50 @@ case class CometNativeScanExec( disableBucketedSca

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021746142 ## datafusion/core/src/datasource/physical_plan/parquet.rs: ## @@ -1769,13 +1775,13 @@ mod tests { let sql = "select * from base_table where name='test02'";

Re: [PR] fix: make register_object_store use same session_env as file scan [datafusion-comet]

2025-03-31 Thread via GitHub
kazuyukitanimura merged PR #1555: URL: https://github.com/apache/datafusion-comet/pull/1555 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021737787 ## datafusion/physical-plan/src/filter.rs: ## @@ -433,6 +433,22 @@ impl ExecutionPlan for FilterExec { } try_embed_projection(projection, self)

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021737787 ## datafusion/physical-plan/src/filter.rs: ## @@ -433,6 +433,22 @@ impl ExecutionPlan for FilterExec { } try_embed_projection(projection, self)

Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-03-31 Thread via GitHub
Omega359 commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2767313655 Sure. https://github.com/apache/datafusion/issues/5504 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021713556 ## datafusion/physical-plan/src/filter.rs: ## @@ -433,6 +433,22 @@ impl ExecutionPlan for FilterExec { } try_embed_projection(projection, self)

Re: [I] Run all benchmarks on merge to main branch [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #15511: URL: https://github.com/apache/datafusion/issues/15511#issuecomment-2767310181 I would persoanlly suggest if we are going to run the benchmarks we should also be actively tracking them / making sure they are measuring something useful. -- This is an auto

[I] [Epic] A collection of dynamic filtering related items [datafusion]

2025-03-31 Thread via GitHub
alamb opened a new issue, #15512: URL: https://github.com/apache/datafusion/issues/15512 ### Is your feature request related to a problem or challenge? This is a collection of various items related to "dynamic filtering" Roughly speaking dynamic filters are filters who values ar

Re: [PR] Add documentation example for `AggregateExprBuilder` [datafusion]

2025-03-31 Thread via GitHub
berkaysynnada commented on PR #15504: URL: https://github.com/apache/datafusion/pull/15504#issuecomment-2767270135 Very nice example, thank you @Shreyaskr1409. Perhaps we can make this in datafusion-examples as well? -- This is an automated message from the Apache Git Service. To respond

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021674359 ## datafusion/datasource-parquet/src/source.rs: ## @@ -349,11 +337,13 @@ impl ParquetSource { } /// Optional reference to this parquet scan's pruning

[I] Run all benchmarks on merge to main branch [datafusion]

2025-03-31 Thread via GitHub
Omega359 opened a new issue, #15511: URL: https://github.com/apache/datafusion/issues/15511 ### Is your feature request related to a problem or challenge? There has been a number of issues where benchmarks stopped working and no one noticed until someone happened to try and run them.

Re: [PR] Add query to extended clickbench suite for "complex filter" [datafusion]

2025-03-31 Thread via GitHub
Omega359 commented on PR #15500: URL: https://github.com/apache/datafusion/pull/15500#issuecomment-2767230210 I think we may need to augment the github extended tests to run all the benchmarks, not to validate the results (though we can do that later) but to just make sure the benchmarks st

Re: [PR] Add query to extended clickbench suite for "complex filter" [datafusion]

2025-03-31 Thread via GitHub
Omega359 commented on PR #15500: URL: https://github.com/apache/datafusion/pull/15500#issuecomment-2767216080 I am seeing the sql_planner benchmark now failing after this was merged. ``` Benchmarking physical_plan_clickbench_q50 Benchmarking physical_plan_clickbench_q50: Warming up f

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2021619914 ## datafusion/datasource-parquet/src/source.rs: ## @@ -349,11 +337,13 @@ impl ParquetSource { } /// Optional reference to this parquet scan's pruning pre

[PR] WIP: Aggregate UDF FFI [datafusion]

2025-03-31 Thread via GitHub
CrystalZhou0529 opened a new pull request, #15510: URL: https://github.com/apache/datafusion/pull/15510 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes

Re: [I] Update ClickBench benchmarks with DataFusion `46.0.0` (When Published) [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #14587: URL: https://github.com/apache/datafusion/issues/14587#issuecomment-2767092217 We will probably also need to remove the call to `to_timestamp_seconds` - https://github.com/apache/datafusion/issues/15465 As well as - https://github.com/apache/dataf

Re: [PR] chore: Run Comet tests for more Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
codecov-commenter commented on PR #1582: URL: https://github.com/apache/datafusion-comet/pull/1582#issuecomment-2767137360 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1582?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

[I] Add documentation for benchmarking Comet in AWS with S3 data source [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove opened a new issue, #1583: URL: https://github.com/apache/datafusion-comet/issues/1583 ### What is the problem the feature request solves? We have focused much of our benchmarking effort on single-node benchmarks with local data. This does not reflect how Spark is generally

Re: [PR] MySQL create table options: optional DEFAULT keyword, CHARACTER SET longhand [datafusion-sqlparser-rs]

2025-03-31 Thread via GitHub
MohamedAbdeen21 closed pull request #1783: MySQL create table options: optional DEFAULT keyword, CHARACTER SET longhand URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1783 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [I] Update ClickBench queries to avoid ::INT::DATE casting [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #15509: URL: https://github.com/apache/datafusion/issues/15509#issuecomment-2767089209 It is likely that `::INT::DATE` is left over from some old version of DataFusion that didn't support the correct casting I think we could just remove them directly -- Thi

Re: [PR] Decimal type support for `to_timestamp` [datafusion]

2025-03-31 Thread via GitHub
alamb merged PR #15486: URL: https://github.com/apache/datafusion/pull/15486 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Update ClickBench queries to avoid `to_timestamp_seconds` [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #15465: URL: https://github.com/apache/datafusion/issues/15465#issuecomment-2767087855 > Another discrepancy I found in the queries is the "EventDate"::INT::DATE" casting. Is this something we could remove as well? Maybe would be good to look at all further that are

Re: [PR] fix: adjust CometNativeScan's doCanonicalize and hashCode for AQE, use DataSourceScanExec trait [datafusion-comet]

2025-03-31 Thread via GitHub
codecov-commenter commented on PR #1578: URL: https://github.com/apache/datafusion-comet/pull/1578#issuecomment-2767086468 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1578?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Extract tokio runtime creation from hot loop in benchmarks [datafusion]

2025-03-31 Thread via GitHub
Omega359 commented on PR #15508: URL: https://github.com/apache/datafusion/pull/15508#issuecomment-2767068312 I'll fix the conflicts in a bit, running benchmarks atm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [I] Improve `String` to `&str` conversions [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #15498: URL: https://github.com/apache/datafusion/issues/15498#issuecomment-2767053406 I think a better / safer strategy would be to avoid `String` at all in the first place (we can use `Arc` perhaps in places we are copying Strings > (Actually a is never used

Re: [PR] Add query to extended clickbench suite for "complex filter" [datafusion]

2025-03-31 Thread via GitHub
alamb merged PR #15500: URL: https://github.com/apache/datafusion/pull/15500 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] FIX : some benchmarks are failing [datafusion]

2025-03-31 Thread via GitHub
alamb merged PR #15367: URL: https://github.com/apache/datafusion/pull/15367 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] distinct_query_sql benchmark is failing [datafusion]

2025-03-31 Thread via GitHub
alamb closed issue #15213: distinct_query_sql benchmark is failing URL: https://github.com/apache/datafusion/issues/15213 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [I] Building project takes a *long* time (esp compilation time for `datafusion` core crate) [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #13814: URL: https://github.com/apache/datafusion/issues/13814#issuecomment-2767041075 I ran some profiling recently and it seems like catalog is also getting pretty long to compile ![Image](https://github.com/user-attachments/assets/57e5e6da-88a6-431a-9810-5

Re: [I] Migrate optimizer tests to `insta` [datafusion]

2025-03-31 Thread via GitHub
alamb closed issue #15396: Migrate optimizer tests to `insta` URL: https://github.com/apache/datafusion/issues/15396 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscr

Re: [PR] refactor: Move `Memtable` to catalog [datafusion]

2025-03-31 Thread via GitHub
alamb commented on PR #15459: URL: https://github.com/apache/datafusion/pull/15459#issuecomment-2767041406 Maybe as a follow on PR we can make a new crate `catalog-memory` to hold the in memory catalog implementations from the catalog traits. Something I noticed recently while buildin

Re: [PR] Support computing statistics for FileGroup [datafusion]

2025-03-31 Thread via GitHub
alamb commented on code in PR #15432: URL: https://github.com/apache/datafusion/pull/15432#discussion_r2021528103 ## datafusion/core/src/datasource/statistics.rs: ## @@ -217,3 +354,183 @@ fn set_min_if_lesser( _ => {} } } + +#[cfg(test)] +mod tests { +use supe

Re: [PR] refactor: Move `Memtable` to catalog [datafusion]

2025-03-31 Thread via GitHub
alamb merged PR #15459: URL: https://github.com/apache/datafusion/pull/15459 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] [EPIC] More Subquery support [datafusion]

2025-03-31 Thread via GitHub
alamb commented on issue #5483: URL: https://github.com/apache/datafusion/issues/5483#issuecomment-2766954716 > Where to start in understanding DataFusion’s query planning and optimization workflow. I think starting at https://datafusion.apache.org/ and https://datafusion.ap

Re: [PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove merged PR #1577: URL: https://github.com/apache/datafusion-comet/pull/1577 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] Improve `String` to `&str` conversions [datafusion]

2025-03-31 Thread via GitHub
comphead commented on issue #15498: URL: https://github.com/apache/datafusion/issues/15498#issuecomment-2766908135 > Is there some particular hotspot or hotspots where conversion from String -> &str is happening? Having applied this we likely can get a performance benefit in planning,

[I] bench: Extract tokio Runtime creation from bench functions [datafusion]

2025-03-31 Thread via GitHub
Omega359 opened a new issue, #15507: URL: https://github.com/apache/datafusion/issues/15507 ### Is your feature request related to a problem or challenge? Noticed in https://github.com/apache/datafusion/pull/15367#discussion_r2012933341, a lot of benchmarks are inadvertently includin

Re: [I] [EPIC] More Subquery support [datafusion]

2025-03-31 Thread via GitHub
PhaniKulkarni commented on issue #5483: URL: https://github.com/apache/datafusion/issues/5483#issuecomment-2766774770 Hi, I am interested in contributing to Apache DataFusion, specifically in implementing correlated subqueries support I have experience in SQL Engines, Data Migration a

Re: [PR] chore: Making comet native operators write spill files to spark local dir [datafusion-comet]

2025-03-31 Thread via GitHub
codecov-commenter commented on PR #1581: URL: https://github.com/apache/datafusion-comet/pull/1581#issuecomment-2766806984 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1581?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-03-31 Thread via GitHub
goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766751553 I'm considering another approach. Maybe I shouldn't use `filter_record_batch` 🤔. It filters the all column iteratly. I should filter the row when the accumulator `update_batch

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-03-31 Thread via GitHub
zebsme commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2766784298 > I'm considering another approach. Maybe I shouldn't use `filter_record_batch` 🤔. It filters the all column iteratly. I should filter the row when the accumulator `merge_batch`

Re: [PR] Fix: Snowflake ALTER SESSION cannot be followed by other statements. [datafusion-sqlparser-rs]

2025-03-31 Thread via GitHub
iffyio merged PR #1786: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1786 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] feat: Add union_by_name, union_by_name_distinct to DataFrame api [datafusion]

2025-03-31 Thread via GitHub
alamb commented on PR #15489: URL: https://github.com/apache/datafusion/pull/15489#issuecomment-2766766045 THanks again @Omega359 and @berkaysynnada -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] Snowflake ALTER SESSION cannot be followed by other statements. [datafusion-sqlparser-rs]

2025-03-31 Thread via GitHub
iffyio closed issue #1775: Snowflake ALTER SESSION cannot be followed by other statements. URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

[PR] WIP: Test enabling Parquet filter pushdown with parquet caching page cache reader [datafusion]

2025-03-31 Thread via GitHub
alamb opened a new pull request, #15506: URL: https://github.com/apache/datafusion/pull/15506 ## Which issue does this PR close? - part of https://github.com/apache/arrow-rs/issues/7363 - related to https://github.com/apache/datafusion/issues/3463 ## Rationale for this change

Re: [PR] datafusion-cli: document reading partitioned parquet [datafusion]

2025-03-31 Thread via GitHub
adriangb commented on code in PR #15505: URL: https://github.com/apache/datafusion/pull/15505#discussion_r2021191846 ## docs/source/user-guide/cli/datasources.md: ## @@ -95,8 +95,7 @@ additional configuration options. # `CREATE EXTERNAL TABLE` It is also possible to create a

Re: [PR] Enable double-dot-notation for mssql. [datafusion-sqlparser-rs]

2025-03-31 Thread via GitHub
iffyio merged PR #1787: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

[PR] chore: Making comet native operators write spill files to spark local dir [datafusion-comet]

2025-03-31 Thread via GitHub
Kontinuation opened a new pull request, #1581: URL: https://github.com/apache/datafusion-comet/pull/1581 ## Which issue does this PR close? Closes #. ## Rationale for this change Comet native operators always write spill files into the default tmp directory, this

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-31 Thread via GitHub
goldmedal commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2766646972 I did the benchmarks for HashAggregate https://github.com/apache/datafusion/issues/15383#issuecomment-2766551662 It seems that HashAggregate is slower in the selection vector mod

[PR] docs: Update supported Spark versions [datafusion-comet]

2025-03-31 Thread via GitHub
andygrove opened a new pull request, #1580: URL: https://github.com/apache/datafusion-comet/pull/1580 ## Which issue does this PR close? Part of https://github.com/apache/datafusion-comet/issues/1579 ## Rationale for this change ## What changes are include

  1   2   >