Re: [PR] chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-09 Thread via GitHub
tglanz commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2194231874 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-09 Thread via GitHub
tglanz commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2194237180 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,209 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-09 Thread via GitHub
tglanz commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2194237180 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,209 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-09 Thread via GitHub
tglanz commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2194237180 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,209 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] Fix sqllogictests test running compatibility (ignore `--test-threads`) [datafusion]

2025-07-09 Thread via GitHub
mjgarton commented on code in PR #16694: URL: https://github.com/apache/datafusion/pull/16694#discussion_r2194256906 ## datafusion/sqllogictest/bin/sqllogictests.rs: ## @@ -689,6 +689,12 @@ struct Options { help = "IGNORED (for compatibility with built-in rust test runn

[I] Support memory profiling in benchmarks [datafusion]

2025-07-09 Thread via GitHub
2010YOUY01 opened a new issue, #16720: URL: https://github.com/apache/datafusion/issues/16720 ### Is your feature request related to a problem or challenge? Currently in DataFusion's benchmark: it only measures execution time. It would be helpful to also measure the total memory used.

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-09 Thread via GitHub
UBarney commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2194337073 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter:

Re: [I] Support memory profiling in benchmarks [datafusion]

2025-07-09 Thread via GitHub
ding-young commented on issue #16720: URL: https://github.com/apache/datafusion/issues/16720#issuecomment-3051587282 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [I] Support memory profiling in benchmarks [datafusion]

2025-07-09 Thread via GitHub
ding-young commented on issue #16720: URL: https://github.com/apache/datafusion/issues/16720#issuecomment-3051614458 I think it would be super helpful to include max RSS or other memory metrics in the benchmark results. Recently, I've been manually profiling memory usage with tools like hea

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
xudong963 commented on code in PR #16716: URL: https://github.com/apache/datafusion/pull/16716#discussion_r2194413897 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -928,6 +929,55 @@ pub(crate) fn build_batch_from_indices( Ok(RecordBatch::try_new(Arc::new(schema.clon

Re: [I] Optimized spill file format [datafusion]

2025-07-09 Thread via GitHub
ding-young commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3051639268 It seems like comet has removed their customed BatchReader/Writer (see [PR#1703](https://github.com/apache/datafusion-comet/pull/1703/files)). -- This is an automated mes

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
Dandandan commented on code in PR #16716: URL: https://github.com/apache/datafusion/pull/16716#discussion_r2194855680 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -928,6 +929,55 @@ pub(crate) fn build_batch_from_indices( Ok(RecordBatch::try_new(Arc::new(schema.clon

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
Dandandan commented on code in PR #16716: URL: https://github.com/apache/datafusion/pull/16716#discussion_r2194867505 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -1498,6 +1498,23 @@ impl HashJoinStream { let timer = self.join_metrics.join_time.timer();

Re: [PR] Use the `test-threads` option in sqllogictests [datafusion]

2025-07-09 Thread via GitHub
mjgarton commented on PR #16722: URL: https://github.com/apache/datafusion/pull/16722#issuecomment-3052434316 Closes #16723 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [I] [Discussion]: show more info for `OutputRequirementExec` display [datafusion]

2025-07-09 Thread via GitHub
Loaki07 commented on issue #16725: URL: https://github.com/apache/datafusion/issues/16725#issuecomment-3052433970 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

[PR] Sd reserverd kws tables alias [datafusion-sqlparser-rs]

2025-07-09 Thread via GitHub
yoavcloud opened a new pull request, #1934: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1934 Following the same pattern of `Dialect::is_select_item_alias` for table aliases. Added handling for Snowflake-specific keywords when used as an implicit table alias (i.e. without `AS

[PR] Improved window and aggregate function signature [datafusion-python]

2025-07-09 Thread via GitHub
timsaucer opened a new pull request, #1187: URL: https://github.com/apache/datafusion-python/pull/1187 # Which issue does this PR close? None # Rationale for this change This is a relatively minor change that allows users to pass a single expression for `order_by` and `

[PR] feat: Optimize `collect_left_input` processing [datafusion]

2025-07-09 Thread via GitHub
jonathanc-n opened a new pull request, #16727: URL: https://github.com/apache/datafusion/pull/16727 ## Which issue does this PR close? - Closes #. ## Rationale for this change I tried some optimizations in #16719 but there is a large slow down with concatenating arra

Re: [PR] feat: Optimize `collect_left_input` processing [datafusion]

2025-07-09 Thread via GitHub
jonathanc-n commented on PR #16727: URL: https://github.com/apache/datafusion/pull/16727#issuecomment-3052923076 @alamb Would you be able to run some benchmarks on this please? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

Re: [PR] feat: Optimize `collect_left_input` processing [datafusion]

2025-07-09 Thread via GitHub
jonathanc-n commented on code in PR #16727: URL: https://github.com/apache/datafusion/pull/16727#discussion_r2195211529 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -980,6 +980,17 @@ async fn collect_left_input( }) .await?; +if batches.len()

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052934882 πŸ€–: Benchmark completed Details ``` Comparing HEAD and alamb_test_pushdown Benchmark parquet.json ┏━

[PR] [ISSUE-1277] fix: devcontainer protoc:1 feature url [datafusion-ballista]

2025-07-09 Thread via GitHub
Almaz-KG opened a new pull request, #1278: URL: https://github.com/apache/datafusion-ballista/pull/1278 # Which issue does this PR close? Closes #1277. # Rationale for this change # What changes are included in this PR? # Are there any user-facing

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-09 Thread via GitHub
2010YOUY01 commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2194486005 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -689,6 +674,8 @@ enum NestedLoopJoinStreamState { ProcessProbeBatch(RecordBatch), ///

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052072928 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3052072768 πŸ€–: Benchmark completed Details ``` Comparing HEAD and alamb_update_arrow_56.0.0 Benchmark sort_tpch1.json ┏

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
jonathanc-n commented on code in PR #16716: URL: https://github.com/apache/datafusion/pull/16716#discussion_r2194796765 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -928,6 +929,55 @@ pub(crate) fn build_batch_from_indices( Ok(RecordBatch::try_new(Arc::new(schema.cl

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
zhuqi-lucas commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052815314 > One thing that I think has caused us problems is judging any improvements to pushdown based on not regressing performance when pushdown is enabled vs not. > > However, th

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052818844 > One thing that I think has caused us problems is judging any improvements to pushdown based on not regressing performance when pushdown is enabled vs not. > > However, this mak

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052814944 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
zhuqi-lucas commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052805272 > > The adaptive selection will help Q30 and Q31 from previous PR result: > > [apache/arrow-rs#7454 (comment)](https://github.com/apache/arrow-rs/pull/7454#issuecomment-2840770

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052856712 πŸ€–: Benchmark completed Details ``` Comparing HEAD and alamb_test_pushdown Benchmark clickbench_extended.json

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052856826 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3052003404 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

[I] Discussion: public some aggregate related function and struct [datafusion]

2025-07-09 Thread via GitHub
haohuaijin opened a new issue, #16724: URL: https://github.com/apache/datafusion/issues/16724 ### Is your feature request related to a problem or challenge? DataFusion employs a two-phase aggregation process. In the first phase, it produces intermediate results, and in the second phas

[I] Implement --test-threads CLI argument for sqllogictest runner [datafusion]

2025-07-09 Thread via GitHub
alamb opened a new issue, #16723: URL: https://github.com/apache/datafusion/issues/16723 Another thing we could do (as a follow on PR) is to actually thread this through to the runner: https://github.com/apache/datafusion/blob/6b00a0a0cf1bb449631b57535c06bbf99583/datafusion/sqllo

Re: [I] Improve Display format of `BoundedWindowAggExec` [datafusion]

2025-07-09 Thread via GitHub
alamb closed issue #15758: Improve Display format of `BoundedWindowAggExec` URL: https://github.com/apache/datafusion/issues/15758 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] Improve display format of BoundedWindowAggExec [datafusion]

2025-07-09 Thread via GitHub
alamb merged PR #16645: URL: https://github.com/apache/datafusion/pull/16645 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Improve display format of BoundedWindowAggExec [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16645: URL: https://github.com/apache/datafusion/pull/16645#issuecomment-3051989553 πŸš€ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] Fix sqllogictests test running compatibility (ignore `--test-threads`) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on code in PR #16694: URL: https://github.com/apache/datafusion/pull/16694#discussion_r2194612244 ## datafusion/sqllogictest/bin/sqllogictests.rs: ## @@ -689,6 +689,12 @@ struct Options { help = "IGNORED (for compatibility with built-in rust test runner)

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
nuno-faria commented on PR #16716: URL: https://github.com/apache/datafusion/pull/16716#issuecomment-3052238363 @Dandandan I've added one more test where both tables are empty. Do you have suggestions for more? -- This is an automated message from the Apache Git Service. To respond to the

[I] [Discussion]: show more info for `OutputRequirementExec` display [datafusion]

2025-07-09 Thread via GitHub
xudong963 opened a new issue, #16725: URL: https://github.com/apache/datafusion/issues/16725 Currently, the display of `OutputRequirementExec` only contains the name ``` 01)DataSinkExec: sink=ParquetSink(file_groups=[]) 02)--CoalescePartitionsExec 03)ProjectionExec: expr=[CA

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-09 Thread via GitHub
alamb merged PR #86: URL: https://github.com/apache/datafusion-site/pull/86 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusio

Re: [PR] DataFusion 47.0.0 blog post [datafusion-site]

2025-07-09 Thread via GitHub
alamb commented on PR #83: URL: https://github.com/apache/datafusion-site/pull/83#issuecomment-3052252011 Starting to check this out -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
nuno-faria commented on code in PR #16716: URL: https://github.com/apache/datafusion/pull/16716#discussion_r2194739991 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -928,6 +929,55 @@ pub(crate) fn build_batch_from_indices( Ok(RecordBatch::try_new(Arc::new(schema.clo

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
nuno-faria commented on PR #16716: URL: https://github.com/apache/datafusion/pull/16716#issuecomment-3052253266 @jonathanc-n Since after #16434 the hash map is not directly accessible, I've added an `is_empty` method to `JoinHashMapType`. Please check if this is the preferred approach.

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052758619 > The adaptive selection will help Q30 and Q31 from previous PR result: > > [apache/arrow-rs#7454 (comment)](https://github.com/apache/arrow-rs/pull/7454#issuecomment-2840770864)

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052776447 One thing that I think has caused us problems is judging any improvements to pushdown based on not regressing performance when pushdown is enabled vs not. However, this makes ma

[I] Comet 0.9.1 Release (July/August 2025) [datafusion-comet]

2025-07-09 Thread via GitHub
andygrove opened a new issue, #2002: URL: https://github.com/apache/datafusion-comet/issues/2002 ### What is the problem the feature request solves? We will likely want to create a patch release to contain bug fixes soon, particularly for the Iceberrg integration work. Let's use this

[I] Could not resolve Feature 'ghcr.io/devcontainers-contrib/features/protoc:1' [datafusion-ballista]

2025-07-09 Thread via GitHub
Almaz-KG opened a new issue, #1277: URL: https://github.com/apache/datafusion-ballista/issues/1277 **Describe the bug** The `.devcontainer` configuration uses a deprecated or inaccessible URL for a protoc:1 feature. The feature `ghcr.io/devcontainers-contrib/features/protoc:1` canno

[I] Nested type modifiers/complex type [datafusion-sqlparser-rs]

2025-07-09 Thread via GitHub
ct-badger opened a new issue, #1932: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1932 Example Trino SQL: ```sql SELECT row_id, array_agg( CAST( ROW( total_amount ) AS

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052204040 πŸ€–: Benchmark completed Details ``` Comparing HEAD and alamb_test_pushdown Benchmark clickbench_1.json ┏

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052174459 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052174328 πŸ€–: Benchmark completed Details ``` Comparing HEAD and alamb_test_pushdown Benchmark clickbench_extended.json

Re: [PR] Add support for Snowflake identifier function [datafusion-sqlparser-rs]

2025-07-09 Thread via GitHub
iffyio commented on code in PR #1929: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1929#discussion_r2194728896 ## tests/sqlparser_snowflake.rs: ## @@ -4232,3 +4232,122 @@ fn test_snowflake_create_view_with_composite_policy_name() { r#"CREATE VIEW X (COL

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052692116 My analysis of these results are very consistent with my last attempt at caching filter results The biggest slow downs are in Q30, Q31 ``` β”‚ QQuery 30β”‚ 758.48 ms β”‚

Re: [I] Optimized spill file format [datafusion]

2025-07-09 Thread via GitHub
alamb commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-3052679388 > It seems like comet has removed their customed BatchReader/Writer and switched back to arrow IPC reader/writer (see [PR#1703](https://github.com/apache/datafusion-comet/pull/170

Re: [PR] DataFusion 47.0.0 blog post [datafusion-site]

2025-07-09 Thread via GitHub
Omega359 commented on PR #83: URL: https://github.com/apache/datafusion-site/pull/83#issuecomment-3052697169 Thanks. Odd that RustRover rendered it differently but the wording is definitely better :) -- This is an automated message from the Apache Git Service. To respond to the message, p

Re: [PR] POC: Test DataFusion with experimental Parquet Filter Pushdown (try 4) [datafusion]

2025-07-09 Thread via GitHub
zhuqi-lucas commented on PR #16711: URL: https://github.com/apache/datafusion/pull/16711#issuecomment-3052710705 The adaptive selection will help Q30 and Q31 from previous PR result: https://github.com/apache/arrow-rs/pull/7454#issuecomment-2840770864 -- This is an automated message

Re: [PR] Blog: Embedding User-Defined Indexes in Apache Parquet Files [datafusion-site]

2025-07-09 Thread via GitHub
zhuqi-lucas commented on PR #79: URL: https://github.com/apache/datafusion-site/pull/79#issuecomment-3052718193 > > I made some changes based latest comments from folks. > > FYI @alamb , please correct me if i made some wrong changes, thanks a lot! > > THank you -- it is looking gr

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-09 Thread via GitHub
zhuqi-lucas commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3052731512 Thank you @alamb @Dandandan , we may also try sort_tpch10 , but it may also not too much improvement, the ported PR already has 1.4x faster for sort_tpch Q11(inlined string view

[PR] Use the `test-threads` option in sqllogictests [datafusion]

2025-07-09 Thread via GitHub
mjgarton opened a new pull request, #16722: URL: https://github.com/apache/datafusion/pull/16722 Use the `test-threads` option in sqllogictests if it is passed in, instead of ignoring it. Default to using get_available_parallelism as before otherwise. Previously, this option was only

[I] Use `--test-threads` option properly in sqllogictests.rs [datafusion]

2025-07-09 Thread via GitHub
mjgarton opened a new issue, #16721: URL: https://github.com/apache/datafusion/issues/16721 ### Is your feature request related to a problem or challenge? The test threads option in `sqllogictests.rs` is currently ignored, but it could be used to control parallelism in the running of

Re: [I] How to write csv file to disk from a empty dataframe? [datafusion]

2025-07-09 Thread via GitHub
vadimpiven commented on issue #16240: URL: https://github.com/apache/datafusion/issues/16240#issuecomment-3052130805 I can add that for Arrow format the empty DataFrame also does not create a file. -- This is an automated message from the Apache Git Service. To respond to the message, ple

[PR] fix: add `order_requirement` & `dist_requirement` to `OutputRequirementExec` plan display [datafusion]

2025-07-09 Thread via GitHub
Loaki07 opened a new pull request, #16726: URL: https://github.com/apache/datafusion/pull/16726 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested?

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
jonathanc-n commented on PR #16716: URL: https://github.com/apache/datafusion/pull/16716#issuecomment-3052562671 @nuno-faria We can return early from collect_left_input after intaking batches and checking the number of batches ```rust if batches.len() == 0 { return Ok(J

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
jonathanc-n commented on PR #16716: URL: https://github.com/apache/datafusion/pull/16716#issuecomment-3052564940 I can probably add this with a new pr, i'm currently trying the batch coalescer -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [I] [EPIC] Complete `datafusion-spark` Spark Compatible Functions [datafusion]

2025-07-09 Thread via GitHub
alamb commented on issue #15914: URL: https://github.com/apache/datafusion/issues/15914#issuecomment-3052568506 An update here is that we are waiting for one or two more good example PRs and then we'll turn the community on porting If anyone wants to take a look / help out with http

Re: [I] update the changelog generator script to include a link to the upgrade guide πŸ€” [datafusion]

2025-07-09 Thread via GitHub
xudong963 closed issue #16626: update the changelog generator script to include a link to the upgrade guide πŸ€” URL: https://github.com/apache/datafusion/issues/16626 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Add link to upgrade guide in changelog script [datafusion]

2025-07-09 Thread via GitHub
xudong963 merged PR #16680: URL: https://github.com/apache/datafusion/pull/16680 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

[PR] Added unquoted identifiers unicode support for mySql, postgreSqp, als… [datafusion-sqlparser-rs]

2025-07-09 Thread via GitHub
etgarperets opened a new pull request, #1933: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1933 …o added a test for that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
jonathanc-n commented on code in PR #16716: URL: https://github.com/apache/datafusion/pull/16716#discussion_r2194904410 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -928,6 +929,55 @@ pub(crate) fn build_batch_from_indices( Ok(RecordBatch::try_new(Arc::new(schema.cl

Re: [PR] perf: Optimize hash joins with an empty build side [datafusion]

2025-07-09 Thread via GitHub
jonathanc-n commented on PR #16716: URL: https://github.com/apache/datafusion/pull/16716#issuecomment-3052501889 > @jonathanc-n Since after #16434 the hash map is not directly accessible, I've added an `is_empty` method to `JoinHashMapType`. Please check if this is the preferred approach.

Re: [PR] DataFusion 47.0.0 blog post [datafusion-site]

2025-07-09 Thread via GitHub
alamb commented on PR #83: URL: https://github.com/apache/datafusion-site/pull/83#issuecomment-3052979265 > Thanks. Odd that RustRover rendered it differently but the wording is definitely better :) Yeah, the pelicanasf rendered is pretty wonky and non standard (also doesn't like mar

[PR] Bump the MSRV due to transitive dependencies [datafusion]

2025-07-09 Thread via GitHub
rtyler opened a new pull request, #16728: URL: https://github.com/apache/datafusion/pull/16728 datafusion-cli depends on aws-config which has some transitive dependencies which require 1.85. This was a dependabot upgrade in 8366d6e155 16:18:43 error: rustc 1.83.0 is not supported by

Re: [PR] make: split git clone and checkout commit [datafusion-site]

2025-07-09 Thread via GitHub
kevinjqliu commented on PR #87: URL: https://github.com/apache/datafusion-site/pull/87#issuecomment-3053023934 cc @alamb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

[PR] make: split git clone and checkout commit [datafusion-site]

2025-07-09 Thread via GitHub
kevinjqliu opened a new pull request, #87: URL: https://github.com/apache/datafusion-site/pull/87 Follow up to https://github.com/apache/datafusion-site/pull/86 This takes care of the case where `infrastructure-actions` folder already exists but not yet in the specific commit. Previou

Re: [I] Comet shuffle read size is larger than Spark shuffle [datafusion-comet]

2025-07-09 Thread via GitHub
andygrove commented on issue #1268: URL: https://github.com/apache/datafusion-comet/issues/1268#issuecomment-3053741323 I am investigating this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

[I] Update supportedSortType to remove some of the complex type fallbacks [datafusion-comet]

2025-07-09 Thread via GitHub
andygrove opened a new issue, #2003: URL: https://github.com/apache/datafusion-comet/issues/2003 ### What is the problem the feature request solves? `QueryPlanSerde.supportedSortType` is overly restrictive and prevents some sort operations from being accelerated. We should upda

Re: [I] Update supportedSortType to remove some of the complex type fallbacks [datafusion-comet]

2025-07-09 Thread via GitHub
andygrove commented on issue #2003: URL: https://github.com/apache/datafusion-comet/issues/2003#issuecomment-3053771570 Oops this was a duplicate of https://github.com/apache/datafusion-comet/issues/1854 -- This is an automated message from the Apache Git Service. To respond to the messa

Re: [I] Update supportedSortType to remove some of the complex type fallbacks [datafusion-comet]

2025-07-09 Thread via GitHub
andygrove closed issue #2003: Update supportedSortType to remove some of the complex type fallbacks URL: https://github.com/apache/datafusion-comet/issues/2003 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-09 Thread via GitHub
comphead commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3053785263 > @comphead @Kontinuation > > Thank you for the comments. That all makes sense to me. > > Here’s the plan I propose: > > 1. Extend fs-hdfs to support passing

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-09 Thread via GitHub
comphead commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3053787063 > I have a couple of use cases in mind that I'm hoping this will cover - > > 1. Custom credentials providers - Say a credentials provider extending `org.apache.hadoop.fs.

[PR] share staging infrastructure [datafusion-site]

2025-07-09 Thread via GitHub
kevinjqliu opened a new pull request, #88: URL: https://github.com/apache/datafusion-site/pull/88 try to share `https://datafusion.staged.apache.org/blog/` amongst multiple PRs goal is to access each pr/branch as `https://datafusion.staged.apache.org//blog/` -- This is an

[PR] Remove parquet_filter and parquet_sort benchmarks [datafusion]

2025-07-09 Thread via GitHub
alamb opened a new pull request, #16730: URL: https://github.com/apache/datafusion/pull/16730 ## Which issue does this PR close? - Part of https://github.com/apache/datafusion/issues/16729 ## Rationale for this change These benchmarks are not used, and are done agains

Re: [PR] chore: Remove obsolete supportedSortType function after Arrow updates [datafusion-comet]

2025-07-09 Thread via GitHub
andygrove commented on PR #1946: URL: https://github.com/apache/datafusion-comet/pull/1946#issuecomment-3053956235 @Ruchir28 I suspect that we can't completely remove `supportedSortType` yet, but it could be updated to remove many of the current restrictions. -- This is an automated mess

Re: [PR] share staging infrastructure [datafusion-site]

2025-07-09 Thread via GitHub
kevinjqliu commented on PR #88: URL: https://github.com/apache/datafusion-site/pull/88#issuecomment-3053975872 ok CI now fails with ``` remote: Permission to apache/datafusion-site.git denied to github-actions[bot]. fatal: unable to access 'https://github.com/apache/datafusion-site/

[PR] Add `clickbench_pushdown` benchmark [datafusion]

2025-07-09 Thread via GitHub
alamb opened a new pull request, #16731: URL: https://github.com/apache/datafusion/pull/16731 ## Which issue does this PR close? - Related to of https://github.com/apache/datafusion/issues/3463 - Closes https://github.com/apache/datafusion/issues/16729 ## Rationale fo

Re: [PR] Perform type coercion for corr aggregate function [datafusion]

2025-07-09 Thread via GitHub
kumarlokesh commented on code in PR #15776: URL: https://github.com/apache/datafusion/pull/15776#discussion_r2195970579 ## datafusion/sqllogictest/test_files/corr_type_coercion.slt: ## @@ -0,0 +1,248 @@ +# Licensed to the Apache Software Foundation (ASF) under one Review Commen

Re: [PR] fix: Support `auto` scan mode with Spark 4.0.0 [datafusion-comet]

2025-07-09 Thread via GitHub
andygrove commented on PR #1975: URL: https://github.com/apache/datafusion-comet/pull/1975#issuecomment-3054014387 hive-1 failure to investigate: ``` 2025-07-09T20:18:09.485Z [info] Cause: org.apache.comet.CometNativeException: External error: Arrow: Parquet argument error:

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-09 Thread via GitHub
parthchandra commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3054017619 > @parthchandra I'm not sure if `libhdfs` packaged > > with `hdfs` feature and without the `libcomet.dylib` has the same size of 69M, so static linking is unlikely. A

Re: [I] Implement --test-threads CLI argument for sqllogictest runner [datafusion]

2025-07-09 Thread via GitHub
alamb commented on issue #16723: URL: https://github.com/apache/datafusion/issues/16723#issuecomment-3054019438 Dupe of https://github.com/apache/datafusion/issues/16721 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] feat: expose intersect distinct/except distinct in dataframe api [datafusion]

2025-07-09 Thread via GitHub
alamb commented on code in PR #16578: URL: https://github.com/apache/datafusion/pull/16578#discussion_r2195957972 ## datafusion/core/src/dataframe/mod.rs: ## @@ -1681,6 +1681,40 @@ impl DataFrame { }) } +/// Calculate the distinct intersection of two [`DataFr

Re: [I] Implement --test-threads CLI argument for sqllogictest runner [datafusion]

2025-07-09 Thread via GitHub
alamb closed issue #16723: Implement --test-threads CLI argument for sqllogictest runner URL: https://github.com/apache/datafusion/issues/16723 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] share staging infrastructure [datafusion-site]

2025-07-09 Thread via GitHub
kevinjqliu commented on PR #88: URL: https://github.com/apache/datafusion-site/pull/88#issuecomment-3054029996 Looking at [past successful runs](https://github.com/apache/datafusion-site/actions/workflows/stage-site.yml?query=is%3Asuccess) are all from either andygrove, alamb, or timsaucer

Re: [PR] share staging infrastructure [datafusion-site]

2025-07-09 Thread via GitHub
kevinjqliu commented on PR #88: URL: https://github.com/apache/datafusion-site/pull/88#issuecomment-3054048496 @alamb could you maybe push an empty commit to this branch? Or if its easier, add a github suggestion and apply it. That should trigger CI using your github user -- This is an a

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-09 Thread via GitHub
Dandandan commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3054095304 I wonder if we should focus on more parallelisation in `SortPreservingMerge` (besides improving the hot paths like `gc` `interleave`, comparisons, etc.) I think there might b

Re: [PR] Bump the MSRV due to transitive dependencies [datafusion]

2025-07-09 Thread via GitHub
alamb merged PR #16728: URL: https://github.com/apache/datafusion/pull/16728 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Unnest logical plan lacks decent projection push down [datafusion]

2025-07-09 Thread via GitHub
alamb closed issue #16623: Unnest logical plan lacks decent projection push down URL: https://github.com/apache/datafusion/issues/16623 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-09 Thread via GitHub
alamb merged PR #16632: URL: https://github.com/apache/datafusion/pull/16632 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Bump the MSRV due to transitive dependencies [datafusion]

2025-07-09 Thread via GitHub
alamb commented on PR #16728: URL: https://github.com/apache/datafusion/pull/16728#issuecomment-3054126138 Thanks @comphead and @rtyler -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

  1   2   >