Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-18 Thread via GitHub
ding-young commented on code in PR #16268: URL: https://github.com/apache/datafusion/pull/16268#discussion_r2156147905 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -258,6 +259,8 @@ impl ExternalSorter { batch_size: usize, sort_spill_reservation_bytes: u

Re: [PR] Add compression option to SpillManager [datafusion]

2025-06-18 Thread via GitHub
ding-young commented on code in PR #16268: URL: https://github.com/apache/datafusion/pull/16268#discussion_r2156147684 ## datafusion/physical-plan/src/joins/sort_merge_join.rs: ## @@ -1324,6 +1326,8 @@ impl Stream for SortMergeJoinStream { impl SortMergeJoinStream { #[allo

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2986650188 Hm my earlier benchmarks didn't seem correct. not sure where the earlier run came from πŸ€” -- This is an automated message from the Apache Git Service. To respond to the message,

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on code in PR #16433: URL: https://github.com/apache/datafusion/pull/16433#discussion_r2156139567 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -319,13 +341,87 @@ impl TopK { /// (a > 2 OR (a = 2 AND b < 3)) /// ``` fn update_filter(&mut s

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on code in PR #16433: URL: https://github.com/apache/datafusion/pull/16433#discussion_r2156133514 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -319,13 +341,87 @@ impl TopK { /// (a > 2 OR (a = 2 AND b < 3)) /// ``` fn update_filter(&mut s

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on code in PR #16433: URL: https://github.com/apache/datafusion/pull/16433#discussion_r2156106525 ## datafusion/physical-plan/src/topk/mod.rs: ## @@ -214,41 +238,39 @@ impl TopK { let mut selected_rows = None; -if let Some(filter) = self.

Re: [PR] Fix constant window for evaluate stateful [datafusion]

2025-06-18 Thread via GitHub
suibianwanwank commented on PR #16430: URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2986591856 > > > @andygrove how can we test this with Comet? Can I just pin to a datafusion version? > > > > > > Yes, assuming that there are no breaking API changes in DataFus

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-18 Thread via GitHub
zhuqi-lucas commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2986468374 Yeah, the clickbench benchmark shows a little slower, it seems can be reproduced, about total time 1000ms slower. > hmm there seems to be some regressions there...

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-18 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2986400295 # benchmark I use this [script](https://gist.github.com/UBarney/9dcbf304e65f061d3352b34abd0f0e05#file-sql_bench-py) to do benchmark | ID | SQL | join_base Time(s) | join_li

Re: [PR] feat: Add ConfigOptions to ScalarFunctionArgs [datafusion]

2025-06-18 Thread via GitHub
github-actions[bot] commented on PR #13527: URL: https://github.com/apache/datafusion/pull/13527#issuecomment-2986358525 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] feat: implement GroupsAccumulator for `count(DISTINCT)` aggr [datafusion]

2025-06-18 Thread via GitHub
github-actions[bot] closed pull request #15324: feat: implement GroupsAccumulator for `count(DISTINCT)` aggr URL: https://github.com/apache/datafusion/pull/15324 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [PR] Draft: Use take-in kernel in repartitioning [datafusion]

2025-06-18 Thread via GitHub
github-actions[bot] closed pull request #15392: Draft: Use take-in kernel in repartitioning URL: https://github.com/apache/datafusion/pull/15392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] doc: Add comments to clarify algorithm for `MarkJoin`s [datafusion]

2025-06-18 Thread via GitHub
jonathanc-n commented on code in PR #16436: URL: https://github.com/apache/datafusion/pull/16436#discussion_r2155875377 ## datafusion/physical-plan/src/joins/symmetric_hash_join.rs: ## @@ -810,6 +810,21 @@ where { // Store the result in a tuple let result = match (bui

Re: [PR] feat: Support equijoins in `NestedLoopJoin` [datafusion]

2025-06-18 Thread via GitHub
jonathanc-n commented on PR #16450: URL: https://github.com/apache/datafusion/pull/16450#issuecomment-2986314949 The upside is that it performs well when both tables are extremely small < 50 rows πŸ˜† -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub
codecov-commenter commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2986286369 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1901?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: Add support for native hash join with BuildRight + LeftAnti [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove closed pull request #1904: feat: Add support for native hash join with BuildRight + LeftAnti URL: https://github.com/apache/datafusion-comet/pull/1904 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [PR] feat: Add support for native hash join with BuildRight + LeftAnti [datafusion-comet]

2025-06-18 Thread via GitHub
comphead commented on PR #1904: URL: https://github.com/apache/datafusion-comet/pull/1904#issuecomment-2986186984 > @comphead The root cause is [apache/datafusion#10583](https://github.com/apache/datafusion/issues/10583) Got it so it is a HJ issue, I'll try to check DF issue -- T

Re: [PR] chore: Enable `native_iceberg_compat` Spark SQL tests (for real, this time) [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove merged PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Fix constant window for evaluate stateful [datafusion]

2025-06-18 Thread via GitHub
andygrove commented on PR #16430: URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2986182319 > > @andygrove how can we test this with Comet? Can I just pin to a datafusion version? > > Yes, assuming that there are no breaking API changes in DataFusion since 48 ... I

Re: [PR] [ignore] test DataFusion PR: Fix constant window for evaluate stateful [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove closed pull request #1913: [ignore] test DataFusion PR: Fix constant window for evaluate stateful URL: https://github.com/apache/datafusion-comet/pull/1913 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[PR] [ignore] test DataFusion PR: Fix constant window for evaluate stateful [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove opened a new pull request, #1913: URL: https://github.com/apache/datafusion-comet/pull/1913 ## Which issue does this PR close? N/A ## Rationale for this change We would like to see if https://github.com/apache/datafusion/pull/16430 fixes issues

Re: [PR] chore: Enable `native_iceberg_compat` Spark SQL tests (for real, this time) [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra commented on code in PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#discussion_r2155783253 ## dev/diffs/3.5.6.diff: ## @@ -1938,7 +1938,17 @@ index 8e88049f51e..d3c0737d52e 100644 import testImplicits._ // keep() should take effect o

Re: [PR] Fix constant window for evaluate stateful [datafusion]

2025-06-18 Thread via GitHub
andygrove commented on PR #16430: URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2986119176 > @andygrove how can we test this with Comet? Can I just pin to a datafusion version? Yes, assuming that there are no breaking API changes in DataFusion since 48 ... I will

Re: [PR] chore: Enable `native_iceberg_compat` Spark SQL tests (for real, this time) [datafusion-comet]

2025-06-18 Thread via GitHub
kazuyukitanimura commented on code in PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#discussion_r2155756956 ## dev/diffs/3.5.6.diff: ## @@ -1938,7 +1938,17 @@ index 8e88049f51e..d3c0737d52e 100644 import testImplicits._ // keep() should take effe

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
kazuyukitanimura commented on code in PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#discussion_r2155755872 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1946,6 +1946,52 @@ class ParquetReadV1Suite extends ParquetReadSuite w

Re: [PR] chore: Enable `native_iceberg_compat` Spark SQL tests (for real, this time) [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#discussion_r2155751838 ## dev/diffs/3.5.6.diff: ## @@ -1938,7 +1938,17 @@ index 8e88049f51e..d3c0737d52e 100644 import testImplicits._ // keep() should take effect on S

Re: [PR] Set the default value of `datafusion.execution.collect_statistics` to `true` [datafusion]

2025-06-18 Thread via GitHub
AdamGS commented on PR #16447: URL: https://github.com/apache/datafusion/pull/16447#issuecomment-2986090162 Our benchmarks show this change fixes the performance regression we saw - https://github.com/vortex-data/vortex/pull/3567 -- This is an automated message from the Apache Git Service

Re: [PR] chore: Introduce `exprHandlers` map in QueryPlanSerde [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1903: URL: https://github.com/apache/datafusion-comet/pull/1903#discussion_r2155747372 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -61,6 +61,39 @@ import org.apache.comet.shims.CometExprShim * An utility object

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra commented on code in PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#discussion_r2155736023 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1946,6 +1946,52 @@ class ParquetReadV1Suite extends ParquetReadSuite with

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
codecov-commenter commented on PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#issuecomment-2986043661 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1911?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] test: Trigger Spark 3.4.3 SQL tests for iceberg-compat [datafusion-comet]

2025-06-18 Thread via GitHub
codecov-commenter commented on PR #1912: URL: https://github.com/apache/datafusion-comet/pull/1912#issuecomment-2986069982 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1912?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] chore: Enable `native_iceberg_compat` Spark SQL tests (for real, this time) [datafusion-comet]

2025-06-18 Thread via GitHub
kazuyukitanimura commented on code in PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#discussion_r2155727836 ## dev/diffs/3.5.6.diff: ## @@ -1938,7 +1938,17 @@ index 8e88049f51e..d3c0737d52e 100644 import testImplicits._ // keep() should take effe

Re: [PR] chore: Enable `native_iceberg_compat` Spark SQL tests (for real, this time) [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra commented on code in PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#discussion_r2155726443 ## .github/workflows/spark_sql_test.yml: ## @@ -114,6 +114,6 @@ jobs: run: | cd apache-spark rm -rf /root/.m2/repository/or

Re: [PR] test: Trigger Spark 3.4.3 SQL tests for iceberg-compat [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1912: URL: https://github.com/apache/datafusion-comet/pull/1912#issuecomment-2986048912 @kazuyukitanimura This PR will not actually test iceberg-compat until it includes the fix from https://github.com/apache/datafusion-comet/pull/1910 -- This is an automated m

Re: [PR] chore: Enable Spark SQL tests for auto scan mode [WIP] [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1885: URL: https://github.com/apache/datafusion-comet/pull/1885#issuecomment-2986043502 The fix for the test failure is in https://github.com/apache/datafusion-comet/pull/1910 -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
kazuyukitanimura commented on code in PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#discussion_r2155719528 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1946,6 +1946,52 @@ class ParquetReadV1Suite extends ParquetReadSuite w

[PR] test: Trigger Spark 3.4.3 SQL tests for iceberg-compat [datafusion-comet]

2025-06-18 Thread via GitHub
kazuyukitanimura opened a new pull request, #1912: URL: https://github.com/apache/datafusion-comet/pull/1912 ## Which issue does this PR close? ## Rationale for this change To trigger Spark 3.4.3 SQL tests for iceberg-compat on PRs ## What changes are included in this PR?

Re: [PR] chore: Fix typo in workflow [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#issuecomment-2986017103 One test failure, as expected: ``` 2025-06-18T22:31:07.6082754Z [info] - SPARK-17091: Convert IN predicate to Parquet filter push-down *** FAILED *** (297 millisecond

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-18 Thread via GitHub
blaginin commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2155683469 ## datafusion/functions/src/regex/regexpinstr.rs: ## @@ -0,0 +1,804 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lice

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-18 Thread via GitHub
blaginin commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2155683469 ## datafusion/functions/src/regex/regexpinstr.rs: ## @@ -0,0 +1,804 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lice

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2985988638 > I think it makes sense to only filter on the shared hashmap and not bothering with the min/max values - creating hashes and doing a single table lookup is quite fast, so I think w

Re: [PR] Implementation for regex_instr [datafusion]

2025-06-18 Thread via GitHub
blaginin commented on code in PR #15928: URL: https://github.com/apache/datafusion/pull/15928#discussion_r2155673833 ## datafusion/functions/src/regex/regexpcount.rs: ## @@ -29,10 +30,10 @@ use datafusion_expr::{ use datafusion_macros::user_doc; use itertools::izip; use regex

[PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra opened a new pull request, #1911: URL: https://github.com/apache/datafusion-comet/pull/1911 ## Which issue does this PR close? Adds a new unit test. Also adds a method to generate a complex type parquet file that can be used to test various complex type cases. -- This

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16371: URL: https://github.com/apache/datafusion/pull/16371#issuecomment-2985997261 I'll try to review tomorrow. I took a look the other day and my thought was that while it's complex code that is a bit hard for me to fully wrap my head around it's well teste

Re: [PR] feat: support array_max [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra commented on code in PR #1892: URL: https://github.com/apache/datafusion-comet/pull/1892#discussion_r2155694477 ## spark/src/test/scala/org/apache/comet/CometArrayExpressionSuite.scala: ## @@ -232,6 +232,21 @@ class CometArrayExpressionSuite extends CometTestBase wi

Re: [I] Was GCS support removed? [datafusion-ballista]

2025-06-18 Thread via GitHub
milenkovicm commented on issue #1274: URL: https://github.com/apache/datafusion-ballista/issues/1274#issuecomment-2985986246 In short users should extend ballista to support object store they need. S3 is a bit special case. You can find more details how to do that in the examples.

Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-18 Thread via GitHub
parthchandra commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2985979040 > Thanks for the contribution, @SKY-ALIN! Could we add a test case with timestamps as the join key? The test should have the left side and the right side timestamps b

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
adriangb commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2985967718 Seems like a bug in my implementation right? I'd be surprised if the update checks I added are that heavy compared to other work... -- This is an automated message from the Apache

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2985955904 It seems in some cases it's faster: ``` ┏━━┳━┳━━┳━━━┓ ┃ Query┃ topk-dynamic-filter ┃ topk-filters ┃

[I] Was GCS support removed? [datafusion-ballista]

2025-06-18 Thread via GitHub
dfinninger opened a new issue, #1274: URL: https://github.com/apache/datafusion-ballista/issues/1274 Hi, we're trying to make Ballista read parquet files in Google Cloud Storage. It looks like support for GCS was added in 2023: https://github.com/apache/datafusion-ballista/pull/805. However

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-06-18 Thread via GitHub
jonathanc-n commented on PR #16434: URL: https://github.com/apache/datafusion/pull/16434#issuecomment-2985914954 Those benchmarks make sense, just saves memory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2985881381 > > I think doing only the lookup is preferable above also computing / checking the bounds, I think the latter might create more overhead > > My thought was that for some cas

Re: [PR] feat: Support equijoins in `NestedLoopJoin` [datafusion]

2025-06-18 Thread via GitHub
jonathanc-n commented on code in PR #16450: URL: https://github.com/apache/datafusion/pull/16450#discussion_r2155615159 ## datafusion/core/src/physical_planner.rs: ## @@ -1009,95 +1012,99 @@ impl DefaultPhysicalPlanner { let left_df_schema = left.schema();

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on code in PR #16445: URL: https://github.com/apache/datafusion/pull/16445#discussion_r2155602331 ## datafusion/physical-plan/src/joins/hash_join.rs: ## @@ -943,10 +978,71 @@ impl ExecutionPlan for HashJoinExec { try_embed_projection(projection,

Re: [PR] chore: Fix typo in workflow [datafusion-comet]

2025-06-18 Thread via GitHub
codecov-commenter commented on PR #1910: URL: https://github.com/apache/datafusion-comet/pull/1910#issuecomment-2985831551 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1910?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #16434: URL: https://github.com/apache/datafusion/pull/16434#issuecomment-2985835006 πŸ€–: Benchmark completed Details ``` Comparing HEAD and support-u32-hashmap Benchmark clickbench_extended.json

Re: [PR] minor: Avoid rewriting join to unsupported join [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on code in PR #1888: URL: https://github.com/apache/datafusion-comet/pull/1888#discussion_r2155523377 ## spark/src/main/scala/org/apache/comet/rules/RewriteJoin.scala: ## @@ -65,9 +65,8 @@ object RewriteJoin extends JoinSelectionHelper { def rewrite(plan:

[I] Add support for native hash join with LeftAnti + BuildRight [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove opened a new issue, #1909: URL: https://github.com/apache/datafusion-comet/issues/1909 ### What is the problem the feature request solves? We currently fall back to Spark for hash join with LeftAnti + BuildRight because of correctness issues. We should file an issue in DataF

Re: [PR] fix: set RangePartitioning for native shuffle default to false [datafusion-comet]

2025-06-18 Thread via GitHub
mbutrovich merged PR #1907: URL: https://github.com/apache/datafusion-comet/pull/1907 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-18 Thread via GitHub
Dandandan commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2985797508 hmm there seems to be some regressions there... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [I] Add support for native hash join with LeftAnti + BuildRight [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on issue #1909: URL: https://github.com/apache/datafusion-comet/issues/1909#issuecomment-2985709966 There is already an issue in DataFusion https://github.com/apache/datafusion/issues/10583 -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] Set the default value of `datafusion.execution.collect_statistics` to `true` [datafusion]

2025-06-18 Thread via GitHub
AdamGS commented on PR #16447: URL: https://github.com/apache/datafusion/pull/16447#issuecomment-2985792094 Got a similar test failure to #16448 (issue filed in #16452). I have to conclude its personal at this point, I'll try and find some time to dig into it. -- This is an automated mess

Re: [PR] feat: Add support for native hash join with BuildRight + LeftAnti [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1904: URL: https://github.com/apache/datafusion-comet/pull/1904#issuecomment-2985785274 @comphead The root cause is https://github.com/apache/datafusion/issues/10583 -- This is an automated message from the Apache Git Service. To respond to the message, please l

Re: [PR] feat: Add support for native hash join with BuildRight + LeftAnti [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1904: URL: https://github.com/apache/datafusion-comet/pull/1904#issuecomment-2985784145 > > @comphead we seem to have a correctness issue when enabling LeftAnti + BuildRight: > > ``` > > [info] +- == Initial Plan == > > [info] CometBroadcastHashJoi

[PR] chore: Fix typo in workflow [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove opened a new pull request, #1910: URL: https://github.com/apache/datafusion-comet/pull/1910 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #16371: URL: https://github.com/apache/datafusion/pull/16371#issuecomment-2985770479 Sorry I have seen this one but haven't found time to review it yet cc @adriangb and @timsaucer -- This is an automated message from the Apache Git Service. To respond to the m

Re: [PR] doc: Add comments to clarify algorithm for `MarkJoin`s [datafusion]

2025-06-18 Thread via GitHub
alamb commented on code in PR #16436: URL: https://github.com/apache/datafusion/pull/16436#discussion_r211968 ## datafusion/physical-plan/src/joins/symmetric_hash_join.rs: ## @@ -810,6 +810,21 @@ where { // Store the result in a tuple let result = match (build_sid

Re: [PR] Set the default value of `datafusion.execution.collect_statistics` to `true` [datafusion]

2025-06-18 Thread via GitHub
AdamGS commented on PR #16447: URL: https://github.com/apache/datafusion/pull/16447#issuecomment-2985767325 Added a short upgrade note -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] Remove unused feature in `physical-plan` and fix compilation error in benchmark [datafusion]

2025-06-18 Thread via GitHub
alamb commented on code in PR #16449: URL: https://github.com/apache/datafusion/pull/16449#discussion_r2155540756 ## datafusion/physical-plan/Cargo.toml: ## @@ -36,7 +36,6 @@ workspace = true [features] force_hash_collisions = [] -bench = [] Review Comment: πŸ‘ ###

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #16434: URL: https://github.com/apache/datafusion/pull/16434#issuecomment-2985750963 πŸ€– `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubun

Re: [PR] Set the default value of `datafusion.execution.collect_statistics` to `true` [datafusion]

2025-06-18 Thread via GitHub
alamb commented on code in PR #16447: URL: https://github.com/apache/datafusion/pull/16447#discussion_r2155537108 ## datafusion/sqllogictest/test_files/parquet_statistics.slt: ## @@ -59,18 +59,18 @@ query TT EXPLAIN SELECT * FROM test_table WHERE column1 = 1; physical_pl

[I] SortQueryFuzzer found a failing case [datafusion]

2025-06-18 Thread via GitHub
AdamGS opened a new issue, #16452: URL: https://github.com/apache/datafusion/issues/16452 ### Describe the bug Fuzzer failed during an unrelated change - https://github.com/apache/datafusion/actions/runs/15741542523/job/44367876525?pr=16449. Not sure how long GitHub retains log

Re: [PR] Remove unused feature in `physical-plan` and fix compilation error in benchmark [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #16449: URL: https://github.com/apache/datafusion/pull/16449#issuecomment-2985730673 > The test failure is a fuzzer failure, is there a accepted way to open tickets generated by fuzzing? I'll restart the test. Maybe you can just create a ticket with a link to the

Re: [PR] feat: Add support for native hash join with BuildRight + LeftAnti [datafusion-comet]

2025-06-18 Thread via GitHub
comphead commented on PR #1904: URL: https://github.com/apache/datafusion-comet/pull/1904#issuecomment-2985725340 > @comphead we seem to have a correctness issue when enabling LeftAnti + BuildRight: > > ``` > [info] +- == Initial Plan == > [info] CometBroadcastHashJoin [

Re: [PR] Fix constant window for evaluate stateful [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #16430: URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2985661485 I tried making a reproducer but I could not reproduce the wrong results or panic reported in @andygrove 's comment https://github.com/apache/datafusion/issues/16308#issuecomment-294951

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2985718371 I took the liberty of merging up from main to resolve a logical conflict -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [PR] minor: Avoid rewriting join to unsupported join [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove merged PR #1888: URL: https://github.com/apache/datafusion-comet/pull/1888 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [I] Add support for native hash join with LeftAnti + BuildRight [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on issue #1909: URL: https://github.com/apache/datafusion-comet/issues/1909#issuecomment-2985711731 This issue may be a duplicate of https://github.com/apache/datafusion-comet/issues/457 -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] minor: Avoid rewriting join to unsupported join [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1888: URL: https://github.com/apache/datafusion-comet/pull/1888#issuecomment-2985701990 > btw is it still an issue? I think LeftAnti with SMJ has been fixed a while ago in DF It looks like there are still issues. I see a correctness issue when trying to en

Re: [PR] fix: create file for empty stream [datafusion]

2025-06-18 Thread via GitHub
alamb merged PR #16342: URL: https://github.com/apache/datafusion/pull/16342 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] How to write csv file to disk from a empty dataframe? [datafusion]

2025-06-18 Thread via GitHub
alamb closed issue #16240: How to write csv file to disk from a empty dataframe? URL: https://github.com/apache/datafusion/issues/16240 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #16401: URL: https://github.com/apache/datafusion/pull/16401#issuecomment-2985697551 Thanks again @epgif and @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-18 Thread via GitHub
alamb merged PR #16401: URL: https://github.com/apache/datafusion/pull/16401 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] feat: Add support for native hash join with BuildRight + LeftAnti [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1904: URL: https://github.com/apache/datafusion-comet/pull/1904#issuecomment-2985695312 @comphead we seem to have a correctness issue when enabling LeftAnti + BuildRight: ``` [info] +- == Initial Plan == [info] CometBroadcastHashJoin [c1#253],

Re: [PR] Fix constant window for evaluate stateful [datafusion]

2025-06-18 Thread via GitHub
alamb commented on PR #16430: URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2985662279 @andygrove how can we test this with Comet? Can I just pin to a datafusion version? -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] feat: Add support for native hash join with BuildRight + LeftAnti [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove commented on PR #1904: URL: https://github.com/apache/datafusion-comet/pull/1904#issuecomment-2985629164 One test failure: ``` - SPARK-38132: Not IN subquery correctness checks *** FAILED *** ``` -- This is an automated message from the Apache Git Service. To respon

Re: [I] Blog post about TopK filter pushdown [datafusion]

2025-06-18 Thread via GitHub
alamb commented on issue #15513: URL: https://github.com/apache/datafusion/issues/15513#issuecomment-2985614903 Now all we need to do is find time to write one -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [I] RangePartitioning does not yield correct results [datafusion-comet]

2025-06-18 Thread via GitHub
mbutrovich commented on issue #1906: URL: https://github.com/apache/datafusion-comet/issues/1906#issuecomment-2985599126 So the big challenge here seems to be mapping Comet to Spark's execution. Each generated Comet plan samples from its input stream, which itself is only a single partitio

Re: [PR] Remove unused feature in `physical-plan` and fix compilation error in benchmark [datafusion]

2025-06-18 Thread via GitHub
AdamGS commented on PR #16449: URL: https://github.com/apache/datafusion/pull/16449#issuecomment-2985590878 The test failure is a fuzzer failure, is there a accepted way to open tickets generated by fuzzing? -- This is an automated message from the Apache Git Service. To respond to the me

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-18 Thread via GitHub
alamb commented on code in PR #16401: URL: https://github.com/apache/datafusion/pull/16401#discussion_r2155411540 ## datafusion/catalog/src/information_schema/tests.rs: ## @@ -0,0 +1,88 @@ +use std::sync::Arc; Review Comment: The CI is failing because this file doesn't have

Re: [PR] Support types other than String and Int for partition columns [datafusion-python]

2025-06-18 Thread via GitHub
miclegr commented on code in PR #1154: URL: https://github.com/apache/datafusion-python/pull/1154#discussion_r2155423158 ## python/datafusion/context.py: ## @@ -535,7 +535,7 @@ def register_listing_table( self, name: str, path: str | pathlib.Path, -

Re: [PR] Set the default value of `datafusion.execution.collect_statistics` to `true` [datafusion]

2025-06-18 Thread via GitHub
AdamGS commented on PR #16447: URL: https://github.com/apache/datafusion/pull/16447#issuecomment-2985571652 I'm getting a lot of sqllogictest failures, is there a reason to think there something weird going on? I was somewhat open to the idea its all fine until I ran into the last test in `

Re: [PR] Remove redundant license-header-check CI job [datafusion]

2025-06-18 Thread via GitHub
alamb commented on code in PR #16451: URL: https://github.com/apache/datafusion/pull/16451#discussion_r2155406105 ## .github/workflows/rust.yml: ## @@ -39,14 +39,6 @@ on: workflow_dispatch: jobs: - # Check license header Review Comment: The other copy is here: The oth

[PR] Remove redundant license-header-check CI job [datafusion]

2025-06-18 Thread via GitHub
alamb opened a new pull request, #16451: URL: https://github.com/apache/datafusion/pull/16451 ## Which issue does this PR close? ## Rationale for this change - While working on https://github.com/apache/datafusion/pull/16401 I noticed that the license header check ran t

Re: [PR] Support data source sampling with TABLESAMPLE [datafusion]

2025-06-18 Thread via GitHub
theirix commented on PR #16325: URL: https://github.com/apache/datafusion/pull/16325#issuecomment-2985522134 > According to PostgreSQL's reference: https://wiki.postgresql.org/wiki/TABLESAMPLE_Implementation#SYSTEM_Option I believe `SYSTEM` option is equivalent to keep the entire `RecordBat

[I] auto scan mode should take file location into account [datafusion-comet]

2025-06-18 Thread via GitHub
andygrove opened a new issue, #1908: URL: https://github.com/apache/datafusion-comet/issues/1908 ### What is the problem the feature request solves? When using the `auto` mode for choosing the best Parquet scan implementation, we do not currently take the file source into account. Thi

Re: [PR] feat: Support Equijoin Expressions in `NestedLoopJoin` [datafusion]

2025-06-18 Thread via GitHub
jonathanc-n commented on PR #16450: URL: https://github.com/apache/datafusion/pull/16450#issuecomment-2985477515 I will try to run a benchmark on a table with smaller rows and return the result when finished. -- This is an automated message from the Apache Git Service. To respond to the m

[PR] feat: Support Equijoin Expressions in `NestedLoopJoin` [datafusion]

2025-06-18 Thread via GitHub
jonathanc-n opened a new pull request, #16450: URL: https://github.com/apache/datafusion/pull/16450 ## Which issue does this PR close? - Closes #. ## Rationale for this change We want to support equijoins in `NestedLoopJoin` in the case where one of the tables in the

Re: [PR] Set the default value of `datafusion.execution.collect_statistics` to `true` [datafusion]

2025-06-18 Thread via GitHub
AdamGS commented on PR #16447: URL: https://github.com/apache/datafusion/pull/16447#issuecomment-2985425260 Definitely, I'll run [our benchmarks](https://github.com/vortex-data/vortex/pull/3560) once I get all tests passing here. -- This is an automated message from the Apache Git Servic

Re: [I] Default to collecting statistics when creating LIstingTables [datafusion]

2025-06-18 Thread via GitHub
alamb commented on issue #16158: URL: https://github.com/apache/datafusion/issues/16158#issuecomment-2985421552 > Is there anything to this issue besides changing the default `datafusion.execution.collect_statistics`, fixing any tests that rely on the default value being `false` and and

  1   2   3   >