Re: [PR] feat: Avoid duplicate `PhyscialExpr` evaluation on hash table [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n closed pull request #16719: feat: Avoid duplicate `PhyscialExpr` evaluation on hash table URL: https://github.com/apache/datafusion/pull/16719 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
xudong963 merged PR #16718: URL: https://github.com/apache/datafusion/pull/16718 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
adriangb commented on PR #16718: URL: https://github.com/apache/datafusion/pull/16718#issuecomment-3051014271 Thanks for merging @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

Re: [PR] fix: Add LogicalTypeAnnotation in ParquetColumnSpec [datafusion-comet]

2025-07-08 Thread via GitHub
hsiang-c commented on PR #2000: URL: https://github.com/apache/datafusion-comet/pull/2000#issuecomment-3051010409 @kazuyukitanimura yes. We can consider merging https://github.com/apache/datafusion-comet/pull/1987 to speed up CI, but need to fix some tests due to configuration. --

Re: [I] Simplify Joins on Shared Column Name [datafusion-python]

2025-07-08 Thread via GitHub
kosiew commented on issue #1173: URL: https://github.com/apache/datafusion-python/issues/1173#issuecomment-3051181319 @timsaucer Sorry for crossing lanes. Feel free to close my PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [I] [EPIC] Tracking issue of support substrait logical plan [datafusion]

2025-07-08 Thread via GitHub
ViggoC commented on issue #8149: URL: https://github.com/apache/datafusion/issues/8149#issuecomment-3051149803 @waynexia Why do you think that we don't need to support OuterReferenceColumn? IMHO, It is the key path to implementing subquery. -- This is an automated message from the Apa

Re: [PR] Add support for automatic join column deduplication in DataFrame joins [datafusion-python]

2025-07-08 Thread via GitHub
kosiew closed pull request #1185: Add support for automatic join column deduplication in DataFrame joins URL: https://github.com/apache/datafusion-python/pull/1185 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] Add support for automatic join column deduplication in DataFrame joins [datafusion-python]

2025-07-08 Thread via GitHub
kosiew commented on PR #1185: URL: https://github.com/apache/datafusion-python/pull/1185#issuecomment-3051209201 Closed because of #1184 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
zhuqi-lucas commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3048588922 > So one thing we could potentially do here is make RepartitionExec smarter for large batches, and if the input is really large split it into smaller batch sizes or somethin

[PR] feat: reduce duplicate fields on join [datafusion-python]

2025-07-08 Thread via GitHub
timsaucer opened a new pull request, #1184: URL: https://github.com/apache/datafusion-python/pull/1184 # Which issue does this PR close? Closes #1173 # Rationale for this change In the current version of the code when you do a join and there is a common `on` column name,

Re: [PR] Per file filter evaluation [datafusion]

2025-07-08 Thread via GitHub
alamb commented on code in PR #15057: URL: https://github.com/apache/datafusion/pull/15057#discussion_r2192262877 ## datafusion-examples/examples/variant_shredding.rs: ## @@ -0,0 +1,408 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor lic

Re: [I] Filter cache based on the paper "Predicate Caching: Query-Driven Secondary Indexing for Cloud Data" [datafusion]

2025-07-08 Thread via GitHub
adriangb commented on issue #15585: URL: https://github.com/apache/datafusion/issues/15585#issuecomment-3048634993 ClickHouse has this: https://clickhouse.com/docs/operations/query-condition-cache -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3048785975 🤖: Benchmark completed Details ``` Comparing HEAD and alamb_update_arrow_56.0.0 Benchmark sort_tpch1.json ┏

Re: [I] Filter cache based on the paper "Predicate Caching: Query-Driven Secondary Indexing for Cloud Data" [datafusion]

2025-07-08 Thread via GitHub
adriangb commented on issue #15585: URL: https://github.com/apache/datafusion/issues/15585#issuecomment-3048933537 Yes I think DataFusion should only provide the right hook points to enable this sort of cache. -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Filter cache based on the paper "Predicate Caching: Query-Driven Secondary Indexing for Cloud Data" [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #15585: URL: https://github.com/apache/datafusion/issues/15585#issuecomment-304879 I think this is a good building block for a system, but likely needs to be carefully considered as not all datafusion installations will want to do this (as it will tradeoff memor

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3048779730 > Thank you @alamb, it seems we have some improvement for clickbench. Not too much because we gain for sort string view mostly which is not in clickbench but in sort_tpch. I wil

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3048782227 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

Re: [I] Filter cache based on the paper "Predicate Caching: Query-Driven Secondary Indexing for Cloud Data" [datafusion]

2025-07-08 Thread via GitHub
acking-you commented on issue #15585: URL: https://github.com/apache/datafusion/issues/15585#issuecomment-3048962546 > but likely needs to be carefully considered as not all datafusion installations will want to do this Perhaps, similar to ClickHouse, a controllable setting thres

Re: [PR] docs: Documentation updates for 0.9.0 release [datafusion-comet]

2025-07-08 Thread via GitHub
andygrove merged PR #1981: URL: https://github.com/apache/datafusion-comet/pull/1981 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] docs: Documentation updates for 0.9.0 release [datafusion-comet]

2025-07-08 Thread via GitHub
andygrove commented on PR #1981: URL: https://github.com/apache/datafusion-comet/pull/1981#issuecomment-3049242019 Thanks for the review @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] Perform type coercion for corr aggregate function [datafusion]

2025-07-08 Thread via GitHub
alamb commented on code in PR #15776: URL: https://github.com/apache/datafusion/pull/15776#discussion_r2192719901 ## datafusion/sqllogictest/test_files/corr_type_coercion.slt: ## @@ -0,0 +1,248 @@ +# Licensed to the Apache Software Foundation (ASF) under one Review Comment:

Re: [I] Simplify Joins on Shared Column Name [datafusion-python]

2025-07-08 Thread via GitHub
timsaucer commented on issue #1173: URL: https://github.com/apache/datafusion-python/issues/1173#issuecomment-3048558886 @kosiew Nick is a coworker of mine and we went over this yesterday. I'm going to put up a PR in a few minutes for it. -- This is an automated message from the Apache G

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-08 Thread via GitHub
zhuqi-lucas commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3048816186 > 🤖: Benchmark completed > > Details > > ``` > Comparing HEAD and alamb_update_arrow_56.0.0 > > Benchmark sort_tpch1.json > -

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-08 Thread via GitHub
alamb commented on code in PR #16632: URL: https://github.com/apache/datafusion/pull/16632#discussion_r2192561119 ## datafusion/proto/src/logical_plan/mod.rs: ## @@ -916,40 +915,13 @@ impl AsLogicalPlan for LogicalPlanNode { LogicalPlanType::Unnest(unnest) => {

Re: [I] NoSuchMethodError with Spark 3.5.3 (EMR 7.6) [datafusion-comet]

2025-07-08 Thread via GitHub
dirrao commented on issue #1451: URL: https://github.com/apache/datafusion-comet/issues/1451#issuecomment-3048723212 We're experiencing the same issue with Spark 3.5.5 on EMR. It appears that EMR might be using a customized build of Spark, which could be contributing to this behavior. --

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3048541094 I tried changing the code to use ChunkedArray rather than a single array ```diff -names_array = pa.concat_array([pa.array(names)] * batches) +names_array = pa.chunked_array

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3048543973 So one thing we could potentially do here is make RepartitionExec smarter for large batches, and if the input is really large split it into smaller batch sizes or something 🤔 -

Re: [PR] chore: use DF scalar functions for StartsWith, EndsWith, Contains, DF LikeExpr [datafusion-comet]

2025-07-08 Thread via GitHub
comphead commented on code in PR #1887: URL: https://github.com/apache/datafusion-comet/pull/1887#discussion_r2192767215 ## spark/src/test/scala/org/apache/comet/CometStringExpressionSuite.scala: ## @@ -185,4 +186,67 @@ class CometStringExpressionSuite extends CometTestBase {

Re: [PR] Fix for Postgres regex and like binary operators [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
iffyio commented on code in PR #1928: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1928#discussion_r2192773155 ## tests/sqlparser_postgres.rs: ## @@ -2207,19 +2223,31 @@ fn parse_pg_like_match_ops() { ]; for (str_op, op) in pg_like_match_ops { -

Re: [PR] Revert "fix: create file for empty stream" [datafusion]

2025-07-08 Thread via GitHub
chenkovsky commented on PR #16682: URL: https://github.com/apache/datafusion/pull/16682#issuecomment-3049753756 @alamb @brunal sorry for inconvenience, revert it is ok. I'm trying to find a new way to implement this feature. -- This is an automated message from the Apache Git Service. To

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
tglanz commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2193074384 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
tglanz commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2193074384 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r219304 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +//

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r219304 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +//

Re: [PR] Chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#issuecomment-3049764358 Just pending CI, will likely merge later today. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the U

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3049790410 Not sure why the pelican Dockerfile build fails... ``` 6.961 ERROR: Could not find a version that satisfies the requirement urllib3==2.5.0 (from versions: 0.3, 1.0, 1.0.1, 1.

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3049794080 changing to `urllib3==2.2.3` works https://github.com/apache/infrastructure-actions/blob/1115490227e7aaf7ccee5b06bb3b5955e7cf8493/pelican/requirements.txt#L11 -- This is an au

[PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu opened a new pull request, #86: URL: https://github.com/apache/datafusion-site/pull/86 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscrib

Re: [I] Support u32 indices in HashJoinExec [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #16179: Support u32 indices in HashJoinExec URL: https://github.com/apache/datafusion/issues/16179 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16434: URL: https://github.com/apache/datafusion/pull/16434 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] chore: Implement BloomFilterMightContain as a ScalarUDFImpl [datafusion-comet]

2025-07-08 Thread via GitHub
mbutrovich commented on code in PR #1954: URL: https://github.com/apache/datafusion-comet/pull/1954#discussion_r2193310051 ## native/spark-expr/src/bloom_filter/bloom_filter_might_contain.rs: ## @@ -0,0 +1,196 @@ +// Licensed to the Apache Software Foundation (ASF) under one +//

Re: [PR] cast_operands_to_double_type_to_fix_arithmetic_overflow [datafusion-comet]

2025-07-08 Thread via GitHub
codecov-commenter commented on PR #1996: URL: https://github.com/apache/datafusion-comet/pull/1996#issuecomment-3050078977 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1996?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Add support for automatic join column deduplication in DataFrame joins [datafusion-python]

2025-07-08 Thread via GitHub
timsaucer commented on PR #1185: URL: https://github.com/apache/datafusion-python/pull/1185#issuecomment-3050083502 Would you mind taking a look at https://github.com/apache/datafusion-python/pull/1184 ? It's an alternate approach which basically reuses the logic of `drop_columns` on the r

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3050196330 > Isn't the easiest solution is to update `DataSourceExec` to adhere to target batch size? Maybe -- I will file a ticket to explain the problem with a DataFusion only repro

Re: [D] DISCUSSION: DataFusion Meetup in Boston, USA [datafusion]

2025-07-08 Thread via GitHub
GitHub user NGA-TRAN edited a discussion: DISCUSSION: DataFusion Meetup in Boston, USA With the upcoming New York meetup on the horizon, the DataDog Boston team is excited to plan a local DataFusion-themed gathering this fall! **Date:** Wednesday, November 12 📍 Location: DataDog, 225 Frankl

Re: [PR] feat: Support `u32` indices for `HashJoinExec` [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16434: URL: https://github.com/apache/datafusion/pull/16434#issuecomment-3050097130 TLDR this branch looks good from my performance perspective. Thank you @jonathanc-n and @Dandandan -- This is an automated message from the Apache Git Service. To respond to the mes

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16632: URL: https://github.com/apache/datafusion/pull/16632#issuecomment-3050201621 > branch has been rebased. Should I squash commit or is this done during merge ? commits are squashed on merge so no need to do it on the branch Pushing commits rather tha

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3050505258 Opened https://github.com/apache/infrastructure-actions/issues/218 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [PR] add Makefile and local setup instruction in README [datafusion-site]

2025-07-08 Thread via GitHub
kevinjqliu commented on PR #86: URL: https://github.com/apache/datafusion-site/pull/86#issuecomment-3050539883 ok updated the PR to include the workaround. lets pin to use the working commit for `apache/infrastructure-actions`, which is `8aee7a080268198548d8d1b4f1315a4fb94bffea`. Added

Re: [PR] feat: reduce duplicate fields on join [datafusion-python]

2025-07-08 Thread via GitHub
kosiew commented on code in PR #1184: URL: https://github.com/apache/datafusion-python/pull/1184#discussion_r2194093566 ## python/tests/test_dataframe.py: ## @@ -400,7 +400,6 @@ def test_unnest_without_nulls(nested_df): assert result.column(1) == pa.array([7, 8, 8, 9, 9, 9

[PR] feat: Upgrade to the official DataFusion 49.0.0 release [datafusion-comet]

2025-07-08 Thread via GitHub
dharanad opened a new pull request, #1997: URL: https://github.com/apache/datafusion-comet/pull/1997 ## Which issue does this PR close? Closes #1993 ## Rationale for this change ## What changes are included in this PR? ## How are these chan

Re: [I] [DISCUSSION] Show `predicates` in `DataSourceExec` explain (indent) [datafusion]

2025-07-08 Thread via GitHub
xudong963 closed issue #16561: [DISCUSSION] Show `predicates` in `DataSourceExec` explain (indent) URL: https://github.com/apache/datafusion/issues/16561 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [I] [DISCUSSION] Show `predicates` in `DataSourceExec` explain (indent) [datafusion]

2025-07-08 Thread via GitHub
xudong963 commented on issue #16561: URL: https://github.com/apache/datafusion/issues/16561#issuecomment-3048029215 Thank you all, I believe I ran with an old version before -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] cast_operands_to_double_type_to_fix_arithmetic_overflow [datafusion-comet]

2025-07-08 Thread via GitHub
andygrove commented on code in PR #1996: URL: https://github.com/apache/datafusion-comet/pull/1996#discussion_r2193256587 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -677,7 +677,13 @@ object QueryPlanSerde extends Logging with CometExprShim {

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
ozankabak commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3050173385 Isn't the easiest solution is to update `DataSourceExec` to adhere to target batch size? -- This is an automated message from the Apache Git Service. To respond to the messa

Re: [PR] github: turn on discussion [datafusion-site]

2025-07-08 Thread via GitHub
alamb merged PR #85: URL: https://github.com/apache/datafusion-site/pull/85 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusio

Re: [PR] fix: try to lower plain reserved functions to columns as well [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16669: URL: https://github.com/apache/datafusion/pull/16669 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] fix: try to lower plain reserved functions to columns as well [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16669: URL: https://github.com/apache/datafusion/pull/16669#issuecomment-3050174594 Thank you for the review @jonahgao ❤️ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [I] Error when use `user` field in where clause [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #14141: Error when use `user` field in where clause URL: https://github.com/apache/datafusion/issues/14141 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Improve display format of BoundedWindowAggExec [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16645: URL: https://github.com/apache/datafusion/pull/16645#issuecomment-3050178802 > > I suspect updating the results was faster than it might have previously been thanks to the work @blaginin @Chen-Yuan-Lai have done to migrate most of our plan tests to `insta` >

Re: [I] Bug: the new filter pushdown optimizer rule in physical layer will miss the equivalence info in filter [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #16563: Bug: the new filter pushdown optimizer rule in physical layer will miss the equivalence info in filter URL: https://github.com/apache/datafusion/issues/16563 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

Re: [I] Running tests with `--test-threads` option fails. [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #16693: Running tests with `--test-threads` option fails. URL: https://github.com/apache/datafusion/issues/16693 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] Fix sqllogictests test running compatibility (ignore `--test-threads`) [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16694: URL: https://github.com/apache/datafusion/pull/16694#issuecomment-3050382157 Thanks again @mjgarton -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific com

Re: [PR] Add the missing equivalence info for filter pushdown [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16686: URL: https://github.com/apache/datafusion/pull/16686 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Fix: Make `CopyTo` logical plan output schema consistent with physical schema [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16705: URL: https://github.com/apache/datafusion/pull/16705 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [I] Output schema of the CopyTo logical plan is not correct. [datafusion]

2025-07-08 Thread via GitHub
alamb closed issue #16704: Output schema of the CopyTo logical plan is not correct. URL: https://github.com/apache/datafusion/issues/16704 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specifi

Re: [PR] Fix: Make `CopyTo` logical plan output schema consistent with physical schema [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16705: URL: https://github.com/apache/datafusion/pull/16705#issuecomment-3050382520 Thanks again @bert-beyondloops -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16687: URL: https://github.com/apache/datafusion/pull/16687#issuecomment-3050383546 Thanks again @fvj -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16687: URL: https://github.com/apache/datafusion/pull/16687#issuecomment-3050383391 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [PR] chore(devcontainer): use debian's `protobuf-compiler` package [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16687: URL: https://github.com/apache/datafusion/pull/16687 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Fix sqllogictests test running compatibility (ignore `--test-threads`) [datafusion]

2025-07-08 Thread via GitHub
alamb merged PR #16694: URL: https://github.com/apache/datafusion/pull/16694 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Add the missing equivalence info for filter pushdown [datafusion]

2025-07-08 Thread via GitHub
alamb commented on code in PR #16686: URL: https://github.com/apache/datafusion/pull/16686#discussion_r2193493034 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -289,7 +289,7 @@ fn test_no_pushdown_through_aggregates() { Ok: - Filte

Re: [D] DISCUSSION: DataFusion Meetup in Boston, USA [datafusion]

2025-07-08 Thread via GitHub
GitHub user NGA-TRAN added a comment to the discussion: DISCUSSION: DataFusion Meetup in Boston, USA I have updated the description. Let us go with Wednesday, November 12th. GitHub link: https://github.com/apache/datafusion/discussions/16703#discussioncomment-13700744 This is an automat

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3050260270 I would expect the largest difference to be in sorting benchmarks (`sort_tpch` etc.) -- This is an automated message from the Apache Git Service. To respond to the message, pleas

[PR] Sf create table as [datafusion-sqlparser-rs]

2025-07-08 Thread via GitHub
yoavcloud opened a new pull request, #1931: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1931 The code that parses `CREATE TABLE` in the Snowflake dialect assumed that if the `AS`, `LIKE` or `CLONE` options are used, then no other options can be specified. That is not true, s

[PR] feat: add CopyExec and move CopyExec handling to Spark [datafusion-comet]

2025-07-08 Thread via GitHub
dharanad opened a new pull request, #2001: URL: https://github.com/apache/datafusion-comet/pull/2001 ## Which issue does this PR close? Closes #1995 ## Rationale for this change ## What changes are included in this PR? ## How are these chang

[PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n opened a new pull request, #16718: URL: https://github.com/apache/datafusion/pull/16718 ## Which issue does this PR close? - Closes #. ## Rationale for this change Fix ci ## What changes are included in this PR? #16686 seems to have been merg

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-08 Thread via GitHub
parthchandra commented on code in PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#discussion_r2193617615 ## native/core/src/parquet/objectstore/jni_hdfs.rs: ## @@ -0,0 +1,332 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contri

Re: [PR] feat: Add JNI-based Hadoop FileSystem support for S3 and other Hadoop-compatible stores [datafusion-comet]

2025-07-08 Thread via GitHub
parthchandra commented on PR #1992: URL: https://github.com/apache/datafusion-comet/pull/1992#issuecomment-3050630944 @comphead @Kontinuation you might be interested in looking at this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n commented on PR #16718: URL: https://github.com/apache/datafusion/pull/16718#issuecomment-3050635513 Actually instead of adding back the code, i think i'll bring this function back into `PredicateSupport` since it now makes sense to. -- This is an automated message from the Ap

Re: [PR] fix: Fix CI failing due to #16686 [datafusion]

2025-07-08 Thread via GitHub
jonathanc-n commented on PR #16718: URL: https://github.com/apache/datafusion/pull/16718#issuecomment-3050634024 @adriangb Just need a quick merge here 😆 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

2025-07-08 Thread via GitHub
rluvaton commented on code in PR #15700: URL: https://github.com/apache/datafusion/pull/15700#discussion_r2191760246 ## datafusion/physical-plan/src/aggregates/row_hash.rs: ## @@ -1067,14 +1074,13 @@ impl GroupedHashAggregateStream { sort_batch(&batch, &expr, No

Re: [D] DISCUSSION: DataFusion Meetup in Boston, USA [datafusion]

2025-07-08 Thread via GitHub
GitHub user alamb added a comment to the discussion: DISCUSSION: DataFusion Meetup in Boston, USA I can attend -- thank you for organizing @NGA-TRAN How about Nov 12 or Nov 13? GitHub link: https://github.com/apache/datafusion/discussions/16703#discussioncomment-13694117 This is an

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3048361744 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1016-gcp #16~24.04.1-Ubun

[I] Replace π-related bound constants with `next_up` / `next_down` [datafusion]

2025-07-08 Thread via GitHub
ding-young opened a new issue, #16712: URL: https://github.com/apache/datafusion/issues/16712 ### Background Rust 1.86 stabilized [f64::{next_up, next_down}](https://doc.rust-lang.org/std/primitive.f64.html#method.next_up) and [f32::{next_down, next_up}](https://doc.rust-lang.org/std/pr

Re: [PR] DRAFT: Update arrow/parquet to 56.0.0 [datafusion]

2025-07-08 Thread via GitHub
alamb commented on PR #16690: URL: https://github.com/apache/datafusion/pull/16690#issuecomment-3048363279 > Could we trigger the benchmark for this PR, thanks! Done! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [I] 1000x slowdown opening parquet file due to partitions [datafusion]

2025-07-08 Thread via GitHub
asayers commented on issue #16676: URL: https://github.com/apache/datafusion/issues/16676#issuecomment-3048396188 I can't share the data I was hitting this case with. I could try to make a synthetic reproducer, but work is very busy right now so I might not get to it for a few months. --

Re: [I] Some group by query is 6~7x slower than DuckDB [datafusion]

2025-07-08 Thread via GitHub
alamb commented on issue #16707: URL: https://github.com/apache/datafusion/issues/16707#issuecomment-3048476870 ```shell python3 -m venv test_venv source test_venv/bin/activate pip install duckdb datafusion numpy python3 repro.py ``` I see the output ``` du

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2191939442 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

Re: [PR] Fix: optimize projections for unnest logical plan. [datafusion]

2025-07-08 Thread via GitHub
bert-beyondloops commented on PR #16632: URL: https://github.com/apache/datafusion/pull/16632#issuecomment-3048052607 I moved the logic around the unnest column detection into a new Unnest::try_new method. In my opinion, this did not belong into the LogicalPlanBuilder. The projecti

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2191954803 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

Re: [I] Feature Request: Implement `MATCH_RECOGNIZE` for Advanced Pattern Matching [datafusion]

2025-07-08 Thread via GitHub
geoffreyclaude commented on issue #13583: URL: https://github.com/apache/datafusion/issues/13583#issuecomment-3048085132 I'm pretty much AFK until September, so sharing what I've got so far: https://github.com/apache/datafusion/pull/16685 As written in the PR description, this is a pa

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2191937168 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
Dandandan commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2191935939 ## datafusion/physical-plan/src/joins/utils.rs: ## @@ -843,24 +844,56 @@ pub(crate) fn apply_join_filter_to_indices( probe_indices: UInt32Array, filter

[PR] cast_operands_to_double_type_to_fix_arithmetic_overflow [datafusion-comet]

2025-07-08 Thread via GitHub
coderfender opened a new pull request, #1996: URL: https://github.com/apache/datafusion-comet/pull/1996 ## Which issue does this PR close? Closes #1477 ## Rationale for this change This change fixes an overflow eception which occurs when we divide ## What changes

Re: [PR] refactor filter pushdown APIs [datafusion]

2025-07-08 Thread via GitHub
kosiew commented on PR #16642: URL: https://github.com/apache/datafusion/pull/16642#issuecomment-3047833471 yep -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub

Re: [I] Simplify Filter Pushdown APIs for Better Maintainability and Developer Experience [datafusion]

2025-07-08 Thread via GitHub
kosiew closed issue #16188: Simplify Filter Pushdown APIs for Better Maintainability and Developer Experience URL: https://github.com/apache/datafusion/issues/16188 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] refactor filter pushdown APIs [datafusion]

2025-07-08 Thread via GitHub
kosiew merged PR #16642: URL: https://github.com/apache/datafusion/pull/16642 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafus

Re: [PR] feat: add multi level merge sort that will always fit in memory [datafusion]

2025-07-08 Thread via GitHub
2010YOUY01 commented on code in PR #15700: URL: https://github.com/apache/datafusion/pull/15700#discussion_r2191875936 ## datafusion/physical-plan/src/aggregates/row_hash.rs: ## @@ -1067,14 +1074,13 @@ impl GroupedHashAggregateStream { sort_batch(&batch, &expr,

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-07-08 Thread via GitHub
2010YOUY01 commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-3047980169 > @korowa @2010YOUY01 Are you able to take a quick look? Thanks! Thank you so much for this optimization. It's on my list, but due to the complexity of the join operator, I

  1   2   3   >