Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-29 Thread via GitHub
qstommyshu commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2764305700 > Hey again, thanks for working on this 🙏 > > > can you merge main into this branch please? to remove extra diff > > Just to explain, the current PR diff is quite larg

[I] Add missing error macro [datafusion]

2025-03-29 Thread via GitHub
jayzhan211 opened a new issue, #15491: URL: https://github.com/apache/datafusion/issues/15491 ### Is your feature request related to a problem or challenge? We have error macro like `internal_err` and `exec_err` supported, but not all of them are supported Example * External

Re: [PR] fix: aggregation corner case [datafusion]

2025-03-29 Thread via GitHub
jayzhan211 commented on PR #15457: URL: https://github.com/apache/datafusion/pull/15457#issuecomment-2764347664 > > > count(*) actually doesnt depend on any column on input logically > > > > > > count(*) need to know the row number of the column, and it doesn't make sense to count

Re: [PR] fix: Assertion fail in external sort [datafusion]

2025-03-29 Thread via GitHub
2010YOUY01 commented on code in PR #15469: URL: https://github.com/apache/datafusion/pull/15469#discussion_r2020058583 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -416,21 +409,23 @@ impl ExternalSorter { Some(self.spill_manager.create_in_progress_file("

Re: [PR] perf: Reuse row converter during sort [datafusion]

2025-03-29 Thread via GitHub
2010YOUY01 merged PR #15302: URL: https://github.com/apache/datafusion/pull/15302 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-29 Thread via GitHub
blaginin commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2764289869 Hey again, thanks for working on this 🙏 > can you merge main into this branch please? to remove extra diff Just to explain, the current PR diff is quite large because it

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-29 Thread via GitHub
blaginin commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2764291364 > Hi, @blaginin I'm not sure what exactly do you mean by merge it to main? I see there is no conflicts with base branch so it probably means GitHub can fast forward it? that'

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-29 Thread via GitHub
acking-you commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019944906 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-29 Thread via GitHub
qstommyshu commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2764295647 Got it, thanks for pointing that out. Just cleared up the diff tree. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitH

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-29 Thread via GitHub
qstommyshu commented on code in PR #15480: URL: https://github.com/apache/datafusion/pull/15480#discussion_r2020033789 ## datafusion/substrait/tests/cases/roundtrip_logical_plan.rs: ## @@ -1374,30 +1464,32 @@ async fn assert_read_filter_count( Ok(()) } -async fn assert_e

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-29 Thread via GitHub
qstommyshu commented on code in PR #15480: URL: https://github.com/apache/datafusion/pull/15480#discussion_r2020033789 ## datafusion/substrait/tests/cases/roundtrip_logical_plan.rs: ## @@ -1374,30 +1464,32 @@ async fn assert_read_filter_count( Ok(()) } -async fn assert_e

Re: [PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-03-29 Thread via GitHub
berkaysynnada commented on PR #15479: URL: https://github.com/apache/datafusion/pull/15479#issuecomment-2764267974 I think you can generalize this logic by tracking the `ExecutionPlanProperties::pipeline_behavior()` of operators in the plan. -- This is an automated message from the Apach

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-29 Thread via GitHub
ctsk closed pull request #15418: Clean up hash_join's ExecutionPlan::execute URL: https://github.com/apache/datafusion/pull/15418 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[I] Support macro `ensore_or_internal_err` to clean up repeated sanity checks [datafusion]

2025-03-29 Thread via GitHub
2010YOUY01 opened a new issue, #15492: URL: https://github.com/apache/datafusion/issues/15492 ### Is your feature request related to a problem or challenge? When implementing complex logic, it's common to include assertions as sanity checks to catch potential errors. The Rust `asse

[I] `custom_datasource` example panicked during `RepartitionExec` planning [datafusion]

2025-03-29 Thread via GitHub
2010YOUY01 opened a new issue, #15493: URL: https://github.com/apache/datafusion/issues/15493 ### Describe the bug I have a PR that didn't change the repartition code, but caused one assertion failure inside `RepartitionExec`'s `execute()` method, during `custom_datasource.rs` exampl

Re: [PR] refactor: Move `Memtable` to catalog [datafusion]

2025-03-29 Thread via GitHub
berkaysynnada commented on code in PR #15459: URL: https://github.com/apache/datafusion/pull/15459#discussion_r2019995058 ## datafusion/catalog/src/memory/table.rs: ## @@ -0,0 +1,377 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor licens

Re: [I] IDEA: Use one of the examples from Datafusion Blog 45 to complete custom logical plans/execution plans page [datafusion]

2025-03-29 Thread via GitHub
the0ninjas commented on issue #15422: URL: https://github.com/apache/datafusion/issues/15422#issuecomment-2764416585 I'd love to work on this! @alamb Could you share the link to the blog examples please? -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [PR] Parse Postgres's LOCK TABLE statement [datafusion-sqlparser-rs]

2025-03-29 Thread via GitHub
github-actions[bot] closed pull request #1614: Parse Postgres's LOCK TABLE statement URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1614 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-29 Thread via GitHub
2010YOUY01 commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2764361563 > Isn't it always better partitioning on this selection vectors in case of hash-rep 🤔 What is the reason of keeping the old strategy ? I think to support this selection vect

Re: [PR] Remove redundant statistics from FileScanConfig [datafusion]

2025-03-29 Thread via GitHub
Standing-Man commented on PR #14955: URL: https://github.com/apache/datafusion/pull/14955#issuecomment-2764329379 Thanks for your valuable contributions! I’ll continue working on fixing this issue. -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-29 Thread via GitHub
Dandandan commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2763850029 > I'm working on HashAggregate [goldmedal#3](https://github.com/goldmedal/datafusion/pull/3) based on this PR. I found we shouldn't use only one config, `prefer_hash_selection_vec

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-29 Thread via GitHub
berkaysynnada commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2764263842 Isn't it always better partitioning on this selection vectors in case of hash-rep 🤔 What is the reason of keeping the old strategy ? -- This is an automated message from the

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-29 Thread via GitHub
qstommyshu commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2764290781 > Hey again, thanks for working on this 🙏 > > > can you merge main into this branch please? to remove extra diff > > Just to explain, the current PR diff is quite larg

Re: [PR] Support computing statistics for FileGroup [datafusion]

2025-03-29 Thread via GitHub
alamb commented on code in PR #15432: URL: https://github.com/apache/datafusion/pull/15432#discussion_r2019764664 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -1181,6 +1175,92 @@ impl ListingTable { } } +/// Processes a stream of partitioned files and return

Re: [I] Duplicate unqualified field names error on queries with multiple JOIN [datafusion]

2025-03-29 Thread via GitHub
LiaCastaneda commented on issue #15439: URL: https://github.com/apache/datafusion/issues/15439#issuecomment-2763259160 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [I] Add most functions to the Expr class so that they're chainable. [datafusion-python]

2025-03-29 Thread via GitHub
ion-elgreco commented on issue #1064: URL: https://github.com/apache/datafusion-python/issues/1064#issuecomment-2763203660 Actually a duplicate of https://github.com/apache/datafusion-python/issues/876 -- This is an automated message from the Apache Git Service. To respond to the message

Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-29 Thread via GitHub
alan910127 commented on code in PR #15482: URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019692309 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -230,19 +230,19 @@ logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)]

Re: [PR] Draft: Use take-in kernel in repartitioning [datafusion]

2025-03-29 Thread via GitHub
ctsk commented on PR #15392: URL: https://github.com/apache/datafusion/pull/15392#issuecomment-2754815322 @alamb This PR should be able to run benchmarks now. I've added overrides to use the modified version of arrow in the PR and a lockfile to avoid chrono issues. At least it can run tpch

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-29 Thread via GitHub
acking-you commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019945905 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -

[PR] Add support for Databricks TIMESTAMP_NTZ. [datafusion-sqlparser-rs]

2025-03-29 Thread via GitHub
romanb opened a new pull request, #1781: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1781 This PR adds support for parsing Databricks' [TIMESTAMP_NTZ](https://docs.databricks.com/aws/en/sql/language-manual/data-types/timestamp-ntz-type) data type. -- This is an automated

Re: [PR] Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… [datafusion]

2025-03-29 Thread via GitHub
comphead merged PR #15447: URL: https://github.com/apache/datafusion/pull/15447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [I] Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single `Utf8View` ) [datafusion]

2025-03-29 Thread via GitHub
comphead closed issue #15403: Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving merging on a single `Utf8View` ) URL: https://github.com/apache/datafusion/issues/15403 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub

Re: [I] `"UTMMedium"` field in the hits dataset causes a panic when performing SIMILAR TO. [datafusion]

2025-03-29 Thread via GitHub
acking-you commented on issue #15461: URL: https://github.com/apache/datafusion/issues/15461#issuecomment-2763406114 > Trying with this data I am not getting any error The dataset is 14gb > > https://github.com/user-attachments/assets/b777bcc1-198c-425b-9433-8299016d232c"; /> >

[PR] feat: Add union_by_name, union_by_name_distinct to DataFrame api [datafusion]

2025-03-29 Thread via GitHub
Omega359 opened a new pull request, #15489: URL: https://github.com/apache/datafusion/pull/15489 ## Which issue does this PR close? - Closes #12650 ## Rationale for this change Expose union_by_name/union_by_name_distinct logical plan ops as DataFrame operations

Re: [I] Support zero copy hash repartitioning for Hash Aggregate [datafusion]

2025-03-29 Thread via GitHub
goldmedal commented on issue #15383: URL: https://github.com/apache/datafusion/issues/15383#issuecomment-2763293892 @Dandandan I have a draft https://github.com/goldmedal/datafusion/pull/3 based on #15423 for `HashAggregate`. Could you check if it's heading in the right direction?

[PR] wip: decimal type support for to_timestamp [datafusion]

2025-03-29 Thread via GitHub
jatin510 opened a new pull request, #15486: URL: https://github.com/apache/datafusion/pull/15486 ## Which issue does this PR close? while working on this: https://github.com/apache/datafusion/issues/14612 i found out that, when we enable `parse_float_as_decimal` as `t

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-29 Thread via GitHub
goldmedal commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2763356410 I'm working on HashAggregate https://github.com/goldmedal/datafusion/pull/3 based on this PR. I found we shouldn't use only one config, `prefer_hash_selection_vector_partitionin

Re: [I] Duplicate Unqualified Field Name [datafusion]

2025-03-29 Thread via GitHub
LiaCastaneda commented on issue #14799: URL: https://github.com/apache/datafusion/issues/14799#issuecomment-2763312742 Thanks, this is actually an issue that happens specifically when using the substrait consumer, I'm closing in favour of #15439 -- This is an automated message from the A

Re: [I] Duplicate Unqualified Field Name [datafusion]

2025-03-29 Thread via GitHub
LiaCastaneda closed issue #14799: Duplicate Unqualified Field Name URL: https://github.com/apache/datafusion/issues/14799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To uns

Re: [I] `"UTMMedium"` field in the hits dataset causes a panic when performing SIMILAR TO. [datafusion]

2025-03-29 Thread via GitHub
acking-you commented on issue #15461: URL: https://github.com/apache/datafusion/issues/15461#issuecomment-2763365258 > I am using this as the dataset: [https://datasets.clickhouse.com/hits_compatible/athena_partitioned/[hits_1.parquet](https://datasets.clickhouse.com/hits_compatible/athena_p

Re: [PR] added fallback using reflection for backward-compatibility [datafusion-comet]

2025-03-29 Thread via GitHub
andygrove commented on code in PR #1573: URL: https://github.com/apache/datafusion-comet/pull/1573#discussion_r2019828133 ## spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala: ## @@ -55,15 +55,48 @@ trait ShimCometScanExec { protected def isNee

Re: [I] `"UTMMedium"` field in the hits dataset causes a panic when performing SIMILAR TO. [datafusion]

2025-03-29 Thread via GitHub
psiayn commented on issue #15461: URL: https://github.com/apache/datafusion/issues/15461#issuecomment-2763338588 Hi, I was trying to reproduce this issue but I get a different error. I am using this as the dataset: https://datasets.clickhouse.com/hits_compatible/athena_partitioned/[hi

Re: [I] Dynamic pruning filters from TopK state [datafusion]

2025-03-29 Thread via GitHub
adriangb commented on issue #15037: URL: https://github.com/apache/datafusion/issues/15037#issuecomment-2763348634 wrt waiting for filter pushdown to be enabled by default, I think we're just making our lives harder by coupling them, especially since we can already test them together under

Re: [PR] fix: aggregation corner case [datafusion]

2025-03-29 Thread via GitHub
chenkovsky commented on PR #15457: URL: https://github.com/apache/datafusion/pull/15457#issuecomment-2763209390 > > count(*) actually doesnt depend on any column on input logically > > count(*) need to know the row number of the column, and it doesn't make sense to count all on "empty

Re: [I] Support for generating JSON formatted substrait plan [datafusion-python]

2025-03-29 Thread via GitHub
Vabs-108 commented on issue #508: URL: https://github.com/apache/datafusion-python/issues/508#issuecomment-2763377457 is the issue still coming. Do anyone wants me to resolve the issue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [I] `"UTMMedium"` field in the hits dataset causes a panic when performing SIMILAR TO. [datafusion]

2025-03-29 Thread via GitHub
psiayn commented on issue #15461: URL: https://github.com/apache/datafusion/issues/15461#issuecomment-2763392522 Trying with this data I am not getting any error The dataset is 14gb https://github.com/user-attachments/assets/b777bcc1-198c-425b-9433-8299016d232c"; /> https:/

Re: [I] `"UTMMedium"` field in the hits dataset causes a panic when performing SIMILAR TO. [datafusion]

2025-03-29 Thread via GitHub
psiayn commented on issue #15461: URL: https://github.com/apache/datafusion/issues/15461#issuecomment-2763381508 Thank you, I will try with this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to th

Re: [PR] Migrate datafusion/sql tests to insta [datafusion]

2025-03-29 Thread via GitHub
qstommyshu commented on PR #15484: URL: https://github.com/apache/datafusion/pull/15484#issuecomment-2763402403 Hi @alamb , @blaginin , @xudong963 , The code changes is now done, please review carefully as the code changes is **LARGE**. There some several things to note when I

Re: [PR] added fallback using reflection for backward-compatibility [datafusion-comet]

2025-03-29 Thread via GitHub
andygrove commented on code in PR #1573: URL: https://github.com/apache/datafusion-comet/pull/1573#discussion_r2019823038 ## spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala: ## @@ -55,15 +55,48 @@ trait ShimCometScanExec { protected def isNee

Re: [PR] added fallback using reflection for backward-compatibility [datafusion-comet]

2025-03-29 Thread via GitHub
codecov-commenter commented on PR #1573: URL: https://github.com/apache/datafusion-comet/pull/1573#issuecomment-2763618618 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1573?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-29 Thread via GitHub
zebsme commented on code in PR #15423: URL: https://github.com/apache/datafusion/pull/15423#discussion_r2019843158 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -316,6 +334,70 @@ impl BatchPartitioner { Ok((partition, batch))

Re: [PR] doc: fix quick-start executor command [datafusion-ballista]

2025-03-29 Thread via GitHub
milenkovicm merged PR #1217: URL: https://github.com/apache/datafusion-ballista/pull/1217 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubsc

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-29 Thread via GitHub
zebsme commented on code in PR #15423: URL: https://github.com/apache/datafusion/pull/15423#discussion_r2019843158 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -316,6 +334,70 @@ impl BatchPartitioner { Ok((partition, batch))

Re: [PR] doc: fix quick-start executor command [datafusion-ballista]

2025-03-29 Thread via GitHub
milenkovicm commented on PR #1217: URL: https://github.com/apache/datafusion-ballista/pull/1217#issuecomment-2763769459 Thanks @westhide -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-29 Thread via GitHub
Dandandan commented on PR #15476: URL: https://github.com/apache/datafusion/pull/15476#issuecomment-2763855444 Thanks @ctsk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-29 Thread via GitHub
Dandandan merged PR #15476: URL: https://github.com/apache/datafusion/pull/15476 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-29 Thread via GitHub
zebsme commented on code in PR #15423: URL: https://github.com/apache/datafusion/pull/15423#discussion_r2019843158 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -316,6 +334,70 @@ impl BatchPartitioner { Ok((partition, batch))

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-29 Thread via GitHub
acking-you commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019944906 ## datafusion/physical-expr/src/expressions/binary.rs: ## @@ -358,7 +358,50 @@ impl PhysicalExpr for BinaryExpr { fn evaluate(&self, batch: &RecordBatch) -

Re: [PR] Add short circuit evaluation for `AND` and `OR` [datafusion]

2025-03-29 Thread via GitHub
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2764093707 > Also, could you please add the new Q6 benchmark in a separate PR so I can more easily run my benchmark scripts before/after your code change? Okey,I got it.Do you mean tha

Re: [PR] Allow type coersion of zero input arrays to nullary [datafusion]

2025-03-29 Thread via GitHub
timsaucer commented on PR #15487: URL: https://github.com/apache/datafusion/pull/15487#issuecomment-2764095858 @jayzhan211 Would you mind reviewing, specifically the part in `datafusion/expr/src/type_coercion/functions.rs` since you were the prior author? -- This is an automated message

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-29 Thread via GitHub
goldmedal commented on code in PR #15423: URL: https://github.com/apache/datafusion/pull/15423#discussion_r2019935247 ## datafusion/sqllogictest/test_files/join.slt.part: ## @@ -1389,6 +1389,112 @@ physical_plan 14)--FilterExec: y@1 = x@0 15)---

Re: [I] `physical-expr`: Nullability of `Literal` is not determined by surrounding context [datafusion]

2025-03-29 Thread via GitHub
Omega359 commented on issue #15394: URL: https://github.com/apache/datafusion/issues/15394#issuecomment-2764244381 I've been trying to find where to resolve this issue but my understanding of the core of DF is currently too limited to uncover a solution. I've created a test though that exhi

[PR] feat(sql): add diagnostic for wrong number of function arguments [datafusion]

2025-03-29 Thread via GitHub
prowang01 opened a new pull request, #15490: URL: https://github.com/apache/datafusion/pull/15490 ## Which issue does this PR close? Closes #14432 Relates to #14429 ## Rationale for this change This PR adds a user-facing diagnostic when a SQL function is called with