Re: [I] Simplify Filter Pushdown APIs for Better Maintainability and Developer Experience [datafusion]

2025-06-22 Thread via GitHub
xudong963 commented on issue #16188: URL: https://github.com/apache/datafusion/issues/16188#issuecomment-2995081666 FYI, I added more doc for the related code based on my understanding: https://github.com/apache/datafusion/pull/16504/ -- This is an automated message from the Apache Git Se

Re: [PR] feat: supports array_distinct [datafusion-comet]

2025-06-22 Thread via GitHub
codecov-commenter commented on PR #1923: URL: https://github.com/apache/datafusion-comet/pull/1923#issuecomment-2995032965 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1923?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [I] Simplify Filter Pushdown APIs for Better Maintainability and Developer Experience [datafusion]

2025-06-22 Thread via GitHub
xudong963 commented on issue #16188: URL: https://github.com/apache/datafusion/issues/16188#issuecomment-2994977304 @adriangb @alamb Thank you, now I understand the background about the filter pushdown in logical & physical phases. > The main con of having TableProvider attach the

[PR] feat: supports array_distinct [datafusion-comet]

2025-06-22 Thread via GitHub
drexler-sky opened a new pull request, #1923: URL: https://github.com/apache/datafusion-comet/pull/1923 ## Which issue does this PR close? Closes #. ## Rationale for this change adds support for array_distinct ## What changes are included in this PR?

Re: [PR] fix: SortMergeJoin for timestamp keys [datafusion-comet]

2025-06-22 Thread via GitHub
SKY-ALIN commented on PR #1901: URL: https://github.com/apache/datafusion-comet/pull/1901#issuecomment-2994779058 @mbutrovich done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

Re: [PR] Support Null aware anti join by HashJoin [datafusion]

2025-06-22 Thread via GitHub
github-actions[bot] commented on PR #10584: URL: https://github.com/apache/datafusion/pull/10584#issuecomment-2994692091 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Perform type coercion for corr aggregate function during physical planning [datafusion]

2025-06-22 Thread via GitHub
alamb commented on code in PR #15776: URL: https://github.com/apache/datafusion/pull/15776#discussion_r2160311322 ## datafusion/core/src/physical_planner.rs: ## @@ -1598,8 +1600,16 @@ pub fn create_aggregate_expr_with_name_and_maybe_filter( physical_name(e)?

Re: [PR] Temporarily fix bug in dynamic top-k optimization [datafusion]

2025-06-22 Thread via GitHub
alamb commented on PR #16465: URL: https://github.com/apache/datafusion/pull/16465#issuecomment-2994205944 We think we have fixed the real cause in - https://github.com/apache/datafusion/pull/16491 Here is a PR to restore the related code in - https://github.com/apache/datafusio

Re: [PR] re-enable `sort_query_fuzzer_runner` [datafusion]

2025-06-22 Thread via GitHub
alamb commented on PR #16491: URL: https://github.com/apache/datafusion/pull/16491#issuecomment-2994130421 > LGTM. IDK if there's a precedence to formatting the error case as an empty string elsewhere in Datafusion. It seems like `format_option!` uses `"NULL"` as sort of a display sentinel

Re: [PR] Restore topk filtering tests [datafusion]

2025-06-22 Thread via GitHub
alamb commented on PR #16501: URL: https://github.com/apache/datafusion/pull/16501#issuecomment-2994232687 Thanks @adriangb ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] Chore: implement predicate exprs as ScalarUDFImpl [datafusion-comet]

2025-06-22 Thread via GitHub
kazantsev-maksim commented on code in PR #1864: URL: https://github.com/apache/datafusion-comet/pull/1864#discussion_r2160393687 ## spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala: ## @@ -952,32 +947,23 @@ object QueryPlanSerde extends Logging with CometExprShim

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-22 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2994387124 @Dandandan @2010YOUY01 Both of those snippets are nor covering for ByteViewArrays. Is that a bug? -- This is an automated message from the Apache Git Service. To respond to the messag

[I] Dataframe doesn't properly implement ArrowStream export interface [datafusion-python]

2025-06-22 Thread via GitHub
johnnyg opened a new issue, #1166: URL: https://github.com/apache/datafusion-python/issues/1166 **Describe the bug** When trying to pass a dataframe to another library that expects an [ArrowStream export interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.ht

Re: [PR] fix: Add overflow check for SumDecimalGroupsAccumulator::evaluate [datafusion-comet]

2025-06-22 Thread via GitHub
codecov-commenter commented on PR #1922: URL: https://github.com/apache/datafusion-comet/pull/1922#issuecomment-2994535649 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1922?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-22 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2994386331 @Dandandan I believe that that heuristic does not make sense in this context. The reason why the gc is introduced here is mainly to reduce the size of the data buffer vector of StringVi

[PR] Add support for Arrow Duration type in Substrait [datafusion]

2025-06-22 Thread via GitHub
jkosh44 opened a new pull request, #16503: URL: https://github.com/apache/datafusion/pull/16503 ## Which issue does this PR close? - Closes #16285. ## Rationale for this change This commit adds support for Arrow Duration types in Substrait plans. Substrait has no equival

Re: [PR] Restore topk filtering tests [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16501: URL: https://github.com/apache/datafusion/pull/16501#issuecomment-2994331327 So from my investigation what I *think* is happening is that https://github.com/apache/datafusion/pull/15770 fundamentally converted the TopK operation from being isolated per parti

Re: [I] Add a case for reading map_entries in the `read basic complex types` test [datafusion-comet]

2025-06-22 Thread via GitHub
comphead commented on issue #1916: URL: https://github.com/apache/datafusion-comet/issues/1916#issuecomment-2994364892 I'll take it @parthchandra as already working on other MAP functions -- This is an automated message from the Apache Git Service. To respond to the message, please log on

[PR] datafusion-cli: Use correct S3 region if it is not specified [datafusion]

2025-06-22 Thread via GitHub
liamzwbao opened a new pull request, #16502: URL: https://github.com/apache/datafusion/pull/16502 ## Which issue does this PR close? - Closes #16306. ## Rationale for this change ## What changes are included in this PR? ## Are these changes

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994216351 @kosiew I'm curious what you think about this. Would it be possible to implement the nested struct imputation work you're doing with this approach? -- This is an automated message

Re: [PR] Simplify predicates in `PushDownFilter` optimizer rule [datafusion]

2025-06-22 Thread via GitHub
alamb commented on code in PR #16362: URL: https://github.com/apache/datafusion/pull/16362#discussion_r2160308484 ## datafusion/optimizer/src/simplify_predicates.rs: ## @@ -0,0 +1,194 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor licen

Re: [PR] use 'lit' as the field name for literal values [datafusion]

2025-06-22 Thread via GitHub
alamb commented on code in PR #16498: URL: https://github.com/apache/datafusion/pull/16498#discussion_r2160359663 ## datafusion/physical-expr/src/expressions/literal.rs: ## @@ -66,8 +66,7 @@ impl Literal { value: ScalarValue, metadata: Option, ) -> Self {

Re: [PR] Perform type coercion for corr aggregate function during physical planning [datafusion]

2025-06-22 Thread via GitHub
alamb commented on code in PR #15776: URL: https://github.com/apache/datafusion/pull/15776#discussion_r2160311996 ## datafusion/functions-aggregate/src/correlation.rs: ## @@ -372,10 +377,8 @@ impl GroupsAccumulator for CorrelationGroupsAccumulator { self.sum_xx.resize(t

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994242282 Just for fun I opened https://github.com/pydantic/datafusion/pull/31 to see how hard it would be to incorporate https://github.com/apache/datafusion/issues/15780#issuecomment-282471

Re: [PR] Fix `limit` in subqueries [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
alamb commented on PR #1899: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1899#issuecomment-2994195905 > @iffyio Thanks for merging. @alamb Could this fix be backported to 0.56, or will DataFusion move straight to 0.57? DataFusion hasn't updated to 0.56 yet - https:

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16461: URL: https://github.com/apache/datafusion/pull/16461#issuecomment-2994214075 My hopes with this work is that we can: - Make it easier to do further optimizations like https://github.com/apache/datafusion/issues/15780#issuecomment-2824716928 - Replace th

Re: [PR] Fix `limit` in subqueries [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
iffyio commented on PR #1899: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1899#issuecomment-2994227284 Ah yeah a new release candidate sounds reasonable given 0.57 isn't out yet -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [PR] Restore topk filtering tests [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16501: URL: https://github.com/apache/datafusion/pull/16501#issuecomment-2994226988 Sadly the 1200 run still reports failures 😭 I feel like @AdamGS 's original intuition that it's something about sort stabilit with nulls is correct. I'll see if I can find a f

Re: [PR] Restore topk filtering tests [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16501: URL: https://github.com/apache/datafusion/pull/16501#issuecomment-2994222512 We have the test in https://github.com/apache/datafusion/pull/16465/files#diff-f38cac7a9ac55c93d71632c96d6d2afa219cfb07351125a349099c86df859446 which seems to be passing. I'm runnin

Re: [PR] use 'lit' as the field name for literal values [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16498: URL: https://github.com/apache/datafusion/pull/16498#issuecomment-2994218318 Thanks folks. One more thought: I wonder if this improves performance by some small fraction given that I could see `lit(..)` being called in hot loops in some place... Anywa

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-22 Thread via GitHub
jonathanc-n commented on code in PR #16443: URL: https://github.com/apache/datafusion/pull/16443#discussion_r2160347506 ## datafusion/physical-plan/src/joins/nested_loop_join.rs: ## @@ -828,13 +833,127 @@ impl NestedLoopJoinStream { handle_state!(self.proces

Re: [I] SortQueryFuzzer found a failing case on main [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on issue #16452: URL: https://github.com/apache/datafusion/issues/16452#issuecomment-2994208242 > We revert the code changes in Temporarily fix bug in dynamic top-k optimizationĀ #16465 that were unrelated. I'll make a PR to do that (update Restore topk filtering testsĀ #16

Re: [PR] refactor reassign_predicate_columns to accept an &Schema instead of &Arc [datafusion]

2025-06-22 Thread via GitHub
adriangb merged PR #16499: URL: https://github.com/apache/datafusion/pull/16499 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] re-enable `sort_query_fuzzer_runner` [datafusion]

2025-06-22 Thread via GitHub
adriangb commented on PR #16491: URL: https://github.com/apache/datafusion/pull/16491#issuecomment-2994196260 Should we go ahead and merge and get the test running again (or find out quickly with the many CI runs it's still broken)? Or do you want to wait for your local testing? -- This

Re: [PR] re-enable `sort_query_fuzzer_runner` [datafusion]

2025-06-22 Thread via GitHub
alamb commented on PR #16491: URL: https://github.com/apache/datafusion/pull/16491#issuecomment-2994201911 > Should we go ahead and merge and get the test running again (or find out quickly with the many CI runs it's still broken)? Or do you want to wait for your local testing? I thi

[PR] Alamb/revert fix [datafusion]

2025-06-22 Thread via GitHub
alamb opened a new pull request, #16501: URL: https://github.com/apache/datafusion/pull/16501 ## Which issue does this PR close? - Closes #. ## Rationale for this change - @AdamGS removed some of the code here https://github.com/apache/datafusion/pull/16465 -

Re: [PR] re-enable `sort_query_fuzzer_runner` [datafusion]

2025-06-22 Thread via GitHub
alamb merged PR #16491: URL: https://github.com/apache/datafusion/pull/16491 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] re-enable `sort_query_fuzzer_runner` [datafusion]

2025-06-22 Thread via GitHub
alamb commented on PR #16491: URL: https://github.com/apache/datafusion/pull/16491#issuecomment-2994201964 Thank you @adriangb -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] [datafusion-spark] Implement `factorical` function [datafusion]

2025-06-22 Thread via GitHub
alamb commented on code in PR #16125: URL: https://github.com/apache/datafusion/pull/16125#discussion_r2160337528 ## datafusion/spark/src/function/math/factorial.rs: ## @@ -0,0 +1,195 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor licen

Re: [PR] Fix `limit` in subqueries [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
iffyio merged PR #1899: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1899 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr

Re: [PR] Perform type coercion for corr aggregate function during physical planning [datafusion]

2025-06-22 Thread via GitHub
alamb commented on PR #15776: URL: https://github.com/apache/datafusion/pull/15776#issuecomment-2994144320 I am sorry for the delayed review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the sp

[PR] Add support for LANGUAGE clause in CREATE PROCEDURE [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
ZacJW opened a new pull request, #1903: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1903 sqlparser-rs doesn't currently support the `LANGUAGE` clause in `CREATE PROCEDURE` ([see grammar](https://jakewheat.github.io/sql-overview/sql-2016-foundation-grammar.html#routine-charac

Re: [PR] Support for Map values in ClickHouse settings [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
solontsev commented on code in PR #1896: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1896#discussion_r2160273062 ## src/parser/mod.rs: ## @@ -3156,18 +3156,21 @@ impl<'a> Parser<'a> { /// ``` /// /// [dictionary]: https://duckdb.org/docs/sql/data

Re: [PR] Make `SessionContext::register_parquet` obey `collect_statistics` config [datafusion]

2025-06-22 Thread via GitHub
alamb commented on PR #16080: URL: https://github.com/apache/datafusion/pull/16080#issuecomment-2994126025 > Yep that's what I meant. Personally I think either default is justifiable, but before this PR there were 2 different defaults depending on your usage. But having a single default is

Re: [PR] use 'lit' as the field name for literal values [datafusion]

2025-06-22 Thread via GitHub
alamb commented on PR #16498: URL: https://github.com/apache/datafusion/pull/16498#issuecomment-2994123402 FYI @timsaucer -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] limit intermediate batch size in nested_loop_join [datafusion]

2025-06-22 Thread via GitHub
UBarney commented on PR #16443: URL: https://github.com/apache/datafusion/pull/16443#issuecomment-2994063902 > * `apply_join_filter_to_indices` Showed a reduction in execution time (sample count reduced from 528million to 241million). The benchmark results indicate that restricting th

Re: [PR] Support for Map values in ClickHouse settings [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
solontsev commented on code in PR #1896: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1896#discussion_r2160273062 ## src/parser/mod.rs: ## @@ -3156,18 +3156,21 @@ impl<'a> Parser<'a> { /// ``` /// /// [dictionary]: https://duckdb.org/docs/sql/data

Re: [PR] Fix `limit` in subqueries [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
Dimchikkk commented on PR #1899: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1899#issuecomment-2994075201 @iffyio Thanks for merging. @alamb Could this fix be backported to 0.56, or will DataFusion move straight to 0.57? -- This is an automated message from the Apache

[I] Improving support for `CREATE PROCEDURE` [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
ZacJW opened a new issue, #1902: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1902 sqlparser-rs currently doesn't have great support for `CREATE PROCEDURE` statements, leading to parser errors when I pass in valid SQL. I'd like to work on improving this (I've already starte

Re: [PR] Support procedure argmode [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
ZacJW commented on PR #1901: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1901#issuecomment-2994066217 Just found the link to in the README, and this implementation is still valid. -- This is an a

[PR] Support procedure argmode [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
ZacJW opened a new pull request, #1901: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1901 Currently there's no support for argmodes (`IN`, `OUT`, `INOUT`) in `CREATE PROCEDURE` statements despite the standard (I'm reading a [BNF for ISO/IEC 9075-2:2003](https://ronsavage.gith

Re: [PR] Use `IndexColumn` in all index definitions [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
iffyio commented on code in PR #1900: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1900#discussion_r2160246517 ## src/parser/mod.rs: ## @@ -10595,6 +10593,16 @@ impl<'a> Parser<'a> { self.parse_parenthesized_column_list_inner(optional, allow_empty, |p|

Re: [I] Parse error for `limit` in subqueries [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
iffyio closed issue #1898: Parse error for `limit` in subqueries URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1898 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

[PR] chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics` [datafusion]

2025-06-22 Thread via GitHub
Samyak2 opened a new pull request, #16500: URL: https://github.com/apache/datafusion/pull/16500 ## Which issue does this PR close? Closes #16495 ## Rationale for this change ## What changes are included in this PR? Add a `BaselineMetrics` an

Re: [PR] Support for Map values in ClickHouse settings [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
iffyio commented on code in PR #1896: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1896#discussion_r2160244131 ## src/parser/mod.rs: ## @@ -3156,18 +3156,21 @@ impl<'a> Parser<'a> { /// ``` /// /// [dictionary]: https://duckdb.org/docs/sql/data_ty

Re: [PR] Postgres: Add support for text search types [datafusion-sqlparser-rs]

2025-06-22 Thread via GitHub
iffyio merged PR #1889: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1889 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr