Re: [PR] Prepare for `0.57.0` release [datafusion-sqlparser-rs]

2025-06-13 Thread via GitHub
alamb merged PR #1885: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1885 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr.

Re: [PR] Reuse alias if possible [datafusion]

2025-06-13 Thread via GitHub
github-actions[bot] closed pull request #14781: Reuse alias if possible URL: https://github.com/apache/datafusion/pull/14781 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] feat: Fix multi-lines printing issue for datafusion-cli and add the streaming printing feature back [datafusion]

2025-06-13 Thread via GitHub
github-actions[bot] closed pull request #14954: feat: Fix multi-lines printing issue for datafusion-cli and add the streaming printing feature back URL: https://github.com/apache/datafusion/pull/14954 -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] chore(deps): bump rand_distr from 0.4.3 to 0.5.1 [datafusion]

2025-06-13 Thread via GitHub
dependabot[bot] commented on PR #14807: URL: https://github.com/apache/datafusion/pull/14807#issuecomment-2972121901 OK, I won't notify you again about this release, but will get in touch when a new version is available. If you'd rather skip all updates until the next major or minor version

Re: [PR] Add hook for sharing join state in distributed execution [datafusion]

2025-06-13 Thread via GitHub
github-actions[bot] commented on PR #12523: URL: https://github.com/apache/datafusion/pull/12523#issuecomment-2972121947 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] refactor: use TypeSignature::Coercible for math functions [datafusion]

2025-06-13 Thread via GitHub
github-actions[bot] commented on PR #14872: URL: https://github.com/apache/datafusion/pull/14872#issuecomment-2972121717 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Draft: Parse literal to different types [datafusion]

2025-06-13 Thread via GitHub
github-actions[bot] closed pull request #15202: Draft: Parse literal to different types URL: https://github.com/apache/datafusion/pull/15202 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the speci

Re: [PR] Fix logo in rust API docs [datafusion]

2025-06-13 Thread via GitHub
github-actions[bot] closed pull request #14989: Fix logo in rust API docs URL: https://github.com/apache/datafusion/pull/14989 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] chore(deps): bump rand_distr from 0.4.3 to 0.5.1 [datafusion]

2025-06-13 Thread via GitHub
github-actions[bot] closed pull request #14807: chore(deps): bump rand_distr from 0.4.3 to 0.5.1 URL: https://github.com/apache/datafusion/pull/14807 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] WIP: User defined sorting [datafusion]

2025-06-13 Thread via GitHub
github-actions[bot] closed pull request #15106: WIP: User defined sorting URL: https://github.com/apache/datafusion/pull/15106 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[I] decimal calculate overflow but not throw error [datafusion]

2025-06-13 Thread via GitHub
mmooyyii opened a new issue, #16406: URL: https://github.com/apache/datafusion/issues/16406 ### Describe the bug 1. make test csv ``` import csv import random import decimal random.seed(42) def make_big_random_decimal(): n = random.randint(1, 1 << 5

Re: [I] Improved experience when remote object store URL does not end in `/` [datafusion]

2025-06-13 Thread via GitHub
xiedeyantu commented on issue #16302: URL: https://github.com/apache/datafusion/issues/16302#issuecomment-2971964565 > Thanks [@xiedeyantu](https://github.com/xiedeyantu) -- I'll try and review it shortly Thanks a lot! @alamb -- This is an automated message from the Apache Git Ser

Re: [PR] chore: Implement date_trunc as ScalarUDFImpl [datafusion-comet]

2025-06-13 Thread via GitHub
codecov-commenter commented on PR #1880: URL: https://github.com/apache/datafusion-comet/pull/1880#issuecomment-2971927127 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1880?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] chore: refactor planner read schema tests [datafusion-comet]

2025-06-13 Thread via GitHub
comphead merged PR #1886: URL: https://github.com/apache/datafusion-comet/pull/1886 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@d

Re: [PR] chore: refactor planner read schema tests [datafusion-comet]

2025-06-13 Thread via GitHub
comphead commented on PR #1886: URL: https://github.com/apache/datafusion-comet/pull/1886#issuecomment-2971915136 Thanks for the review -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [PR] feat: support RangePartitioning with native shuffle [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra commented on code in PR #1862: URL: https://github.com/apache/datafusion-comet/pull/1862#discussion_r2146243533 ## common/src/main/scala/org/apache/comet/CometConf.scala: ## @@ -307,6 +307,18 @@ object CometConf extends ShimCometConf { .booleanConf .cr

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra merged PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr.

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra commented on code in PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883#discussion_r2146241641 ## common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java: ## @@ -533,13 +533,20 @@ private StructType getSparkSchemaByFieldId( return

Re: [PR] feat: support RangePartitioning with native shuffle [datafusion-comet]

2025-06-13 Thread via GitHub
mbutrovich commented on code in PR #1862: URL: https://github.com/apache/datafusion-comet/pull/1862#discussion_r2146241932 ## common/src/main/scala/org/apache/comet/CometConf.scala: ## @@ -307,6 +307,18 @@ object CometConf extends ShimCometConf { .booleanConf .crea

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra commented on PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883#issuecomment-2971907817 Merged. Thanks for the reviews @andygrove @comphead -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

Re: [PR] feat: support RangePartitioning with native shuffle [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra commented on code in PR #1862: URL: https://github.com/apache/datafusion-comet/pull/1862#discussion_r2146236444 ## spark/src/test/scala/org/apache/comet/exec/CometNativeShuffleSuite.scala: ## @@ -120,29 +120,51 @@ class CometNativeShuffleSuite extends CometTestBase

Re: [I] Evaluate filter pushdown against the physical schema for performance and correctness [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on issue #15780: URL: https://github.com/apache/datafusion/issues/15780#issuecomment-2971805394 @alamb I tried to put together an example of schema evolution where the file had a Int32 column at the file schema level and the table has it as Int64, I can see the extra conv

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
comphead commented on code in PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883#discussion_r2146172373 ## common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java: ## @@ -533,13 +533,20 @@ private StructType getSparkSchemaByFieldId( return newS

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra commented on code in PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883#discussion_r2146162601 ## common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java: ## @@ -533,13 +533,20 @@ private StructType getSparkSchemaByFieldId( return

Re: [PR] Chore: implement predicate exprs as ScalarUDFImpl [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra commented on PR #1864: URL: https://github.com/apache/datafusion-comet/pull/1864#issuecomment-2971784129 Ci is failing because in (`iceberg_compat`)`initRecordBatchReader` we call `planner.createExpr` for predicates that are pushed down and the expressions are no longer there

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
comphead commented on code in PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883#discussion_r2146142132 ## common/src/main/java/org/apache/comet/parquet/NativeBatchReader.java: ## @@ -533,13 +533,20 @@ private StructType getSparkSchemaByFieldId( return newS

Re: [PR] chore: refactor planner read schema tests [datafusion-comet]

2025-06-13 Thread via GitHub
codecov-commenter commented on PR #1886: URL: https://github.com/apache/datafusion-comet/pull/1886#issuecomment-2971750960 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1886?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] feat: pass ignore_nulls flag to first and last [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra commented on PR #1866: URL: https://github.com/apache/datafusion-comet/pull/1866#issuecomment-2971722751 @andygrove perhaps we can merge this while we wait for the tests to be made more accurate? -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] fix: Fixed error handling for `generate_series/range` [datafusion]

2025-06-13 Thread via GitHub
jonathanc-n commented on PR #16391: URL: https://github.com/apache/datafusion/pull/16391#issuecomment-2971722156 Fixed @alamb! Should be good now -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

[PR] chore: refactor planner read schema tests [datafusion-comet]

2025-06-13 Thread via GitHub
comphead opened a new pull request, #1886: URL: https://github.com/apache/datafusion-comet/pull/1886 ## Which issue does this PR close? Closes #. ## Rationale for this change Refactor planner read tests to improve readability and factor out reusable code #

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra commented on code in PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883#discussion_r2146079262 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1745,6 +1746,77 @@ abstract class ParquetReadSuite extends CometTestBase {

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971653072 @alamb I opened https://github.com/pydantic/datafusion/pull/30 to explore the idea of having two pushdown phases. It's not complete (some failing tests, some TODOs) but I think it c

Re: [I] Release sqlparser-rs version `0.57.0` around 2024-06-15 [datafusion-sqlparser-rs]

2025-06-13 Thread via GitHub
alamb commented on issue #1837: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1837#issuecomment-2971597878 Ok, I have a PR up with the changelog and version bump: https://github.com/apache/datafusion-sqlparser-rs/pull/1885 -- This is an automated message from the Apache G

[I] Upgrade to sqlparser 0.56.0 [datafusion]

2025-06-13 Thread via GitHub
alamb opened a new issue, #16405: URL: https://github.com/apache/datafusion/issues/16405 ### Is your feature request related to a problem or challenge? _No response_ ### Describe the solution you'd like _No response_ ### Describe alternatives you've considered

[PR] Remove some clones [datafusion]

2025-06-13 Thread via GitHub
simonvandel opened a new pull request, #16404: URL: https://github.com/apache/datafusion/pull/16404 ## Which issue does this PR close? - Closes #. ## Rationale for this change Mostly drive-by changes. I don't think they have much impact on performance.

Re: [PR] fix typo in test file name [datafusion]

2025-06-13 Thread via GitHub
alamb merged PR #16403: URL: https://github.com/apache/datafusion/pull/16403 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] chore: Enable Spark SQL tests for auto scan mode [WIP] [datafusion-comet]

2025-06-13 Thread via GitHub
codecov-commenter commented on PR #1885: URL: https://github.com/apache/datafusion-comet/pull/1885#issuecomment-2971547836 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1885?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

[I] Release sqlparser-rs version `0.58.0` around 2024-08-15 [datafusion-sqlparser-rs]

2025-06-13 Thread via GitHub
alamb opened a new issue, #1886: URL: https://github.com/apache/datafusion-sqlparser-rs/issues/1886 Follow on to - https://github.com/apache/datafusion-sqlparser-rs/issues/1837 This ticket tracks creating the next sqlparser release (mostly so others can follow along) **Targ

[PR] Prepare for `0.57.0` release [datafusion-sqlparser-rs]

2025-06-13 Thread via GitHub
alamb opened a new pull request, #1885: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1885 - Part of https://github.com/apache/datafusion-sqlparser-rs/issues/1837 Changes: 1. Generate CHANGELOG 2. Update version -- This is an automated message from the Apache Gi

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971539472 > It could (and does currently) handle push down of dynamic filters, the issue is that it cannot be run after EnforceSorting because EnforceSorting and EnforceDistribution need to be r

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971532504 Okay then I'll file a ticket for the multi-column sort and the display. But I do think we should hash out https://github.com/apache/datafusion/pull/15770#issuecomment-2971441638 in

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971531318 > I don't understand why FilterPushdown can't also push down DynamicFlters if it was run after EnforceSorting but I vaguely remember it being discussed and rejected before It

Re: [PR] feat: support RangePartitioning with native shuffle [datafusion-comet]

2025-06-13 Thread via GitHub
andygrove commented on code in PR #1862: URL: https://github.com/apache/datafusion-comet/pull/1862#discussion_r2146006687 ## dev/diffs/3.4.3.diff: ## @@ -2404,7 +2411,31 @@ index 266bb343526..c3e3d155813 100644 checkAnswer(aggDF, df1.groupBy("j").agg(max("k"))) }

Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971519773 > Could it be that in that test we don't have file statistics (`datafusion.execution.collect_statistics = false`) -> the pruning is happening at the row group level? ```she

Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971520273 (basically I want to be able to see from statistics when the dynamic filters are helping / not helping) -- This is an automated message from the Apache Git Service. To respond t

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971510953 > 1. Mutli-column order by not working. This seems like a bug / oversight. Fundamentally I don't see any reason it shouldn't work, I'll have to investigate why. I suggest we fil

[PR] chore: Enable Spark SQL tests for auto scan mode [WIP] [datafusion-comet]

2025-06-13 Thread via GitHub
andygrove opened a new pull request, #1885: URL: https://github.com/apache/datafusion-comet/pull/1885 ## Which issue does this PR close? Closes #. ## Rationale for this change ## What changes are included in this PR? ## How are these changes

Re: [PR] Blog: Optimizing SQL and DataFrames [datafusion-site]

2025-06-13 Thread via GitHub
alamb commented on code in PR #74: URL: https://github.com/apache/datafusion-site/pull/74#discussion_r2145976359 ## content/blog/2025-06-15-optimizing-sql-dataframes-part-one.md: ## @@ -0,0 +1,250 @@ +--- +layout: post +title: Optimizing SQL (and DataFrames) in DataFusion, Part

Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971481867 Could it be that in that test we don't have file statistics (`datafusion.execution.collect_statistics = false`) -> the pruning is happening at the row group level? -- This i

Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971469382 > Hmm maybe we aren't including that statistic in the output? I think everything that is non zero is included. I'll have to look into it some more -- This is an automated

Re: [PR] feat: Add experimental auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-06-13 Thread via GitHub
andygrove merged PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@

Re: [PR] Add design process section to the docs [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #16397: URL: https://github.com/apache/datafusion/pull/16397#issuecomment-2971466770 > Relevant question to this text and the project is what the project's stance is wrt API stability? Merging fast means you're likely to ship something a little bit too quickly every no

Re: [PR] feat: Add experimental auto mode for `COMET_PARQUET_SCAN_IMPL` [datafusion-comet]

2025-06-13 Thread via GitHub
andygrove commented on PR #1747: URL: https://github.com/apache/datafusion-comet/pull/1747#issuecomment-2971468514 Thanks for the review @parthchandra. I added the logging. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Add design process section to the docs [datafusion]

2025-06-13 Thread via GitHub
pepijnve commented on PR #16397: URL: https://github.com/apache/datafusion/pull/16397#issuecomment-2971445102 Sorry to go a bit off topic for a sec, but there's some context I would like to add. I worked on API design of a commercial software library with tons of extension points for 10+ ye

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971441638 > We might be able to split the filter pushdown into two steps: static (cannot make assumptions about reference links but can modify the plan tree, e.g. for FitlerExec) and dynamic

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145924185 ## datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs: ## @@ -114,6 +118,18 @@ fn pushdown_sorts_helper( sort_push_down.data.fetch =

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145917211 ## datafusion/core/tests/physical_optimizer/filter_pushdown/mod.rs: ## @@ -346,6 +359,137 @@ fn test_node_handles_child_pushdown_result() { ); } +#[tokio:

[PR] fix typo in test file name [datafusion]

2025-06-13 Thread via GitHub
adriangb opened a new pull request, #16403: URL: https://github.com/apache/datafusion/pull/16403 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe

Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971392142 > > cover that? > > Yes 🤦 > > For some reason it doesn't show up for me in the explain analyze I have: [q25-analyze-topk-dynamic-filter.txt](https://github.com/use

Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub
alamb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971390765 > cover that? Yes 🤦 For some reason it doesn't show up for me in the explain analyze I have: [q25-analyze-topk-dynamic-filter.txt](https://github.com/user-attachment

Re: [PR] Add design process section to the docs [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #16397: URL: https://github.com/apache/datafusion/pull/16397#issuecomment-2971392100 > This is really nice, thanks @alamb! Thanks -- I was just channeling @ozankabak :) -- This is an automated message from the Apache Git Service. To respond to the message, ple

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971391046 Thanks for the review @alamb! I'll try to summarize the high level issues: 1. Mutli-column order by not working. This seems like a bug / oversight. Fundamentally I don't se

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
alamb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145903411 ## datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs: ## @@ -70,6 +71,8 @@ pub fn assign_initial_requirements(sort_push_down: &mut SortPushDown) {

[I] Support map lookup by key operation [datafusion-comet]

2025-06-13 Thread via GitHub
comphead opened a new issue, #1884: URL: https://github.com/apache/datafusion-comet/issues/1884 ### What is the problem the feature request solves? Currently test cannot be run in native mode ``` test("test lookup map by a key") { withSQLConf( CometConf.COM

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145900061 ## datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs: ## @@ -114,6 +118,18 @@ fn pushdown_sorts_helper( sort_push_down.data.fetch =

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145895957 ## datafusion/sqllogictest/test_files/parquet_filter_pushdown.slt: ## @@ -246,38 +246,3 @@ physical_plan 02)--FilterExec: val@0 != part@1 03)RepartitionExec

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145894014 ## datafusion/physical-optimizer/src/enforce_sorting/sort_pushdown.rs: ## @@ -70,6 +71,8 @@ pub fn assign_initial_requirements(sort_push_down: &mut SortPushDown)

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
codecov-commenter commented on PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883#issuecomment-2971338111 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1883?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
alamb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145790872 ## datafusion/core/tests/fuzz_cases/topk_filter_pushdown.rs: ## @@ -0,0 +1,387 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contribu

Re: [PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
andygrove commented on code in PR #1883: URL: https://github.com/apache/datafusion-comet/pull/1883#discussion_r2145878327 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1745,6 +1746,77 @@ abstract class ParquetReadSuite extends CometTestBase {

Re: [I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on issue #16402: URL: https://github.com/apache/datafusion/issues/16402#issuecomment-2971321730 Doesn't https://github.com/apache/datafusion/blob/4dd6923787084548c9ecc6d90c630c2c28ee9259/datafusion/datasource-parquet/src/metrics.rs#L30-L33 cover that? -- This is an aut

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971297789 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking/blob/main/gh_compare_branch.sh) Running Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubun

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971305067 🤖: Benchmark completed Details ``` Comparing HEAD and topk-dynamic-filters Benchmark sort_tpch.json ┏━━

Re: [PR] Add fast paths for try_process_unnest [datafusion]

2025-06-13 Thread via GitHub
simonvandel commented on code in PR #16389: URL: https://github.com/apache/datafusion/pull/16389#discussion_r2145802029 ## datafusion/sql/src/select.rs: ## @@ -374,6 +383,14 @@ impl SqlToRel<'_, S> { fn try_process_aggregate_unnest(&self, input: LogicalPlan) -> Result {

[I] Add statistics to ParquetExec for *files* pruned [datafusion]

2025-06-13 Thread via GitHub
alamb opened a new issue, #16402: URL: https://github.com/apache/datafusion/issues/16402 ### Is your feature request related to a problem or challenge? - This is a follow on to the feature added by @adriangb in https://github.com/apache/datafusion/pull/16014 @adriangb added th

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971198838 I also filed a ticket to add a metric that we can use to see when file pruning is working: - https://github.com/apache/datafusion/issues/16402 -- This is an automated message from

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971173365 > > QQuery 25│ 380.03 ms │279.23 ms │ +1.36x faster > > ```sql > SELECT SearchPhrase FROM hits WHERE SearchPhrase <> '' ORDER BY SearchPhrase LIMIT 10; >

[PR] fix: correctly handle schemas with nested array of struct (native_iceberg_compat) [datafusion-comet]

2025-06-13 Thread via GitHub
parthchandra opened a new pull request, #1883: URL: https://github.com/apache/datafusion-comet/pull/1883 ## Which issue does this PR close? The mapping between Spark and Parquet for schemas with field ids did not correctly handle the schemas with nested arrays of structs. ## R

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145547651 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -843,6 +846,8 @@ pub struct SortExec { common_sort_prefix: Vec, /// Cache holding plan properties l

Re: [PR] WIP: scalar UDFs with metadata [datafusion-python]

2025-06-13 Thread via GitHub
timsaucer closed pull request #1110: WIP: scalar UDFs with metadata URL: https://github.com/apache/datafusion-python/pull/1110 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] WIP: scalar UDFs with metadata [datafusion-python]

2025-06-13 Thread via GitHub
timsaucer commented on PR #1110: URL: https://github.com/apache/datafusion-python/pull/1110#issuecomment-2971126021 Superseded by https://github.com/apache/datafusion-python/pull/1145 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [I] [EPIC] Spark SQL test failures when Comet JVM shuffle is used [datafusion-comet]

2025-06-13 Thread via GitHub
andygrove commented on issue #1254: URL: https://github.com/apache/datafusion-comet/issues/1254#issuecomment-2971117509 I removed the following from the scope of this issue since they turned out not to be bugs or correctness issues, but valid failures because Comet does not support DPP nat

Re: [I] [EPIC] Spark SQL test failures when Comet JVM shuffle is used [datafusion-comet]

2025-06-13 Thread via GitHub
andygrove closed issue #1254: [EPIC] Spark SQL test failures when Comet JVM shuffle is used URL: https://github.com/apache/datafusion-comet/issues/1254 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [I] SparkSha2 is not compliant with Spark and does not support Int32 type [datafusion]

2025-06-13 Thread via GitHub
alamb closed issue #16336: SparkSha2 is not compliant with Spark and does not support Int32 type URL: https://github.com/apache/datafusion/issues/16336 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] fix: Fix SparkSha2 to be compliant with Spark response and add support for Int32 [datafusion]

2025-06-13 Thread via GitHub
alamb merged PR #16350: URL: https://github.com/apache/datafusion/pull/16350 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] chore(deps): bump prost-build from 0.13.5 to 0.14.0 in the proto group [datafusion]

2025-06-13 Thread via GitHub
dependabot[bot] commented on PR #16392: URL: https://github.com/apache/datafusion/pull/16392#issuecomment-2971066993 This pull request was built based on a group rule. Closing it will not ignore any of these versions in future pull requests. To ignore these dependencies, configure [ig

Re: [I] Request to update crates.io ownership [datafusion]

2025-06-13 Thread via GitHub
alamb commented on issue #16323: URL: https://github.com/apache/datafusion/issues/16323#issuecomment-2971079530 I think we are now done with this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to t

Re: [I] Request to update crates.io ownership [datafusion]

2025-06-13 Thread via GitHub
alamb closed issue #16323: Request to update crates.io ownership URL: https://github.com/apache/datafusion/issues/16323 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub

Re: [PR] chore(deps): bump prost-build from 0.13.5 to 0.14.0 in the proto group [datafusion]

2025-06-13 Thread via GitHub
alamb closed pull request #16392: chore(deps): bump prost-build from 0.13.5 to 0.14.0 in the proto group URL: https://github.com/apache/datafusion/pull/16392 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above t

Re: [PR] Optimize Hex Function [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #16077: URL: https://github.com/apache/datafusion/pull/16077#issuecomment-2971062970 Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is read

Re: [PR] chore(deps): bump prost-build from 0.13.5 to 0.14.0 in the proto group [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #16392: URL: https://github.com/apache/datafusion/pull/16392#issuecomment-2971066685 This needs to wait for an arrow-rs update -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

Re: [PR] Simplify expressions passed to table functions [datafusion]

2025-06-13 Thread via GitHub
alamb commented on code in PR #16388: URL: https://github.com/apache/datafusion/pull/16388#discussion_r2145600999 ## datafusion/core/src/execution/session_state.rs: ## @@ -1675,6 +1675,13 @@ impl ContextProvider for SessionContextProvider<'_> { .get(name)

Re: [PR] Add fast paths for try_process_unnest [datafusion]

2025-06-13 Thread via GitHub
alamb commented on code in PR #16389: URL: https://github.com/apache/datafusion/pull/16389#discussion_r2145582029 ## datafusion/sql/src/select.rs: ## @@ -374,6 +383,14 @@ impl SqlToRel<'_, S> { fn try_process_aggregate_unnest(&self, input: LogicalPlan) -> Result {

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
Dandandan commented on PR #15770: URL: https://github.com/apache/datafusion/pull/15770#issuecomment-2971038192 I wonder if we see any improvements on the "sort tpch with limit" benchmark? ```cargo run --release --bin dfbench -- sort-tpch --iterations 5 --path "${TPCH_DIR}" -o "${RESUL

Re: [PR] Add fast paths for try_process_unnest [datafusion]

2025-06-13 Thread via GitHub
alamb commented on PR #16389: URL: https://github.com/apache/datafusion/pull/16389#issuecomment-2971026040 > logical_select_all_from_1000 10.80 120.4±0.22ms ? ?/sec1.00 11.1±0.06ms? ?/sec 🚀 The other planning benchmarks look like prett

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
Dandandan commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145572928 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -843,6 +846,8 @@ pub struct SortExec { common_sort_prefix: Vec, /// Cache holding plan properties

Re: [PR] feat: add SchemaProvider::table_type(table_name: &str) [datafusion]

2025-06-13 Thread via GitHub
epgif commented on PR #16401: URL: https://github.com/apache/datafusion/pull/16401#issuecomment-2971013443 @alamb > I wonder if there is some way we can write a test for it (mostly to prevent it from being accidentally broken/changed in the future) I looked around for some tes

Re: [PR] Chore: implement hour func as ScalarUDFImpl [datafusion-comet]

2025-06-13 Thread via GitHub
trompa commented on PR #1874: URL: https://github.com/apache/datafusion-comet/pull/1874#issuecomment-2971008063 val df = spark.sql("select hour('1969-12-31 16:00:00.0') AS folded_hour") == Physical Plan == *(1) Project [16 AS folded_hour#0] +- *(1) Scan OneRowRelation[]

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
adriangb commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145548000 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -843,6 +846,8 @@ pub struct SortExec { common_sort_prefix: Vec, /// Cache holding plan properties l

Re: [PR] TopK dynamic filter pushdown attempt 2 [datafusion]

2025-06-13 Thread via GitHub
Dandandan commented on code in PR #15770: URL: https://github.com/apache/datafusion/pull/15770#discussion_r2145545013 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -843,6 +846,8 @@ pub struct SortExec { common_sort_prefix: Vec, /// Cache holding plan properties

  1   2   3   >