Re: [PR] Make clickbench query IDs 1-based [datafusion]

2025-06-19 Thread via GitHub
AdamGS commented on PR #16455: URL: https://github.com/apache/datafusion/pull/16455#issuecomment-2988832498 But Clickbench treats/displays them as 0-based https://github.com/user-attachments/assets/64742df1-dd95-4004-b1ed-a8218e68cdc7"; /> -- This is an automated message from the A

Re: [PR] chore: release datafusion 47.0.0 [datafusion-ballista]

2025-06-19 Thread via GitHub
andygrove merged PR #1269: URL: https://github.com/apache/datafusion-ballista/pull/1269 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr.

Re: [PR] Make clickbench query IDs 1-based [datafusion]

2025-06-19 Thread via GitHub
pepijnve commented on PR #16455: URL: https://github.com/apache/datafusion/pull/16455#issuecomment-2988864943 Ok, that’s probably a stronger argument for keeping things as is -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Make clickbench query IDs 1-based [datafusion]

2025-06-19 Thread via GitHub
pepijnve closed pull request #16455: Make clickbench query IDs 1-based URL: https://github.com/apache/datafusion/pull/16455 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [I] SortQueryFuzzer found a failing case on main [datafusion]

2025-06-19 Thread via GitHub
AdamGS commented on issue #16452: URL: https://github.com/apache/datafusion/issues/16452#issuecomment-2988819800 Some more findings: 1. `datafusion.optimizer.enable_dynamic_filter_pushdown` doesn't seem to make a difference 2. Played around with the seeds, seems like the only one that'

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157528156 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -879,4 +972,107 @@ mod test { assert_eq!(num_batches, 0); assert_eq!(num_rows, 0); }

Re: [I] SortQueryFuzzer found a failing case on main [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on issue #16452: URL: https://github.com/apache/datafusion/issues/16452#issuecomment-2988929483 I think this is a reproducer? https://github.com/apache/datafusion/actions/runs/15764484967/job/44438090077?pr=16461 -- This is an automated message from the Apache Git Servi

Re: [PR] Temporarily fix bug in dynamic top-k optimization [datafusion]

2025-06-19 Thread via GitHub
AdamGS commented on code in PR #16465: URL: https://github.com/apache/datafusion/pull/16465#discussion_r2157698764 ## datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs: ## @@ -579,6 +618,7 @@ impl SortFuzzerTestGenerator { let with_mem_limit = !query_str.contains("LIM

Re: [PR] Simplify predicates in filter [datafusion]

2025-06-19 Thread via GitHub
xudong963 commented on code in PR #16362: URL: https://github.com/apache/datafusion/pull/16362#discussion_r2157945137 ## datafusion/optimizer/src/push_down_filter.rs: ## @@ -778,6 +779,16 @@ impl OptimizerRule for PushDownFilter { return Ok(Transformed::no(plan));

Re: [PR] Perf: Optimize in memory sort [datafusion]

2025-06-19 Thread via GitHub
zhuqi-lucas commented on PR #15380: URL: https://github.com/apache/datafusion/pull/15380#issuecomment-2989672223 Thank you @alamb , the result no regression now, but also no obvious performance improvement. Let me try to increase the memory internal sort size to see the result. -- This i

Re: [PR] Improve dictionary null handling in hashing and expand aggregate test coverage for nulls [datafusion]

2025-06-19 Thread via GitHub
kosiew closed pull request #16458: Improve dictionary null handling in hashing and expand aggregate test coverage for nulls URL: https://github.com/apache/datafusion/pull/16458 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Improve dictionary null handling in hashing and expand aggregate test coverage for nulls [datafusion]

2025-06-19 Thread via GitHub
kosiew commented on PR #16458: URL: https://github.com/apache/datafusion/pull/16458#issuecomment-2989731599 Removing the large file test_data.txt messed up this branch. Closing -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

[PR] Improve dictionary null handling in hashing and expand aggregate test coverage for nulls [datafusion]

2025-06-19 Thread via GitHub
kosiew opened a new pull request, #16466: URL: https://github.com/apache/datafusion/pull/16466 ## Which issue does this PR close? - Closes #16266 ## Rationale for this change This change addresses a bug where `combine_hashes` was applied even if a dictionary value was nu

Re: [PR] Improve dictionary null handling in hashing and expand aggregate test coverage for nulls [datafusion]

2025-06-19 Thread via GitHub
kosiew commented on PR #16466: URL: https://github.com/apache/datafusion/pull/16466#issuecomment-2989750769 https://github.com/user-attachments/assets/4782f3ea-4e60-4f92-83e1-87fca4b57770"; /> datafusion/common/src/hash_utils.rs| 50 +- .../tests/fuzz_cases/aggre

Re: [PR] Improve dictionary null handling in hashing and expand aggregate test coverage for nulls [datafusion]

2025-06-19 Thread via GitHub
kosiew commented on PR #16458: URL: https://github.com/apache/datafusion/pull/16458#issuecomment-2989753584 Replaced with #16466 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comm

[I] UDTFs in logical plan have same name 'tmp_table' [datafusion]

2025-06-19 Thread via GitHub
Jeadie opened a new issue, #16467: URL: https://github.com/apache/datafusion/issues/16467 ### Describe the bug The logical plan for all UDTFs is a table scan with a fixed table name `tmp_table` ([source](https://github.com/apache/datafusion/blob/5ca4ff02932eecdd203b1b90acaf4381c0d5cb

[PR] use UDTF name in logical plan table scan [datafusion]

2025-06-19 Thread via GitHub
Jeadie opened a new pull request, #16468: URL: https://github.com/apache/datafusion/pull/16468 ## Which issue does this PR close? - Closes #16467. ## What changes are included in this PR? Use UDTF name in logical plan table scan ## Are these changes tested? Yes, exist `

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-19 Thread via GitHub
Dandandan commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2989336247 > We could try a shared heap. It might work? I guess it will be a sort of balance between lock contention and better selectivity. Maybe we can balance it by having distinct heaps f

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2989360115 > how would you compute the shared heap on the fly? I was thinking you'd compute the top K of the top K * partitions on the fly. But maybe your proposal makes more sense

[I] Add a case for reading map_entries in the `read basic complex types` test [datafusion-comet]

2025-06-19 Thread via GitHub
parthchandra opened a new issue, #1916: URL: https://github.com/apache/datafusion-comet/issues/1916 ### Describe the bug The test has a query - ``` sql( "select optional_array[0], " + "array_of_struct[0].field1, " + "array_of_struct[0].optional_nested_array, " +

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-19 Thread via GitHub
parthchandra commented on code in PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#discussion_r2157793226 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1946,6 +1946,52 @@ class ParquetReadV1Suite extends ParquetReadSuite with

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-19 Thread via GitHub
parthchandra commented on PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911#issuecomment-2989423116 Merged. Thank you for the review @kazuyukitanimura -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] chore: add a test case to read from an arbitrarily complex type schema [datafusion-comet]

2025-06-19 Thread via GitHub
parthchandra merged PR #1911: URL: https://github.com/apache/datafusion-comet/pull/1911 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr.

Re: [PR] chore: Introduce `exprHandlers` map in QueryPlanSerde [datafusion-comet]

2025-06-19 Thread via GitHub
parthchandra commented on PR #1903: URL: https://github.com/apache/datafusion-comet/pull/1903#issuecomment-2989443037 I assume this is not the end of it and we would be enhancing this as we go? Some initial thoughts on that - Add one or more annotations that can include additional i

Re: [PR] feat: add FFI support for user defined functions [datafusion-python]

2025-06-19 Thread via GitHub
davisp commented on code in PR #1145: URL: https://github.com/apache/datafusion-python/pull/1145#discussion_r2157817285 ## examples/datafusion-ffi-example/python/tests/_test_aggregate_udf.py: ## @@ -0,0 +1,77 @@ +# Licensed to the Apache Software Foundation (ASF) under one +# o

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-19 Thread via GitHub
kosiew commented on code in PR #16371: URL: https://github.com/apache/datafusion/pull/16371#discussion_r2157873541 ## datafusion/common/src/nested_struct.rs: ## @@ -0,0 +1,150 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agree

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-19 Thread via GitHub
kosiew commented on code in PR #16371: URL: https://github.com/apache/datafusion/pull/16371#discussion_r2157884676 ## datafusion/datasource/src/nested_schema_adapter/adapter.rs: ## @@ -0,0 +1,93 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contri

Re: [PR] WIP: Aggregate UDF FFI [datafusion]

2025-06-19 Thread via GitHub
github-actions[bot] closed pull request #15510: WIP: Aggregate UDF FFI URL: https://github.com/apache/datafusion/pull/15510 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] Update parser recursion limit from 50 to 100 [datafusion]

2025-06-19 Thread via GitHub
github-actions[bot] closed pull request #15622: Update parser recursion limit from 50 to 100 URL: https://github.com/apache/datafusion/pull/15622 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] WIP: Test enabling Parquet filter pushdown with parquet caching page cache reader [datafusion]

2025-06-19 Thread via GitHub
github-actions[bot] closed pull request #15506: WIP: Test enabling Parquet filter pushdown with parquet caching page cache reader URL: https://github.com/apache/datafusion/pull/15506 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHu

Re: [I] [Epic] Add snapshot tests (migrate to `insta` for tests) [datafusion]

2025-06-19 Thread via GitHub
xudong963 commented on issue #15178: URL: https://github.com/apache/datafusion/issues/15178#issuecomment-2989610772 https://github.com/apache/datafusion/blob/5ca4ff02932eecdd203b1b90acaf4381c0d5cb5c/datafusion/proto/tests/cases/roundtrip_physical_plan.rs#L143 I think the roundtrip tests als

Re: [PR] docs: add conventional commit guide and PR title examples [datafusion]

2025-06-19 Thread via GitHub
github-actions[bot] closed pull request #15638: docs: add conventional commit guide and PR title examples URL: https://github.com/apache/datafusion/pull/15638 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Perform type coercion for corr aggregate function during physical planning [datafusion]

2025-06-19 Thread via GitHub
github-actions[bot] commented on PR #15776: URL: https://github.com/apache/datafusion/pull/15776#issuecomment-2989604583 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] feat(sql): add diagnostic for wrong number of function arguments [datafusion]

2025-06-19 Thread via GitHub
github-actions[bot] closed pull request #15490: feat(sql): add diagnostic for wrong number of function arguments URL: https://github.com/apache/datafusion/pull/15490 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] fix: union all by name [datafusion]

2025-06-19 Thread via GitHub
github-actions[bot] closed pull request #15603: fix: union all by name URL: https://github.com/apache/datafusion/pull/15603 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] Fix constant window for evaluate stateful [datafusion]

2025-06-19 Thread via GitHub
suibianwanwank commented on PR #16430: URL: https://github.com/apache/datafusion/pull/16430#issuecomment-2989644426 https://github.com/apache/datafusion-comet/pull/1913 @andygrove CI seems to have passed.πŸŽ‰ -- This is an automated message from the Apache Git Service. To respond to the mess

Re: [I] Make `datafusion-cli` read parquet folders [datafusion]

2025-06-19 Thread via GitHub
comphead commented on issue #16460: URL: https://github.com/apache/datafusion/issues/16460#issuecomment-2988809219 Surprisingly if I try same tests in DF itself using unit or `slt` tests, it works. Looks like the issue in datafusion-cli only -- This is an automated message from the Apache

Re: [I] Make `datafusion-cli` read parquet folders [datafusion]

2025-06-19 Thread via GitHub
hendrikmakait commented on issue #16460: URL: https://github.com/apache/datafusion/issues/16460#issuecomment-2988812320 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [I] Add `output_bytes` metrics to Explain Analyze [datafusion]

2025-06-19 Thread via GitHub
hendrikmakait commented on issue #16244: URL: https://github.com/apache/datafusion/issues/16244#issuecomment-2988800750 > A quick reminder for someone who is willing to implement it: It's possible that multiple `Array`s share the same underlying buffer -- those `Array`s can be within the sa

Re: [PR] [ignore] test DataFusion PR: Fix constant window for evaluate stateful [datafusion-comet]

2025-06-19 Thread via GitHub
andygrove commented on PR #1913: URL: https://github.com/apache/datafusion-comet/pull/1913#issuecomment-2988814308 > @andygrove If you have time, could you try the branch at https://github.com/suibianwanwank/datafusion/tree/48_fix? This branch is based on DF48 and includes the revert commi

Re: [I] Support types other than String and Int for partition columns [datafusion-python]

2025-06-19 Thread via GitHub
timsaucer closed issue #1155: Support types other than String and Int for partition columns URL: https://github.com/apache/datafusion-python/issues/1155 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

Re: [PR] Support types other than String and Int for partition columns [datafusion-python]

2025-06-19 Thread via GitHub
timsaucer merged PR #1154: URL: https://github.com/apache/datafusion-python/pull/1154 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...

[PR] Ignore `sort_query_fuzzer_runner` [datafusion]

2025-06-19 Thread via GitHub
blaginin opened a new pull request, #16462: URL: https://github.com/apache/datafusion/pull/16462 ## Rationale for this change See https://github.com/apache/datafusion/issues/16452 Test keep keeps failing. I think it's important to keep CI green, so let's skip it now

Re: [I] Default to collecting statistics when creating LIstingTables [datafusion]

2025-06-19 Thread via GitHub
blaginin commented on issue #16158: URL: https://github.com/apache/datafusion/issues/16158#issuecomment-295187 https://github.com/apache/datafusion/pull/16447 is merged so closing this -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [I] Default to collecting statistics when creating LIstingTables [datafusion]

2025-06-19 Thread via GitHub
blaginin closed issue #16158: Default to collecting statistics when creating LIstingTables URL: https://github.com/apache/datafusion/issues/16158 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [I] Release DataFusion `48.0.0` (June 2025) [datafusion]

2025-06-19 Thread via GitHub
blaginin commented on issue #15771: URL: https://github.com/apache/datafusion/issues/15771#issuecomment-298367 Do you think we need to do `48.0.1` to include fix for https://github.com/apache/datafusion/issues/16444? -- This is an automated message from the Apache Git Service. To res

Re: [I] Datafusion 48 Clickbench Q6 and Q0 regression [datafusion]

2025-06-19 Thread via GitHub
blaginin closed issue #16444: Datafusion 48 Clickbench Q6 and Q0 regression URL: https://github.com/apache/datafusion/issues/16444 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [I] Datafusion 48 Clickbench Q6 and Q0 regression [datafusion]

2025-06-19 Thread via GitHub
blaginin commented on issue #16444: URL: https://github.com/apache/datafusion/issues/16444#issuecomment-295594 https://github.com/apache/datafusion/pull/16447 is merged so closing this -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-06-19 Thread via GitHub
Dandandan commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2988836122 Thanks for sharing the insight @ctsk I think there is some overlap here to the work @alamb to improve (and maybe remove over time) `CoalesceBatchesExec` and gc of views

Re: [I] Use sha2 implementation from datafusion-spark crate [datafusion-comet]

2025-06-19 Thread via GitHub
rishvin commented on issue #1820: URL: https://github.com/apache/datafusion-comet/issues/1820#issuecomment-2988838360 @andygrove can I backport [SHA2-fix](https://github.com/apache/datafusion/pull/16350) to branch-48 of datafusion ? I tried updating with datafusion-main branch until my com

Re: [PR] Set the default value of `datafusion.execution.collect_statistics` to `true` [datafusion]

2025-06-19 Thread via GitHub
blaginin merged PR #16447: URL: https://github.com/apache/datafusion/pull/16447 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

[PR] GC string views on hash join build side [datafusion]

2025-06-19 Thread via GitHub
ctsk opened a new pull request, #16463: URL: https://github.com/apache/datafusion/pull/16463 ## Which issue does this PR close? - Closes #16206. ## What changes are included in this PR? - A utility function to garbage collect (gc) all view-type columns of a batch

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16371: URL: https://github.com/apache/datafusion/pull/16371#discussion_r2157585906 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -83,17 +85,16 @@ pub struct ListingTableConfig { pub options: Option, /// Tracks the source of

[I] Test TopK Dynamic Filter Optimization w/ ORDER BY on multiple columns [datafusion]

2025-06-19 Thread via GitHub
adriangb opened a new issue, #16464: URL: https://github.com/apache/datafusion/issues/16464 ### Describe the bug @alamb had reported it didn't work, need to investigate and make sure there's a test ### To Reproduce _No response_ ### Expected behavior _No re

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-19 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2989059712 Benchmark results: ``` Comparing main and fix_build-side-gc Benchmark tpch_sf10.json ┏━━┳┳━

[I] Merge dataframe documentation prior to release of DF48 [datafusion-python]

2025-06-19 Thread via GitHub
timsaucer opened a new issue, #1158: URL: https://github.com/apache/datafusion-python/issues/1158 **Describe the bug** Right now in the documentation on main we have two entries for API and two entries called DataFrame. We should merge the dataframe documentation, and probably make t

[I] Add Iceberg documentation to website [datafusion-python]

2025-06-19 Thread via GitHub
timsaucer opened a new issue, #1159: URL: https://github.com/apache/datafusion-python/issues/1159 **Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I believe we can now access Iceberg tables via ffi according to @kevinjqliu 's r

Re: [PR] GC string views on hash join build side [datafusion]

2025-06-19 Thread via GitHub
ctsk commented on PR #16463: URL: https://github.com/apache/datafusion/pull/16463#issuecomment-2989084548 At SF=100, this PR is 10% faster: ``` Benchmark tpch_sf100.json ┏━━┳┳━┳━━━┓ ┃ Q

Re: [PR] Use Tokio's task budget consistently, better APIs to support task cancellation [datafusion]

2025-06-19 Thread via GitHub
pepijnve commented on PR #16398: URL: https://github.com/apache/datafusion/pull/16398#issuecomment-2989086637 @alamb @Dandandan I'm starting to get the feeling this is a wild goose chase. I adapted `bench.sh` a bit so that I can pass in `--query`. I then ran `clickbench_1` query 4 multiple

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-19 Thread via GitHub
kosiew commented on code in PR #16371: URL: https://github.com/apache/datafusion/pull/16371#discussion_r2157862933 ## datafusion/common/src/nested_struct.rs: ## @@ -0,0 +1,150 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agree

[PR] fix: Make cast from float/double to decimal compatible with Spark [datafusion-comet]

2025-06-19 Thread via GitHub
leung-ming opened a new pull request, #1915: URL: https://github.com/apache/datafusion-comet/pull/1915 ## Which issue does this PR close? Closes #1371 Built on #1914 ## Rationale for this change ## What changes are included in this PR?

Re: [I] SortQueryFuzzer found a failing case on main [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on issue #16452: URL: https://github.com/apache/datafusion/issues/16452#issuecomment-2988934802 > datafusion.optimizer.enable_dynamic_filter_pushdown doesn't seem to make a difference @Dandandan I wonder if it's the filtering being done inside of the TopK? @AdamG

[PR] fix: Add continue after append_null when casting float to decimal [datafusion-comet]

2025-06-19 Thread via GitHub
leung-ming opened a new pull request, #1914: URL: https://github.com/apache/datafusion-comet/pull/1914 ## Which issue does this PR close? ## Rationale for this change The `continue` after `append_null` was forgotten ## What changes are included in this PR?

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157528156 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -879,4 +972,107 @@ mod test { assert_eq!(num_batches, 0); assert_eq!(num_rows, 0); }

Re: [I] Was GCS support removed? [datafusion-ballista]

2025-06-19 Thread via GitHub
milenkovicm commented on issue #1274: URL: https://github.com/apache/datafusion-ballista/issues/1274#issuecomment-2988967946 I'll close this issue if ok with you @dfinninger -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub an

Re: [I] Was GCS support removed? [datafusion-ballista]

2025-06-19 Thread via GitHub
milenkovicm closed issue #1274: Was GCS support removed? URL: https://github.com/apache/datafusion-ballista/issues/1274 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsub

Re: [I] SortQueryFuzzer found a failing case on main [datafusion]

2025-06-19 Thread via GitHub
AdamGS commented on issue #16452: URL: https://github.com/apache/datafusion/issues/16452#issuecomment-2989118706 that indeed makes the failure to go away -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [PR] feat: add FFI support for user defined functions [datafusion-python]

2025-06-19 Thread via GitHub
timsaucer commented on code in PR #1145: URL: https://github.com/apache/datafusion-python/pull/1145#discussion_r2157653630 ## examples/datafusion-ffi-example/python/tests/_test_aggregate_udf.py: ## @@ -0,0 +1,77 @@ +# Licensed to the Apache Software Foundation (ASF) under one +

Re: [I] SortQueryFuzzer found a failing case on main [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on issue #16452: URL: https://github.com/apache/datafusion/issues/16452#issuecomment-2989145334 Let's merge that ASAP. I'm AFK for the next two hours? Could you prepare a PR by chance? -- This is an automated message from the Apache Git Service. To respond to the messag

[PR] Fix bug in dynamic topk optimization [datafusion]

2025-06-19 Thread via GitHub
AdamGS opened a new pull request, #16465: URL: https://github.com/apache/datafusion/pull/16465 ## Which issue does this PR close? See #16452. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested?

Re: [I] SortQueryFuzzer found a failing case on main [datafusion]

2025-06-19 Thread via GitHub
AdamGS commented on issue #16452: URL: https://github.com/apache/datafusion/issues/16452#issuecomment-2989157877 done - https://github.com/apache/datafusion/pull/16465/files -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Temporarily fix bug in dynamic top-k optimization [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16465: URL: https://github.com/apache/datafusion/pull/16465#discussion_r2157671900 ## datafusion/core/tests/fuzz_cases/sort_query_fuzz.rs: ## @@ -579,6 +618,7 @@ impl SortFuzzerTestGenerator { let with_mem_limit = !query_str.contains("L

Re: [PR] remove unused methods in SortExec [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on PR #16457: URL: https://github.com/apache/datafusion/pull/16457#issuecomment-2988748949 We just added them in #15770 2 days ago. I'm 99% certain no downstream projects are using them. I feel it would be less painful to just rip them out right now rather than having the

Re: [PR] remove unused methods in SortExec [datafusion]

2025-06-19 Thread via GitHub
comphead commented on PR #16457: URL: https://github.com/apache/datafusion/pull/16457#issuecomment-2988800023 > We just added them in #15770 2 days ago. I'm 99% certain no downstream projects are using them. I feel it would be less painful to just rip them out right now rather than having t

Re: [I] Excessive Arc-clone in HashJoinStream with StringView on build-side [datafusion]

2025-06-19 Thread via GitHub
ctsk commented on issue #16206: URL: https://github.com/apache/datafusion/issues/16206#issuecomment-2988799373 I've re-encountered this issue and it (obviously, duh) gets amplified with larger scale facctors (>100). For example, the top 3 items by cycles when running tpch query 18 @ sf 300

Re: [PR] [ignore] test DataFusion PR: Fix constant window for evaluate stateful [datafusion-comet]

2025-06-19 Thread via GitHub
suibianwanwank commented on PR #1913: URL: https://github.com/apache/datafusion-comet/pull/1913#issuecomment-2988797853 @andygrove If you have time, could you try the branch at https://github.com/suibianwanwank/datafusion/tree/48_fix? This branch is based on DF48 and includes the revert co

Re: [PR] remove unused methods in SortExec [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on PR #16457: URL: https://github.com/apache/datafusion/pull/16457#issuecomment-2988806515 Thanks for the review @comphead ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] remove unused methods in SortExec [datafusion]

2025-06-19 Thread via GitHub
adriangb merged PR #16457: URL: https://github.com/apache/datafusion/pull/16457 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

[I] Make `datafusion-cli` read parquet folders [datafusion]

2025-06-19 Thread via GitHub
comphead opened a new issue, #16460: URL: https://github.com/apache/datafusion/issues/16460 > Create test data > ``` > ls -la /tmp/t1 > > -rw-r--r--@ 1 xxx wheel 12 Jun 6 08:35 .part-0-e248d995-5eac-404e-a2ed-0eb16e27c005-c000.snappy.parquet.crc > -rw-r--r--@ 1 xxx

Re: [PR] [ignore] test DataFusion PR: Fix constant window for evaluate stateful [datafusion-comet]

2025-06-19 Thread via GitHub
codecov-commenter commented on PR #1913: URL: https://github.com/apache/datafusion-comet/pull/1913#issuecomment-2988850106 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1913?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16461: URL: https://github.com/apache/datafusion/pull/16461#discussion_r2157496405 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -524,6 +532,62 @@ fn should_enable_page_index( .unwrap_or(false) } +use datafusion_physical_

[PR] adapt filter expressions to file schema during parquet scan [datafusion]

2025-06-19 Thread via GitHub
adriangb opened a new pull request, #16461: URL: https://github.com/apache/datafusion/pull/16461 The idea here is to move us one step closer to https://github.com/apache/datafusion/issues/15780 although there is still work to do (e.g. https://github.com/apache/datafusion/issues/15780#issue

Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on PR #16424: URL: https://github.com/apache/datafusion/pull/16424#issuecomment-2989096200 @alamb I reverted the filtering during the stream so this should now do strictly less work πŸ˜„ -- This is an automated message from the Apache Git Service. To respond to the messag

Re: [PR] Add dynamic filter (bounds) pushdown to HashJoinExec [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on PR #16445: URL: https://github.com/apache/datafusion/pull/16445#issuecomment-2989099023 > I think it also makes sense to also thing about a heuristic we want to use to use this pushdown only when we think it might be useful - e.g. the left side is much smaller than the

Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16424: URL: https://github.com/apache/datafusion/pull/16424#discussion_r2157629376 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -524,6 +512,91 @@ fn should_enable_page_index( .unwrap_or(false) } +/// Prune based on parti

Re: [PR] Skip re-pruning based on partition values and file level stats if there are no dynamic filters [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on code in PR #16424: URL: https://github.com/apache/datafusion/pull/16424#discussion_r2157631212 ## datafusion/datasource-parquet/src/opener.rs: ## @@ -524,6 +498,99 @@ fn should_enable_page_index( .unwrap_or(false) } +/// Prune based on parti

[I] Release DataFusion 48.0.0 [datafusion-python]

2025-06-19 Thread via GitHub
timsaucer opened a new issue, #1160: URL: https://github.com/apache/datafusion-python/issues/1160 This Issue is to track the release for DataFusion 48.0.0 now that the upstream repository has been published. Before release I would like to merge the following PRs: - https://gith

Re: [PR] Temporarily fix bug in dynamic top-k optimization [datafusion]

2025-06-19 Thread via GitHub
adriangb merged PR #16465: URL: https://github.com/apache/datafusion/pull/16465 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

Re: [PR] Temporarily fix bug in dynamic top-k optimization [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on PR #16465: URL: https://github.com/apache/datafusion/pull/16465#issuecomment-2989282614 Sad to have to do this but I agree keeping CI green is the top priority cc @Dandandan -- This is an automated message from the Apache Git Service. To respond to the message, plea

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-19 Thread via GitHub
kosiew commented on code in PR #16371: URL: https://github.com/apache/datafusion/pull/16371#discussion_r2157862933 ## datafusion/common/src/nested_struct.rs: ## @@ -0,0 +1,150 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agree

Re: [PR] fix: miss output ordering during projection [datafusion]

2025-06-19 Thread via GitHub
xudong963 closed pull request #15683: fix: miss output ordering during projection URL: https://github.com/apache/datafusion/pull/15683 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific co

Re: [PR] Perform type coercion for corr aggregate function during physical planning [datafusion]

2025-06-19 Thread via GitHub
kumarlokesh commented on PR #15776: URL: https://github.com/apache/datafusion/pull/15776#issuecomment-2989918505 PR is active. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-19 Thread via GitHub
kosiew commented on code in PR #16371: URL: https://github.com/apache/datafusion/pull/16371#discussion_r2158136476 ## datafusion/core/src/datasource/listing/table.rs: ## @@ -83,17 +85,16 @@ pub struct ListingTableConfig { pub options: Option, /// Tracks the source of t

Re: [PR] chore(deps): Update sqlparser to 0.56.0 [datafusion]

2025-06-19 Thread via GitHub
Dimchikkk commented on PR #16456: URL: https://github.com/apache/datafusion/pull/16456#issuecomment-2988028529 (converted to draft while I'm investigating CI failures) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[PR] chore(deps): Update sqlparser to 0.56.0 [datafusion]

2025-06-19 Thread via GitHub
Dimchikkk opened a new pull request, #16456: URL: https://github.com/apache/datafusion/pull/16456 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/16405 ## Rationale for this change ## What changes are included in thi

Re: [PR] chore(deps): Update sqlparser to 0.56.0 [datafusion]

2025-06-19 Thread via GitHub
Dimchikkk commented on code in PR #16456: URL: https://github.com/apache/datafusion/pull/16456#discussion_r2156911331 ## datafusion/sql/src/expr/substring.rs: ## @@ -77,8 +78,16 @@ impl SqlToRel<'_, S> { } } -not_impl_err!( -"Substring

Re: [PR] Introduce Async User Defined Functions [datafusion]

2025-06-19 Thread via GitHub
samuelcolvin commented on PR #14837: URL: https://github.com/apache/datafusion/pull/14837#issuecomment-2987279730 In case anyone is interested, I used async UDFs to implement SQL function support for datafusion, demo here - https://github.com/samuelcolvin/datafusion-sql-udfs. -- This is

Re: [PR] Enable schema evolution for nested structs via adapt_column and custom adapter support in ListingTable [datafusion]

2025-06-19 Thread via GitHub
timsaucer commented on code in PR #16371: URL: https://github.com/apache/datafusion/pull/16371#discussion_r2156903806 ## datafusion/common/src/nested_struct.rs: ## @@ -0,0 +1,150 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license ag

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-19 Thread via GitHub
Dandandan commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2988289302 > > Hm my earlier benchmarks didn't seem correct. not sure where the earlier run came from πŸ€” > > What do the current ones show? Not much improvement? Yes about the sam

Re: [PR] Only update TopK dynamic filters if the new ones are more selective [datafusion]

2025-06-19 Thread via GitHub
adriangb commented on PR #16433: URL: https://github.com/apache/datafusion/pull/16433#issuecomment-2988328335 We could try a shared heap. It might work? I guess it will be a sort of balance between lock contention and better selectivity. Maybe we can balance it by having distinct heaps for

  1   2   >