Re: [PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]

2025-03-28 Thread via GitHub
l0kr commented on PR #1577: URL: https://github.com/apache/datafusion-comet/pull/1577#issuecomment-2761641254 Previous PR: #958 I noticed that collecting dataframe directly from parquet causes an error. From what I see the problem lies in converting Columnar back to Row. If that's some

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018831008 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761603385 Should we update the same query in the clickbench repo as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2760349060 I created https://github.com/pydantic/datafusion/pull/13/files to discuss that idea further. It's promising in some ways but also has some issues, I left TODOs and comments. -- T

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
Copilot commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018163460 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups ba

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2761504336 > Things that serialize a PhysicalExpr across the wire, e.g. https://github.com/XiangpengHao/liquid-cache does it via [serialize_physical_expr](https://github.com/apache/datafusion/blo

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2761489660 Thank you very much @adriangb -- given the new (warranted) complexity this feature is likely to add to DataFusion, and the fact if done right it can serve as the foundation for many a

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2761509767 > I would still keep the methods on ExecutionPlan to do the pushdown instead of the optimizer rule unless I'm wrong about optimizer rules not being able to deal with LiquidCacheClientE

Re: [PR] Remove redundant statistics from FileScanConfig [datafusion]

2025-03-28 Thread via GitHub
blaginin commented on PR #14955: URL: https://github.com/apache/datafusion/pull/14955#issuecomment-2762094667 hey @Standing-Man https://github.com/apache/datafusion/pull/15352 just got merged so this pr may be easier to finish fyi 🌻 -- This is an automated message from the Apache Git Serv

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub
andygrove commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762091836 Comet 0.7.0 supported 3.5.0 through 3.5.4 but not 3.5.5 which had breaking changes to internal apis. I am on vacation this week but it looks like 3.5.5 support was added

Re: [I] Change mapping of SQL `VARCHAR` from `Utf8` to `Utf8View` [datafusion]

2025-03-28 Thread via GitHub
alamb commented on issue #15096: URL: https://github.com/apache/datafusion/issues/15096#issuecomment-2762085363 > Small improvement, i think becasue it's parquet format, mostly we already load it as the Utf8View for benchmark: Yes I would expect no change for the clickbench be

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2762100250 @Dandandan do you think we can merge this here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [I] Migrate datasource tests to `insta` [datafusion]

2025-03-28 Thread via GitHub
alamb closed issue #15246: Migrate datasource tests to `insta` URL: https://github.com/apache/datafusion/issues/15246 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2762327907 I also wrote up some notes here: https://github.com/apache/datafusion/issues/15037#issuecomment-2762326990 -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019547635 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = formatter.va

[PR] Add insta as a dependency in Cargo.toml and Cargo.lock [datafusion]

2025-03-28 Thread via GitHub
qstommyshu opened a new pull request, #15484: URL: https://github.com/apache/datafusion/pull/15484 ## Which issue does this PR close? - Closes #15397 . ## Rationale for this change ## What changes are included in this PR? Migrated tests in `data

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762810733 > Run extended tests I see it did trigger but I somehow was expecting feedback in the comments -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] feat: make parquet native scan schema case insensitive [datafusion-comet]

2025-03-28 Thread via GitHub
parthchandra commented on code in PR #1575: URL: https://github.com/apache/datafusion-comet/pull/1575#discussion_r2019608751 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1460,6 +1460,25 @@ class ParquetReadV1Suite extends ParquetReadSuite with

Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub
alan910127 commented on code in PR #15482: URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019692309 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -230,19 +230,19 @@ logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)]

[I] QUALIFY syntax [datafusion]

2025-03-28 Thread via GitHub
jayzhan211 opened a new issue, #15485: URL: https://github.com/apache/datafusion/issues/15485 ### Is your feature request related to a problem or challenge? The QUALIFY clause is used to filter the results of [WINDOW functions](https://duckdb.org/docs/stable/sql/functions/window_funct

Re: [PR] Fuse `CASE(a > 0, b / a)` [datafusion]

2025-03-28 Thread via GitHub
github-actions[bot] closed pull request #14200: Fuse `CASE(a > 0, b / a)` URL: https://github.com/apache/datafusion/pull/14200 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]

2025-03-28 Thread via GitHub
github-actions[bot] commented on PR #14028: URL: https://github.com/apache/datafusion/pull/14028#issuecomment-2763014013 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2763036538 cc @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] added fallback using reflection for backward-compatibility [datafusion-comet]

2025-03-28 Thread via GitHub
Kontinuation commented on code in PR #1573: URL: https://github.com/apache/datafusion-comet/pull/1573#discussion_r2019693023 ## spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala: ## @@ -55,15 +55,48 @@ trait ShimCometScanExec { protected def is

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
chenkovsky commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760736404 ``` grep -n "p_partkey" part.csv ``` why there are two head rows ``` 1:p_partkey,p_name,p_mfgr,p_brand,p_type,p_size,p_container,p_retailprice,p_commen

[I] Pub and export stats_union and col_stats_union [datafusion]

2025-03-28 Thread via GitHub
niebayes opened a new issue, #15474: URL: https://github.com/apache/datafusion/issues/15474 The `UnionExec` operator has two helper functions for merging statistics from multiple input operators. Such a functionality is also needed by my project. Can we provide a public helper func

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub
berkaysynnada commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2760833915 > I think coalesce added before `execute`? > > However just noticed the execute currently happens in main thread instead of `future::once`. But it shouldn't be an issue a

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2760774554 > #15339 It looks like the join plan is being changed. You should be able to get the test back by also setting `hash_join_single_partition_threshold` to `0` / a low value.

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub
berkaysynnada commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2760838692 > This should not trigger for physical plans generated by datafusion, since the EnforceDistribution pass already adds that CoalescePartitionsExec. You mean `coalesce_part

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-03-28 Thread via GitHub
2010YOUY01 commented on PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#issuecomment-2760428379 > While reading this I also wonder if we would be able to hook into [DF's new SpillManager](https://github.com/apache/datafusion/pull/15355). That's a task for another PR, bu

Re: [PR] Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on PR #15447: URL: https://github.com/apache/datafusion/pull/15447#issuecomment-2760854661 First of all, i create the reproducer benchmark PR for arrow-rs: https://github.com/apache/arrow-rs/pull/7351 | Benchmark | Utf8 Time (µs)| Ut

Re: [PR] Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on PR #15447: URL: https://github.com/apache/datafusion/pull/15447#issuecomment-2760778885 After thinking more, i think the better way is to improve it in arrow-rs, so we will benefit more about the utf8view regression cases, created the ticket for arrow-rs: htt

Re: [PR] Docs: Formatting and Added Extra resources [datafusion]

2025-03-28 Thread via GitHub
berkaysynnada commented on PR #15450: URL: https://github.com/apache/datafusion/pull/15450#issuecomment-2760926069 > BTW @oznur-synnada I wonder if you have time to update the page with other recent blog content 🤔 You mean this? https://github.com/apache/datafusion/pull/15440 -- Th

Re: [PR] fix: the average time for clickbench query compute should use new vec to make it compute for each query [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on PR #15472: URL: https://github.com/apache/datafusion/pull/15472#issuecomment-2760531078 cc @2010YOUY01 @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [I] `BinaryExpr` evaluate lacks optimization for `Or` and `And` scenarios [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on issue #11212: URL: https://github.com/apache/datafusion/issues/11212#issuecomment-2761086840 Thank you very much for your reply. These are some updates on this issue. @alamb: 1. I have added the extended SQL in this PR #15462, you can check the details there:[SQL

[PR] fix: Assertion fail in external sort [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 opened a new pull request, #15469: URL: https://github.com/apache/datafusion/pull/15469 ## Which issue does this PR close? related to https://github.com/apache/datafusion/issues/15372 ## Rationale for this change One external sort query will panic d

Re: [PR] fix: Assertion fail in external sort [datafusion]

2025-03-28 Thread via GitHub
Copilot commented on code in PR #15469: URL: https://github.com/apache/datafusion/pull/15469#discussion_r2018081822 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -416,21 +409,23 @@ impl ExternalSorter { Some(self.spill_manager.create_in_progress_file("Sor

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018207072 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2760648004 > Modifying the plan post-execute() feels a bit off to me. Does it seem like a smell to you as well? This should not trigger for physical plans generated by datafusion, since the

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018226169 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2018202146 ## datafusion/datasource/src/memory.rs: ## @@ -440,6 +443,35 @@ impl DataSource for MemorySourceConfig { } } +fn repartitioned( Review Comm

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2018210999 ## datafusion/datasource/src/memory.rs: ## @@ -902,4 +1130,319 @@ mod tests { Ok(()) } + +fn batch(row_size: usize) -> RecordBatch { +

Re: [PR] chore: move `optimize_subquery_sort` into optimizer [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on code in PR #15441: URL: https://github.com/apache/datafusion/pull/15441#discussion_r2018254293 ## datafusion/optimizer/src/eliminate_sort.rs: ## @@ -0,0 +1,78 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

Re: [I] [DISCUSS] Consider Vendoring Certain Dependencies [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on issue #15360: URL: https://github.com/apache/datafusion/issues/15360#issuecomment-2759069227 I can't say I agree with you there but I'm only one voice. I'd rather see the effort put into other areas tbh. -- This is an automated message from the Apache Git Service. To

Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2007105841 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,420 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstPrimitiveGroupsAccu

Re: [PR] minor: Allow to run TPCH bench for a specific query [datafusion]

2025-03-28 Thread via GitHub
comphead merged PR #15467: URL: https://github.com/apache/datafusion/pull/15467 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

[I] Support Min/Max accumulator for type List [datafusion]

2025-03-28 Thread via GitHub
LiaCastaneda opened a new issue, #15477: URL: https://github.com/apache/datafusion/issues/15477 ### Is your feature request related to a problem or challenge? 👋 Min/Max accumulator for List type is missing on datafusion. ### Describe the solution you'd like The logic is

[PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-03-28 Thread via GitHub
ctsk opened a new pull request, #15479: URL: https://github.com/apache/datafusion/pull/15479 Relates to Issue: #15478 ## Rationale for this change The blocking operators (HJ buid side, Aggregation) are often planned on top of a RepartitionExec with a CoalesceBatchesExec in-betw

[PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-28 Thread via GitHub
ctsk opened a new pull request, #15476: URL: https://github.com/apache/datafusion/pull/15476 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested?

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762008525 You're absolutely right, I got my logic wrong there. Embarrasing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762014991 > You're absolutely right, I got my logic wrong there. Embarrasing! It's okay. You've also taught me a lot. When I first started writing this, I really didn't consider the c

Re: [PR] feat: make parquet native scan schema case insensitive [datafusion-comet]

2025-03-28 Thread via GitHub
kazuyukitanimura merged PR #1575: URL: https://github.com/apache/datafusion-comet/pull/1575 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [I] Support Min/Max accumulator for type List [datafusion]

2025-03-28 Thread via GitHub
zuston commented on issue #15477: URL: https://github.com/apache/datafusion/issues/15477#issuecomment-2761897910 If you don’t want to fix, I’m glad to fix this as my first pr. @LiaCastaneda -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] Migrate subtrait tests to insta, part1 [datafusion]

2025-03-28 Thread via GitHub
alamb merged PR #15444: URL: https://github.com/apache/datafusion/pull/15444 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] feat: make parquet native scan schema case insensitive [datafusion-comet]

2025-03-28 Thread via GitHub
kazuyukitanimura commented on PR #1575: URL: https://github.com/apache/datafusion-comet/pull/1575#issuecomment-2762019121 Merged, thanks @wForget -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762023923 Hello @alamb, the optimization SQL and documentation related to this PR have been completed, and all tests have passed. We may need to formally verify the performance, but I'm not

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
alamb commented on code in PR #15480: URL: https://github.com/apache/datafusion/pull/15480#discussion_r2019086527 ## datafusion/substrait/tests/cases/emit_kind_tests.rs: ## @@ -53,15 +54,15 @@ mod tests { let ctx = add_plan_schemas_to_ctx(SessionContext::new(), &proto_p

<    1   2