Re: [I] Idea: Avoid planning CoalesceBatches in front of blocking operators. [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on issue #15478: URL: https://github.com/apache/datafusion/issues/15478#issuecomment-2762997345 > One downside: Increased memory usage. > > The hash join build side stores the RecordBatches in a vector before building the hash table. This vector will grow larger. I

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762816764 Looks like it failed? https://github.com/apache/datafusion/actions/runs/14139465370/job/39618247236 -- This is an automated message from the Apache Git Service. To respond to the

Re: [I] Update ClickBench queries to avoid `to_timestamp_seconds` [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on issue #15465: URL: https://github.com/apache/datafusion/issues/15465#issuecomment-2762989314 Another discrepancy I found in the queries is the "EventDate"::INT::DATE" casting. Is this something we could remove as well? Maybe would be good to look at all further that a

Re: [I] Support for user defined FFI functions [datafusion-ballista]

2025-03-28 Thread via GitHub
unknown-no commented on issue #1215: URL: https://github.com/apache/datafusion-ballista/issues/1215#issuecomment-2763011354 Related [WASM UDFs](https://github.com/apache/datafusion/issues/9326) -- This is an automated message from the Apache Git Service. To respond to the message, please

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub
parthchandra commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762922710 This particular API is not a public API and we use it to so we can verify the metrics in tests. Maybe we can disable its use in non test environments? -- This is an

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2763043386 @alamb I've achieved 2/3 goals: - I added wrapping of a `DynamicFilterSource` in a `PhysicalExpr` such that it can dynamically update itself to prune rows using filter pushdown _e

Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub
jayzhan211 commented on code in PR #15482: URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019668440 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -230,19 +230,19 @@ logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)]

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2763074215 > For your planning purposes I will be away the week of April 21 -- so perhaps we can start testing a week earlier (week of April 7 so we have time to complete / fix issues pr

Re: [PR] Support computing statistics for FileGroup [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15432: URL: https://github.com/apache/datafusion/pull/15432#discussion_r2019681642 ## datafusion/core/src/datasource/statistics.rs: ## @@ -145,7 +147,142 @@ pub async fn get_statistics_with_limit( Ok((result_files, statistics)) } -fn ad

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-03-28 Thread via GitHub
shehabgamin commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2762517221 Happy to test whenever! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the s

Re: [PR] feat: Add regexp_substr function [datafusion]

2025-03-28 Thread via GitHub
github-actions[bot] commented on PR #14323: URL: https://github.com/apache/datafusion/pull/14323#issuecomment-2763013966 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] fix: aggregation corner case [datafusion]

2025-03-28 Thread via GitHub
jayzhan211 commented on PR #15457: URL: https://github.com/apache/datafusion/pull/15457#issuecomment-2763109194 > count(*) actually doesnt depend on any column on input logically count(*) need to know the row number of the column -- This is an automated message from the Apache Git S

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
alamb commented on code in PR #15462: URL: https://github.com/apache/datafusion/pull/15462#discussion_r2019123741 ## benchmarks/queries/clickbench/README.md: ## @@ -120,13 +122,42 @@ LIMIT 10; ``` Results look like - +``` +-+-+---+--+

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
Dandandan merged PR #15475: URL: https://github.com/apache/datafusion/pull/15475 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-03-28 Thread via GitHub
andygrove commented on issue #458: URL: https://github.com/apache/datafusion-comet/issues/458#issuecomment-2762962087 I discussed this feature with @mbutrovich recently and he may have additional thoughts on this topic. -- This is an automated message from the Apache Git Service. To resp

Re: [I] Support RangePartitioning with native shuffle [datafusion-comet]

2025-03-28 Thread via GitHub
viirya commented on issue #458: URL: https://github.com/apache/datafusion-comet/issues/458#issuecomment-2763002946 The implementation issue or difference for RangePartitioning other than other partitioning like HashPartitioning, is that it involves some sampling operations that perform wit

Re: [PR] feat: pushdown filter for native_iceberg_compat [datafusion-comet]

2025-03-28 Thread via GitHub
parthchandra commented on code in PR #1566: URL: https://github.com/apache/datafusion-comet/pull/1566#discussion_r2019634486 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1460,6 +1460,59 @@ class ParquetReadV1Suite extends ParquetReadSuite with

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2761412282 > You mean coalesce_partitions_if_needed() call is redundant in datafusion? I don't think that's the case, but if it is so, why don't we remove that line? I wanted to keep the PR

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
alamb commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2761675489 Nice find @chenkovsky -- so looks like there is some bug in the data generator after all. -- This is an automated message from the Apache Git Service. To respond to the message

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018823379 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub
mkgada commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2761689234 @jinwenjie123 appreciate your response! I am using one of the pre-built JARs, I will not be able to switch to Spark 3.4.x since our cluster was recently upgraded to 3.5.x an

[PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
qstommyshu opened a new pull request, #15480: URL: https://github.com/apache/datafusion/pull/15480 ## Which issue does this PR close? - Closes #15398. Related #15444 ## Rationale for this change ## What changes are included in this PR? Migr

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
qstommyshu commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2761715713 Hi @alamb and @blaginin Part2 of the substrait tests migration is done as well. Please take a look when you have time :) The only tests that cannot be changed to `in

[PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]

2025-03-28 Thread via GitHub
l0kr opened a new pull request, #1577: URL: https://github.com/apache/datafusion-comet/pull/1577 ## Which issue does this PR close? Closes #936. ## Rationale for this change Previous PR went stale so I wanted to move it forward. ## What changes are included

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on code in PR #15301: URL: https://github.com/apache/datafusion/pull/15301#discussion_r2004247883 ## datafusion/common/src/config.rs: ## @@ -590,6 +590,9 @@ config_namespace! { /// during aggregations, if possible pub enable_topk_aggregation:

Re: [PR] chore: Override node name for CometSparkToColumnar [datafusion-comet]

2025-03-28 Thread via GitHub
l0kr commented on PR #1577: URL: https://github.com/apache/datafusion-comet/pull/1577#issuecomment-2761641254 Previous PR: #958 I noticed that collecting dataframe directly from parquet causes an error. From what I see the problem lies in converting Columnar back to Row. If that's some

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018831008 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2761603385 Should we update the same query in the clickbench repo as well? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2760349060 I created https://github.com/pydantic/datafusion/pull/13/files to discuss that idea further. It's promising in some ways but also has some issues, I left TODOs and comments. -- T

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
Copilot commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018163460 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups ba

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2761504336 > Things that serialize a PhysicalExpr across the wire, e.g. https://github.com/XiangpengHao/liquid-cache does it via [serialize_physical_expr](https://github.com/apache/datafusion/blo

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2761489660 Thank you very much @adriangb -- given the new (warranted) complexity this feature is likely to add to DataFusion, and the fact if done right it can serve as the foundation for many a

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2761509767 > I would still keep the methods on ExecutionPlan to do the pushdown instead of the optimizer rule unless I'm wrong about optimizer rules not being able to deal with LiquidCacheClientE

Re: [PR] Remove redundant statistics from FileScanConfig [datafusion]

2025-03-28 Thread via GitHub
blaginin commented on PR #14955: URL: https://github.com/apache/datafusion/pull/14955#issuecomment-2762094667 hey @Standing-Man https://github.com/apache/datafusion/pull/15352 just got merged so this pr may be easier to finish fyi 🌻 -- This is an automated message from the Apache Git Serv

Re: [I] NoSuchMethodError: java.lang.Object org.apache.spark.executor.TaskMetrics.withExternalAccums(scala.Function1) [datafusion-comet]

2025-03-28 Thread via GitHub
andygrove commented on issue #1576: URL: https://github.com/apache/datafusion-comet/issues/1576#issuecomment-2762091836 Comet 0.7.0 supported 3.5.0 through 3.5.4 but not 3.5.5 which had breaking changes to internal apis. I am on vacation this week but it looks like 3.5.5 support was added

Re: [I] Change mapping of SQL `VARCHAR` from `Utf8` to `Utf8View` [datafusion]

2025-03-28 Thread via GitHub
alamb commented on issue #15096: URL: https://github.com/apache/datafusion/issues/15096#issuecomment-2762085363 > Small improvement, i think becasue it's parquet format, mostly we already load it as the Utf8View for benchmark: Yes I would expect no change for the clickbench be

Re: [PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
adriangb commented on PR #15475: URL: https://github.com/apache/datafusion/pull/15475#issuecomment-2762100250 @Dandandan do you think we can merge this here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abov

Re: [I] Migrate datasource tests to `insta` [datafusion]

2025-03-28 Thread via GitHub
alamb closed issue #15246: Migrate datasource tests to `insta` URL: https://github.com/apache/datafusion/issues/15246 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubsc

Re: [PR] Add dynamic pruning filters from TopK state [datafusion]

2025-03-28 Thread via GitHub
alamb commented on PR #15301: URL: https://github.com/apache/datafusion/pull/15301#issuecomment-2762327907 I also wrote up some notes here: https://github.com/apache/datafusion/issues/15037#issuecomment-2762326990 -- This is an automated message from the Apache Git Service. To respond to

Re: [PR] Format `Date32` to string given timestamp specifiers [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on code in PR #15361: URL: https://github.com/apache/datafusion/pull/15361#discussion_r2019547635 ## datafusion/functions/src/datetime/to_char.rs: ## @@ -277,7 +282,25 @@ fn _to_char_array(args: &[ColumnarValue]) -> Result { let result = formatter.va

[PR] Add insta as a dependency in Cargo.toml and Cargo.lock [datafusion]

2025-03-28 Thread via GitHub
qstommyshu opened a new pull request, #15484: URL: https://github.com/apache/datafusion/pull/15484 ## Which issue does this PR close? - Closes #15397 . ## Rationale for this change ## What changes are included in this PR? Migrated tests in `data

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2762810733 > Run extended tests I see it did trigger but I somehow was expecting feedback in the comments -- This is an automated message from the Apache Git Service. To respond to the

Re: [PR] feat: make parquet native scan schema case insensitive [datafusion-comet]

2025-03-28 Thread via GitHub
parthchandra commented on code in PR #1575: URL: https://github.com/apache/datafusion-comet/pull/1575#discussion_r2019608751 ## spark/src/test/scala/org/apache/comet/parquet/ParquetReadSuite.scala: ## @@ -1460,6 +1460,25 @@ class ParquetReadV1Suite extends ParquetReadSuite with

Re: [PR] fix!: incorrect coercion when comparing with string literals [datafusion]

2025-03-28 Thread via GitHub
alan910127 commented on code in PR #15482: URL: https://github.com/apache/datafusion/pull/15482#discussion_r2019692309 ## datafusion/sqllogictest/test_files/push_down_filter.slt: ## @@ -230,19 +230,19 @@ logical_plan TableScan: t projection=[a], full_filters=[t.a != Int32(100)]

[I] QUALIFY syntax [datafusion]

2025-03-28 Thread via GitHub
jayzhan211 opened a new issue, #15485: URL: https://github.com/apache/datafusion/issues/15485 ### Is your feature request related to a problem or challenge? The QUALIFY clause is used to filter the results of [WINDOW functions](https://duckdb.org/docs/stable/sql/functions/window_funct

Re: [PR] Fuse `CASE(a > 0, b / a)` [datafusion]

2025-03-28 Thread via GitHub
github-actions[bot] closed pull request #14200: Fuse `CASE(a > 0, b / a)` URL: https://github.com/apache/datafusion/pull/14200 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

Re: [PR] fix: yield when the next file is ready to open to prevent CPU starvation [datafusion]

2025-03-28 Thread via GitHub
github-actions[bot] commented on PR #14028: URL: https://github.com/apache/datafusion/pull/14028#issuecomment-2763014013 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on PR #15480: URL: https://github.com/apache/datafusion/pull/15480#issuecomment-2763036538 cc @blaginin -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] added fallback using reflection for backward-compatibility [datafusion-comet]

2025-03-28 Thread via GitHub
Kontinuation commented on code in PR #1573: URL: https://github.com/apache/datafusion-comet/pull/1573#discussion_r2019693023 ## spark/src/main/spark-3.5/org/apache/spark/sql/comet/shims/ShimCometScanExec.scala: ## @@ -55,15 +55,48 @@ trait ShimCometScanExec { protected def is

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
chenkovsky commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760736404 ``` grep -n "p_partkey" part.csv ``` why there are two head rows ``` 1:p_partkey,p_name,p_mfgr,p_brand,p_type,p_size,p_container,p_retailprice,p_commen

[I] Pub and export stats_union and col_stats_union [datafusion]

2025-03-28 Thread via GitHub
niebayes opened a new issue, #15474: URL: https://github.com/apache/datafusion/issues/15474 The `UnionExec` operator has two helper functions for merging statistics from multiple input operators. Such a functionality is also needed by my project. Can we provide a public helper func

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub
berkaysynnada commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2760833915 > I think coalesce added before `execute`? > > However just noticed the execute currently happens in main thread instead of `future::once`. But it shouldn't be an issue a

Re: [PR] Introduce selection vector repartitioning [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on PR #15423: URL: https://github.com/apache/datafusion/pull/15423#issuecomment-2760774554 > #15339 It looks like the join plan is being changed. You should be able to get the test back by also setting `hash_join_single_partition_threshold` to `0` / a low value.

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub
berkaysynnada commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2760838692 > This should not trigger for physical plans generated by datafusion, since the EnforceDistribution pass already adds that CoalescePartitionsExec. You mean `coalesce_part

Re: [PR] chore: Reimplement ShuffleWriterExec using interleave_record_batch [datafusion-comet]

2025-03-28 Thread via GitHub
2010YOUY01 commented on PR #1511: URL: https://github.com/apache/datafusion-comet/pull/1511#issuecomment-2760428379 > While reading this I also wonder if we would be able to hook into [DF's new SpillManager](https://github.com/apache/datafusion/pull/15355). That's a task for another PR, bu

Re: [PR] Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on PR #15447: URL: https://github.com/apache/datafusion/pull/15447#issuecomment-2760854661 First of all, i create the reproducer benchmark PR for arrow-rs: https://github.com/apache/arrow-rs/pull/7351 | Benchmark | Utf8 Time (µs)| Ut

Re: [PR] Improve performance sort TPCH q3 with Utf8Vew ( Sort-preserving mergi… [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on PR #15447: URL: https://github.com/apache/datafusion/pull/15447#issuecomment-2760778885 After thinking more, i think the better way is to improve it in arrow-rs, so we will benefit more about the utf8view regression cases, created the ticket for arrow-rs: htt

Re: [PR] Docs: Formatting and Added Extra resources [datafusion]

2025-03-28 Thread via GitHub
berkaysynnada commented on PR #15450: URL: https://github.com/apache/datafusion/pull/15450#issuecomment-2760926069 > BTW @oznur-synnada I wonder if you have time to update the page with other recent blog content 🤔 You mean this? https://github.com/apache/datafusion/pull/15440 -- Th

Re: [PR] fix: the average time for clickbench query compute should use new vec to make it compute for each query [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on PR #15472: URL: https://github.com/apache/datafusion/pull/15472#issuecomment-2760531078 cc @2010YOUY01 @xudong963 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [I] `BinaryExpr` evaluate lacks optimization for `Or` and `And` scenarios [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on issue #11212: URL: https://github.com/apache/datafusion/issues/11212#issuecomment-2761086840 Thank you very much for your reply. These are some updates on this issue. @alamb: 1. I have added the extended SQL in this PR #15462, you can check the details there:[SQL

[PR] fix: Assertion fail in external sort [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 opened a new pull request, #15469: URL: https://github.com/apache/datafusion/pull/15469 ## Which issue does this PR close? related to https://github.com/apache/datafusion/issues/15372 ## Rationale for this change One external sort query will panic d

Re: [PR] fix: Assertion fail in external sort [datafusion]

2025-03-28 Thread via GitHub
Copilot commented on code in PR #15469: URL: https://github.com/apache/datafusion/pull/15469#discussion_r2018081822 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -416,21 +409,23 @@ impl ExternalSorter { Some(self.spill_manager.create_in_progress_file("Sor

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018207072 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Clean up hash_join's ExecutionPlan::execute [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15418: URL: https://github.com/apache/datafusion/pull/15418#issuecomment-2760648004 > Modifying the plan post-execute() feels a bit off to me. Does it seem like a smell to you as well? This should not trigger for physical plans generated by datafusion, since the

Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018226169 ## datafusion/datasource/benches/split_groups_by_statistics.rs: ## @@ -0,0 +1,178 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more c

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2018202146 ## datafusion/datasource/src/memory.rs: ## @@ -440,6 +443,35 @@ impl DataSource for MemorySourceConfig { } } +fn repartitioned( Review Comm

Re: [PR] Enable repartitioning on MemTable. [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 commented on code in PR #15409: URL: https://github.com/apache/datafusion/pull/15409#discussion_r2018210999 ## datafusion/datasource/src/memory.rs: ## @@ -902,4 +1130,319 @@ mod tests { Ok(()) } + +fn batch(row_size: usize) -> RecordBatch { +

Re: [PR] chore: move `optimize_subquery_sort` into optimizer [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on code in PR #15441: URL: https://github.com/apache/datafusion/pull/15441#discussion_r2018254293 ## datafusion/optimizer/src/eliminate_sort.rs: ## @@ -0,0 +1,78 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license

Re: [I] [DISCUSS] Consider Vendoring Certain Dependencies [datafusion]

2025-03-28 Thread via GitHub
Omega359 commented on issue #15360: URL: https://github.com/apache/datafusion/issues/15360#issuecomment-2759069227 I can't say I agree with you there but I'm only one voice. I'd rather see the effort put into other areas tbh. -- This is an automated message from the Apache Git Service. To

Re: [PR] Improve performance of `first_value` by implementing special `GroupsAccumulator` [datafusion]

2025-03-28 Thread via GitHub
Dandandan commented on code in PR #15266: URL: https://github.com/apache/datafusion/pull/15266#discussion_r2007105841 ## datafusion/functions-aggregate/src/first_last.rs: ## @@ -179,6 +292,420 @@ impl AggregateUDFImpl for FirstValue { } } +struct FirstPrimitiveGroupsAccu

Re: [PR] minor: Allow to run TPCH bench for a specific query [datafusion]

2025-03-28 Thread via GitHub
comphead merged PR #15467: URL: https://github.com/apache/datafusion/pull/15467 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dataf

[I] Support Min/Max accumulator for type List [datafusion]

2025-03-28 Thread via GitHub
LiaCastaneda opened a new issue, #15477: URL: https://github.com/apache/datafusion/issues/15477 ### Is your feature request related to a problem or challenge? 👋 Min/Max accumulator for List type is missing on datafusion. ### Describe the solution you'd like The logic is

[PR] experiment: Selectively remove CoalesceBatchesExec [datafusion]

2025-03-28 Thread via GitHub
ctsk opened a new pull request, #15479: URL: https://github.com/apache/datafusion/pull/15479 Relates to Issue: #15478 ## Rationale for this change The blocking operators (HJ buid side, Aggregation) are often planned on top of a RepartitionExec with a CoalesceBatchesExec in-betw

[PR] Remove CoalescePartitions insertion from HashJoinExec [datafusion]

2025-03-28 Thread via GitHub
ctsk opened a new pull request, #15476: URL: https://github.com/apache/datafusion/pull/15476 ## Which issue does this PR close? - Closes #. ## Rationale for this change ## What changes are included in this PR? ## Are these changes tested?

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762008525 You're absolutely right, I got my logic wrong there. Embarrasing! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and us

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762014991 > You're absolutely right, I got my logic wrong there. Embarrasing! It's okay. You've also taught me a lot. When I first started writing this, I really didn't consider the c

Re: [PR] feat: make parquet native scan schema case insensitive [datafusion-comet]

2025-03-28 Thread via GitHub
kazuyukitanimura merged PR #1575: URL: https://github.com/apache/datafusion-comet/pull/1575 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsub

Re: [I] Support Min/Max accumulator for type List [datafusion]

2025-03-28 Thread via GitHub
zuston commented on issue #15477: URL: https://github.com/apache/datafusion/issues/15477#issuecomment-2761897910 If you don’t want to fix, I’m glad to fix this as my first pr. @LiaCastaneda -- This is an automated message from the Apache Git Service. To respond to the message, please log

Re: [PR] Migrate subtrait tests to insta, part1 [datafusion]

2025-03-28 Thread via GitHub
alamb merged PR #15444: URL: https://github.com/apache/datafusion/pull/15444 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] feat: make parquet native scan schema case insensitive [datafusion-comet]

2025-03-28 Thread via GitHub
kazuyukitanimura commented on PR #1575: URL: https://github.com/apache/datafusion-comet/pull/1575#issuecomment-2762019121 Merged, thanks @wForget -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2762023923 Hello @alamb, the optimization SQL and documentation related to this PR have been completed, and all tests have passed. We may need to formally verify the performance, but I'm not

Re: [PR] Migrate-substrait-tests-to-insta, part2 [datafusion]

2025-03-28 Thread via GitHub
alamb commented on code in PR #15480: URL: https://github.com/apache/datafusion/pull/15480#discussion_r2019086527 ## datafusion/substrait/tests/cases/emit_kind_tests.rs: ## @@ -53,15 +54,15 @@ mod tests { let ctx = add_plan_schemas_to_ctx(SessionContext::new(), &proto_p

[PR] chore(deps): bump aws-config from 1.6.0 to 1.6.1 [datafusion]

2025-03-28 Thread via GitHub
dependabot[bot] opened a new pull request, #15470: URL: https://github.com/apache/datafusion/pull/15470 Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.6.0 to 1.6.1. Commits See full diff in https://github.com/smithy-lang/smithy-rs/commits";>compare view

[I] The average time compute for clickbench query should not inside the query iterator [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas opened a new issue, #15471: URL: https://github.com/apache/datafusion/issues/15471 ### Describe the bug The average time compute for clickbench query should not inside the query iterator. I was mistakenly added inside the iterator. ### To Reproduce _N

Re: [I] Enable `split_file_groups_by_statistics` by default [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on issue #10336: URL: https://github.com/apache/datafusion/issues/10336#issuecomment-2760506147 Fyi, I'm working on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [PR] fix: the average time for clickbench query compute should outside the iterator loop [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on PR #15472: URL: https://github.com/apache/datafusion/pull/15472#issuecomment-2760512944 cc @xudong963 @2010YOUY01 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spe

Re: [I] The average time compute for clickbench query should not inside the query iterator [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas commented on issue #15471: URL: https://github.com/apache/datafusion/issues/15471#issuecomment-2760505849 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. T

[PR] Improve `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 opened a new pull request, #15473: URL: https://github.com/apache/datafusion/pull/15473 ## Which issue does this PR close? - Closes https://github.com/apache/datafusion/issues/10336#issuecomment-2758082825 ## Rationale for this change As @surema

Re: [PR] Improve `split_groups_by_statistics` method [datafusion]

2025-03-28 Thread via GitHub
xudong963 commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2018151341 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -575,6 +575,95 @@ impl FileScanConfig { }) } +/// Splits file groups into new groups

[PR] Partial fix for #1078 — [Add Dataframe display config] [datafusion-python]

2025-03-28 Thread via GitHub
kosiew opened a new pull request, #1086: URL: https://github.com/apache/datafusion-python/pull/1086 ## Which issue does this PR close? Partial fix for #1078 ## Rationale for this change This PR adds configurable display settings for `DataFrame` representations in the Pyt

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
ctsk commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2760574284 I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` -- This is an automated me

[PR] fix: the average time for clickbench query compute should outside the iterator loop [datafusion]

2025-03-28 Thread via GitHub
zhuqi-lucas opened a new pull request, #15472: URL: https://github.com/apache/datafusion/pull/15472 ## Which issue does this PR close? - Closes [#15471](https://github.com/apache/datafusion/issues/15471) ## Rationale for this change the average time for clickbench query c

Re: [PR] chore(deps): bump aws-config from 1.6.0 to 1.6.1 [datafusion]

2025-03-28 Thread via GitHub
xudong963 merged PR #15470: URL: https://github.com/apache/datafusion/pull/15470 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@data

Re: [PR] fix: the average time for clickbench query compute should use new vec to make it compute for each query [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 merged PR #15472: URL: https://github.com/apache/datafusion/pull/15472 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@dat

Re: [I] The average time compute for clickbench query is wrong [datafusion]

2025-03-28 Thread via GitHub
2010YOUY01 closed issue #15471: The average time compute for clickbench query is wrong URL: https://github.com/apache/datafusion/issues/15471 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the spec

[PR] Partial fix for 1078: [refactor: Simplify HTML generation in PyDataFrame by extracting helper functions] [datafusion-python]

2025-03-28 Thread via GitHub
kosiew opened a new pull request, #1087: URL: https://github.com/apache/datafusion-python/pull/1087 # Which issue does this PR close? Partial fix for #1078 # Rationale for this change > Split up some of the html generation into a set of helper functions. The render

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
niebayes commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760761409 The line number in the error message is the row index of a certain record batch, not the line number in the csv file. I have filed an issue to arrow-rs for making this error me

Re: [I] [Bug] datafusion-cli may fail to read csv files [datafusion]

2025-03-28 Thread via GitHub
niebayes commented on issue #15456: URL: https://github.com/apache/datafusion/issues/15456#issuecomment-2760766577 > why there are two head rows I didn't find this. You might find the cause. -- This is an automated message from the Apache Git Service. To respond to the message, plea

[PR] Update ClickBench queries to avoid to_timestamp_seconds [datafusion]

2025-03-28 Thread via GitHub
acking-you opened a new pull request, #15475: URL: https://github.com/apache/datafusion/pull/15475 ## Which issue does this PR close? - Closes #15465 . ## Rationale for this change ## What changes are included in this PR? ## Are these change

Re: [PR] Add short circuit [datafusion]

2025-03-28 Thread via GitHub
acking-you commented on PR #15462: URL: https://github.com/apache/datafusion/pull/15462#issuecomment-2761137454 > I think one issue is that the short-circuit logic is not handling cases where the the `rhs` contains NULLs. E.g. `true OR NULL` needs to evaluate to `NULL` Thank you very

  1   2   >