Re: [PR] Introduce load-balanced `split_groups_by_statistics` method [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on code in PR #15473: URL: https://github.com/apache/datafusion/pull/15473#discussion_r2030077255 ## datafusion/datasource/src/file_scan_config.rs: ## @@ -858,6 +858,96 @@ impl FileScanConfig { }) } +/// Splits file groups into new gro

Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781429711 I expect this to make a large performance difference when x is a string type (as string comparisons are fairly expensive) Thank you for this PR @ding-young and the great reviews

Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781429451 We could also use a CASE ```sql CASE x IS NOT NULL THEN true ELSE null END ``` -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] Remove CoalescePartitions insertion from Joins [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15570: URL: https://github.com/apache/datafusion/pull/15570#issuecomment-2781430463 I see -- this is code simplification 👍 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] FIX: Remove redundant repartition [datafusion]

2025-04-06 Thread via GitHub
getChan commented on PR #15604: URL: https://github.com/apache/datafusion/pull/15604#issuecomment-2781414127 close. it isn't really redundant. see issue comments -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

Re: [PR] FIX: Remove redundant repartition [datafusion]

2025-04-06 Thread via GitHub
getChan closed pull request #15604: FIX: Remove redundant repartition URL: https://github.com/apache/datafusion/pull/15604 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on issue #15524: URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2781418213 > 1. just following the Duck and make a benchmark specific optimization (and don't try to handle any other cases) > 2. take the high road and say we aren't going to benc

Re: [PR] Remove CoalescePartitions insertion from Joins [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15570: URL: https://github.com/apache/datafusion/pull/15570#issuecomment-2781420170 I wonder if there is any way to write some tests for this (perhaps via `EXPLAIN` in .slt tests to demonstrate that the unecessary exec is removed) -- This is an automated message fro

Re: [PR] Remove CoalescePartitions insertion from Joins [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15570: URL: https://github.com/apache/datafusion/pull/15570#issuecomment-2781421263 🤖 `./gh_compare_branch.sh` [Benchmark Script](https://github.com/alamb/datafusion-benchmarking) Running Linux aal-dev 6.8.0-1016-gcp #18-Ubuntu SMP Fri Oct 4 22:16:29 UTC 2024 x86_

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on issue #15601: URL: https://github.com/apache/datafusion/issues/15601#issuecomment-2781412638 > The round robin repartitioning is added to increase parallelism (by increasing number of partitions). Hash repartitioning does also increase the number of partitions, bu

Re: [PR] Use pager and allow configuration via `\pset` [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15597: URL: https://github.com/apache/datafusion/pull/15597#issuecomment-2781418624 This PR has several CI failures so marking as a draft while they are addressed. (I do this to make it easier to see what PRs are waiting on review) -- This is an automated messa

Re: [PR] Remove CoalescePartitions insertion from Joins [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on PR #15570: URL: https://github.com/apache/datafusion/pull/15570#issuecomment-2781422765 > > I wonder if there is any way to write some tests for this (perhaps via `EXPLAIN` in .slt tests to demonstrate that the unecessary exec is removed) > > I don't think a

Re: [PR] Remove CoalescePartitions insertion from Joins [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on PR #15570: URL: https://github.com/apache/datafusion/pull/15570#issuecomment-2781422523 > I wonder if there is any way to write some tests for this (perhaps via `EXPLAIN` in .slt tests to demonstrate that the unecessary exec is removed) I don't think a redun

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on PR #15566: URL: https://github.com/apache/datafusion/pull/15566#issuecomment-2781418886 I'll take a look at this as well asap -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to

Re: [I] Make Clickbench Q29 5x faster for datafusion [datafusion]

2025-04-06 Thread via GitHub
zhuqi-lucas commented on issue #15524: URL: https://github.com/apache/datafusion/issues/15524#issuecomment-2781423856 Thank you @alamb , @berkaysynnada for further step suggestion, i will create a follow-up ticket to do the 2nd option if i completed option 1. -- This is an automated messa

Re: [PR] ExecutionPlan: add APIs for filter pushdown & optimizer rule to apply them [datafusion]

2025-04-06 Thread via GitHub
adriangb commented on code in PR #15566: URL: https://github.com/apache/datafusion/pull/15566#discussion_r2029972339 ## datafusion/physical-plan/src/execution_plan.rs: ## @@ -467,6 +468,353 @@ pub trait ExecutionPlan: Debug + DisplayAs + Send + Sync { ) -> Result>> {

Re: [PR] Migrate datafusion/sql tests to insta, part6 [datafusion]

2025-04-06 Thread via GitHub
alamb commented on code in PR #15578: URL: https://github.com/apache/datafusion/pull/15578#discussion_r2030152111 ## datafusion/sql/tests/cases/diagnostic.rs: ## @@ -136,7 +137,7 @@ fn test_table_not_found() -> Result<()> { let query = "SELECT * FROM /*a*/personx/*a*/";

Re: [PR] fix: recursion protection for physical plan node [datafusion]

2025-04-06 Thread via GitHub
chenkovsky commented on PR #15600: URL: https://github.com/apache/datafusion/pull/15600#issuecomment-2781426396 > @chenkovsky do you have any idea about the root cause of the problem? I think this PR shouldn't close the issue until fixing/understanding the underlying problem @berkays

Re: [PR] fix: union all by name [datafusion]

2025-04-06 Thread via GitHub
alamb commented on code in PR #15603: URL: https://github.com/apache/datafusion/pull/15603#discussion_r2030153718 ## datafusion/physical-plan/src/stream.rs: ## @@ -362,6 +362,8 @@ pin_project! { #[pin] stream: S, + +transform_schema: bool, Review Com

Re: [I] Remove `ParquetSource::pruning_predicate` [datafusion]

2025-04-06 Thread via GitHub
alamb closed issue #15534: Remove `ParquetSource::pruning_predicate` URL: https://github.com/apache/datafusion/issues/15534 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To u

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-06 Thread via GitHub
alamb merged PR #15561: URL: https://github.com/apache/datafusion/pull/15561 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] parquet reader: move pruning predicate creation from ParquetSource to ParquetOpener [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15561: URL: https://github.com/apache/datafusion/pull/15561#issuecomment-2781427143 Onwards towards topk pushdown -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

2025-04-06 Thread via GitHub
UBarney closed issue #15601: Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans URL: https://github.com/apache/datafusion/issues/15601 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

2025-04-06 Thread via GitHub
UBarney commented on issue #15601: URL: https://github.com/apache/datafusion/issues/15601#issuecomment-2781435121 Thanks for your explanation @Dandandan @berkaysynnada . I now understand the benefit of adding roundrobin even followed by Hash -- This is an automated message from the Apach

Re: [PR] Remove CoalescePartitions insertion from Joins [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15570: URL: https://github.com/apache/datafusion/pull/15570#issuecomment-2781439879 🤖: Benchmark completed Details ``` Comparing HEAD and remove-hj-coalesce Benchmark clickbench_extended.json

Re: [PR] Migrate datafusion/sql tests to insta, part6 [datafusion]

2025-04-06 Thread via GitHub
qstommyshu commented on PR #15578: URL: https://github.com/apache/datafusion/pull/15578#issuecomment-2781439769 > Hi @alamb and @blaginin > > I found a way to migrate `roundtrip_statement_with_dialect()` now, just want to confirm if you like to proceed with it. > > What I did b

[PR] chore: Repartitionexec display tree [datafusion]

2025-04-06 Thread via GitHub
getChan opened a new pull request, #15606: URL: https://github.com/apache/datafusion/pull/15606 ## Which issue does this PR close? - Closes #. ## Rationale for this change It is easy to understand because it means the number of `RepartitionExec` input's output partit

Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781454412 I have a working version locally and will create a PR soon, just one problem, I don't think I can know the number of blocking threads tokio is configured with. this is i

Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub
andygrove commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781458966 > I have a working version locally and will create a PR soon, just one problem, I don't think we can know the number of blocking threads tokio is configured with. > > t

Re: [PR] fix: union all by name [datafusion]

2025-04-06 Thread via GitHub
chenkovsky commented on code in PR #15603: URL: https://github.com/apache/datafusion/pull/15603#discussion_r2030172787 ## datafusion/physical-plan/src/stream.rs: ## @@ -362,6 +362,8 @@ pin_project! { #[pin] stream: S, + +transform_schema: bool, Revie

Re: [PR] Introduce DynamicFilterSource and DynamicPhysicalExpr [datafusion]

2025-04-06 Thread via GitHub
adriangb commented on code in PR #15568: URL: https://github.com/apache/datafusion/pull/15568#discussion_r2030171299 ## datafusion/physical-expr-common/src/physical_expr.rs: ## @@ -283,6 +284,51 @@ pub trait PhysicalExpr: Send + Sync + Display + Debug + DynEq + DynHash { /

Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781460121 > > Comet currently creates a new tokio runtime per plan but there is a proposal to move to a global tokio runtime (per executor) instead. > > [apache/datafusion-com

Re: [PR] fix decimal precision issue in simplify expression optimize rule [datafusion]

2025-04-06 Thread via GitHub
shehabgamin commented on PR #15588: URL: https://github.com/apache/datafusion/pull/15588#issuecomment-2781460484 > FYI @shehabgamin > > > > Do you have some time to review this PR Yes! Will carve out some time on Monday. Thanks @jayzhan211 !! -- This is an aut

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

2025-04-06 Thread via GitHub
getChan commented on issue #15601: URL: https://github.com/apache/datafusion/issues/15601#issuecomment-2781463692 FWI. tpch benchmark. main vs remove round-robin repartition ```sh Benchmark tpch_sf1.json ┏━━┳

Re: [I] Cache Parquet Metadata [datafusion]

2025-04-06 Thread via GitHub
alamb commented on issue #15582: URL: https://github.com/apache/datafusion/issues/15582#issuecomment-2781382025 > I would be happy to share / upstream any work I do on this if there is interest. Thanks @matthewmturner -- what I think would be really valuable is if you could prov

Re: [I] Blog post about TopK filter pushdown [datafusion]

2025-04-06 Thread via GitHub
alamb commented on issue #15513: URL: https://github.com/apache/datafusion/issues/15513#issuecomment-2781382737 Thanks @aaryyya -- note we'll need to actually finish the work before we can publish a blog Of course, blog driven development does work pretty well -- it is basically we

Re: [I] Getting started guide for new users (who want to use DataFusion in their project) [datafusion]

2025-04-06 Thread via GitHub
alamb commented on issue #7014: URL: https://github.com/apache/datafusion/issues/7014#issuecomment-2781383363 Hi @aaryyya -- I think thanks to @tshauck and others with the library user guide, this is mostly done now (so closing it) - https://datafusion.apache.org/library-user-guide/

Re: [I] Getting started guide for new users (who want to use DataFusion in their project) [datafusion]

2025-04-06 Thread via GitHub
alamb closed issue #7014: Getting started guide for new users (who want to use DataFusion in their project) URL: https://github.com/apache/datafusion/issues/7014 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL ab

Re: [PR] FIX: Remove redundant repartition [datafusion]

2025-04-06 Thread via GitHub
getChan commented on code in PR #15604: URL: https://github.com/apache/datafusion/pull/15604#discussion_r2030128874 ## datafusion/physical-plan/src/repartition/mod.rs: ## @@ -510,7 +510,7 @@ impl DisplayAs for RepartitionExec { writeln!(f, "partitioning_scheme={

[I] Unrelated to the current fix, we should compare them using normalized names to support [datafusion]

2025-04-06 Thread via GitHub
alamb opened a new issue, #15605: URL: https://github.com/apache/datafusion/issues/15605 Unrelated to the current fix, we should compare them using normalized names to support ```sql SELECT t1.v1, SUM(t1.v1) OVER W + 1 FROM generate_series(1, 5) AS t1(v1

Re: [PR] fix: nested window function [datafusion]

2025-04-06 Thread via GitHub
alamb commented on code in PR #15033: URL: https://github.com/apache/datafusion/pull/15033#discussion_r2030129209 ## datafusion/sql/src/select.rs: ## @@ -891,29 +892,42 @@ fn match_window_definitions( named_windows: &[NamedWindowDefinition], ) -> Result<()> { for proj

[PR] Support nested join without parentheses [datafusion-sqlparser-rs]

2025-04-06 Thread via GitHub
barsela1 opened a new pull request, #1798: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1798 support for parsing nested JOINs without parentheses -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL a

Re: [PR] Support nested join without parentheses [datafusion-sqlparser-rs]

2025-04-06 Thread via GitHub
barsela1 closed pull request #1798: Support nested join without parentheses URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1798 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specif

[PR] FIX: Remove redundant repartition [datafusion]

2025-04-06 Thread via GitHub
getChan opened a new pull request, #15604: URL: https://github.com/apache/datafusion/pull/15604 ## Which issue does this PR close? - Closes #15601 ## Rationale for this change - It seems like the purpose of `add_roundrobin` in `EnforceDistribution` optimizer rule is

Re: [I] Extend TopK early termination to partially sorted inputs [datafusion]

2025-04-06 Thread via GitHub
alamb commented on issue #15529: URL: https://github.com/apache/datafusion/issues/15529#issuecomment-2781385443 @NGA-TRAN and @gabotechs can you please help review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] Actually run wasm test in ci [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15595: URL: https://github.com/apache/datafusion/pull/15595#issuecomment-2781386141 Thank you @XiangpengHao -- I also find CI debugging very long and painful -- This is an automated message from the Apache Git Service. To respond to the message, please log on to Git

[PR] add support to nested join_without parentheses snowflake [datafusion-sqlparser-rs]

2025-04-06 Thread via GitHub
barsela1 opened a new pull request, #1799: URL: https://github.com/apache/datafusion-sqlparser-rs/pull/1799 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [PR] Use pager and allow configuration via `\pset` [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on PR #15597: URL: https://github.com/apache/datafusion/pull/15597#issuecomment-2781391914 @djellemah thank you for working on this! Can we also add some tests to not break these features in the future? and there are some failures in CI -- This is an automated mess

Re: [PR] Chore: Call arrow's methods `row_count` and `skipped_row_count` [datafusion]

2025-04-06 Thread via GitHub
alamb merged PR #15587: URL: https://github.com/apache/datafusion/pull/15587 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] Actually run wasm test in ci [datafusion]

2025-04-06 Thread via GitHub
alamb commented on code in PR #15595: URL: https://github.com/apache/datafusion/pull/15595#discussion_r2030130653 ## .github/workflows/rust.yml: ## @@ -385,24 +385,24 @@ jobs: linux-wasm-pack: name: build with wasm-pack -runs-on: ubuntu-latest -container: -

Re: [PR] chore: rm duplicated `JoinOn` type [datafusion]

2025-04-06 Thread via GitHub
alamb merged PR #15590: URL: https://github.com/apache/datafusion/pull/15590 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusi

Re: [PR] chore: rm duplicated `JoinOn` type [datafusion]

2025-04-06 Thread via GitHub
alamb commented on code in PR #15590: URL: https://github.com/apache/datafusion/pull/15590#discussion_r2030131422 ## datafusion/physical-plan/src/joins/mod.rs: ## @@ -39,6 +40,11 @@ mod join_hash_map; #[cfg(test)] pub mod test_utils; +/// The on clause of the join, as vector

Re: [PR] Migrate datafusion/sql tests to insta, part6 [datafusion]

2025-04-06 Thread via GitHub
qstommyshu commented on PR #15578: URL: https://github.com/apache/datafusion/pull/15578#issuecomment-2781397695 Hi @alamb and @blaginin I found a way to migrate `roundtrip_statement_with_dialect()` now, just want to confirm if you like to proceed with it. What I did below is to incor

Re: [PR] Set DataFusion runtime configurations through SQL interface [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on PR #15594: URL: https://github.com/apache/datafusion/pull/15594#issuecomment-2781398502 Hello @kumarlokesh. Thank you for working on this. I have 2 questions/concerns. Let's discuss on them a bit to get a future-proof design 1) There are also `runtime_env: A

Re: [PR] Remove CoalescePartitions insertion from Joins [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on code in PR #15570: URL: https://github.com/apache/datafusion/pull/15570#discussion_r2028923212 ## datafusion/physical-plan/src/joins/cross_join.rs: ## @@ -189,19 +188,12 @@ impl CrossJoinExec { /// Asynchronously collect the result of the left child

[PR] fix doc and broken api [datafusion]

2025-04-06 Thread via GitHub
logan-keede opened a new pull request, #15602: URL: https://github.com/apache/datafusion/pull/15602 ## Which issue does this PR close? - Closes #15443 ## Rationale for this change - doc is advising user to use a private function, that used to be public. #

Re: [PR] fix: recursion protection for physical plan node [datafusion]

2025-04-06 Thread via GitHub
milenkovicm commented on PR #15600: URL: https://github.com/apache/datafusion/pull/15600#issuecomment-2781350515 I'm not an expert, but I don't think this issue is due unbounded recursion -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781400999 > About the performance, I'm not 100% sure whether this rule worth the change, but I made this change because `IS NOT NULL OR NULL` was a bit better than actual comparision.

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

2025-04-06 Thread via GitHub
Dandandan commented on issue #15601: URL: https://github.com/apache/datafusion/issues/15601#issuecomment-2781405588 I don't think it is really redundant. The rund robin repartitioning is added to increase parallelism (by increasing number of partitions). Hash repartitioning does a

Re: [PR] Actually run wasm test in ci [datafusion]

2025-04-06 Thread via GitHub
qstommyshu commented on PR #15595: URL: https://github.com/apache/datafusion/pull/15595#issuecomment-2781406092 This could be very helpful for implementing CI for `datafusion-wasm-bindings` as well! -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] FIX: Remove redundant repartition [datafusion]

2025-04-06 Thread via GitHub
getChan commented on code in PR #15604: URL: https://github.com/apache/datafusion/pull/15604#discussion_r2030142110 ## datafusion/physical-optimizer/src/enforce_distribution.rs: ## @@ -1258,19 +1259,14 @@ pub fn ensure_distribution( child = add_spm_on_top(ch

Re: [PR] fix: recursion protection for physical plan node [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on PR #15600: URL: https://github.com/apache/datafusion/pull/15600#issuecomment-2781407718 @chenkovsky do you have any idea about the root cause of the problem? I think this PR shouldn't close the issue until fixing/understanding the underlying problem -- This is

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

2025-04-06 Thread via GitHub
getChan commented on issue #15601: URL: https://github.com/apache/datafusion/issues/15601#issuecomment-2781316139 take -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To un

Re: [PR] Enhance: simplify x=x [datafusion]

2025-04-06 Thread via GitHub
ding-young commented on PR #15589: URL: https://github.com/apache/datafusion/pull/15589#issuecomment-2781281225 @2010YOUY01 Instead of applying transformation on filter expression, I adjusted the rule to transform x=x into `x IS NOT NULL OR NULL`. This preserves the behavior where NULL = NU

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

2025-04-06 Thread via GitHub
getChan commented on issue #15601: URL: https://github.com/apache/datafusion/issues/15601#issuecomment-2781294932 I agree with the suggestion. It seems like the purpose of add_roundrobin in ensure_distribution optimizer is only to increase parallelism. it can achieve that with just add_hash

Re: [PR] fix doc and broken api [datafusion]

2025-04-06 Thread via GitHub
logan-keede commented on PR #15602: URL: https://github.com/apache/datafusion/pull/15602#issuecomment-2781360799 cc @findepi -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

[PR] fix: union all by name [datafusion]

2025-04-06 Thread via GitHub
chenkovsky opened a new pull request, #15603: URL: https://github.com/apache/datafusion/pull/15603 ## Which issue does this PR close? - Closes #15394. ## Rationale for this change schema from inner physical plan is returned. ## What changes are included in this PR?

Re: [PR] fix: recursion protection for physical plan node [datafusion]

2025-04-06 Thread via GitHub
chenkovsky commented on PR #15600: URL: https://github.com/apache/datafusion/pull/15600#issuecomment-2781362304 > I'm not an expert, but I don't think this issue is due unbounded recursion yes, it's not due to unbounded recursion. -- This is an automated message from the Apache Git

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-06 Thread via GitHub
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030116177 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,614 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, wri

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-06 Thread via GitHub
kevinjqliu commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030181076 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator

[PR] perf: Use a global tokio runtime [datafusion-comet]

2025-04-06 Thread via GitHub
andygrove opened a new pull request, #1614: URL: https://github.com/apache/datafusion-comet/pull/1614 ## Which issue does this PR close? Closes https://github.com/apache/datafusion-comet/issues/1590 Maybe helps with https://github.com/apache/datafusion-comet/issues/1523

[PR] add object store support for executor and scheduler [datafusion-ballista]

2025-04-06 Thread via GitHub
milenkovicm opened a new pull request, #1230: URL: https://github.com/apache/datafusion-ballista/pull/1230 # Which issue does this PR close? Closes #1205 . # Rationale for this change # What changes are included in this PR? # Are there any user-facing changes?

Re: [PR] Fix: after repartitioning, the `PartitionedFile` and `FileGroup` statistics should be inexact [datafusion]

2025-04-06 Thread via GitHub
berkaysynnada commented on code in PR #15539: URL: https://github.com/apache/datafusion/pull/15539#discussion_r2030265464 ## datafusion/datasource/src/file_groups.rs: ## @@ -263,7 +264,21 @@ impl FileGroupPartitioner { .flatten() .chunk_by(|(partition_i

Re: [PR] datafusion-python 46.0.0 announcement [datafusion-site]

2025-04-06 Thread via GitHub
timsaucer commented on PR #65: URL: https://github.com/apache/datafusion-site/pull/65#issuecomment-2781663438 I was just playing around with disabling `codehilite` and just allowing the `highlight.js` tool to do the code formatting. That does give better python rendering. It seems like the

[PR] feat: add multi level merge for sorting [datafusion]

2025-04-06 Thread via GitHub
rluvaton opened a new pull request, #15608: URL: https://github.com/apache/datafusion/pull/15608 ## Which issue does this PR close? - Closes #15323. ## Rationale for this change To be able to sort any amount spill files without getting over the tokio blocking thr

Re: [PR] Migrate datafusion/sql tests to insta, part6 [datafusion]

2025-04-06 Thread via GitHub
qstommyshu commented on PR #15578: URL: https://github.com/apache/datafusion/pull/15578#issuecomment-2781841733 Got it, Thanks @alamb for your thoughts. I will update `roundtrip_statement_with_dialect()`. I will probably go with the macro approach because the macro approach is essent

Re: [PR] Blog post about user defined window functions [datafusion-site]

2025-04-06 Thread via GitHub
Adez017 commented on PR #66: URL: https://github.com/apache/datafusion-site/pull/66#issuecomment-2782070224 any updates ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To

Re: [I] Redundant Repartition: `RoundRobinBatch` Followed by `Hash` in Physical Plans [datafusion]

2025-04-06 Thread via GitHub
UBarney commented on issue #15601: URL: https://github.com/apache/datafusion/issues/15601#issuecomment-2782098214 > `tpch_sf1` And `tpch_sf10` by default already partition the input data, so AFAIK the plans should not be any different (they don't introduce round-robin repartition) Th

Re: [I] Bug in 'cum_dist()' example in the docs [datafusion]

2025-04-06 Thread via GitHub
Adez017 commented on issue #15611: URL: https://github.com/apache/datafusion/issues/15611#issuecomment-2782129401 i know that there might be something left out form the merge request and I want to correct it . @alamb -- This is an automated message from the Apache Git Service. To resp

Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2782032011 Also, to have a fully working sort, you need to spill in https://github.com/apache/datafusion/blob/362fcdfc7b9e00cb6126a0cbc41c9abb2637c563/datafusion/physical-plan/src/sorts/bui

[I] Bug in 'cum_dist()' example in the docs [datafusion]

2025-04-06 Thread via GitHub
Adez017 opened a new issue, #15611: URL: https://github.com/apache/datafusion/issues/15611 ### Describe the bug i was reviewing the changes made in the docs for the examples in window functions and notice that the example was not in a organised manner . ![Image](https://gith

Re: [I] Release DataFusion `47.0.0` (April 2025) [datafusion]

2025-04-06 Thread via GitHub
xudong963 commented on issue #15072: URL: https://github.com/apache/datafusion/issues/15072#issuecomment-2781924081 Hey guys, happy new week, let's start testing the incoming DF47 this week! 🚀 -- This is an automated message from the Apache Git Service. To respond to the message, please l

[PR] POC: Cascaded spill merge and re-spill [datafusion]

2025-04-06 Thread via GitHub
2010YOUY01 opened a new pull request, #15610: URL: https://github.com/apache/datafusion/pull/15610 ## Which issue does this PR close? - Closes #14692 ## Rationale for this change ### Background for memory-limited sort execution See figures in https://githu

[I] Q23 fails when running TPC-DS SF=1 because of invalid offset buffer being exported for empty StringArray. [datafusion-comet]

2025-04-06 Thread via GitHub
Kontinuation opened a new issue, #1615: URL: https://github.com/apache/datafusion-comet/issues/1615 ### Describe the bug Running TPC-DS SF=1 using [queries-spark/q23.sql in datafusion-benchmarks](https://github.com/apache/datafusion-benchmarks/blob/main/tpcds/queries-spark/q23.sql) f

Re: [PR] feat: add multi level merge for sorting [datafusion]

2025-04-06 Thread via GitHub
2010YOUY01 commented on PR #15608: URL: https://github.com/apache/datafusion/pull/15608#issuecomment-2781983752 I think this is a similar problem as https://github.com/apache/datafusion/issues/14692, will check this out soon -- This is an automated message from the Apache Git Service.

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-06 Thread via GitHub
alamb commented on PR #67: URL: https://github.com/apache/datafusion-site/pull/67#issuecomment-2781547765 Thanks @kevinjqliu -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment.

Re: [PR] [BLOG] tpchgen-rs: World’s fastest open source TPCH data generator, written in Rust [datafusion-site]

2025-04-06 Thread via GitHub
alamb commented on code in PR #67: URL: https://github.com/apache/datafusion-site/pull/67#discussion_r2030224671 ## content/blog/2025-04-10-fastest-tpch-generator.md: ## @@ -0,0 +1,613 @@ +--- +layout: post +title: tpchgen-rs World’s fastest open source TPC-H data generator, wri

Re: [PR] perf: Use a global tokio runtime [datafusion-comet]

2025-04-06 Thread via GitHub
codecov-commenter commented on PR #1614: URL: https://github.com/apache/datafusion-comet/pull/1614#issuecomment-2781519025 ## [Codecov](https://app.codecov.io/gh/apache/datafusion-comet/pull/1614?dropdown=coverage&src=pr&el=h1&utm_medium=referral&utm_source=github&utm_content=comment&utm_ca

Re: [PR] perf: Use a global tokio runtime [datafusion-comet]

2025-04-06 Thread via GitHub
andygrove commented on PR #1614: URL: https://github.com/apache/datafusion-comet/pull/1614#issuecomment-2781532455 @Kontinuation @wForget could you review? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

Re: [PR] Support Avg distinct for `float64` type [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15413: URL: https://github.com/apache/datafusion/pull/15413#issuecomment-2781566886 > sadly I'm working on my undergrad thesis project at this time and do not have time to investigate this either 😢 , might be back around mid april Good luck with your project / t

Re: [PR] Remove CoalescePartitions insertion from Joins [datafusion]

2025-04-06 Thread via GitHub
alamb commented on PR #15570: URL: https://github.com/apache/datafusion/pull/15570#issuecomment-2781567330 TLDR is benchmark results look good to me -- thanks @ctsk ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

Re: [I] Reduce number of tokio blocking threads in SortExec spill [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on issue #15323: URL: https://github.com/apache/datafusion/issues/15323#issuecomment-2781673294 I created a draft PR with a solution, would appreciate your opinion: - #15608 -- This is an automated message from the Apache Git Service. To respond to the message, pleas

Re: [PR] feat: add multi level merge for sorting [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on code in PR #15608: URL: https://github.com/apache/datafusion/pull/15608#discussion_r2030275003 ## datafusion/physical-plan/src/sorts/multi_level_sort_preserving_merge_stream.rs: ## @@ -0,0 +1,244 @@ +// Licensed to the Apache Software Foundation (ASF) under

Re: [PR] feat: add multi level merge for sorting [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on code in PR #15608: URL: https://github.com/apache/datafusion/pull/15608#discussion_r2030275339 ## datafusion/physical-plan/src/sorts/streaming_merge.rs: ## @@ -133,6 +142,24 @@ impl<'a> StreamingMergeBuilder<'a> { self } +pub fn with_spi

Re: [PR] feat: add multi level merge for sorting [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on code in PR #15608: URL: https://github.com/apache/datafusion/pull/15608#discussion_r2030274817 ## datafusion/physical-plan/src/sorts/multi_level_sort_preserving_merge_stream.rs: ## @@ -0,0 +1,244 @@ +// Licensed to the Apache Software Foundation (ASF) under

Re: [PR] feat: add multi level merge for sorting [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on code in PR #15608: URL: https://github.com/apache/datafusion/pull/15608#discussion_r2030275339 ## datafusion/physical-plan/src/sorts/streaming_merge.rs: ## @@ -133,6 +142,24 @@ impl<'a> StreamingMergeBuilder<'a> { self } +pub fn with_spi

Re: [PR] feat: add multi level merge for sorting [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on code in PR #15608: URL: https://github.com/apache/datafusion/pull/15608#discussion_r2030275648 ## datafusion/physical-plan/src/sorts/streaming_merge.rs: ## @@ -143,8 +170,27 @@ impl<'a> StreamingMergeBuilder<'a> { fetch, expression

[PR] Add test case for new casting feature [datafusion]

2025-04-06 Thread via GitHub
friendlymatthew opened a new pull request, #15609: URL: https://github.com/apache/datafusion/pull/15609 ## Which issue does this PR close? - Closes #14638 ## Rationale for this change https://github.com/apache/arrow-rs/pull/7141 enables casting from date to time zone-aw

Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on PR #15610: URL: https://github.com/apache/datafusion/pull/15610#issuecomment-2782018619 BTW, row_hash uses the sort preserving merge stream as well and has similar problem, I think this should be a solution outside the sort exec -- This is an automated message from t

Re: [PR] POC: Cascaded spill merge and re-spill [datafusion]

2025-04-06 Thread via GitHub
rluvaton commented on code in PR #15610: URL: https://github.com/apache/datafusion/pull/15610#discussion_r2030454805 ## datafusion/physical-plan/src/sorts/sort.rs: ## @@ -535,56 +457,262 @@ impl ExternalSorter { // reserved again for the next spill. self.merge_

  1   2   >