Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-08-02 Thread via GitHub
AdamGS commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-3146689387 Ready here - https://github.com/apache/datafusion/pull/17020, I think I didn't mess up the conflict resolution too much, I'll probably do another pass to make sure. I don't ful

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-08-02 Thread via GitHub
adriangb commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-3146547049 I think so! I haven't fully grokked what this PR means but I will say one of the pieces of DataFusion we customize is partitioning / concurrency so I'm interested to see what impact

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-08-02 Thread via GitHub
AdamGS commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-3146537884 I can take the work to rebase this PR and fix the conflicts, is there interest? -- This is an automated message from the Apache Git Service. To respond to the message, please log on

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-08-01 Thread via GitHub
github-actions[bot] closed pull request #14411: feat: Support On-Demand Repartition URL: https://github.com/apache/datafusion/pull/14411 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-07-23 Thread via GitHub
github-actions[bot] commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-3111723405 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-05-23 Thread via GitHub
Dandandan commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2904301788 The conflicts seem very minimal @adriangb so if someone can fix the conflicts, it should be in reviewable state again -- This is an automated message from the Apache Git Service.

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-05-17 Thread via GitHub
adriangb commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2888717383 I'm sad to see this go stale 😢 , sadly I also don't have the bandwith to push it forward -- This is an automated message from the Apache Git Service. To respond to the message, pl

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-05-17 Thread via GitHub
github-actions[bot] commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2888711945 Thank you for your contribution. Unfortunately, this pull request is stale because it has been open 60 days with no activity. Please remove the stale label or comment or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-03-17 Thread via GitHub
adriangb commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2729232834 This is exciting! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific commen

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-03-01 Thread via GitHub
berkaysynnada commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2692276290 I got a bit off-topic, but we will focus on this promising work to drive it to completion in the coming week, together with @mertak-synnada. -- This is an automated message f

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-03-01 Thread via GitHub
Weijun-H commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2692083297 I actually anticipate higher memory consumption, particularly in systems where the upstream portion of RepartitionExec generates results faster than the downstream component process

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-28 Thread via GitHub
Dandandan commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2690915249 I still got the higher memory usage compared to main. Is this something that is planned to improve or you want to have it reviewed as is @Weijun-H ? -- This is an automated messa

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-16 Thread via GitHub
Dandandan commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2661425398 > > I ran some tests yesterday and I can confirm the runtime improvements. I do get some high memory usage however especially with some queries (TPC-H Query 18 I believe) than when

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-14 Thread via GitHub
Weijun-H commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2660797386 > I ran some tests yesterday and I can confirm the runtime improvements. I do get some high memory usage however especially with some queries (TPC-H Query 18 I believe) than when us

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-14 Thread via GitHub
Weijun-H commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1957051847 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1589 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-14 Thread via GitHub
Weijun-H commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1957051847 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1589 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-14 Thread via GitHub
Dandandan commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2659343979 I ran some tests yesterday and I can confirm the runtime improvements. I do get some high memory usage however especially with some queries (TPC-H Query 18 I believe) than with t

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-14 Thread via GitHub
Dandandan commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1956072347 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1589 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-13 Thread via GitHub
Weijun-H commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2658410453 > > > > I wonder why tpch_mem_sf10 is slower for some queries? Might it be possible the created memtable is not created evenly because of the new round robin (that might be fixable

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-12 Thread via GitHub
Dandandan commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2654601057 Specifically, I think we can try [this approach](https://github.com/apache/datafusion/pull/13707) together with on-demand repartition 🤔 -- This is an automated message from the

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-12 Thread via GitHub
Dandandan commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2654136409 > > > I wonder why tpch_mem_sf10 is slower for some queries? Might it be possible the created memtable is not created evenly because of the new round robin (that might be fixable e

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-12 Thread via GitHub
Weijun-H commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2654051392 > > I wonder why tpch_mem_sf10 is slower for some queries? Might it be possible the created memtable is not created evenly because of the new round robin (that might be fixable e.g.

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-11 Thread via GitHub
berkaysynnada commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2652041558 > I wonder why tpch_mem_sf10 is slower for some queries? Might it be possible the created memtable is not created evenly because of the new round robin (that might be fixable e

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-11 Thread via GitHub
Dandandan commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2651627436 I wonder why tpch_mem_sf10 is slower for some queries? Might it be possible the created memtable is not created evenly because of the new round robin (that might be fixable). --

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-09 Thread via GitHub
Weijun-H commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2646358808 I believe we cannot use the customized channel `DistributionReceiver` currently, as `OnDemandRepartitionExec` can prevent channels from filling up endlessly. Additionally, I noticed

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-09 Thread via GitHub
berkaysynnada commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2646294361 > The point is: I think there should be no memory-bloat issue in TPCH/clickbench queries caused by `RepartitionExec`, just wondering do you have any bad query can reproduce the

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-09 Thread via GitHub
2010YOUY01 commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2646218873 > Hi @2010YOUY01. I'd like to thank you firstly for this investigation. I actually expect higher memory consumption—especially in systems where the upstream part of the Repartitio

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-09 Thread via GitHub
berkaysynnada commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2646154037 > * No performance regression (benchmarks already showed) > * Reduce memory footprint, for queries which batch can accumulate in `RepartitionExec` (as the origin issue said)

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-09 Thread via GitHub
Weijun-H commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2646145904 > Impressive work! I got a suggestion and a high-level question: > > ### Suggestion > I think to justify this change, we have to make sure: > > * No performance regre

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-09 Thread via GitHub
2010YOUY01 commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2646119135 Impressive work! I got a suggestion and a high-level question: ### Suggestion I think to justify this change, we have to make sure: - No performance regression (benchma

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-07 Thread via GitHub
mertak-synnada commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1946078413 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1362 @@ +// Licensed to the Apache Software Foundation (ASF) under one +

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-06 Thread via GitHub
Weijun-H commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2641916193 I updated the latest benchmark results. It seems the `OnDemandRepartition` improved performance on `clickbench_partitioned` and large datasets like `tpch_sf50`. For `tpch_sf1` and `

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-06 Thread via GitHub
berkaysynnada commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2640985434 > Maybe I am missing something, but the benchmark numbers reported above don't really show much of an improvement this might be a silly question but, did you set the conf

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-06 Thread via GitHub
ozankabak commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2640838240 @Weijun-H did [some benchmarks](https://github.com/synnada-ai/datafusion-upstream/pull/60) a while back and the approach seemed promising in TPCH/SF50. @mertak-synnada will

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-06 Thread via GitHub
alamb commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2640829028 > This is still in somewhat early stages, and there is work to do. But it might be good to get feedback early on from the community as the performance of this code is somewhat sensitiv

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-06 Thread via GitHub
mertak-synnada commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1944579206 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1320 @@ +// Licensed to the Apache Software Foundation (ASF) under one +

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-05 Thread via GitHub
Weijun-H commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1942878131 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1320 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-04 Thread via GitHub
Weijun-H commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1941260256 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1320 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-04 Thread via GitHub
Weijun-H commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1941261097 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1320 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-04 Thread via GitHub
Weijun-H commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1941257144 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1320 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-04 Thread via GitHub
Weijun-H commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1941255907 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1320 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-04 Thread via GitHub
Weijun-H commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1941254980 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1320 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-03 Thread via GitHub
mertak-synnada commented on code in PR #14411: URL: https://github.com/apache/datafusion/pull/14411#discussion_r1938964242 ## datafusion/physical-plan/src/repartition/on_demand_repartition.rs: ## @@ -0,0 +1,1320 @@ +// Licensed to the Apache Software Foundation (ASF) under one +

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-03 Thread via GitHub
ozankabak commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2630289724 This is still in somewhat early stages, and there is work to do. But it might be good to get feedback early on from the community as the performance of this code is somewhat sensit

Re: [PR] feat: Support On-Demand Repartition [datafusion]

2025-02-02 Thread via GitHub
ozankabak commented on PR #14411: URL: https://github.com/apache/datafusion/pull/14411#issuecomment-2630036258 @Weijun-H has been working on this with the Synnada team for a while. The initial benchmark results were promising, so we decided to continue development while receiving community