Tushar7012 commented on PR #20023:
URL: https://github.com/apache/datafusion/pull/20023#issuecomment-3813239229
Hi @2010YOUY01 ,
### Key Difference: *Where* Parallelization Happens
The main distinction here is **where the parallelism is introduced**.
#### Before (using `try_join_all`)
```rust
future::try_join_all(self.table_paths.iter().map(|table_path| {
pruned_partition_list(ctx, store.as_ref(), table_path, ...)
}))
-> This creates multiple futures, but they all share the same borrowed
context (ctx, store, etc.).
-> While try_join_all can poll these futures concurrently, they effectively
run on the same async task/thread due to shared borrows.
-> As a result, concurrency is limited to cooperative yielding within a
single task, not true parallel execution.
#### After (using JoinSet)
join_set.spawn(async move {
let stream = pruned_partition_list(
&config,
&runtime_env,
store.as_ref(),
...
)
.await;
stream.try_collect::<Vec<_>>().await
});
-> Each table_path is processed in a separately spawned task.
-> These tasks can run on different threads in Tokio’s thread pool.
->This enables true parallelism, not just concurrency.
Expected Impact
The benefits of this change should be most noticeable when:
-> There are multiple table_paths to scan
-> Storage operations involve I/O latency (e.g., S3, GCS)
-> Running on multi-core systems
I don’t have an end-to-end benchmark yet, but I can add one if a performance
comparison would be helpful for reviewing this PR.
If you want, I can also:
- Make it **shorter & punchier** (reviewer skim mode)
- Tune the tone to be more **assertive** or more **collaborative**
- Add **benchmark placeholder language** that maintainers usually like
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]