Re: [PR] perf: Parallelize list_files_for_scan using tokio::task::JoinSet [datafusion]

via GitHub Wed, 28 Jan 2026 10:44:25 -0800


Tushar7012 commented on PR #20023:
URL: https://github.com/apache/datafusion/pull/20023#issuecomment-3813164303


   Great question! The key difference is **where** the parallelization happens:
   
   **Before (with `try_join_all`):**
   ```rust
   future::try_join_all(self.table_paths.iter().map(|table_path| {
       pruned_partition_list(ctx, store.as_ref(), table_path, ...)
   }))
   
   This creates futures but they all share the same borrowed context 
([ctx](vscode-file://vscode-app/c:/Users/td334/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html),
 
[store](vscode-file://vscode-app/c:/Users/td334/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html),
 etc.). While 
[try_join_all](vscode-file://vscode-app/c:/Users/td334/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
 can poll futures concurrently, they run on the same task/thread because they 
need the borrowed references. The concurrency is limited by cooperative 
yielding within a single async task.
   
   After (with 
[JoinSet](vscode-file://vscode-app/c:/Users/td334/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)):
  
   
   join_set.spawn(async move {
       let stream = pruned_partition_list(&config, &runtime_env, 
store.as_ref(), ...)
           .await?;
       stream.try_collect::<Vec<_>>().await
   });
   
   Each 
[table_path](vscode-file://vscode-app/c:/Users/td334/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
 is processed in a separate spawned task that can run on different threads in 
the Tokio runtime's thread pool. This is true parallelism vs just concurrency.
   
   Regarding benchmarks: I don't have an end-to-end benchmark yet. The benefit 
would be most visible when:
   
   There are multiple 
[table_paths](vscode-file://vscode-app/c:/Users/td334/AppData/Local/Programs/Microsoft%20VS%20Code/resources/app/out/vs/code/electron-browser/workbench/workbench.html)
 to scan
   Object store operations have I/O latency (e.g., S3, GCS)
   Running on multi-core systems
   Would a benchmark demonstrating the speedup be helpful for reviewing this 
PR? I can add one if needed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] perf: Parallelize list_files_for_scan using tokio::task::JoinSet [datafusion]

Reply via email to