geoffreyclaude opened a new issue, #14916:
URL: https://github.com/apache/datafusion/issues/14916

   ### Describe the bug
   
   **Issue Summary**:
   When a logical plan involves a `TableScan` over a large, explicit list of 
remote Parquet files (as opposed to scanning an entire directory), the creation 
of the physical plan can take a minute or more. This high latency is 
particularly noticeable when the files reside in a remote blob store.
   
   **Root Cause**:
   Investigation reveals that the `ListingTable::list_files_for_scan` is 
fetching file metadata sequentially rather than concurrently.
   
   Although the code correctly employs concurrent operations before (using 
`future::try_join_all`) and after (with 
`.buffered(ctx.config_options().execution.meta_fetch_concurrency)`), the use of 
`stream::iter(file_list).flatten()` forces a sequential iteration over the file 
list. Each individual metadata fetch (using a `head` call on the 
`object_store`) takes tens to hundreds of milliseconds, and when many such 
calls are performed in sequence, the delays add up significantly.
   
   **Proposed Fix**:
   A simple one-line change can resolve the issue by merging the streams 
concurrently. Replace the sequential flattening with concurrent stream 
selection:
   
   ```diff
   -        let file_list = stream::iter(file_list).flatten();
   +        let file_list = futures::stream::select_all(file_list);
   ```
   
   This change allows the `head` requests to be executed in parallel (up to the 
configured `meta_fetch_concurrency` concurrency limit), greatly reducing the 
latency in physical plan creation.
   
   **Additional Context**:
   This issue was introduced in [PR 
#2775](https://github.com/apache/datafusion/pull/2775), where the intent to use 
concurrency was present both before and after this stage. However, the 
sequential flattening inadvertently negated the benefits of parallel execution 
during metadata fetching.
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to