geoffreyclaude opened a new issue, #14916: URL: https://github.com/apache/datafusion/issues/14916
### Describe the bug **Issue Summary**: When a logical plan involves a `TableScan` over a large, explicit list of remote Parquet files (as opposed to scanning an entire directory), the creation of the physical plan can take a minute or more. This high latency is particularly noticeable when the files reside in a remote blob store. **Root Cause**: Investigation reveals that the `ListingTable::list_files_for_scan` is fetching file metadata sequentially rather than concurrently. Although the code correctly employs concurrent operations before (using `future::try_join_all`) and after (with `.buffered(ctx.config_options().execution.meta_fetch_concurrency)`), the use of `stream::iter(file_list).flatten()` forces a sequential iteration over the file list. Each individual metadata fetch (using a `head` call on the `object_store`) takes tens to hundreds of milliseconds, and when many such calls are performed in sequence, the delays add up significantly. **Proposed Fix**: A simple one-line change can resolve the issue by merging the streams concurrently. Replace the sequential flattening with concurrent stream selection: ```diff - let file_list = stream::iter(file_list).flatten(); + let file_list = futures::stream::select_all(file_list); ``` This change allows the `head` requests to be executed in parallel (up to the configured `meta_fetch_concurrency` concurrency limit), greatly reducing the latency in physical plan creation. **Additional Context**: This issue was introduced in [PR #2775](https://github.com/apache/datafusion/pull/2775), where the intent to use concurrency was present both before and after this stage. However, the sequential flattening inadvertently negated the benefits of parallel execution during metadata fetching. ### To Reproduce _No response_ ### Expected behavior _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org