Dandandan commented on code in PR #20481:
URL: https://github.com/apache/datafusion/pull/20481#discussion_r2854051331
##########
datafusion/datasource/src/file_stream.rs:
##########
@@ -130,9 +130,16 @@ impl FileStream {
///
/// Since file opening is mostly IO (and may involve a
/// bunch of sequential IO), it can be parallelized with decoding.
+ ///
+ /// In morsel-driven mode this prefetches the next already-morselized item
+ /// from the shared queue (leaf morsels only — items that still need
+ /// async morselization are left in the queue for the normal Idle →
+ /// Morselizing path).
fn start_next_file(&mut self) -> Option<Result<FileOpenFuture>> {
if self.morsel_driven {
- return None;
+ let queue = Arc::clone(self.shared_queue.as_ref()?);
Review Comment:
I reverted this, as `start_next_file` wasn't even used in the new path.
I think we should probably avoid "prefetching" like implemented now as much
as possible or only do it when we certainly know it doesn't slow down things:
* As the opening phase also does pruning we will have less selective dynamic
filters if we "prefetch" aggressively using `open`
* "Opening" files / row groups can consume considerable CPU time
* When workstealing is effective, it probably reduces the benefits of
prefetching (as the threads will be less in idle state)
* Perhaps if added it should do IO only (e.g. fetch ranges and cache it but
don't do any processing) and make it configurable / respect some limits
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]