berkaysynnada commented on code in PR #14028:
URL: https://github.com/apache/datafusion/pull/14028#discussion_r1906813655
##########
datafusion/core/src/datasource/physical_plan/file_stream.rs:
##########
@@ -478,7 +478,12 @@ impl<F: FileOpener> FileStream<F> {
reader,
)),
partition_values,
- }
+ };
+ // Return control to the runtime
when we're ready to open the next file
+ // to prevent uncancellable
queries in scenarios with many large files.
+ // This functions similarly to a
`tokio::task::yield_now()`.
+ cx.waker().wake_by_ref();
+ return Poll::Pending;
Review Comment:
> I think this may not be correct. The `wake_by_ref` should be called AFTER
returning `Pending`. I think you may be lucky that this works, but it's
undefined behavior. If I understand the `yield_now` impl. correctly, it tries
to delay the "wake" call.
I was thinking similarly, but those have changed my mind:
https://github.com/rust-lang/rust/blob/1f81f906893d166d05fb4839f169983f2b564cc7/library/core/src/task/wake.rs#L423-L426
and this example:
https://github.com/rust-lang/rust/blob/1f81f906893d166d05fb4839f169983f2b564cc7/library/core/src/task/wake.rs#L703-L704
I believe the usage is not a problem, but yielding after being ready to open
the next file seems not the correct solution to me. Does this also start to
degrade the throughput for small files? Do we have any example of this in the
codebase (returning pending not because of an IO)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]