tustvold commented on PR #14286: URL: https://github.com/apache/datafusion/pull/14286#issuecomment-2624213017
I took a look at your example @JanKaul, thank you for this as I think it very nicely demonstrates the challenge of shimming IO at the object store interface. Unfortunately I think this may be a different issue from the runtime stall issue. At a high level the code is doing this ``` object_store.get().into_stream().map(|x| { sleep(Duration::from_seconds(2)); }).try_collect() ``` The problem is this introduces 2 second pauses between polling the streaming request from object storage. Regardless of runtime setup, holding a request across a CPU bound task will result in timeouts, as backpressure will eventually cause the sender to timeout. There are two broad strategies to avoid this: 1. Add an unbounded queue on the stream output 2. Make separate requests each time for the work that can be performed, e.g. first 1MB, perform work, then fetch next MB Option 1. will potentially buffer the entire input stream if the consumer is running behind, but should ensure the request runs to completion. Option 2. is I think the better approach, and is what code should probably be doing (FWIW this is what the parquet reader does). _As an aside I also took the time to port over the custom executor approach, to show what that looks like - https://github.com/JanKaul/cpu-io-executor/pull/1. I still think this is a cleaner approach, although much like the DedicatedExecutor in this PR, it doesn't help with the issue this example is running into_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org