tustvold commented on PR #14286:
URL: https://github.com/apache/datafusion/pull/14286#issuecomment-2624213017

   I took a look at your example @JanKaul, thank you for this as I think it 
very nicely demonstrates the challenge of shimming IO at the object store 
interface. Unfortunately I think this may be a different issue from the runtime 
stall issue.
   
   At a high level the code is doing this
   
   ```
   object_store.get().into_stream().map(|x| {
       sleep(Duration::from_seconds(2));
   }).try_collect()
   ```
   
   The problem is this introduces 2 second pauses between polling the streaming 
request from object storage. Regardless of runtime setup, holding a request 
across a CPU bound task will result in timeouts, as backpressure will 
eventually cause the sender to timeout.
   
   There are two broad strategies to avoid this:
   
   1. Add an unbounded queue on the stream output
   2. Make separate requests each time for the work that can be performed, e.g. 
first 1MB, perform work, then fetch next MB
   
   Option 1. will potentially buffer the entire input stream if the consumer is 
running behind, but should ensure the request runs to completion. Option 2. is 
I think the better approach, and is what code should probably be doing (FWIW 
this is what the parquet reader does).
   
   _As an aside I also took the time to port over the custom executor approach, 
to show what that looks like - 
https://github.com/JanKaul/cpu-io-executor/pull/1. I still think this is a 
cleaner approach, although much like the DedicatedExecutor in this PR, it 
doesn't help with the issue this example is running into_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to