Re: [PR] Add example for using a separate threadpool for CPU bound work [datafusion]

via GitHub Fri, 22 Nov 2024 16:12:36 -0800


tustvold commented on PR #13424:
URL: https://github.com/apache/datafusion/pull/13424#issuecomment-2495131293


   > but I don't know how to translate your suggestions into actual code
   
   The basic idea is rather than shoehorning the runtime dispatch into the 
ObjectStore trait, instead make the components within DataFusion that perform 
IO themselves spawn the relevant work to a separate runtime. Or in other words, 
**make DataFusion draw a distinction between IO bound and CPU bound 
operators**. Not only does this avoid the issues above, but is also much easier 
to reason about. I at least am not confident I fully grasp the full 
implications of things like CrossRtStream w.r.t backpressure, task wakeups, 
etc... If I stream a 10GB CSV file, will it end up buffering the entire 10GB 
CSV in memory whilst waiting for the DF runtime to have capacity, I don't 
honestly know? 😅
   
   To give a concrete example of this, rather than bridging ObjectStore::list 
across runtimes, and incurring that penalty for every wakeup of every stream, 
instead dispatch 
[list_partitions](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/listing/helpers.rs#L180)
 or possibly one of the higher level methods in ListingTable to the IO runtime.
   
   A similar approach could be taken for AsyncFileReader in parquet, or 
FileStream, or any of the other IO components.
   
   I appreciate this is a more intrusive approach, but I don't really think 
DataFusion can continue to leave this sort of thing as an exercise for the 
reader, especially given the issues only start to become obvious as load 
increases. Having the DataFusion operators designed to accommodate and 
coordinate this IO separation will lead to the best outcomes, especially when 
it comes to resource constrained systems where it really matters when/how IO is 
interleaved with the corresponding CPU work.
   
   That all being said what is being proposed in this PR **is** an improvement 
over the current state of play, and I don't want to detract from that, however, 
I had hoped that we might be able to take this opportunity to define a better 
story for this rather than simply "blessing" the somewhat arcane hackery we 
worked into InfluxDB to get around this.
   
   _I personally would be interested in @crepererum's take on this, as someone 
who wrote much of this code for InfluxDB. I am aware my judgement may be 
slightly clouded by my long-held desire for a more formal separation of IO and 
CPU bound tasks within DF (https://github.com/apache/datafusion/issues/2199), 
combined with a deep distaste for using async with CPU-bound work, but I don't 
think I am being unreasonable here_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add example for using a separate threadpool for CPU bound work [datafusion]

Reply via email to