tustvold commented on PR #13424: URL: https://github.com/apache/datafusion/pull/13424#issuecomment-2495131293
> but I don't know how to translate your suggestions into actual code The basic idea is rather than shoehorning the runtime dispatch into the ObjectStore trait, instead make the components within DataFusion that perform IO themselves spawn the relevant work to a separate runtime. Or in other words, **make DataFusion draw a distinction between IO bound and CPU bound operators**. Not only does this avoid the issues above, but is also much easier to reason about. I at least am not confident I fully grasp the full implications of things like CrossRtStream w.r.t backpressure, task wakeups, etc... If I stream a 10GB CSV file, will it end up buffering the entire 10GB CSV in memory whilst waiting for the DF runtime to have capacity, I don't honestly know? 😅 To give a concrete example of this, rather than bridging ObjectStore::list across runtimes, and incurring that penalty for every wakeup of every stream, instead dispatch [list_partitions](https://github.com/apache/datafusion/blob/main/datafusion/core/src/datasource/listing/helpers.rs#L180) or possibly one of the higher level methods in ListingTable to the IO runtime. A similar approach could be taken for AsyncFileReader in parquet, or FileStream, or any of the other IO components. I appreciate this is a more intrusive approach, but I don't really think DataFusion can continue to leave this sort of thing as an exercise for the reader, especially given the issues only start to become obvious as load increases. Having the DataFusion operators designed to accommodate and coordinate this IO separation will lead to the best outcomes, especially when it comes to resource constrained systems where it really matters when/how IO is interleaved with the corresponding CPU work. That all being said what is being proposed in this PR **is** an improvement over the current state of play, and I don't want to detract from that, however, I had hoped that we might be able to take this opportunity to define a better story for this rather than simply "blessing" the somewhat arcane hackery we worked into InfluxDB to get around this. _I personally would be interested in @crepererum's take on this, as someone who wrote much of this code for InfluxDB. I am aware my judgement may be slightly clouded by my long-held desire for a more formal separation of IO and CPU bound tasks within DF (https://github.com/apache/datafusion/issues/2199), combined with a deep distaste for using async with CPU-bound work, but I don't think I am being unreasonable here_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org