My understanding of tokio is that there is exactly one global Runtime <https://docs.rs/tokio/0.3.3/tokio/runtime/struct.Runtime.html> which has two thread pools: one for synchronous tasks and one for async tasks
I am fairly sure there can be only one global Runtime (because when I tried try to explicitly create one when an existing one is present, tokio panic!'ed on me). The complexity created is definitely a concern. My personal feeling is that getting async pushed all the way down will be the least complex solution, though it still won't be "simple" On Fri, Nov 13, 2020 at 5:16 AM Rémi Dettai <[email protected]> wrote: > Hi Andrew! > > Thanks for your quick response and sorry it took me so long to answer back. > > `spawn_blocking` solves the issue: > https://gist.github.com/rdettai/d2f9bc59b31785c35dce792878976a19 > > I am still worried by the amount of thread pools and complexity it creates > (1 pool for the outer runtime, 1 pool for spawn_blocking, 1 pool for the > inner runtime). As you said, the best thing would be to push async all the > way down but it's pretty hard as it propagates through the entire codebase > :). For now I settled for adding async fetchers that download the data then > sync read from the in-memory buffers. I'll come back to this issue a bit > later because it still needs some adjustments. > > Remi > > > Le ven. 30 oct. 2020 à 11:27, Andrew Lamb <[email protected]> a écrit : > > > Tokio has a function `spawn_blocking` > > <https://docs.rs/tokio/0.3.2/tokio/task/fn.spawn_blocking.html> that > > allows > > running synchronous / blocking code as a future on the current runtime. > You > > can finagle pretty much any combination of sync / async using > > spawn_blocking and channels, though the resulting code may not be the > most > > beautiful. > > > > Once you introduce `async` into a project or use an `async` library like > > rusto, it feels to me like Rust leads you towards pushing async all the > way > > down and indeed the easiest thing for you, given your described > > usecase would be async all the way down. > > > > I personally think having an async implementation of parquet would be > very > > valuable, as more and more Rust uses tokio / async IO. Maybe we could > > implement an optional async interface on top of the blocking > > implementation. > > > > Likewise, having a sync api and an async api for DataFusion also seems > > valuable to to me. > > > > In my opinion, the biggest benefit from having DataFusion use tokio/async > > is a single unified thread pool and execution model for both CPU and IO > > work. Prior to being async-ized with the tokio thread pool, DataFusion > > spawned / managed threads on its own; Adding additional parallelism > without > > over subscribing the CPU was likely going to be a significant effort. > There > > is a thread > > < > > > https://lists.apache.org/thread.html/rbc4535613cb9af3467255234b49222bb8d3e57ef91790ebeff66aa74%40%3Cdev.arrow.apache.org%3E > > > > > on this mailing list about a similar challenge in the C++ implementation, > > to give you a sense of the kinds of issues we are hoping to avoid in > > DataFusion with using async > > > > Andrew > > > > > > On Fri, Oct 30, 2020 at 4:28 AM Rémi Dettai <[email protected]> wrote: > > > > > Hi everyone! > > > > > > If you are reading this, it means that you felt in the trap of my > catchy > > > (but meaningless) title! > > > > > > This discussion somewhat relates to [1]. > > > > > > DataFusion has recently made its top level "actions" (collect, > write...) > > > async. The problem is that most of the codebase is not async (in > > particular > > > Parquet [2]), which means that you have to make an async context work > > > together with a sync one. > > > > > > This works okay... until it doesn't! > > > > > > I am trying to read into DataFusion from S3, using the AWS Rust SDK > > Rusoto. > > > The problem is that this SDK is itself async. This means that you end > up > > > with the following layers: > > > DataFusion (async) -> Parquet (sync) -> Rusoto (async) > > > As you might now, Tokio does not support blocking on a runtime from > > within > > > a runtime. > > > > > > This triggers a set of questions: > > > - Does anybody know a way to make such a setup work? > > > - Making Parquet async is extremely difficult and breaking, should we > try > > > to do it [2] ? > > > - Is the benefit of having DataFusion async really big? Should we maybe > > > have both a sync and an async API ? > > > > > > Thanks everybody and have a wonderful day. > > > > > > Regards, > > > > > > Remi > > > > > > [1] https://issues.apache.org/jira/browse/ARROW-9464 > > > [2] https://issues.apache.org/jira/browse/ARROW-10307 > > > > > >
