There are two main things that have been important to us in Dremio around threading:
Separate threading model from algorithms. We chose to do parallelization at the engine level instead of the operation level. This allows us to substantially increase parallelization while still maintaining a strong thread prioritization model. This contrasts to some systems like Apache Impala which chose to implement threading at the operation level. This has ultimately hurt their ability for individual workloads to scale out within a node. See the experimental features around MT_DOP when the tried to retreat from this model and struggled to do so. It serves as an example of the challenges if you don't separate data algorithms from threading early on in design [1]. This intention was core to how we designed Gandiva, where an external driver makes decisions around threading and the actual algorithm only does small amounts of work before yielding to the driver. This allows a driver to make parallelization and scheduling decisions without having to know the internals of the algorithm. (In Dremio, these are all covered under the interfaces described in Operator [2] and it's subclasses that together provide a very simple state of operation states for the driver to understand. The second is that the majority of the data we work with these days is primarily in high latency cloud storage. While we may stage data locally, a huge amount of reads are impacted by the performance of cloud stores. To cover these performance behaviors we did two things, the first was introduce a very simple to use async reading interface for data, seen at [3] and introduce a collaborative way that individual tasks could declare their blocking state to a central coordinator [4]. Happy to cover these in more detail if people are interested. In general, using these techniques have allowed us to tune many systems to a situation where the (highly) variable latency of cloud stores like S3 and ADLS can be mostly cloaked by aggressive read ahead and what we call predictive pipelining (where reading is guided based on latency performance characteristics along with knowledge of columnar formats like Parquet). [1] https://www.cloudera.com/documentation/enterprise/latest/topics/impala_mt_dop.html#mt_dop [2] https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/op/spi/Operator.java [3] https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/exec/store/dfs/async/AsyncByteReader.java [4] https://github.com/dremio/dremio-oss/blob/master/sabot/kernel/src/main/java/com/dremio/sabot/threads/sharedres/SharedResourceManager.java On Mon, Jul 22, 2019 at 9:56 AM Antoine Pitrou <anto...@python.org> wrote: > > Le 22/07/2019 à 18:52, Wes McKinney a écrit : > > > > Probably the way is to introduce async-capable read APIs into the file > > interfaces. For example: > > > > file->ReadAsyncBlock(thread_ctx, ...); > > > > That way the file implementation can decide whether asynchronous logic > > is actually needed. > > I doubt very much that a one-size-fits-all > > concurrency solution can be developed -- in some applications > > coarse-grained IO and CPU task scheduling may be warranted, but we > > need to have a solution for finer-grained scenarios where > > > > * In the memory-mapped case, there is no overhead and > > * The programming model is not too burdensome to the library developer > > Well, the asynchronous I/O programming model *will* be burdensome at > least until C++ gets coroutines (which may happen in C++20, and > therefore be usable somewhere around 2024 for Arrow?). > > Regards > > Antoine. >