I took a pass through this, thank you for a good discussion of the alternative. One thing that I don't quite understand with this proposal is the scope? Is the intention that most APIs will eventually work with Futures instead of raw return values (i.e. returning a Table or Record batch will never be a thing, but instead you get references to Future<Table>)?
Thanks, Micah On Mon, Feb 15, 2021 at 2:15 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Weston, > > Thanks for putting this comprehensive and informative document together. > > There are several layers of problems to consider, just thinking out loud: > > * I hypothesize that the bottom of the stack is a thread pool with a > queue-per-thread that implements work stealing. Some code paths might > use this low-level task API directly, for example a workload putting > all of its tasks into one particular queue and letting the other > threads take work if they are idle. > > * I've brought this up in the past, but if we are comfortable with > more threads than CPU cores, we may allow for the base level thread > pool to be expanded dynamically. The tradeoff here is coarse > granularity context switching between tasks only at time of task > completion vs. the OS context-switching mid-task between threads. For > example, if there is a code path which wishes to guarantee that a > thread is being put to work right away to execute its tasks, even if > all of the other queues are full of other tasks, then this could > partially address the task prioritization problem discussed in the > document. If there is a notion of a "task producer" or a "workload" > and then the number of task producers exceeds the size of the thread > pool, then additional an thread+dedicated task queue for that thread > could be created to handle tasks submitted by the producer. Maybe this > is a bad idea (I'm not an expert in this domain after all), let me > know if it doesn't make sense. > > * I agree that we should encourage as much code as possible to use the > asynchronous model — per above, if there is a mechanism for async task > producers to coexist alongside with code that manually manages the > execution order of tasks generated by its task graph (thinking of > query engine code here a la Quickstep), then that might be good. > > Lots to do here but excited to see things evolve here and see the > project grow faster and more scalable on systems with a lot of cores > that do a lot of mixed IO/CPU work! > > - Wes > > On Tue, Feb 2, 2021 at 9:02 PM Weston Pace <weston.p...@gmail.com> wrote: > > > > This is a follow up to a discussion from last September [3]. I've > > been investigating Arrow's use of threading and I/O and I believe > > there are some improvements that could be made. Arrow is currently > > supporting two threading options (single thread and "per-core" thread > > pool). Both of these approaches are hindered if blocking I/O is > > performed on a CPU worker thread. > > > > It is somewhat alleviated by using background threads for I/O (in the > > readahead iterator) but this implementation is not complete and does > > not allow for nested parallelism. I would like to convert Arrow's I/O > > operations to an asynchronous model (expanding on the existing futures > > API). I have already converted the CSV reader in this fashion [2] as > > a proof of concept. > > > > I have written a more detailed proposal here [1]. Please feel free to > > suggest improvements or alternate approaches. Also, please let me > > know if I missed any goals or considerations I should keep in mind. > > > > Also, hello, this email is a bit of an introduction. I have > > previously made one or two small comments/changes but I am hoping to > > be more involved going forwards. I've mostly worked on proprietary > > test and measurement software but have recently joined Ursa Computing > > which will allow me more time to work on Arrow. > > > > Thanks, > > > > Weston Pace > > > > [1] > https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing > > [2] https://github.com/apache/arrow/pull/9095 > > [3] > https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3CCAJPUwMDmU3rFt6Upyis%3DyXB%3DECkmrjdncgR9xj%3DDFapJt9FfUg%40mail.gmail.com%3E >