Re: Threading Improvements Proposal

Micah Kornfield Mon, 15 Feb 2021 20:49:52 -0800

I took a pass through this, thank you for a good discussion of the
alternative.  One thing that I don't quite understand with this proposal is
the scope?  Is the intention that most APIs will eventually work with
Futures instead of raw return values (i.e. returning a Table or Record
batch will never be a thing, but instead you get references to
Future<Table>)?


Thanks,
Micah

On Mon, Feb 15, 2021 at 2:15 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Weston,
>
> Thanks for putting this comprehensive and informative document together.
>
> There are several layers of problems to consider, just thinking out loud:
>
> * I hypothesize that the bottom of the stack is a thread pool with a
> queue-per-thread that implements work stealing. Some code paths might
> use this low-level task API directly, for example a workload putting
> all of its tasks into one particular queue and letting the other
> threads take work if they are idle.
>
> * I've brought this up in the past, but if we are comfortable with
> more threads than CPU cores, we may allow for the base level thread
> pool to be expanded dynamically. The tradeoff here is coarse
> granularity context switching between tasks only at time of task
> completion vs. the OS context-switching mid-task between threads. For
> example, if there is a code path which wishes to guarantee that a
> thread is being put to work right away to execute its tasks, even if
> all of the other queues are full of other tasks, then this could
> partially address the task prioritization problem discussed in the
> document. If there is a notion of a "task producer" or a "workload"
> and then the number of task producers exceeds the size of the thread
> pool, then additional an thread+dedicated task queue for that thread
> could be created to handle tasks submitted by the producer. Maybe this
> is a bad idea (I'm not an expert in this domain after all), let me
> know if it doesn't make sense.
>
> * I agree that we should encourage as much code as possible to use the
> asynchronous model — per above, if there is a mechanism for async task
> producers to coexist alongside with code that manually manages the
> execution order of tasks generated by its task graph (thinking of
> query engine code here a la Quickstep), then that might be good.
>
> Lots to do here but excited to see things evolve here and see the
> project grow faster and more scalable on systems with a lot of cores
> that do a lot of mixed IO/CPU work!
>
> - Wes
>
> On Tue, Feb 2, 2021 at 9:02 PM Weston Pace <weston.p...@gmail.com> wrote:
> >
> > This is a follow up to a discussion from last September [3].  I've
> > been investigating Arrow's use of threading and I/O and I believe
> > there are some improvements that could be made.  Arrow is currently
> > supporting two threading options (single thread and "per-core" thread
> > pool).  Both of these approaches are hindered if blocking I/O is
> > performed on a CPU worker thread.
> >
> > It is somewhat alleviated by using background threads for I/O (in the
> > readahead iterator) but this implementation is not complete and does
> > not allow for nested parallelism.  I would like to convert Arrow's I/O
> > operations to an asynchronous model (expanding on the existing futures
> > API).  I have already converted the CSV reader in this fashion [2] as
> > a proof of concept.
> >
> > I have written a more detailed proposal here [1].  Please feel free to
> > suggest improvements or alternate approaches.  Also, please let me
> > know if I missed any goals or considerations I should keep in mind.
> >
> > Also, hello, this email is a bit of an introduction.  I have
> > previously made one or two small comments/changes but I am hoping to
> > be more involved going forwards.  I've mostly worked on proprietary
> > test and measurement software but have recently joined Ursa Computing
> > which will allow me more time to work on Arrow.
> >
> > Thanks,
> >
> > Weston Pace
> >
> > [1]
> https://docs.google.com/document/d/1tO2WwYL-G2cB_MCPqYguKjKkRT7mZ8C2Gc9ONvspfgo/edit?usp=sharing
> > [2] https://github.com/apache/arrow/pull/9095
> > [3]
> https://mail-archives.apache.org/mod_mbox/arrow-dev/202009.mbox/%3CCAJPUwMDmU3rFt6Upyis%3DyXB%3DECkmrjdncgR9xj%3DDFapJt9FfUg%40mail.gmail.com%3E
>

Re: Threading Improvements Proposal

Reply via email to