Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

Weston Pace Tue, 15 Sep 2020 17:54:47 -0700

Yes.  Thank you.  I am in agreement with you and futures/callbacks are
one such "richer programming model for
hierarchical work scheduling".


A scan task with a naive approach is:

    workers = partition_files_list(files_list)
    for worker in workers:
        start_thread(worker)
    for worker in workers:
        join_thread(worker)
    return aggregate_results()

You have N+1 threads because you have N worker threads and 1 scan
thread.  There is the potential for deadlock if your thread pool only
has one remaining spot and it is given to the scan thread.

On the other hand, with a futures based approach you have:

futures = partition_files_list(files_list)
return when_all(futures).do(aggregate_results)

There are only N threads.  The scan thread goes away.  In fact, if all
of your underlying OS/FS libraries are non-blocking then you can
completely eliminate threads in the waiting state and an entire
category of deadlocks are no longer a possibility.

-Weston

On Tue, Sep 15, 2020 at 1:21 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
> hi Weston,
>
> We've discussed some of these problems in the past -- I was
> enumerating some of these issues to highlight the problems that are
> resulting from an absence of a richer programming model for
> hierarchical work scheduling. Parallel tasks originating in each
> workload are submitted to a global thread pool where they are
> commingled with the tasks coming from other workloads.
>
> As an example of how this can go wrong, suppose we have a static
> thread pool with 4 executors. If we submit 4 long-running tasks to the
> pool, and then each of these tasks spawn additional tasks that go into
> the thread pool, a deadlock can occur, because the thread pool thinks
> that it's executing tasks when in fact those tasks are waiting on
> their dependent tasks to complete.
>
> A similar resource underutilization occurs when we do
> pool->Submit(ReadFile), where ReadFile needs to do some IO -- from the
> thread pool's perspective, the task is "working" even though it may
> wait for one or more IO calls to complete.
>
> In the Datasets API in C++ we have both of these problems: file scan
> tasks are being pushed onto the global thread pool, and so to prevent
> deadlocks multithreaded file parsing has been disabled. Additionally,
> the scan tasks do IO, resulting in suboptimal performance (the
> problems caused by this will be especially exacerbated when running
> against slower filesystems like Amazon S3)
>
> Hopefully the issues are more clear.
>
> Thanks
> Wes
>
> On Tue, Sep 15, 2020 at 2:57 PM Weston Pace <weston.p...@gmail.com> wrote:
> >
> > It sounds like you are describing two problems.
> >
> > 1) Idleness - Tasks are holding threads in the thread pool while they
> > wait for IO or some long running non-CPU task to complete.  These
> > threads are often in a "wait" state or something similar.
> > 2) Fairness - The ordering of tasks is causing short tasks that could
> > be completed quickly from being stuck behind longer term tasks.
> > Fairness can be an issue even if all tasks are always in the active
> > state consuming CPU time.
> >
> > Are both of these issues a problem?  Are you looking to address both of 
> > them?
> >
> > I doubt it's much help as it is probably a more substantial change
> > than what you were looking for but the popular solution to #1 these
> > days seems to be moving toward non blocking IO with
> > promises/callbacks/async.  That way threads are never in the waiting
> > state (unless sitting idle in the pool).
> >
> > -Weston
> >
> > On Tue, Sep 15, 2020 at 7:00 AM Wes McKinney <wesmck...@gmail.com> wrote:
> > >
> > > In light of ARROW-9924, I wanted to rekindle the discussion about our
> > > approach to multithreading (especially the _programming model_) in
> > > C++. We had some discussions about this about 6 months ago and there
> > > were more discussions as I recall in summer 2019.
> > >
> > > Realistically, we are going to be consistently dealing with
> > > independent concurrent in-process workloads that each respectively can
> > > go faster by multithreading. These could be things like:
> > >
> > > * Reading file formats (CSV, Parquet, etc.) that benefit from
> > > multithreaded parsing/decoding
> > > * Reading one or more files in parallel using the Datasets API
> > > * Executing any number of multithreaded analytical workloads
> > >
> > > One obvious issue with our thread scheduling is the FIFO nature of the
> > > global thread pool. If a new independent multithreaded workload shows
> > > up, it has to wait for other workloads to complete before the new work
> > > will be scheduled. Think about a Flight server serving queries to
> > > users -- is it fair for one query to "hog" the thread pool and force
> > > other requests to wait until they can get access to some CPU
> > > resources? You could imagine a workload that spawns 10 minutes worth
> > > of CPU work, where a new workload has to wait for all of that work to
> > > complete before having any tasks scheduled for execution.
> > >
> > > The approach that's been taken in the Datasets API to avoid problems
> > > with nested parallelism (file-specific operations spawning multiple
> > > tasks onto the global thread pool) is simply to disable multithreading
> > > at the level of a single file. This is clearly suboptimal.
> > >
> > > We have additional problems in that some file-loading related tasks do
> > > a mixture of CPU work and IO work, and once a thread has been
> > > dispatched to execute one of these tasks, when IO takes place, a CPU
> > > core may sit underutilized while the IO is waiting.
> > >
> > > There's more aspects we can discuss, but in general I think we need to
> > > come up with a programming model for building our C++ system
> > > components with the following requirements:
> > >
> > > * Deadlocks not possible by design
> > > * Any component can safely use "nested parallelism" without the
> > > programmer having to worry about deadlocks or one task "hogging" the
> > > thread pool. So in other words, if there's only a single
> > > multithreading-capable workload running, we "let it rip"
> > > * Resources can be reasonably fairly allocated amongst concurrent
> > > workloads (think: independent requests coming in through Flight, or
> > > scan tasks on different Parquet files in the Datasets API). Limit
> > > scenarios where a new workload is blocked altogether on the completion
> > > of other workloads
> > > * A well-defined programming pattern for tasks that do a mixture of
> > > CPU work and IO work that allows CPU cores to be used when a task is
> > > waiting on IO
> > >
> > > We can't be the only project that has these problems, so I'm
> > > interested to see what solutions have been successfully employed by
> > > others. For example, it strikes me as similar to concurrency issues
> > > inside an analytic database. How are they preventing concurrent
> > > workload starvation problems or handling CPU/IO task scheduling to
> > > avoid CPU underutilization?
> > >
> > > Choices of which threading libraries we might use to implement a
> > > viable solution (e.g. TBB) seem secondary to the programming model
> > > that we use to implement our components.
> > >
> > > Thanks,
> > > Wes

Re: [DISCUSS] Rethinking our approach to scheduling CPU and IO work in C++?

Reply via email to