Jed,

> From: Jed Brown [mailto:j...@jedbrown.org]
> Sent: Friday, May 3, 2019 12:41

> You linked to a NumPy discussion
> (https://github.com/numpy/numpy/issues/11826) that is encountering the same
> issues, but proposing solutions based on the global environment.
> That is perhaps acceptable for typical Python callers due to the GIL, but C++
> callers may be using threads themselves.  A typical example:
> 
> App:
>   calls libB sequentially:
>     calls Arrow sequentially (wants to use threads)
>   calls libC sequentially:
>     omp parallel (creates threads somehow):
>       calls Arrow from threads (Arrow should not create more)
>   omp parallel:
>     calls libD from threads:
>       calls Arrow (Arrow should not create more)

That's not correct assumption about Python. GIL is used for synchronization of 
Python's interpreter state, its C-API data structures. When Python calls a C 
extension like Numpy, the latter is not restricted for doing its own internal 
parallelism (like what OpenBLAS and MKL do). Moreover, Numpy and other 
libraries usually release GIL before going into a long compute region, which 
allows a concurrent thread to start a compute region in parallel. So, there is 
no much difference between Python and C++ for what you can get in terms of 
nested parallelism (the difference is in overheads and scalability). If there 
is an app-level parallelism (like for libD) and/or other nesting (like in your 
libC), which can be implemented e.g. with Dask, Numpy will still create 
parallel region inside for each call from outermost thread or process (Python, 
Dask support both). And this is exactly the problem I'm solving, that's the 
reason I started this discussion, so thanks for sharing my concerns. For more 
information, please refer to my Scipy2017 talk and later paper where we 
introduced 3 approaches to the problem (TBB, settings orchestration, OpenMP 
extension): 
http://conference.scipy.org/proceedings/scipy2018/pdfs/anton_malakhov.pdf

 > Arrow doesn't need to know the difference between the libC and libD cases, 
 > but
> it may make a difference to the implementation of those libraries.  In both of
> these cases, the user may desire that Arrow create tasks for load balancing
> reasons (but no new threads) so long as they can run on the specified thread
> team.

Exactly, tasks is one way to solve it. This is what TBB does as a good first 
approximation for the solution: global task scheduler, no mandatory 
threads/parallel regions, wide adoption in numeric libraries (MKL, DAAL, Numba, 
soon PyTorch and others). And that's the first step I'm proposing.
Though we know based on the past experience, that it is still not sufficient 
because NUMA effects are not accounted: tasks are randomly distributed. That's 
where other threading layer implementations can work better for some cases and 
where more elaborated TBB-based NUMA-aware implementation is needed.

> Global solutions like this one (linked by Antoine)
> 
>   https://github.com/apache/arrow/blob/master/cpp/src/arrow/util/thread-
> pool.cc#L268
> 
> imply that threading mode is global and set via an environment variable, 
> neither
> of which are true in cases such as the above (and many simpler cases).
Right. I wrote about problem with this implementation in the proposal. First, 
we should not mimic OpenMP for something completely irrelevant, it is causing 
confusion and is hard to control for more complex cases.

Regards,
// Anton

Reply via email to