I would caution to please not commit to the MKL/BLAS model in which the
library creates threads internally. It's a disaster for managing
oversubscription and affinity issues among groups of threads and/or
multiple processes (e.g., MPI). For example, a composable OpenMP
technique is for the caller who wants threaded execution to create the
thread team and explicitly give it to the library,
#pragma omp parallel
#pragma omp single
{
library_calls();
}
The library is then free to use constructs like omp taskgroup/taskloop
as granularity warrants; it will never utilize threads that the
application didn't explicitly give it.
I know Intel has put work into making TBB and OpenMP thread pools
interoperate, though this historically hasn't worked cleanly on other
platforms and thus may be a liability for Arrow to consider.
"Malakhov, Anton" <[email protected]> writes:
> Thanks Wes!
>
> Sounds like a good way to go! We'll create a demo, as you suggested,
> implementing a parallel execution model for a simple analytics pipeline that
> reads and processes the files. My only concern is about adding more pipeline
> breaker nodes and compute intensive operations into this demo because min/max
> are effectively no-ops fused into I/O scan node. What do you think about
> adding group-by into this picture, effectively implementing NY taxi and/or
> mortgage benchmarks? Ideally, I'd like to go even further and add sci-kit
> learn-like stuff for processing that data in order to demonstrate the
> co-existence side of the story. What do you think?
> So, the idea of the prototype will be to validate the parallel execution
> model as the first step. After that, it'll help to shape API for both -
> execution nodes and the threading backend. Does it sound right to you?
>
> P.S. I can well understand your hesitation about using TBB directly and as
> non-optional dependency, thus I'm suggesting threading layers approach here.
> Please let me clarify myself, using TBB and nested parallelism is non-goal by
> itself. The goal is to build components of efficient execution model, which
> coexist well with each other and with all the other, external to Arrow,
> components of an applications. However, without a rich, composable, and
> mature parallel toolkit, it is hard to achieve and to focus on this goal.
> Thus, I wanted to check with the community if it is an acceptable way at all
> and what's the roadmap.
>
> Thanks,
> // Anton
>
>
> -----Original Message-----
> From: Wes McKinney [mailto:[email protected]]
> Sent: Thursday, May 2, 2019 13:52
> To: [email protected]
> Subject: Re: [DISCUSS][C++][Proposal] Threading engine for Arrow
>
> hi Anton,
>
> Thank you for bringing your expertise to the project -- this is a very useful
> discussion to have.
>
> Partly why our threading capabilities in the project are not further
> developed is that there is not much that needs to be parallelized. It would
> be like designing a supercharger when you don't have a car yet.
> That being said, it is worthwhile to plan ahead so we aren't trying to
> retrofit significant pieces of software to be able to take advantage of a
> more advanced task scheduler.
>
> From my perspective, we have a few key practical areas of consideration:
>
> * Computational tasks that may offer nested parallelism (e.g. an Aggregation
> or Projection task may be able to execution in multiple
> threads)
> * IO operations performed from within tasks that appear to be computational
> in nature (example: in the course of reading a Parquet file, both computation
> -- decoding, decompression -- and IO -- local or remote filesystem operations
> -- must be performed). The status quo right now is that IO performed inside a
> task in the thread pool is not releasing any resources to other tasks.
>
> I believe that we should design and develop a sane programming model / API
> for implementing our software in the presence of these challenges.
> If the backend / implementation of this API uses TBB and that makes things
> more efficient than other approaches, then that sounds great to me. I would
> be hesitant to use TBB APIs directly in Arrow application code unless it can
> be clearly demonstrated by that is a superior option to alternatives.
>
> It seems useful to validate the implementation approach by starting with some
> practical problems. Suppose, for the sake of argument, you want to read 10
> Parquet files (constituting a single logical dataset) as fast as possible and
> perform some simple analytics on them -- let's take something very simple
> like computing the maximum and minimum values of each column in the dataset.
> This problem features both problems listed above:
>
> * Reading a single Parquet file can be parallelized (by columns -- since
> columns can be decoded in parallel) on the global thread pool, so reading
> multiple files in parallel would cause nested parallelism
> * Within the context of reading a single Parquet file column, IO calls are
> performed. CPU threads sit idle while this IO is taking place, particularly
> if the file system is high latency (e.g. HDFS)
>
> What do you think about -- as a way of moving this project forward --
> developing a prototype threading backend and developer API (for people like
> me to use to develop libraries like the Parquet library) that addresses these
> issues? I think it could be difficult to build consensus around a threading
> backend developed in the abstract.
>
> Thanks
> Wes
>
> On Tue, Apr 30, 2019 at 9:28 PM Malakhov, Anton <[email protected]>
> wrote:
>>
>> Hi dear Arrow developers, Antoine,
>>
>> I'd like to kick off the discussion of the threading engine that Arrow can
>> use underneath for implementing multicore parallelism for execution nodes,
>> kernels, and/or all the functions, which can be optimized this way.
>> I've documented some ideas on Arrow's Confluence Wiki:
>> https://cwiki.apache.org/confluence/display/ARROW/Parallel+Execution+E
>> ngine The bottom line is that while Arrow is moving into the right
>> direction introducing shared thread pool, there are some questions and
>> concerns about current implementation and the way how it is supposed to
>> co-exist with other threaded libraries ("threading composability") while
>> providing efficient nestable NUMA&cache-aware data and data-flow parallelism.
>> I suggest to introduce threading layers like in other libraries like MKL and
>> Numba, starting with TBB-based layer. Or maybe even use TBB directly. In
>> short, there are the following arguments for it:
>>
>> 1. Designed for composability from day zero. Avoids mandatory
>> parallelism. Provides work stealing and FIFO scheduling. Compatible with
>> parallel depth first scheduling (a better composability research).
>>
>> 2. TBB Flow Graph. It fits nicely into data flow and execution nodes
>> model of SQL databases. Besides basic nodes needed for implementing an
>> execution engine, it also provides a foundation for heterogeneous and
>> distributed computing (async_node, opencl_node, distributed_node)
>>
>> 3. Arrow's ThreadPool, TaskGroup, and ParallelFor have direct
>> equivalent in TBB: task_arena, task_group, and parallel_for while providing
>> mature and performant implementation, which solves many if not all of the
>> XXX todo notes in the comments like exceptions, singletons and time of
>> initialization, lock-free.
>>
>> 4. Concurrent hash tables, queues, vector and other concurrent
>> containers. Hash tables are required for implementing parallel versions of
>> joins, groupby, uniq, dictionary operations. There is a contribution to
>> integrate libcuckoo under TBB interface.
>>
>> 5. TBB scalable malloc and memory pools, which can use any
>> user-provided memory chunk for scalable allocation. Arrow uses jemalloc,
>> which is slower in some cases than tbbmalloc or tcmalloc.
>>
>> 6. OpenMP is good for NUMA with static schedule, however, there is no
>> good answer for dynamic tasks, graphs. TBB provides tools for implementing
>> NUMA support: task_arena, task_scheduler_observer, task affinity &
>> priorities, committed to improve NUMA for its other customers in 2019.
>>
>> 7. TBB is licensed under Apache 2.0, has conda-forge feedstock,
>> supports CMake, it's adopted for CPU scheduling by other industry players,
>> has multiple ports for other OSes and CPU arches.
>>
>> Full disclosure: I was TBB developer before its 1.0 version, responsible for
>> multiple core components like hash tables, adaptive partitioning, interfaces
>> of memory pools and task_arena, all of these are very relevant to Arrow.
>> I've background in scalability and NUMA-aware performance optimization like
>> what we did for OpenCL runtime for CPU (TBB-based). I also was behind
>> optimizations for Intel Distribution for Python and its threading
>> composability
>> story<https://software.intel.com/en-us/blogs/2016/04/04/unleash-parallel-performance-of-python-programs>.
>> Thus, I'm sincerely hope to reuse all these stuff in order to deliver the
>> best performance for Arrow.
>>
>>
>> Best regards,
>> Anton Malakhov<http://www.linkedin.com/in/antonmalakhov>
>> IAGS Scripting Analyzers & Tools
>>
>> O: +1-512-3620-512
>> 1300 S. MoPac Expy
>> Office: AN4-C1-D4
>> Austin, TX 78746
>> Intel Corporation | www.intel.com
>>