Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

Wes McKinney Mon, 15 Jul 2019 09:50:41 -0700

On Mon, Jul 15, 2019 at 11:38 AM Antoine Pitrou <anto...@python.org> wrote:
>
>
> Hi Anton,
>
> Le 12/07/2019 à 23:21, Malakhov, Anton a écrit :
> >
> > The result is that all these execution nodes scale well enough and run 
> > under 100 milliseconds on my 2 x Xeon E5-2650 v4 @ 2.20GHz, 128Gb RAM while 
> > CSV reader takes several seconds to complete even reading from in-memory 
> > file (8Gb), thus it is not IO bound yet even with good consumer-grade SSDs. 
> > Thus my focus recently has been around optimization of CSV parser where I 
> > have achieved 50% improvement substituting all the small object allocations 
> > via TBB scalable allocator and using TBB-based memory pool instead of 
> > default one with pre-allocated huge (2Mb) memory pages (echo 30000 > 
> > /proc/sys/vm/nr_hugepages). I found no way yet how to do both of these 
> > tricks with jemalloc, so please try to beat or meet my times without TBB 
> > allocator.
>
> That sounds interesting, though optimizing memory allocations is
> probably not the most enticing use case for TBB.  Memory allocators can
> fare differently on different workloads, and just because TBB is better
> in some situation doesn't mean it'll always be better.  Similarly,
> jemalloc is not the best for every use case.
>
> Note that, as Arrow is a library, we don't want to impose a memory
> allocator on the user, hence why jemalloc is merely optional.
>
> (one reason we added the jemalloc option is that jemalloc has
> non-standard APIs for aligned allocation and reallocation, btw)
>
> > I also see other hotspots and opportunities for optimizations, some 
> > examples are memset is being heavily used while resizing buffers (why and 
> > why?) and the column builder trashes caches by not using of streaming 
> > stores.
>
> Could you open JIRA issues with your investigations?  I'd be interested
> to know what the actual execution bottlenecks are in the CSV reader.
>
> > I used TBB directly to make the execution nodes parallel, however I have 
> > also implemented a simple TBB-based ThreadPool and TaskGroup as you can see 
> > in this PR: https://github.com/aregm/arrow/pull/6
> > I see consistent improvement (up to 1200%!) on BM_ThreadedTaskGroup and 
> > BM_ThreadPoolSpawn microbenchmarks, however applying it to the real world 
> > task of CSV reader, I don't see any improvements yet.
>
> One thing you could try is shrink the block size in CSV reader and see
> when performance starts to fall significantly.  With the current
> TaskGroup overhead, small block sizes will suffer a lot.  I expect TBB
> to fare better.
>
> (and / or try a CSV file with a hundred columns or so)
>
> > Or even worse, while reading the file, TBB wastes some cycles spinning.
>
> That doesn't sound good (but is a separate issue from the main TaskGroup
> usage, IMHO).  TBB doesn't provide a facility for background IO threads
> perhaps?
>


I think we need to spend some design effort on an programming model /
API for these code paths that do a mix of IO and deserialization. This
is also a problem with Parquet files -- a CPU thread that is
deserializing a column will sit idle while it waits for IO. IMHO such
IO calls need to be able to signal to the concurrency manager that
another task can be started.

For example, suppose we had a thread pool with a limit of 8 concurrent
tasks. Now 4 of them perform IO calls. Hypothetically this should
happen:

* Thread pool increments a "soft limit" to allow 4 more tasks to
spawn, so at this point technically we have 12 active tasks
* When each IO call returns, the soft limit is decremented
* The soft limit can be constrained to be some multiple of the hard
limit. So if we have a hard limit of 8 CPU-bound threads, then we
might allow an additional 8 tasks to be spawned if a CPU bound thread
indicates that it's waiting for IO

I think that any code in the codebase that does a mix of CPU and IO
should be retrofitted with some kind of object to allow code to signal
that it's about to wait for IO.

> > I'll be looking into applying more sophisticated NUMA and locality-aware 
> > tricks as I'll be cleaning paths for the data streams in the parser.
>
> Hmm, as a first approach, I don't think we should waste time trying such
> sophisticated optimizations (well, of course, you are free to do so :-)).
>
> Regards
>
> Antoine.

Re: [DISCUSS][C++][Proposal] Threading engine for Arrow

Reply via email to