hi Anton,

Ideally PRs like https://github.com/aregm/arrow/pull/6 would be made
into apache/arrow where the community can see them. I had no idea this
PR existed.

I have looked at the demo repository a little bit, but I'm not sure
what conclusions it will help reach. What we are missing at the moment
is a richer programming model / API for multi-threaded workflows that
do a mix of CPU and IO. If that programming model can make use of TBB
for better performance (but without exposing a lot of TBB-specific
details to the Arrow library developer), then that is great.

In any case, I'd prefer to collaborate in the context of the Arrow C++
library and the specific problems we are solving there. I haven't had
much time lately to write new code but one of the thing I'm most
interested in working on in the near future is the C++ Data Frame
library: 
https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing.
That might be a good place to experiment with a better threading /
work-scheduling API to make things simpler for implementers.

- Wes

On Fri, Jul 12, 2019 at 4:21 PM Malakhov, Anton
<anton.malak...@intel.com> wrote:
>
> Hi, folks
>
> We were discussing improvements for the threading engine back in May and 
> agreed to implement benchmarks (sorry, I've lost the original mail thread, 
> here is the link: 
> https://lists.apache.org/thread.html/c690253d0bde643a5b644af70ec1511c6e510ebc86cc970aa8d5252e@%3Cdev.arrow.apache.org%3E
>  )
>
> Here is update of what's going on with this effort.
> We've implemented a rough prototype for group_by, aggregate, and transform 
> execution nodes on top of Arrow (along with studying the whole data analytics 
> domain along the way :-) ) and made them parallel, as you can see in this 
> repository: https://github.com/anton-malakhov/nyc_taxi
>
> The result is that all these execution nodes scale well enough and run under 
> 100 milliseconds on my 2 x Xeon E5-2650 v4 @ 2.20GHz, 128Gb RAM while CSV 
> reader takes several seconds to complete even reading from in-memory file 
> (8Gb), thus it is not IO bound yet even with good consumer-grade SSDs. Thus 
> my focus recently has been around optimization of CSV parser where I have 
> achieved 50% improvement substituting all the small object allocations via 
> TBB scalable allocator and using TBB-based memory pool instead of default one 
> with pre-allocated huge (2Mb) memory pages (echo 30000 > 
> /proc/sys/vm/nr_hugepages). I found no way yet how to do both of these tricks 
> with jemalloc, so please try to beat or meet my times without TBB allocator. 
> I also see other hotspots and opportunities for optimizations, some examples 
> are memset is being heavily used while resizing buffers (why and why?) and 
> the column builder trashes caches by not using of streaming stores.
>
> I used TBB directly to make the execution nodes parallel, however I have also 
> implemented a simple TBB-based ThreadPool and TaskGroup as you can see in 
> this PR: https://github.com/aregm/arrow/pull/6
> I see consistent improvement (up to 1200%!) on BM_ThreadedTaskGroup and 
> BM_ThreadPoolSpawn microbenchmarks, however applying it to the real world 
> task of CSV reader, I don't see any improvements yet. Or even worse, while 
> reading the file, TBB wastes some cycles spinning.. probably because of 
> read-ahead thread, which oversubscribes the machine. Arrow's threading better 
> interacts with OS scheduler thus shows better performance. So, this simple 
> approach to TBB without a deeper redesign didn't help. I'll be looking into 
> applying more sophisticated NUMA and locality-aware tricks as I'll be 
> cleaning paths for the data streams in the parser. Though, I'll take some 
> time off before returning to this effort. See you in September!
>
>
> Regards,
> // Anton
>

Reply via email to