Hi, folks

We were discussing improvements for the threading engine back in May and agreed 
to implement benchmarks (sorry, I've lost the original mail thread, here is the 
link: 
https://lists.apache.org/thread.html/c690253d0bde643a5b644af70ec1511c6e510ebc86cc970aa8d5252e@%3Cdev.arrow.apache.org%3E
 )

Here is update of what's going on with this effort.
We've implemented a rough prototype for group_by, aggregate, and transform 
execution nodes on top of Arrow (along with studying the whole data analytics 
domain along the way :-) ) and made them parallel, as you can see in this 
repository: https://github.com/anton-malakhov/nyc_taxi

The result is that all these execution nodes scale well enough and run under 
100 milliseconds on my 2 x Xeon E5-2650 v4 @ 2.20GHz, 128Gb RAM while CSV 
reader takes several seconds to complete even reading from in-memory file 
(8Gb), thus it is not IO bound yet even with good consumer-grade SSDs. Thus my 
focus recently has been around optimization of CSV parser where I have achieved 
50% improvement substituting all the small object allocations via TBB scalable 
allocator and using TBB-based memory pool instead of default one with 
pre-allocated huge (2Mb) memory pages (echo 30000 > /proc/sys/vm/nr_hugepages). 
I found no way yet how to do both of these tricks with jemalloc, so please try 
to beat or meet my times without TBB allocator. I also see other hotspots and 
opportunities for optimizations, some examples are memset is being heavily used 
while resizing buffers (why and why?) and the column builder trashes caches by 
not using of streaming stores.

I used TBB directly to make the execution nodes parallel, however I have also 
implemented a simple TBB-based ThreadPool and TaskGroup as you can see in this 
PR: https://github.com/aregm/arrow/pull/6
I see consistent improvement (up to 1200%!) on BM_ThreadedTaskGroup and 
BM_ThreadPoolSpawn microbenchmarks, however applying it to the real world task 
of CSV reader, I don't see any improvements yet. Or even worse, while reading 
the file, TBB wastes some cycles spinning.. probably because of read-ahead 
thread, which oversubscribes the machine. Arrow's threading better interacts 
with OS scheduler thus shows better performance. So, this simple approach to 
TBB without a deeper redesign didn't help. I'll be looking into applying more 
sophisticated NUMA and locality-aware tricks as I'll be cleaning paths for the 
data streams in the parser. Though, I'll take some time off before returning to 
this effort. See you in September!


Regards,
// Anton

Reply via email to