Hi, folks We were discussing improvements for the threading engine back in May and agreed to implement benchmarks (sorry, I've lost the original mail thread, here is the link: https://lists.apache.org/thread.html/c690253d0bde643a5b644af70ec1511c6e510ebc86cc970aa8d5252e@%3Cdev.arrow.apache.org%3E )
Here is update of what's going on with this effort. We've implemented a rough prototype for group_by, aggregate, and transform execution nodes on top of Arrow (along with studying the whole data analytics domain along the way :-) ) and made them parallel, as you can see in this repository: https://github.com/anton-malakhov/nyc_taxi The result is that all these execution nodes scale well enough and run under 100 milliseconds on my 2 x Xeon E5-2650 v4 @ 2.20GHz, 128Gb RAM while CSV reader takes several seconds to complete even reading from in-memory file (8Gb), thus it is not IO bound yet even with good consumer-grade SSDs. Thus my focus recently has been around optimization of CSV parser where I have achieved 50% improvement substituting all the small object allocations via TBB scalable allocator and using TBB-based memory pool instead of default one with pre-allocated huge (2Mb) memory pages (echo 30000 > /proc/sys/vm/nr_hugepages). I found no way yet how to do both of these tricks with jemalloc, so please try to beat or meet my times without TBB allocator. I also see other hotspots and opportunities for optimizations, some examples are memset is being heavily used while resizing buffers (why and why?) and the column builder trashes caches by not using of streaming stores. I used TBB directly to make the execution nodes parallel, however I have also implemented a simple TBB-based ThreadPool and TaskGroup as you can see in this PR: https://github.com/aregm/arrow/pull/6 I see consistent improvement (up to 1200%!) on BM_ThreadedTaskGroup and BM_ThreadPoolSpawn microbenchmarks, however applying it to the real world task of CSV reader, I don't see any improvements yet. Or even worse, while reading the file, TBB wastes some cycles spinning.. probably because of read-ahead thread, which oversubscribes the machine. Arrow's threading better interacts with OS scheduler thus shows better performance. So, this simple approach to TBB without a deeper redesign didn't help. I'll be looking into applying more sophisticated NUMA and locality-aware tricks as I'll be cleaning paths for the data streams in the parser. Though, I'll take some time off before returning to this effort. See you in September! Regards, // Anton