hi Anton, Ideally PRs like https://github.com/aregm/arrow/pull/6 would be made into apache/arrow where the community can see them. I had no idea this PR existed.
I have looked at the demo repository a little bit, but I'm not sure what conclusions it will help reach. What we are missing at the moment is a richer programming model / API for multi-threaded workflows that do a mix of CPU and IO. If that programming model can make use of TBB for better performance (but without exposing a lot of TBB-specific details to the Arrow library developer), then that is great. In any case, I'd prefer to collaborate in the context of the Arrow C++ library and the specific problems we are solving there. I haven't had much time lately to write new code but one of the thing I'm most interested in working on in the near future is the C++ Data Frame library: https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit?usp=sharing. That might be a good place to experiment with a better threading / work-scheduling API to make things simpler for implementers. - Wes On Fri, Jul 12, 2019 at 4:21 PM Malakhov, Anton <anton.malak...@intel.com> wrote: > > Hi, folks > > We were discussing improvements for the threading engine back in May and > agreed to implement benchmarks (sorry, I've lost the original mail thread, > here is the link: > https://lists.apache.org/thread.html/c690253d0bde643a5b644af70ec1511c6e510ebc86cc970aa8d5252e@%3Cdev.arrow.apache.org%3E > ) > > Here is update of what's going on with this effort. > We've implemented a rough prototype for group_by, aggregate, and transform > execution nodes on top of Arrow (along with studying the whole data analytics > domain along the way :-) ) and made them parallel, as you can see in this > repository: https://github.com/anton-malakhov/nyc_taxi > > The result is that all these execution nodes scale well enough and run under > 100 milliseconds on my 2 x Xeon E5-2650 v4 @ 2.20GHz, 128Gb RAM while CSV > reader takes several seconds to complete even reading from in-memory file > (8Gb), thus it is not IO bound yet even with good consumer-grade SSDs. Thus > my focus recently has been around optimization of CSV parser where I have > achieved 50% improvement substituting all the small object allocations via > TBB scalable allocator and using TBB-based memory pool instead of default one > with pre-allocated huge (2Mb) memory pages (echo 30000 > > /proc/sys/vm/nr_hugepages). I found no way yet how to do both of these tricks > with jemalloc, so please try to beat or meet my times without TBB allocator. > I also see other hotspots and opportunities for optimizations, some examples > are memset is being heavily used while resizing buffers (why and why?) and > the column builder trashes caches by not using of streaming stores. > > I used TBB directly to make the execution nodes parallel, however I have also > implemented a simple TBB-based ThreadPool and TaskGroup as you can see in > this PR: https://github.com/aregm/arrow/pull/6 > I see consistent improvement (up to 1200%!) on BM_ThreadedTaskGroup and > BM_ThreadPoolSpawn microbenchmarks, however applying it to the real world > task of CSV reader, I don't see any improvements yet. Or even worse, while > reading the file, TBB wastes some cycles spinning.. probably because of > read-ahead thread, which oversubscribes the machine. Arrow's threading better > interacts with OS scheduler thus shows better performance. So, this simple > approach to TBB without a deeper redesign didn't help. I'll be looking into > applying more sophisticated NUMA and locality-aware tricks as I'll be > cleaning paths for the data streams in the parser. Though, I'll take some > time off before returning to this effort. See you in September! > > > Regards, > // Anton >