Can't some (most) of it be added to APR <https://apr.apache.org/>?
On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Henry, > > Thank you for these comments. > > I think having a kind of "Apache Commons for [Modern] C++" would be an > ideal (though perhaps initially more labor intensive) solution. > There's code in Arrow that I would move into this project if it > existed. I am happy to help make this happen if there is interest from > the Kudu and Impala communities. I am not sure logistically what would > be the most expedient way to establish the project, whether as an ASF > Incubator project or possibly as a new TLP that could be created by > spinning IP out of Apache Kudu. > > I'm interested to hear the opinions of others, and possible next steps. > > Thanks > Wes > > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote: > > Thanks for bringing this up, Wes. > > > > On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> wrote: > > > >> Dear Apache Kudu and Apache Impala (incubating) communities, > >> > >> (I'm not sure the best way to have a cross-list discussion, so I > >> apologize if this does not work well) > >> > >> On the recent Apache Parquet sync call, we discussed C++ code sharing > >> between the codebases in Apache Arrow and Apache Parquet, and > >> opportunities for more code sharing with Kudu and Impala as well. > >> > >> As context > >> > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the > >> first C++ release within Apache Parquet. I got involved with this > >> project a little over a year ago and was faced with the unpleasant > >> decision to copy and paste a significant amount of code out of > >> Impala's codebase to bootstrap the project. > >> > >> * In parallel, we begin the Apache Arrow project, which is designed to > >> be a complementary library for file formats (like Parquet), storage > >> engines (like Kudu), and compute engines (like Impala and pandas). > >> > >> * As Arrow and parquet-cpp matured, an increasing amount of code > >> overlap crept up surrounding buffer memory management and IO > >> interface. We recently decided in PARQUET-818 > >> (https://github.com/apache/parquet-cpp/commit/ > >> 2154e873d5aa7280314189a2683fb1e12a590c02) > >> to remove some of the obvious code overlap in Parquet and make > >> libarrow.a/so a hard compile and link-time dependency for > >> libparquet.a/so. > >> > >> * There is still quite a bit of code in parquet-cpp that would better > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding, > >> compression, bit utilities, and so forth. Much of this code originated > >> from Impala > >> > >> This brings me to a next set of points: > >> > >> * parquet-cpp contains quite a bit of code that was extracted from > >> Impala. This is mostly self-contained in > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util > >> > >> * My understanding is that Kudu extracted certain computational > >> utilities from Impala in its early days, but these tools have likely > >> diverged as the needs of the projects have evolved. > >> > >> Since all of these projects are quite different in their end goals > >> (runtime systems vs. libraries), touching code that is tightly coupled > >> to either Kudu or Impala's runtimes is probably not worth discussing. > >> However, I think there is a strong basis for collaboration on > >> computational utilities and vectorized array processing. Some obvious > >> areas that come to mind: > >> > >> * SIMD utilities (for hashing or processing of preallocated contiguous > >> memory) > >> * Array encoding utilities: RLE / Dictionary, etc. > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire > >> contributed a patch to parquet-cpp around this) > >> * Date and time utilities > >> * Compression utilities > >> > > > > Between Kudu and Impala (at least) there are many more opportunities for > > sharing. Threads, logging, metrics, concurrent primitives - the list is > > quite long. > > > > > >> > >> I hope the benefits are obvious: consolidating efforts on unit > >> testing, benchmarking, performance optimizations, continuous > >> integration, and platform compatibility. > >> > >> Logistically speaking, one possible avenue might be to use Apache > >> Arrow as the place to assemble this code. Its thirdparty toolchain is > >> small, and it builds and installs fast. It is intended as a library to > >> have its headers used and linked against other applications. (As an > >> aside, I'm very interested in building optional support for Arrow > >> columnar messages into the kudu client). > >> > > > > In principle I'm in favour of code sharing, and it seems very much in > > keeping with the Apache way. However, practically speaking I'm of the > > opinion that it only makes sense to house shared support code in a > > separate, dedicated project. > > > > Embedding the shared libraries in, e.g., Arrow naturally limits the scope > > of sharing to utilities that Arrow is interested in. It would make no > sense > > to add a threading library to Arrow if it was never used natively. > Muddying > > the waters of the project's charter seems likely to lead to user, and > > developer, confusion. Similarly, we should not necessarily couple Arrow's > > design goals to those it inherits from Kudu and Impala's source code. > > > > I think I'd rather see a new Apache project than re-use a current one for > > two independent purposes. > > > > > >> > >> The downside of code sharing, which may have prevented it so far, are > >> the logistics of coordinating ASF release cycles and keeping build > >> toolchains in sync. It's taken us the past year to stabilize the > >> design of Arrow for its intended use cases, so at this point if we > >> went down this road I would be OK with helping the community commit to > >> a regular release cadence that would be faster than Impala, Kudu, and > >> Parquet's respective release cadences. Since members of the Kudu and > >> Impala PMC are also on the Arrow PMC, I trust we would be able to > >> collaborate to each other's mutual benefit and success. > >> > >> Note that Arrow does not throw C++ exceptions and similarly follows > >> Google C++ style guide to the same extent at Kudu and Impala. > >> > >> If this is something that either the Kudu or Impala communities would > >> like to pursue in earnest, I would be happy to work with you on next > >> steps. I would suggest that we start with something small so that we > >> could address the necessary build toolchain changes, and develop a > >> workflow for moving around code and tests, a protocol for code reviews > >> (e.g. Gerrit), and coordinating ASF releases. > >> > > > > I think, if I'm reading this correctly, that you're assuming integration > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via > > their toolchains. For something as fast moving as utility code - and > > critical, where you want the latency between adding a fix and including > it > > in your build to be ~0 - that's a non-starter to me, at least with how > the > > toolchains are currently realised. > > > > I'd rather have the source code directly imported into Impala's tree - > > whether by git submodule or other mechanism. That way the coupling is > > looser, and we can move more quickly. I think that's important to other > > projects as well. > > > > Henry > > > > > > > >> > >> Let me know what you think. > >> > >> best > >> Wes > >> >