I also support the idea of creating an "apache commons modern c++" style library, maybe tailored toward the needs of columnar data processing tools. I think APR is the wrong project but I think that *style* of project is the right direction to aim.
I agree this adds test and release process complexity across products but I think the benefits of a shared, well-tested library outweigh that, and creating such test infrastructure will have long-term benefits as well. I'd be happy to lend a hand wherever it's needed. On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <t...@cloudera.com> wrote: > Hey folks, > > As Henry mentioned, Impala is starting to share more code with Kudu (most > notably our RPC system, but that pulls in a fair bit of utility code as > well), so we've been chatting periodically offline about the best way to do > this. Having more projects potentially interested in collaborating is > definitely welcome, though I think does also increase the complexity of > whatever solution we come up with. > > I think the potential benefits of collaboration are fairly self-evident, so > I'll focus on my concerns here, which somewhat echo Henry's. > > 1) Open source release model > > The ASF is very much against having projects which do not do releases. So, > if we were to create some new ASF project to hold this code, we'd be > expected to do frequent releases thereof. Wes volunteered above to lead > frequent releases, but we actually need at least 3 PMC members to vote on > each release, and given people can come and go, we'd probably need at least > 5-8 people who are actively committed to helping with the release process > of this "commons" project. > > Unlike our existing projects, which seem to release every 2-3 months, if > that, I think this one would have to release _much_ more frequently, if we > expect downstream projects to depend on released versions rather than just > pulling in some recent (or even trunk) git hash. Since the ASF requires the > normal voting period and process for every release, I don't think we could > do something like have "daily automatic releases", etc. > > We could probably campaign the ASF membership to treat this project > differently, either as (a) a repository of code that never releases, in > which case the "downstream" projects are responsible for vetting IP, etc, > as part of their own release processes, or (b) a project which does > automatic releases voted upon by robots. I'm guessing that (a) is more > palatable from an IP perspective, and also from the perspective of the > downstream projects. > > > 2) Governance/review model > > The more projects there are sharing this common code, the more difficult it > is to know whether a change would break something, or even whether a change > is considered desirable for all of the projects. I don't want to get into > some world where any change to a central library requires a multi-week > proposal/design-doc/review across 3+ different groups of committers, all of > whom may have different near-term priorities. On the other hand, it would > be pretty frustrating if the week before we're trying to cut a Kudu release > branch, someone in another community decides to make a potentially > destabilizing change to the RPC library. > > > 3) Pre-commit/test mechanics > > Semi-related to the above: we currently feel pretty confident when we make > a change to a central library like kudu/util/thread.cc that nothing broke > because we run the full suite of Kudu tests. Of course the central > libraries have some unit test coverage, but I wouldn't be confident with > any sort of model where shared code can change without verification by a > larger suite of tests. > > On the other hand, I also don't want to move to a model where any change to > shared code requires a 6+-hour precommit spanning several projects, each of > which may have its own set of potentially-flaky pre-commit tests, etc. I > can imagine that if an Arrow developer made some change to "thread.cc" and > saw that TabletServerStressTest failed their precommit, they'd have no idea > how to triage it, etc. That could be a strong disincentive to continued > innovation in these areas of common code, which we'll need a good way to > avoid. > > I think some of the above could be ameliorated with really good > infrastructure -- eg on a test failure, automatically re-run the failed > test on both pre-patch and post-patch, do a t-test to check statistical > significance in flakiness level, etc. But, that's a lot of infrastructure > that doesn't currently exist. > > > 4) Integration mechanics for breaking changes > > Currently these common libraries are treated as components of monolithic > projects. That means it's no extra overhead for us to make some kind of > change which breaks an API in src/kudu/util/ and at the same time updates > all call sites. The internal libraries have no semblance of API > compatibility guarantees, etc, and adding one is not without cost. > > Before sharing code, we should figure out how exactly we'll manage the > cases where we want to make some change in a common library that breaks an > API used by other projects, given there's no way to make an atomic commit > across many repositories. One option is that each "user" of the libraries > manually "rolls" to new versions when they feel like it, but there's still > now a case where a common change "pushes work onto" the consumers to update > call sites, etc. > > Admittedly, the number of breaking API changes in these common libraries is > relatively small, but would still be good to understand how we would plan > to manage them. > > -Todd > > On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <wesmck...@gmail.com> > wrote: > > > hi Henry, > > > > Thank you for these comments. > > > > I think having a kind of "Apache Commons for [Modern] C++" would be an > > ideal (though perhaps initially more labor intensive) solution. > > There's code in Arrow that I would move into this project if it > > existed. I am happy to help make this happen if there is interest from > > the Kudu and Impala communities. I am not sure logistically what would > > be the most expedient way to establish the project, whether as an ASF > > Incubator project or possibly as a new TLP that could be created by > > spinning IP out of Apache Kudu. > > > > I'm interested to hear the opinions of others, and possible next steps. > > > > Thanks > > Wes > > > > On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> > wrote: > > > Thanks for bringing this up, Wes. > > > > > > On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> > wrote: > > > > > >> Dear Apache Kudu and Apache Impala (incubating) communities, > > >> > > >> (I'm not sure the best way to have a cross-list discussion, so I > > >> apologize if this does not work well) > > >> > > >> On the recent Apache Parquet sync call, we discussed C++ code sharing > > >> between the codebases in Apache Arrow and Apache Parquet, and > > >> opportunities for more code sharing with Kudu and Impala as well. > > >> > > >> As context > > >> > > >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the > > >> first C++ release within Apache Parquet. I got involved with this > > >> project a little over a year ago and was faced with the unpleasant > > >> decision to copy and paste a significant amount of code out of > > >> Impala's codebase to bootstrap the project. > > >> > > >> * In parallel, we begin the Apache Arrow project, which is designed to > > >> be a complementary library for file formats (like Parquet), storage > > >> engines (like Kudu), and compute engines (like Impala and pandas). > > >> > > >> * As Arrow and parquet-cpp matured, an increasing amount of code > > >> overlap crept up surrounding buffer memory management and IO > > >> interface. We recently decided in PARQUET-818 > > >> (https://github.com/apache/parquet-cpp/commit/ > > >> 2154e873d5aa7280314189a2683fb1e12a590c02) > > >> to remove some of the obvious code overlap in Parquet and make > > >> libarrow.a/so a hard compile and link-time dependency for > > >> libparquet.a/so. > > >> > > >> * There is still quite a bit of code in parquet-cpp that would better > > >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding, > > >> compression, bit utilities, and so forth. Much of this code originated > > >> from Impala > > >> > > >> This brings me to a next set of points: > > >> > > >> * parquet-cpp contains quite a bit of code that was extracted from > > >> Impala. This is mostly self-contained in > > >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util > > >> > > >> * My understanding is that Kudu extracted certain computational > > >> utilities from Impala in its early days, but these tools have likely > > >> diverged as the needs of the projects have evolved. > > >> > > >> Since all of these projects are quite different in their end goals > > >> (runtime systems vs. libraries), touching code that is tightly coupled > > >> to either Kudu or Impala's runtimes is probably not worth discussing. > > >> However, I think there is a strong basis for collaboration on > > >> computational utilities and vectorized array processing. Some obvious > > >> areas that come to mind: > > >> > > >> * SIMD utilities (for hashing or processing of preallocated contiguous > > >> memory) > > >> * Array encoding utilities: RLE / Dictionary, etc. > > >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire > > >> contributed a patch to parquet-cpp around this) > > >> * Date and time utilities > > >> * Compression utilities > > >> > > > > > > Between Kudu and Impala (at least) there are many more opportunities > for > > > sharing. Threads, logging, metrics, concurrent primitives - the list is > > > quite long. > > > > > > > > >> > > >> I hope the benefits are obvious: consolidating efforts on unit > > >> testing, benchmarking, performance optimizations, continuous > > >> integration, and platform compatibility. > > >> > > >> Logistically speaking, one possible avenue might be to use Apache > > >> Arrow as the place to assemble this code. Its thirdparty toolchain is > > >> small, and it builds and installs fast. It is intended as a library to > > >> have its headers used and linked against other applications. (As an > > >> aside, I'm very interested in building optional support for Arrow > > >> columnar messages into the kudu client). > > >> > > > > > > In principle I'm in favour of code sharing, and it seems very much in > > > keeping with the Apache way. However, practically speaking I'm of the > > > opinion that it only makes sense to house shared support code in a > > > separate, dedicated project. > > > > > > Embedding the shared libraries in, e.g., Arrow naturally limits the > scope > > > of sharing to utilities that Arrow is interested in. It would make no > > sense > > > to add a threading library to Arrow if it was never used natively. > > Muddying > > > the waters of the project's charter seems likely to lead to user, and > > > developer, confusion. Similarly, we should not necessarily couple > Arrow's > > > design goals to those it inherits from Kudu and Impala's source code. > > > > > > I think I'd rather see a new Apache project than re-use a current one > for > > > two independent purposes. > > > > > > > > >> > > >> The downside of code sharing, which may have prevented it so far, are > > >> the logistics of coordinating ASF release cycles and keeping build > > >> toolchains in sync. It's taken us the past year to stabilize the > > >> design of Arrow for its intended use cases, so at this point if we > > >> went down this road I would be OK with helping the community commit to > > >> a regular release cadence that would be faster than Impala, Kudu, and > > >> Parquet's respective release cadences. Since members of the Kudu and > > >> Impala PMC are also on the Arrow PMC, I trust we would be able to > > >> collaborate to each other's mutual benefit and success. > > >> > > >> Note that Arrow does not throw C++ exceptions and similarly follows > > >> Google C++ style guide to the same extent at Kudu and Impala. > > >> > > >> If this is something that either the Kudu or Impala communities would > > >> like to pursue in earnest, I would be happy to work with you on next > > >> steps. I would suggest that we start with something small so that we > > >> could address the necessary build toolchain changes, and develop a > > >> workflow for moving around code and tests, a protocol for code reviews > > >> (e.g. Gerrit), and coordinating ASF releases. > > >> > > > > > > I think, if I'm reading this correctly, that you're assuming > integration > > > with the 'downstream' projects (e.g. Impala and Kudu) would be done via > > > their toolchains. For something as fast moving as utility code - and > > > critical, where you want the latency between adding a fix and including > > it > > > in your build to be ~0 - that's a non-starter to me, at least with how > > the > > > toolchains are currently realised. > > > > > > I'd rather have the source code directly imported into Impala's tree - > > > whether by git submodule or other mechanism. That way the coupling is > > > looser, and we can move more quickly. I think that's important to other > > > projects as well. > > > > > > Henry > > > > > > > > > > > >> > > >> Let me know what you think. > > >> > > >> best > > >> Wes > > >> > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera > -- -- Cheers, Leif