Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Todd Lipcon Sun, 26 Feb 2017 13:04:22 -0800

Hey folks,

As Henry mentioned, Impala is starting to share more code with Kudu (most
notably our RPC system, but that pulls in a fair bit of utility code as
well), so we've been chatting periodically offline about the best way to do
this. Having more projects potentially interested in collaborating is
definitely welcome, though I think does also increase the complexity of
whatever solution we come up with.

I think the potential benefits of collaboration are fairly self-evident, so
I'll focus on my concerns here, which somewhat echo Henry's.

1) Open source release model

The ASF is very much against having projects which do not do releases. So,
if we were to create some new ASF project to hold this code, we'd be
expected to do frequent releases thereof. Wes volunteered above to lead
frequent releases, but we actually need at least 3 PMC members to vote on
each release, and given people can come and go, we'd probably need at least
5-8 people who are actively committed to helping with the release process
of this "commons" project.

Unlike our existing projects, which seem to release every 2-3 months, if
that, I think this one would have to release _much_ more frequently, if we
expect downstream projects to depend on released versions rather than just
pulling in some recent (or even trunk) git hash. Since the ASF requires the
normal voting period and process for every release, I don't think we could
do something like have "daily automatic releases", etc.

We could probably campaign the ASF membership to treat this project
differently, either as (a) a repository of code that never releases, in
which case the "downstream" projects are responsible for vetting IP, etc,
as part of their own release processes, or (b) a project which does
automatic releases voted upon by robots. I'm guessing that (a) is more
palatable from an IP perspective, and also from the perspective of the
downstream projects.

2) Governance/review model

The more projects there are sharing this common code, the more difficult it
is to know whether a change would break something, or even whether a change
is considered desirable for all of the projects. I don't want to get into
some world where any change to a central library requires a multi-week
proposal/design-doc/review across 3+ different groups of committers, all of
whom may have different near-term priorities. On the other hand, it would
be pretty frustrating if the week before we're trying to cut a Kudu release
branch, someone in another community decides to make a potentially
destabilizing change to the RPC library.

3) Pre-commit/test mechanics

Semi-related to the above: we currently feel pretty confident when we make
a change to a central library like kudu/util/thread.cc that nothing broke
because we run the full suite of Kudu tests. Of course the central
libraries have some unit test coverage, but I wouldn't be confident with
any sort of model where shared code can change without verification by a
larger suite of tests.

On the other hand, I also don't want to move to a model where any change to
shared code requires a 6+-hour precommit spanning several projects, each of
which may have its own set of potentially-flaky pre-commit tests, etc. I
can imagine that if an Arrow developer made some change to "thread.cc" and
saw that TabletServerStressTest failed their precommit, they'd have no idea
how to triage it, etc. That could be a strong disincentive to continued
innovation in these areas of common code, which we'll need a good way to
avoid.

I think some of the above could be ameliorated with really good
infrastructure -- eg on a test failure, automatically re-run the failed
test on both pre-patch and post-patch, do a t-test to check statistical
significance in flakiness level, etc. But, that's a lot of infrastructure
that doesn't currently exist.

4) Integration mechanics for breaking changes

Currently these common libraries are treated as components of monolithic
projects. That means it's no extra overhead for us to make some kind of
change which breaks an API in src/kudu/util/ and at the same time updates
all call sites. The internal libraries have no semblance of API
compatibility guarantees, etc, and adding one is not without cost.

Before sharing code, we should figure out how exactly we'll manage the
cases where we want to make some change in a common library that breaks an
API used by other projects, given there's no way to make an atomic commit
across many repositories. One option is that each "user" of the libraries
manually "rolls" to new versions when they feel like it, but there's still
now a case where a common change "pushes work onto" the consumers to update
call sites, etc.

Admittedly, the number of breaking API changes in these common libraries is
relatively small, but would still be good to understand how we would plan
to manage them.

-Todd

On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

-- 
Todd Lipcon
Software Engineer, Cloudera

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Reply via email to