Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Julian Hyde Mon, 27 Feb 2017 10:47:06 -0800

Yes. Since an Apache project is a community, it is very easy for it to produce 
more than one piece of code.



> On Feb 27, 2017, at 10:34 AM, Leif Walsh <leif.wa...@gmail.com> wrote:
> 
> Julian, are you proposing the arrow project ship two artifacts,
> arrow-common and arrow, where arrow depends on arrow-common?
> On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jh...@apache.org 
> <mailto:jh...@apache.org>> wrote:
> 
>> “Commons” projects are often problematic. It is difficult to tell what is
>> in scope and out of scope. If the scope is drawn too wide, there is a real
>> problem of orphaned features, because people contribute one feature and
>> then disappear.
>> 
>> Let’s remember the Apache mantra: community over code. If you create a
>> sustainable community, the code will get looked after. Would this project
>> form a new community, or just a new piece of code? As I read the current
>> proposal, it would be the intersection of some existing communities, not a
>> new community.
>> 
>> I think it would take a considerable effort to create a new project and
>> community around the idea of “c++ commons” (or is it “database-related c++
>> commons”?). I think you already have such a community, to a first
>> approximation, in the Arrow project, because Kudu and Impala developers are
>> already part of the Arrow community. There’s no reason why Arrow cannot
>> contain new modules that have different release schedules than the rest of
>> Arrow. As a TLP, releases are less burdensome, and can happen in a little
>> over 3 days if the component is kept stable.
>> 
>> Lastly, the code is fungible. It can be marked “experimental” within Arrow
>> and moved to another project, or into a new project, as it matures. The
>> Apache license and the ASF CLA makes this very easy. We are doing something
>> like this in Calcite: the Avatica sub-project [1] has a community that
>> intersect’s with Calcite’s, is disconnected at a code level, and may over
>> time evolve into a separate project. In the mean time, being part of an
>> established project is helpful, because there are PMC members to vote.
>> 
>> Julian
>> 
>> [1] https://calcite.apache.org/avatica/ <
>> https://calcite.apache.org/avatica/ <https://calcite.apache.org/avatica/>>
>> 
>>> On Feb 27, 2017, at 6:41 AM, Wes McKinney <wesmck...@gmail.com> wrote:
>>> 
>>> Responding to Todd's e-mail:
>>> 
>>> 1) Open source release model
>>> 
>>> My expectation is that this library would release about once a month,
>>> with occasional faster releases for critical fixes.
>>> 
>>> 2) Governance/review model
>>> 
>>> Beyond having centralized code reviews, it's hard to predict how the
>>> governance would play out. I understand that OSS projects behave
>>> differently in their planning / design / review process, so work on a
>>> common need may require more of a negotiation than the prior
>>> "unilateral" process.
>>> 
>>> I think it says something for our communities that we would make a
>>> commitment in our collaboration on this to the success of the
>>> "consumer" projects. So if the Arrow or Parquet communities were
>>> contemplating a change that might impact Kudu, for example, it would
>>> be in our best interest to be careful and communicate proactively.
>>> 
>>> This all makes sense. From an Arrow and Parquet perspective, we do not
>>> add very much testing burden because our continuous integration suites
>>> do not take long to run.
>>> 
>>> 3) Pre-commit/test mechanics
>>> 
>>> One thing that would help would be community-maintained
>>> Dockerfiles/Docker images (or equivalent) to assist with validation
>>> and testing for developers.
>>> 
>>> I am happy to comply with a pre-commit testing protocol that works for
>>> the Kudu and Impala teams.
>>> 
>>> 4) Integration mechanics for breaking changes
>>> 
>>>> One option is that each "user" of the libraries manually "rolls" to new
>> versions when they feel like it, but there's still now a case where a
>> common change "pushes work onto" the consumers to update call sites, etc.
>>> 
>>> Breaking API changes will create extra work, because any automated
>>> testing that we create will not be able to validate the patch to the
>>> common library. Perhaps we can configure a manual way (in Jenkins,
>>> say) to test two patches together.
>>> 
>>> In the event that a community member has a patch containing an API
>>> break that impacts a project that they are not a contributor for,
>>> there should be some expectation to either work with the affected
>>> project on a coordinated patch or obtain their +1 to merge the patch
>>> even though it will may require a follow up patch if the roll-forward
>>> in the consumer project exposes bugs in the common library. There may
>>> be situations like:
>>> 
>>> * Kudu changes API in $COMMON that impacts Arrow
>>> * Arrow says +1, we will roll forward $COMMON later
>>> * Patch merged
>>> * Arrow rolls forward, discovers bug caused by patch in $COMMON
>>> * Arrow proposes patch to $COMMON
>>> * ...
>>> 
>>> This is the worst case scenario, of course, but I actually think it is
>>> good because it would indicate that the unit testing in $COMMON needs
>>> to be improved. Unit testing in the common library, therefore, would
>>> take on more of a "defensive" quality than currently.
>>> 
>>> In any case, I'm keen to move forward to coming up with a concrete
>>> plan if we can reach consensus on the particulars.
>>> 
>>> Thanks
>>> Wes
>>> 
>>> On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <leif.wa...@gmail.com>
>> wrote:
>>>> I also support the idea of creating an "apache commons modern c++" style
>>>> library, maybe tailored toward the needs of columnar data processing
>>>> tools.  I think APR is the wrong project but I think that *style* of
>>>> project is the right direction to aim.
>>>> 
>>>> I agree this adds test and release process complexity across products
>> but I
>>>> think the benefits of a shared, well-tested library outweigh that, and
>>>> creating such test infrastructure will have long-term benefits as well.
>>>> 
>>>> I'd be happy to lend a hand wherever it's needed.
>>>> 
>>>> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <t...@cloudera.com> wrote:
>>>> 
>>>>> Hey folks,
>>>>> 
>>>>> As Henry mentioned, Impala is starting to share more code with Kudu
>> (most
>>>>> notably our RPC system, but that pulls in a fair bit of utility code as
>>>>> well), so we've been chatting periodically offline about the best way
>> to do
>>>>> this. Having more projects potentially interested in collaborating is
>>>>> definitely welcome, though I think does also increase the complexity of
>>>>> whatever solution we come up with.
>>>>> 
>>>>> I think the potential benefits of collaboration are fairly
>> self-evident, so
>>>>> I'll focus on my concerns here, which somewhat echo Henry's.
>>>>> 
>>>>> 1) Open source release model
>>>>> 
>>>>> The ASF is very much against having projects which do not do releases.
>> So,
>>>>> if we were to create some new ASF project to hold this code, we'd be
>>>>> expected to do frequent releases thereof. Wes volunteered above to lead
>>>>> frequent releases, but we actually need at least 3 PMC members to vote
>> on
>>>>> each release, and given people can come and go, we'd probably need at
>> least
>>>>> 5-8 people who are actively committed to helping with the release
>> process
>>>>> of this "commons" project.
>>>>> 
>>>>> Unlike our existing projects, which seem to release every 2-3 months,
>> if
>>>>> that, I think this one would have to release _much_ more frequently,
>> if we
>>>>> expect downstream projects to depend on released versions rather than
>> just
>>>>> pulling in some recent (or even trunk) git hash. Since the ASF
>> requires the
>>>>> normal voting period and process for every release, I don't think we
>> could
>>>>> do something like have "daily automatic releases", etc.
>>>>> 
>>>>> We could probably campaign the ASF membership to treat this project
>>>>> differently, either as (a) a repository of code that never releases, in
>>>>> which case the "downstream" projects are responsible for vetting IP,
>> etc,
>>>>> as part of their own release processes, or (b) a project which does
>>>>> automatic releases voted upon by robots. I'm guessing that (a) is more
>>>>> palatable from an IP perspective, and also from the perspective of the
>>>>> downstream projects.
>>>>> 
>>>>> 
>>>>> 2) Governance/review model
>>>>> 
>>>>> The more projects there are sharing this common code, the more
>> difficult it
>>>>> is to know whether a change would break something, or even whether a
>> change
>>>>> is considered desirable for all of the projects. I don't want to get
>> into
>>>>> some world where any change to a central library requires a multi-week
>>>>> proposal/design-doc/review across 3+ different groups of committers,
>> all of
>>>>> whom may have different near-term priorities. On the other hand, it
>> would
>>>>> be pretty frustrating if the week before we're trying to cut a Kudu
>> release
>>>>> branch, someone in another community decides to make a potentially
>>>>> destabilizing change to the RPC library.
>>>>> 
>>>>> 
>>>>> 3) Pre-commit/test mechanics
>>>>> 
>>>>> Semi-related to the above: we currently feel pretty confident when we
>> make
>>>>> a change to a central library like kudu/util/thread.cc that nothing
>> broke
>>>>> because we run the full suite of Kudu tests. Of course the central
>>>>> libraries have some unit test coverage, but I wouldn't be confident
>> with
>>>>> any sort of model where shared code can change without verification by
>> a
>>>>> larger suite of tests.
>>>>> 
>>>>> On the other hand, I also don't want to move to a model where any
>> change to
>>>>> shared code requires a 6+-hour precommit spanning several projects,
>> each of
>>>>> which may have its own set of potentially-flaky pre-commit tests, etc.
>> I
>>>>> can imagine that if an Arrow developer made some change to "thread.cc"
>> and
>>>>> saw that TabletServerStressTest failed their precommit, they'd have no
>> idea
>>>>> how to triage it, etc. That could be a strong disincentive to continued
>>>>> innovation in these areas of common code, which we'll need a good way
>> to
>>>>> avoid.
>>>>> 
>>>>> I think some of the above could be ameliorated with really good
>>>>> infrastructure -- eg on a test failure, automatically re-run the failed
>>>>> test on both pre-patch and post-patch, do a t-test to check statistical
>>>>> significance in flakiness level, etc. But, that's a lot of
>> infrastructure
>>>>> that doesn't currently exist.
>>>>> 
>>>>> 
>>>>> 4) Integration mechanics for breaking changes
>>>>> 
>>>>> Currently these common libraries are treated as components of
>> monolithic
>>>>> projects. That means it's no extra overhead for us to make some kind of
>>>>> change which breaks an API in src/kudu/util/ and at the same time
>> updates
>>>>> all call sites. The internal libraries have no semblance of API
>>>>> compatibility guarantees, etc, and adding one is not without cost.
>>>>> 
>>>>> Before sharing code, we should figure out how exactly we'll manage the
>>>>> cases where we want to make some change in a common library that
>> breaks an
>>>>> API used by other projects, given there's no way to make an atomic
>> commit
>>>>> across many repositories. One option is that each "user" of the
>> libraries
>>>>> manually "rolls" to new versions when they feel like it, but there's
>> still
>>>>> now a case where a common change "pushes work onto" the consumers to
>> update
>>>>> call sites, etc.
>>>>> 
>>>>> Admittedly, the number of breaking API changes in these common
>> libraries is
>>>>> relatively small, but would still be good to understand how we would
>> plan
>>>>> to manage them.
>>>>> 
>>>>> -Todd
>>>>> 
>>>>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <wesmck...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> hi Henry,
>>>>>> 
>>>>>> Thank you for these comments.
>>>>>> 
>>>>>> I think having a kind of "Apache Commons for [Modern] C++" would be an
>>>>>> ideal (though perhaps initially more labor intensive) solution.
>>>>>> There's code in Arrow that I would move into this project if it
>>>>>> existed. I am happy to help make this happen if there is interest from
>>>>>> the Kudu and Impala communities. I am not sure logistically what would
>>>>>> be the most expedient way to establish the project, whether as an ASF
>>>>>> Incubator project or possibly as a new TLP that could be created by
>>>>>> spinning IP out of Apache Kudu.
>>>>>> 
>>>>>> I'm interested to hear the opinions of others, and possible next
>> steps.
>>>>>> 
>>>>>> Thanks
>>>>>> Wes
>>>>>> 
>>>>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org>
>>>>> wrote:
>>>>>>> Thanks for bringing this up, Wes.
>>>>>>> 
>>>>>>> On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com>
>>>>> wrote:
>>>>>>> 
>>>>>>>> Dear Apache Kudu and Apache Impala (incubating) communities,
>>>>>>>> 
>>>>>>>> (I'm not sure the best way to have a cross-list discussion, so I
>>>>>>>> apologize if this does not work well)
>>>>>>>> 
>>>>>>>> On the recent Apache Parquet sync call, we discussed C++ code
>> sharing
>>>>>>>> between the codebases in Apache Arrow and Apache Parquet, and
>>>>>>>> opportunities for more code sharing with Kudu and Impala as well.
>>>>>>>> 
>>>>>>>> As context
>>>>>>>> 
>>>>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
>>>>>>>> first C++ release within Apache Parquet. I got involved with this
>>>>>>>> project a little over a year ago and was faced with the unpleasant
>>>>>>>> decision to copy and paste a significant amount of code out of
>>>>>>>> Impala's codebase to bootstrap the project.
>>>>>>>> 
>>>>>>>> * In parallel, we begin the Apache Arrow project, which is designed
>> to
>>>>>>>> be a complementary library for file formats (like Parquet), storage
>>>>>>>> engines (like Kudu), and compute engines (like Impala and pandas).
>>>>>>>> 
>>>>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code
>>>>>>>> overlap crept up surrounding buffer memory management and IO
>>>>>>>> interface. We recently decided in PARQUET-818
>>>>>>>> (https://github.com/apache/parquet-cpp/commit/
>>>>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02)
>>>>>>>> to remove some of the obvious code overlap in Parquet and make
>>>>>>>> libarrow.a/so a hard compile and link-time dependency for
>>>>>>>> libparquet.a/so.
>>>>>>>> 
>>>>>>>> * There is still quite a bit of code in parquet-cpp that would
>> better
>>>>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary
>> encoding,
>>>>>>>> compression, bit utilities, and so forth. Much of this code
>> originated
>>>>>>>> from Impala
>>>>>>>> 
>>>>>>>> This brings me to a next set of points:
>>>>>>>> 
>>>>>>>> * parquet-cpp contains quite a bit of code that was extracted from
>>>>>>>> Impala. This is mostly self-contained in
>>>>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
>>>>>>>> 
>>>>>>>> * My understanding is that Kudu extracted certain computational
>>>>>>>> utilities from Impala in its early days, but these tools have likely
>>>>>>>> diverged as the needs of the projects have evolved.
>>>>>>>> 
>>>>>>>> Since all of these projects are quite different in their end goals
>>>>>>>> (runtime systems vs. libraries), touching code that is tightly
>> coupled
>>>>>>>> to either Kudu or Impala's runtimes is probably not worth
>> discussing.
>>>>>>>> However, I think there is a strong basis for collaboration on
>>>>>>>> computational utilities and vectorized array processing. Some
>> obvious
>>>>>>>> areas that come to mind:
>>>>>>>> 
>>>>>>>> * SIMD utilities (for hashing or processing of preallocated
>> contiguous
>>>>>>>> memory)
>>>>>>>> * Array encoding utilities: RLE / Dictionary, etc.
>>>>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
>>>>>>>> contributed a patch to parquet-cpp around this)
>>>>>>>> * Date and time utilities
>>>>>>>> * Compression utilities
>>>>>>>> 
>>>>>>> 
>>>>>>> Between Kudu and Impala (at least) there are many more opportunities
>>>>> for
>>>>>>> sharing. Threads, logging, metrics, concurrent primitives - the list
>> is
>>>>>>> quite long.
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> I hope the benefits are obvious: consolidating efforts on unit
>>>>>>>> testing, benchmarking, performance optimizations, continuous
>>>>>>>> integration, and platform compatibility.
>>>>>>>> 
>>>>>>>> Logistically speaking, one possible avenue might be to use Apache
>>>>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain
>> is
>>>>>>>> small, and it builds and installs fast. It is intended as a library
>> to
>>>>>>>> have its headers used and linked against other applications. (As an
>>>>>>>> aside, I'm very interested in building optional support for Arrow
>>>>>>>> columnar messages into the kudu client).
>>>>>>>> 
>>>>>>> 
>>>>>>> In principle I'm in favour of code sharing, and it seems very much in
>>>>>>> keeping with the Apache way. However, practically speaking I'm of the
>>>>>>> opinion that it only makes sense to house shared support code in a
>>>>>>> separate, dedicated project.
>>>>>>> 
>>>>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the
>>>>> scope
>>>>>>> of sharing to utilities that Arrow is interested in. It would make no
>>>>>> sense
>>>>>>> to add a threading library to Arrow if it was never used natively.
>>>>>> Muddying
>>>>>>> the waters of the project's charter seems likely to lead to user, and
>>>>>>> developer, confusion. Similarly, we should not necessarily couple
>>>>> Arrow's
>>>>>>> design goals to those it inherits from Kudu and Impala's source code.
>>>>>>> 
>>>>>>> I think I'd rather see a new Apache project than re-use a current one
>>>>> for
>>>>>>> two independent purposes.
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> The downside of code sharing, which may have prevented it so far,
>> are
>>>>>>>> the logistics of coordinating ASF release cycles and keeping build
>>>>>>>> toolchains in sync. It's taken us the past year to stabilize the
>>>>>>>> design of Arrow for its intended use cases, so at this point if we
>>>>>>>> went down this road I would be OK with helping the community commit
>> to
>>>>>>>> a regular release cadence that would be faster than Impala, Kudu,
>> and
>>>>>>>> Parquet's respective release cadences. Since members of the Kudu and
>>>>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to
>>>>>>>> collaborate to each other's mutual benefit and success.
>>>>>>>> 
>>>>>>>> Note that Arrow does not throw C++ exceptions and similarly follows
>>>>>>>> Google C++ style guide to the same extent at Kudu and Impala.
>>>>>>>> 
>>>>>>>> If this is something that either the Kudu or Impala communities
>> would
>>>>>>>> like to pursue in earnest, I would be happy to work with you on next
>>>>>>>> steps. I would suggest that we start with something small so that we
>>>>>>>> could address the necessary build toolchain changes, and develop a
>>>>>>>> workflow for moving around code and tests, a protocol for code
>> reviews
>>>>>>>> (e.g. Gerrit), and coordinating ASF releases.
>>>>>>>> 
>>>>>>> 
>>>>>>> I think, if I'm reading this correctly, that you're assuming
>>>>> integration
>>>>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done
>> via
>>>>>>> their toolchains. For something as fast moving as utility code - and
>>>>>>> critical, where you want the latency between adding a fix and
>> including
>>>>>> it
>>>>>>> in your build to be ~0 - that's a non-starter to me, at least with
>> how
>>>>>> the
>>>>>>> toolchains are currently realised.
>>>>>>> 
>>>>>>> I'd rather have the source code directly imported into Impala's tree
>> -
>>>>>>> whether by git submodule or other mechanism. That way the coupling is
>>>>>>> looser, and we can move more quickly. I think that's important to
>> other
>>>>>>> projects as well.
>>>>>>> 
>>>>>>> Henry
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> Let me know what you think.
>>>>>>>> 
>>>>>>>> best
>>>>>>>> Wes
>>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Todd Lipcon
>>>>> Software Engineer, Cloudera
>>>>> 
>>>> --
>>>> --
>>>> Cheers,
>>>> Leif
>> 
>> --
> -- 
> Cheers,
> Leif

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Reply via email to