Yes. Since an Apache project is a community, it is very easy for it to produce more than one piece of code.
> On Feb 27, 2017, at 10:34 AM, Leif Walsh <leif.wa...@gmail.com> wrote: > > Julian, are you proposing the arrow project ship two artifacts, > arrow-common and arrow, where arrow depends on arrow-common? > On Mon, Feb 27, 2017 at 11:51 Julian Hyde <jh...@apache.org > <mailto:jh...@apache.org>> wrote: > >> “Commons” projects are often problematic. It is difficult to tell what is >> in scope and out of scope. If the scope is drawn too wide, there is a real >> problem of orphaned features, because people contribute one feature and >> then disappear. >> >> Let’s remember the Apache mantra: community over code. If you create a >> sustainable community, the code will get looked after. Would this project >> form a new community, or just a new piece of code? As I read the current >> proposal, it would be the intersection of some existing communities, not a >> new community. >> >> I think it would take a considerable effort to create a new project and >> community around the idea of “c++ commons” (or is it “database-related c++ >> commons”?). I think you already have such a community, to a first >> approximation, in the Arrow project, because Kudu and Impala developers are >> already part of the Arrow community. There’s no reason why Arrow cannot >> contain new modules that have different release schedules than the rest of >> Arrow. As a TLP, releases are less burdensome, and can happen in a little >> over 3 days if the component is kept stable. >> >> Lastly, the code is fungible. It can be marked “experimental” within Arrow >> and moved to another project, or into a new project, as it matures. The >> Apache license and the ASF CLA makes this very easy. We are doing something >> like this in Calcite: the Avatica sub-project [1] has a community that >> intersect’s with Calcite’s, is disconnected at a code level, and may over >> time evolve into a separate project. In the mean time, being part of an >> established project is helpful, because there are PMC members to vote. >> >> Julian >> >> [1] https://calcite.apache.org/avatica/ < >> https://calcite.apache.org/avatica/ <https://calcite.apache.org/avatica/>> >> >>> On Feb 27, 2017, at 6:41 AM, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> Responding to Todd's e-mail: >>> >>> 1) Open source release model >>> >>> My expectation is that this library would release about once a month, >>> with occasional faster releases for critical fixes. >>> >>> 2) Governance/review model >>> >>> Beyond having centralized code reviews, it's hard to predict how the >>> governance would play out. I understand that OSS projects behave >>> differently in their planning / design / review process, so work on a >>> common need may require more of a negotiation than the prior >>> "unilateral" process. >>> >>> I think it says something for our communities that we would make a >>> commitment in our collaboration on this to the success of the >>> "consumer" projects. So if the Arrow or Parquet communities were >>> contemplating a change that might impact Kudu, for example, it would >>> be in our best interest to be careful and communicate proactively. >>> >>> This all makes sense. From an Arrow and Parquet perspective, we do not >>> add very much testing burden because our continuous integration suites >>> do not take long to run. >>> >>> 3) Pre-commit/test mechanics >>> >>> One thing that would help would be community-maintained >>> Dockerfiles/Docker images (or equivalent) to assist with validation >>> and testing for developers. >>> >>> I am happy to comply with a pre-commit testing protocol that works for >>> the Kudu and Impala teams. >>> >>> 4) Integration mechanics for breaking changes >>> >>>> One option is that each "user" of the libraries manually "rolls" to new >> versions when they feel like it, but there's still now a case where a >> common change "pushes work onto" the consumers to update call sites, etc. >>> >>> Breaking API changes will create extra work, because any automated >>> testing that we create will not be able to validate the patch to the >>> common library. Perhaps we can configure a manual way (in Jenkins, >>> say) to test two patches together. >>> >>> In the event that a community member has a patch containing an API >>> break that impacts a project that they are not a contributor for, >>> there should be some expectation to either work with the affected >>> project on a coordinated patch or obtain their +1 to merge the patch >>> even though it will may require a follow up patch if the roll-forward >>> in the consumer project exposes bugs in the common library. There may >>> be situations like: >>> >>> * Kudu changes API in $COMMON that impacts Arrow >>> * Arrow says +1, we will roll forward $COMMON later >>> * Patch merged >>> * Arrow rolls forward, discovers bug caused by patch in $COMMON >>> * Arrow proposes patch to $COMMON >>> * ... >>> >>> This is the worst case scenario, of course, but I actually think it is >>> good because it would indicate that the unit testing in $COMMON needs >>> to be improved. Unit testing in the common library, therefore, would >>> take on more of a "defensive" quality than currently. >>> >>> In any case, I'm keen to move forward to coming up with a concrete >>> plan if we can reach consensus on the particulars. >>> >>> Thanks >>> Wes >>> >>> On Sun, Feb 26, 2017 at 10:18 PM, Leif Walsh <leif.wa...@gmail.com> >> wrote: >>>> I also support the idea of creating an "apache commons modern c++" style >>>> library, maybe tailored toward the needs of columnar data processing >>>> tools. I think APR is the wrong project but I think that *style* of >>>> project is the right direction to aim. >>>> >>>> I agree this adds test and release process complexity across products >> but I >>>> think the benefits of a shared, well-tested library outweigh that, and >>>> creating such test infrastructure will have long-term benefits as well. >>>> >>>> I'd be happy to lend a hand wherever it's needed. >>>> >>>> On Sun, Feb 26, 2017 at 4:03 PM Todd Lipcon <t...@cloudera.com> wrote: >>>> >>>>> Hey folks, >>>>> >>>>> As Henry mentioned, Impala is starting to share more code with Kudu >> (most >>>>> notably our RPC system, but that pulls in a fair bit of utility code as >>>>> well), so we've been chatting periodically offline about the best way >> to do >>>>> this. Having more projects potentially interested in collaborating is >>>>> definitely welcome, though I think does also increase the complexity of >>>>> whatever solution we come up with. >>>>> >>>>> I think the potential benefits of collaboration are fairly >> self-evident, so >>>>> I'll focus on my concerns here, which somewhat echo Henry's. >>>>> >>>>> 1) Open source release model >>>>> >>>>> The ASF is very much against having projects which do not do releases. >> So, >>>>> if we were to create some new ASF project to hold this code, we'd be >>>>> expected to do frequent releases thereof. Wes volunteered above to lead >>>>> frequent releases, but we actually need at least 3 PMC members to vote >> on >>>>> each release, and given people can come and go, we'd probably need at >> least >>>>> 5-8 people who are actively committed to helping with the release >> process >>>>> of this "commons" project. >>>>> >>>>> Unlike our existing projects, which seem to release every 2-3 months, >> if >>>>> that, I think this one would have to release _much_ more frequently, >> if we >>>>> expect downstream projects to depend on released versions rather than >> just >>>>> pulling in some recent (or even trunk) git hash. Since the ASF >> requires the >>>>> normal voting period and process for every release, I don't think we >> could >>>>> do something like have "daily automatic releases", etc. >>>>> >>>>> We could probably campaign the ASF membership to treat this project >>>>> differently, either as (a) a repository of code that never releases, in >>>>> which case the "downstream" projects are responsible for vetting IP, >> etc, >>>>> as part of their own release processes, or (b) a project which does >>>>> automatic releases voted upon by robots. I'm guessing that (a) is more >>>>> palatable from an IP perspective, and also from the perspective of the >>>>> downstream projects. >>>>> >>>>> >>>>> 2) Governance/review model >>>>> >>>>> The more projects there are sharing this common code, the more >> difficult it >>>>> is to know whether a change would break something, or even whether a >> change >>>>> is considered desirable for all of the projects. I don't want to get >> into >>>>> some world where any change to a central library requires a multi-week >>>>> proposal/design-doc/review across 3+ different groups of committers, >> all of >>>>> whom may have different near-term priorities. On the other hand, it >> would >>>>> be pretty frustrating if the week before we're trying to cut a Kudu >> release >>>>> branch, someone in another community decides to make a potentially >>>>> destabilizing change to the RPC library. >>>>> >>>>> >>>>> 3) Pre-commit/test mechanics >>>>> >>>>> Semi-related to the above: we currently feel pretty confident when we >> make >>>>> a change to a central library like kudu/util/thread.cc that nothing >> broke >>>>> because we run the full suite of Kudu tests. Of course the central >>>>> libraries have some unit test coverage, but I wouldn't be confident >> with >>>>> any sort of model where shared code can change without verification by >> a >>>>> larger suite of tests. >>>>> >>>>> On the other hand, I also don't want to move to a model where any >> change to >>>>> shared code requires a 6+-hour precommit spanning several projects, >> each of >>>>> which may have its own set of potentially-flaky pre-commit tests, etc. >> I >>>>> can imagine that if an Arrow developer made some change to "thread.cc" >> and >>>>> saw that TabletServerStressTest failed their precommit, they'd have no >> idea >>>>> how to triage it, etc. That could be a strong disincentive to continued >>>>> innovation in these areas of common code, which we'll need a good way >> to >>>>> avoid. >>>>> >>>>> I think some of the above could be ameliorated with really good >>>>> infrastructure -- eg on a test failure, automatically re-run the failed >>>>> test on both pre-patch and post-patch, do a t-test to check statistical >>>>> significance in flakiness level, etc. But, that's a lot of >> infrastructure >>>>> that doesn't currently exist. >>>>> >>>>> >>>>> 4) Integration mechanics for breaking changes >>>>> >>>>> Currently these common libraries are treated as components of >> monolithic >>>>> projects. That means it's no extra overhead for us to make some kind of >>>>> change which breaks an API in src/kudu/util/ and at the same time >> updates >>>>> all call sites. The internal libraries have no semblance of API >>>>> compatibility guarantees, etc, and adding one is not without cost. >>>>> >>>>> Before sharing code, we should figure out how exactly we'll manage the >>>>> cases where we want to make some change in a common library that >> breaks an >>>>> API used by other projects, given there's no way to make an atomic >> commit >>>>> across many repositories. One option is that each "user" of the >> libraries >>>>> manually "rolls" to new versions when they feel like it, but there's >> still >>>>> now a case where a common change "pushes work onto" the consumers to >> update >>>>> call sites, etc. >>>>> >>>>> Admittedly, the number of breaking API changes in these common >> libraries is >>>>> relatively small, but would still be good to understand how we would >> plan >>>>> to manage them. >>>>> >>>>> -Todd >>>>> >>>>> On Sun, Feb 26, 2017 at 10:12 AM, Wes McKinney <wesmck...@gmail.com> >>>>> wrote: >>>>> >>>>>> hi Henry, >>>>>> >>>>>> Thank you for these comments. >>>>>> >>>>>> I think having a kind of "Apache Commons for [Modern] C++" would be an >>>>>> ideal (though perhaps initially more labor intensive) solution. >>>>>> There's code in Arrow that I would move into this project if it >>>>>> existed. I am happy to help make this happen if there is interest from >>>>>> the Kudu and Impala communities. I am not sure logistically what would >>>>>> be the most expedient way to establish the project, whether as an ASF >>>>>> Incubator project or possibly as a new TLP that could be created by >>>>>> spinning IP out of Apache Kudu. >>>>>> >>>>>> I'm interested to hear the opinions of others, and possible next >> steps. >>>>>> >>>>>> Thanks >>>>>> Wes >>>>>> >>>>>> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> >>>>> wrote: >>>>>>> Thanks for bringing this up, Wes. >>>>>>> >>>>>>> On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> >>>>> wrote: >>>>>>> >>>>>>>> Dear Apache Kudu and Apache Impala (incubating) communities, >>>>>>>> >>>>>>>> (I'm not sure the best way to have a cross-list discussion, so I >>>>>>>> apologize if this does not work well) >>>>>>>> >>>>>>>> On the recent Apache Parquet sync call, we discussed C++ code >> sharing >>>>>>>> between the codebases in Apache Arrow and Apache Parquet, and >>>>>>>> opportunities for more code sharing with Kudu and Impala as well. >>>>>>>> >>>>>>>> As context >>>>>>>> >>>>>>>> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the >>>>>>>> first C++ release within Apache Parquet. I got involved with this >>>>>>>> project a little over a year ago and was faced with the unpleasant >>>>>>>> decision to copy and paste a significant amount of code out of >>>>>>>> Impala's codebase to bootstrap the project. >>>>>>>> >>>>>>>> * In parallel, we begin the Apache Arrow project, which is designed >> to >>>>>>>> be a complementary library for file formats (like Parquet), storage >>>>>>>> engines (like Kudu), and compute engines (like Impala and pandas). >>>>>>>> >>>>>>>> * As Arrow and parquet-cpp matured, an increasing amount of code >>>>>>>> overlap crept up surrounding buffer memory management and IO >>>>>>>> interface. We recently decided in PARQUET-818 >>>>>>>> (https://github.com/apache/parquet-cpp/commit/ >>>>>>>> 2154e873d5aa7280314189a2683fb1e12a590c02) >>>>>>>> to remove some of the obvious code overlap in Parquet and make >>>>>>>> libarrow.a/so a hard compile and link-time dependency for >>>>>>>> libparquet.a/so. >>>>>>>> >>>>>>>> * There is still quite a bit of code in parquet-cpp that would >> better >>>>>>>> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary >> encoding, >>>>>>>> compression, bit utilities, and so forth. Much of this code >> originated >>>>>>>> from Impala >>>>>>>> >>>>>>>> This brings me to a next set of points: >>>>>>>> >>>>>>>> * parquet-cpp contains quite a bit of code that was extracted from >>>>>>>> Impala. This is mostly self-contained in >>>>>>>> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util >>>>>>>> >>>>>>>> * My understanding is that Kudu extracted certain computational >>>>>>>> utilities from Impala in its early days, but these tools have likely >>>>>>>> diverged as the needs of the projects have evolved. >>>>>>>> >>>>>>>> Since all of these projects are quite different in their end goals >>>>>>>> (runtime systems vs. libraries), touching code that is tightly >> coupled >>>>>>>> to either Kudu or Impala's runtimes is probably not worth >> discussing. >>>>>>>> However, I think there is a strong basis for collaboration on >>>>>>>> computational utilities and vectorized array processing. Some >> obvious >>>>>>>> areas that come to mind: >>>>>>>> >>>>>>>> * SIMD utilities (for hashing or processing of preallocated >> contiguous >>>>>>>> memory) >>>>>>>> * Array encoding utilities: RLE / Dictionary, etc. >>>>>>>> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire >>>>>>>> contributed a patch to parquet-cpp around this) >>>>>>>> * Date and time utilities >>>>>>>> * Compression utilities >>>>>>>> >>>>>>> >>>>>>> Between Kudu and Impala (at least) there are many more opportunities >>>>> for >>>>>>> sharing. Threads, logging, metrics, concurrent primitives - the list >> is >>>>>>> quite long. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> I hope the benefits are obvious: consolidating efforts on unit >>>>>>>> testing, benchmarking, performance optimizations, continuous >>>>>>>> integration, and platform compatibility. >>>>>>>> >>>>>>>> Logistically speaking, one possible avenue might be to use Apache >>>>>>>> Arrow as the place to assemble this code. Its thirdparty toolchain >> is >>>>>>>> small, and it builds and installs fast. It is intended as a library >> to >>>>>>>> have its headers used and linked against other applications. (As an >>>>>>>> aside, I'm very interested in building optional support for Arrow >>>>>>>> columnar messages into the kudu client). >>>>>>>> >>>>>>> >>>>>>> In principle I'm in favour of code sharing, and it seems very much in >>>>>>> keeping with the Apache way. However, practically speaking I'm of the >>>>>>> opinion that it only makes sense to house shared support code in a >>>>>>> separate, dedicated project. >>>>>>> >>>>>>> Embedding the shared libraries in, e.g., Arrow naturally limits the >>>>> scope >>>>>>> of sharing to utilities that Arrow is interested in. It would make no >>>>>> sense >>>>>>> to add a threading library to Arrow if it was never used natively. >>>>>> Muddying >>>>>>> the waters of the project's charter seems likely to lead to user, and >>>>>>> developer, confusion. Similarly, we should not necessarily couple >>>>> Arrow's >>>>>>> design goals to those it inherits from Kudu and Impala's source code. >>>>>>> >>>>>>> I think I'd rather see a new Apache project than re-use a current one >>>>> for >>>>>>> two independent purposes. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> The downside of code sharing, which may have prevented it so far, >> are >>>>>>>> the logistics of coordinating ASF release cycles and keeping build >>>>>>>> toolchains in sync. It's taken us the past year to stabilize the >>>>>>>> design of Arrow for its intended use cases, so at this point if we >>>>>>>> went down this road I would be OK with helping the community commit >> to >>>>>>>> a regular release cadence that would be faster than Impala, Kudu, >> and >>>>>>>> Parquet's respective release cadences. Since members of the Kudu and >>>>>>>> Impala PMC are also on the Arrow PMC, I trust we would be able to >>>>>>>> collaborate to each other's mutual benefit and success. >>>>>>>> >>>>>>>> Note that Arrow does not throw C++ exceptions and similarly follows >>>>>>>> Google C++ style guide to the same extent at Kudu and Impala. >>>>>>>> >>>>>>>> If this is something that either the Kudu or Impala communities >> would >>>>>>>> like to pursue in earnest, I would be happy to work with you on next >>>>>>>> steps. I would suggest that we start with something small so that we >>>>>>>> could address the necessary build toolchain changes, and develop a >>>>>>>> workflow for moving around code and tests, a protocol for code >> reviews >>>>>>>> (e.g. Gerrit), and coordinating ASF releases. >>>>>>>> >>>>>>> >>>>>>> I think, if I'm reading this correctly, that you're assuming >>>>> integration >>>>>>> with the 'downstream' projects (e.g. Impala and Kudu) would be done >> via >>>>>>> their toolchains. For something as fast moving as utility code - and >>>>>>> critical, where you want the latency between adding a fix and >> including >>>>>> it >>>>>>> in your build to be ~0 - that's a non-starter to me, at least with >> how >>>>>> the >>>>>>> toolchains are currently realised. >>>>>>> >>>>>>> I'd rather have the source code directly imported into Impala's tree >> - >>>>>>> whether by git submodule or other mechanism. That way the coupling is >>>>>>> looser, and we can move more quickly. I think that's important to >> other >>>>>>> projects as well. >>>>>>> >>>>>>> Henry >>>>>>> >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> Let me know what you think. >>>>>>>> >>>>>>>> best >>>>>>>> Wes >>>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Todd Lipcon >>>>> Software Engineer, Cloudera >>>>> >>>> -- >>>> -- >>>> Cheers, >>>> Leif >> >> -- > -- > Cheers, > Leif