Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Miki Tebeka Sun, 26 Feb 2017 10:58:53 -0800

Can't some (most) of it be added to APR <https://apr.apache.org/>?


On Sun, Feb 26, 2017 at 8:12 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Henry,
>
> Thank you for these comments.
>
> I think having a kind of "Apache Commons for [Modern] C++" would be an
> ideal (though perhaps initially more labor intensive) solution.
> There's code in Arrow that I would move into this project if it
> existed. I am happy to help make this happen if there is interest from
> the Kudu and Impala communities. I am not sure logistically what would
> be the most expedient way to establish the project, whether as an ASF
> Incubator project or possibly as a new TLP that could be created by
> spinning IP out of Apache Kudu.
>
> I'm interested to hear the opinions of others, and possible next steps.
>
> Thanks
> Wes
>
> On Sun, Feb 26, 2017 at 2:12 AM, Henry Robinson <he...@apache.org> wrote:
> > Thanks for bringing this up, Wes.
> >
> > On 25 February 2017 at 14:18, Wes McKinney <wesmck...@gmail.com> wrote:
> >
> >> Dear Apache Kudu and Apache Impala (incubating) communities,
> >>
> >> (I'm not sure the best way to have a cross-list discussion, so I
> >> apologize if this does not work well)
> >>
> >> On the recent Apache Parquet sync call, we discussed C++ code sharing
> >> between the codebases in Apache Arrow and Apache Parquet, and
> >> opportunities for more code sharing with Kudu and Impala as well.
> >>
> >> As context
> >>
> >> * We have an RC out for the 1.0.0 release of apache-parquet-cpp, the
> >> first C++ release within Apache Parquet. I got involved with this
> >> project a little over a year ago and was faced with the unpleasant
> >> decision to copy and paste a significant amount of code out of
> >> Impala's codebase to bootstrap the project.
> >>
> >> * In parallel, we begin the Apache Arrow project, which is designed to
> >> be a complementary library for file formats (like Parquet), storage
> >> engines (like Kudu), and compute engines (like Impala and pandas).
> >>
> >> * As Arrow and parquet-cpp matured, an increasing amount of code
> >> overlap crept up surrounding buffer memory management and IO
> >> interface. We recently decided in PARQUET-818
> >> (https://github.com/apache/parquet-cpp/commit/
> >> 2154e873d5aa7280314189a2683fb1e12a590c02)
> >> to remove some of the obvious code overlap in Parquet and make
> >> libarrow.a/so a hard compile and link-time dependency for
> >> libparquet.a/so.
> >>
> >> * There is still quite a bit of code in parquet-cpp that would better
> >> fit in Arrow: SIMD hash utilities, RLE encoding, dictionary encoding,
> >> compression, bit utilities, and so forth. Much of this code originated
> >> from Impala
> >>
> >> This brings me to a next set of points:
> >>
> >> * parquet-cpp contains quite a bit of code that was extracted from
> >> Impala. This is mostly self-contained in
> >> https://github.com/apache/parquet-cpp/tree/master/src/parquet/util
> >>
> >> * My understanding is that Kudu extracted certain computational
> >> utilities from Impala in its early days, but these tools have likely
> >> diverged as the needs of the projects have evolved.
> >>
> >> Since all of these projects are quite different in their end goals
> >> (runtime systems vs. libraries), touching code that is tightly coupled
> >> to either Kudu or Impala's runtimes is probably not worth discussing.
> >> However, I think there is a strong basis for collaboration on
> >> computational utilities and vectorized array processing. Some obvious
> >> areas that come to mind:
> >>
> >> * SIMD utilities (for hashing or processing of preallocated contiguous
> >> memory)
> >> * Array encoding utilities: RLE / Dictionary, etc.
> >> * Bit manipulation (packing and unpacking, e.g. Daniel Lemire
> >> contributed a patch to parquet-cpp around this)
> >> * Date and time utilities
> >> * Compression utilities
> >>
> >
> > Between Kudu and Impala (at least) there are many more opportunities for
> > sharing. Threads, logging, metrics, concurrent primitives - the list is
> > quite long.
> >
> >
> >>
> >> I hope the benefits are obvious: consolidating efforts on unit
> >> testing, benchmarking, performance optimizations, continuous
> >> integration, and platform compatibility.
> >>
> >> Logistically speaking, one possible avenue might be to use Apache
> >> Arrow as the place to assemble this code. Its thirdparty toolchain is
> >> small, and it builds and installs fast. It is intended as a library to
> >> have its headers used and linked against other applications. (As an
> >> aside, I'm very interested in building optional support for Arrow
> >> columnar messages into the kudu client).
> >>
> >
> > In principle I'm in favour of code sharing, and it seems very much in
> > keeping with the Apache way. However, practically speaking I'm of the
> > opinion that it only makes sense to house shared support code in a
> > separate, dedicated project.
> >
> > Embedding the shared libraries in, e.g., Arrow naturally limits the scope
> > of sharing to utilities that Arrow is interested in. It would make no
> sense
> > to add a threading library to Arrow if it was never used natively.
> Muddying
> > the waters of the project's charter seems likely to lead to user, and
> > developer, confusion. Similarly, we should not necessarily couple Arrow's
> > design goals to those it inherits from Kudu and Impala's source code.
> >
> > I think I'd rather see a new Apache project than re-use a current one for
> > two independent purposes.
> >
> >
> >>
> >> The downside of code sharing, which may have prevented it so far, are
> >> the logistics of coordinating ASF release cycles and keeping build
> >> toolchains in sync. It's taken us the past year to stabilize the
> >> design of Arrow for its intended use cases, so at this point if we
> >> went down this road I would be OK with helping the community commit to
> >> a regular release cadence that would be faster than Impala, Kudu, and
> >> Parquet's respective release cadences. Since members of the Kudu and
> >> Impala PMC are also on the Arrow PMC, I trust we would be able to
> >> collaborate to each other's mutual benefit and success.
> >>
> >> Note that Arrow does not throw C++ exceptions and similarly follows
> >> Google C++ style guide to the same extent at Kudu and Impala.
> >>
> >> If this is something that either the Kudu or Impala communities would
> >> like to pursue in earnest, I would be happy to work with you on next
> >> steps. I would suggest that we start with something small so that we
> >> could address the necessary build toolchain changes, and develop a
> >> workflow for moving around code and tests, a protocol for code reviews
> >> (e.g. Gerrit), and coordinating ASF releases.
> >>
> >
> > I think, if I'm reading this correctly, that you're assuming integration
> > with the 'downstream' projects (e.g. Impala and Kudu) would be done via
> > their toolchains. For something as fast moving as utility code - and
> > critical, where you want the latency between adding a fix and including
> it
> > in your build to be ~0 - that's a non-starter to me, at least with how
> the
> > toolchains are currently realised.
> >
> > I'd rather have the source code directly imported into Impala's tree -
> > whether by git submodule or other mechanism. That way the coupling is
> > looser, and we can move more quickly. I think that's important to other
> > projects as well.
> >
> > Henry
> >
> >
> >
> >>
> >> Let me know what you think.
> >>
> >> best
> >> Wes
> >>
>

Re: [DISCUSS] C++ code sharing amongst Apache {Arrow, Kudu, Impala, Parquet}

Reply via email to