Re: [DISCUSS] Splitting out the Arrow format directory

Wes McKinney Thu, 12 Aug 2021 06:23:49 -0700

On Thu, Aug 12, 2021 at 3:16 PM Neal Richardson
<neal.p.richard...@gmail.com> wrote:
>
> > Maintain this "Arrow types and ComputeIR library" as an always
> zero-dependency library to facilitate vendoring
>
> Would/should this hypothetical zero-dep, vendorable library also include
> the IPC format? Or if you want to interact with IPC in that case, the C
> data interface is the best/only option?


No, to do anything with the IPC format would pull in arrow::Buffer,
arrow::Array, and many other inextricable components which are used
with the IPC read/write implementation.

> Or if you want to interact with IPC in that case, the C data interface is the 
> best/only option?

I'm not clear on what you mean since the C data interface is only for
data interchange at function call sites in-process, and not for
serialization (interprocess).

> On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney <wesmck...@gmail.com> wrote:
>
> > It seems that one adjacent problem here is how to make it simpler for
> > third parties (especially ones that act as front end interfaces) to
> > build and serialize/deserialize the IR structures with some kind of
> > ready-to-go middleware library, written in a language like C++.
> >
> > To do that, one would need the equivalent of arrow/type.h and related
> > Flatbuffers schema serialization code that lives in arrow/ipc. If you
> > want to be able to completely and accurately serialize Schemas, you
> > need quite a bit of code now.
> >
> > One possible approach (and not go crazy) would be to:
> >
> > * Move arrow/types.h and its dependencies into a standalone C++
> > library that can be vendored into the main apache/arrow C++ library. I
> > don't know how onerous arrow/types.h's transitive dependencies /
> > interactions are at this point (there's a lot of stuff going on in
> > type.cc [1] now)
> > * Make the namespaces exported by this library configurable, so any
> > library can vendor the Arrow types / IR builder APIs privately into
> > their project
> > * Maintain this "Arrow types and ComputeIR library" as an always
> > zero-dependency library to facilitate vendoring
> > * Lightweight bindings in languages we care about (like Python or R or
> > GLib/Ruby) could be built to the IR builder middleware library
> >
> > This seems like what is more at issue compared with rather projects
> > are copying the Flatbuffers files out of their project from
> > apache/arrow or apache/arrow-format.
> >
> > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
> >
> > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com> wrote:
> > >
> > > I support the idea of an independent repo that has the arrow flatbuffers
> > > format definition files.
> > >
> > > My rationale is that the Rust implementation has a copy of the `format`
> > > directory [1] and potential drift worries me (a bit). Having a single
> > > source of truth for the format that is not part of the large mono repo
> > > would be a good thing.
> > >
> > > Andrew
> > >
> > > [1] https://github.com/apache/arrow-rs/tree/master/format
> > >
> > > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <cpcl...@gmail.com> wrote:
> > >
> > > > Hi all,
> > > >
> > > > I'd like to bring up an idea from a recent thread ([1]) about moving
> > the
> > > > `format/` directory out of the primary apache/arrow repository.
> > > >
> > > > I understand from that thread there are some concerns about using
> > > > submodules,
> > > > and I definitely sympathize with those concerns.
> > > >
> > > > In talking with David Li (disclaimer: we work together at Voltron
> > Data), he
> > > > has
> > > > a great idea that I think makes everyone happy: an
> > `apache/arrow-format`
> > > > repository that is the official mirror for the flatbuffers IDL, that
> > > > library
> > > > authors should use as the source of truth.
> > > >
> > > > It doesn't require a submodule, yet it also allows external projects
> > the
> > > > ability to access the IDL without having to interact with the main
> > arrow
> > > > repository and is backwards compatible to boot.
> > > >
> > > > In this scenario, repositories that are currently copying in the
> > > > flatbuffers
> > > > IDL can migrate to this repository at their leisure.
> > > >
> > > > My motivation for this was around sharing data structures for the
> > compute
> > > > IR
> > > > proposal ([2]).
> > > >
> > > > I can think of at least two ways for IR producers and consumers of all
> > > > languages to share the flatbuffers IDL:
> > > >
> > > > 1. A set of bindings built in some language that other languages can
> > > > integrate
> > > >    with, likely C++, that allows library users to build IR using an
> > API.
> > > >
> > > > The primary downside to this is that we'd have to deal with
> > > > building another library while working out any kinks in the IR design
> > and
> > > > I'd
> > > > rather avoid that in the initial phases of this project.
> > > >
> > > > The benefit is that IR components don't interact much with
> > `flatbuffers` or
> > > > `flatc` directly.
> > > >
> > > > 2. A single location where the format lives, that doesn't require
> > depending
> > > > on
> > > >    a large multi-language repository to access a handful of files.
> > > >
> > > > I think the downside to this is that there's a bit of additional
> > > > infrastructure
> > > > to automate copying in `arrow-format`.
> > > >
> > > > The benefit there is that producers and consumers can immediately start
> > > > getting
> > > > value from compute IR without having to wait for development of a new
> > API.
> > > >
> > > > One counter-proposal might be to just put the compute IR IDL in a
> > separate
> > > > repo,
> > > > but that isn't tenable because the compute IR needs arrow's type
> > > > information
> > > > contained in `Schema.fbs`.
> > > >
> > > > I was hoping to avoid conflating the discussion about bindings vs
> > direct
> > > > flatbuffer usage (at least initially just supporting one, I predict
> > we'll
> > > > need
> > > > both ultimately) with the decision about whether to split out the
> > format
> > > > directory, but it's a good example of a choice for which splitting out
> > the
> > > > format directory would be well-served.
> > > >
> > > > I'll note that this doesn't block anything on the compute IR side, just
> > > > wanted
> > > > to surface this in a parallel thread and see what folks think.
> > > >
> > > > [1]:
> > > >
> > > >
> > https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E
> > > > [2]:
> > > >
> > > >
> > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l
> > > >
> >

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to