Re: [DISCUSS] Splitting out the Arrow format directory

Jorge Cardoso Leitão Thu, 12 Aug 2021 10:03:13 -0700

I agree with Antoine that we should weigh the pros and cons of flatbuffers
(or protobuf or thrift for that matter) over a more human-friendly,
simpler, format like json or MsgPack. I also struggle a bit to reason with
the complexity of using flatbuffers for this.


E.g. there is no async support for thrift, flatbuffers nor protobuf in
Rust, which e.g. means that we can't read neither parquet nor arrow IPC
async atm. These problems are usually easier to work around in simpler
formats.

Best,
Jorge



On Thu, Aug 12, 2021 at 2:43 PM Antoine Pitrou <[email protected]> wrote:

>
> Le 12/08/2021 à 15:05, Wes McKinney a écrit :
> > It seems that one adjacent problem here is how to make it simpler for
> > third parties (especially ones that act as front end interfaces) to
> > build and serialize/deserialize the IR structures with some kind of
> > ready-to-go middleware library, written in a language like C++.
>
> A C++ library sounds a bit complicated to deal with for Java, Rust, Go,
> etc. developers.
>
> I'm not sure which design decision and set of compromises would make the
> most sense.  But this is why I'm asking the question "why not JSON?" (+
> JSON-Schema if you want to ease validation by third parties).
>
> (note I have already mentioned MsgPack, but only in the case a binary
> encoding is really required; it doesn't have any other advantage that I
> know of over JSON, and it's less ubiquitous)
>
> Regards
>
> Antoine.
>
>
> > To do that, one would need the equivalent of arrow/type.h and related
> > Flatbuffers schema serialization code that lives in arrow/ipc. If you
> > want to be able to completely and accurately serialize Schemas, you
> > need quite a bit of code now.
> >
> > One possible approach (and not go crazy) would be to:
> >
> > * Move arrow/types.h and its dependencies into a standalone C++
> > library that can be vendored into the main apache/arrow C++ library. I
> > don't know how onerous arrow/types.h's transitive dependencies /
> > interactions are at this point (there's a lot of stuff going on in
> > type.cc [1] now)
> > * Make the namespaces exported by this library configurable, so any
> > library can vendor the Arrow types / IR builder APIs privately into
> > their project
> > * Maintain this "Arrow types and ComputeIR library" as an always
> > zero-dependency library to facilitate vendoring
> > * Lightweight bindings in languages we care about (like Python or R or
> > GLib/Ruby) could be built to the IR builder middleware library
> >
> > This seems like what is more at issue compared with rather projects
> > are copying the Flatbuffers files out of their project from
> > apache/arrow or apache/arrow-format.
> >
> > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc
> >
> > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb <[email protected]>
> wrote:
> >>
> >> I support the idea of an independent repo that has the arrow flatbuffers
> >> format definition files.
> >>
> >> My rationale is that the Rust implementation has a copy of the `format`
> >> directory [1] and potential drift worries me (a bit). Having a single
> >> source of truth for the format that is not part of the large mono repo
> >> would be a good thing.
> >>
> >> Andrew
> >>
> >> [1] https://github.com/apache/arrow-rs/tree/master/format
> >>
> >> On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <[email protected]>
> wrote:
> >>
> >>> Hi all,
> >>>
> >>> I'd like to bring up an idea from a recent thread ([1]) about moving
> the
> >>> `format/` directory out of the primary apache/arrow repository.
> >>>
> >>> I understand from that thread there are some concerns about using
> >>> submodules,
> >>> and I definitely sympathize with those concerns.
> >>>
> >>> In talking with David Li (disclaimer: we work together at Voltron
> Data), he
> >>> has
> >>> a great idea that I think makes everyone happy: an
> `apache/arrow-format`
> >>> repository that is the official mirror for the flatbuffers IDL, that
> >>> library
> >>> authors should use as the source of truth.
> >>>
> >>> It doesn't require a submodule, yet it also allows external projects
> the
> >>> ability to access the IDL without having to interact with the main
> arrow
> >>> repository and is backwards compatible to boot.
> >>>
> >>> In this scenario, repositories that are currently copying in the
> >>> flatbuffers
> >>> IDL can migrate to this repository at their leisure.
> >>>
> >>> My motivation for this was around sharing data structures for the
> compute
> >>> IR
> >>> proposal ([2]).
> >>>
> >>> I can think of at least two ways for IR producers and consumers of all
> >>> languages to share the flatbuffers IDL:
> >>>
> >>> 1. A set of bindings built in some language that other languages can
> >>> integrate
> >>>     with, likely C++, that allows library users to build IR using an
> API.
> >>>
> >>> The primary downside to this is that we'd have to deal with
> >>> building another library while working out any kinks in the IR design
> and
> >>> I'd
> >>> rather avoid that in the initial phases of this project.
> >>>
> >>> The benefit is that IR components don't interact much with
> `flatbuffers` or
> >>> `flatc` directly.
> >>>
> >>> 2. A single location where the format lives, that doesn't require
> depending
> >>> on
> >>>     a large multi-language repository to access a handful of files.
> >>>
> >>> I think the downside to this is that there's a bit of additional
> >>> infrastructure
> >>> to automate copying in `arrow-format`.
> >>>
> >>> The benefit there is that producers and consumers can immediately start
> >>> getting
> >>> value from compute IR without having to wait for development of a new
> API.
> >>>
> >>> One counter-proposal might be to just put the compute IR IDL in a
> separate
> >>> repo,
> >>> but that isn't tenable because the compute IR needs arrow's type
> >>> information
> >>> contained in `Schema.fbs`.
> >>>
> >>> I was hoping to avoid conflating the discussion about bindings vs
> direct
> >>> flatbuffer usage (at least initially just supporting one, I predict
> we'll
> >>> need
> >>> both ultimately) with the decision about whether to split out the
> format
> >>> directory, but it's a good example of a choice for which splitting out
> the
> >>> format directory would be well-served.
> >>>
> >>> I'll note that this doesn't block anything on the compute IR side, just
> >>> wanted
> >>> to surface this in a parallel thread and see what folks think.
> >>>
> >>> [1]:
> >>>
> >>>
> https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E
> >>> [2]:
> >>>
> >>>
> https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l
> >>>
>

Re: [DISCUSS] Splitting out the Arrow format directory

Reply via email to