On Thu, Aug 12, 2021 at 3:16 PM Neal Richardson <neal.p.richard...@gmail.com> wrote: > > > Maintain this "Arrow types and ComputeIR library" as an always > zero-dependency library to facilitate vendoring > > Would/should this hypothetical zero-dep, vendorable library also include > the IPC format? Or if you want to interact with IPC in that case, the C > data interface is the best/only option?
No, to do anything with the IPC format would pull in arrow::Buffer, arrow::Array, and many other inextricable components which are used with the IPC read/write implementation. > Or if you want to interact with IPC in that case, the C data interface is the > best/only option? I'm not clear on what you mean since the C data interface is only for data interchange at function call sites in-process, and not for serialization (interprocess). > On Thu, Aug 12, 2021 at 9:06 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > It seems that one adjacent problem here is how to make it simpler for > > third parties (especially ones that act as front end interfaces) to > > build and serialize/deserialize the IR structures with some kind of > > ready-to-go middleware library, written in a language like C++. > > > > To do that, one would need the equivalent of arrow/type.h and related > > Flatbuffers schema serialization code that lives in arrow/ipc. If you > > want to be able to completely and accurately serialize Schemas, you > > need quite a bit of code now. > > > > One possible approach (and not go crazy) would be to: > > > > * Move arrow/types.h and its dependencies into a standalone C++ > > library that can be vendored into the main apache/arrow C++ library. I > > don't know how onerous arrow/types.h's transitive dependencies / > > interactions are at this point (there's a lot of stuff going on in > > type.cc [1] now) > > * Make the namespaces exported by this library configurable, so any > > library can vendor the Arrow types / IR builder APIs privately into > > their project > > * Maintain this "Arrow types and ComputeIR library" as an always > > zero-dependency library to facilitate vendoring > > * Lightweight bindings in languages we care about (like Python or R or > > GLib/Ruby) could be built to the IR builder middleware library > > > > This seems like what is more at issue compared with rather projects > > are copying the Flatbuffers files out of their project from > > apache/arrow or apache/arrow-format. > > > > [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/type.cc > > > > On Thu, Aug 12, 2021 at 2:05 PM Andrew Lamb <al...@influxdata.com> wrote: > > > > > > I support the idea of an independent repo that has the arrow flatbuffers > > > format definition files. > > > > > > My rationale is that the Rust implementation has a copy of the `format` > > > directory [1] and potential drift worries me (a bit). Having a single > > > source of truth for the format that is not part of the large mono repo > > > would be a good thing. > > > > > > Andrew > > > > > > [1] https://github.com/apache/arrow-rs/tree/master/format > > > > > > On Wed, Aug 11, 2021 at 2:40 PM Phillip Cloud <cpcl...@gmail.com> wrote: > > > > > > > Hi all, > > > > > > > > I'd like to bring up an idea from a recent thread ([1]) about moving > > the > > > > `format/` directory out of the primary apache/arrow repository. > > > > > > > > I understand from that thread there are some concerns about using > > > > submodules, > > > > and I definitely sympathize with those concerns. > > > > > > > > In talking with David Li (disclaimer: we work together at Voltron > > Data), he > > > > has > > > > a great idea that I think makes everyone happy: an > > `apache/arrow-format` > > > > repository that is the official mirror for the flatbuffers IDL, that > > > > library > > > > authors should use as the source of truth. > > > > > > > > It doesn't require a submodule, yet it also allows external projects > > the > > > > ability to access the IDL without having to interact with the main > > arrow > > > > repository and is backwards compatible to boot. > > > > > > > > In this scenario, repositories that are currently copying in the > > > > flatbuffers > > > > IDL can migrate to this repository at their leisure. > > > > > > > > My motivation for this was around sharing data structures for the > > compute > > > > IR > > > > proposal ([2]). > > > > > > > > I can think of at least two ways for IR producers and consumers of all > > > > languages to share the flatbuffers IDL: > > > > > > > > 1. A set of bindings built in some language that other languages can > > > > integrate > > > > with, likely C++, that allows library users to build IR using an > > API. > > > > > > > > The primary downside to this is that we'd have to deal with > > > > building another library while working out any kinks in the IR design > > and > > > > I'd > > > > rather avoid that in the initial phases of this project. > > > > > > > > The benefit is that IR components don't interact much with > > `flatbuffers` or > > > > `flatc` directly. > > > > > > > > 2. A single location where the format lives, that doesn't require > > depending > > > > on > > > > a large multi-language repository to access a handful of files. > > > > > > > > I think the downside to this is that there's a bit of additional > > > > infrastructure > > > > to automate copying in `arrow-format`. > > > > > > > > The benefit there is that producers and consumers can immediately start > > > > getting > > > > value from compute IR without having to wait for development of a new > > API. > > > > > > > > One counter-proposal might be to just put the compute IR IDL in a > > separate > > > > repo, > > > > but that isn't tenable because the compute IR needs arrow's type > > > > information > > > > contained in `Schema.fbs`. > > > > > > > > I was hoping to avoid conflating the discussion about bindings vs > > direct > > > > flatbuffer usage (at least initially just supporting one, I predict > > we'll > > > > need > > > > both ultimately) with the decision about whether to split out the > > format > > > > directory, but it's a good example of a choice for which splitting out > > the > > > > format directory would be well-served. > > > > > > > > I'll note that this doesn't block anything on the compute IR side, just > > > > wanted > > > > to surface this in a parallel thread and see what folks think. > > > > > > > > [1]: > > > > > > > > > > https://lists.apache.org/thread.html/rcebfcb4c5d0b7752fcdda6587871c2f94661b8c4e35119f0bcfb883b%40%3Cdev.arrow.apache.org%3E > > > > [2]: > > > > > > > > > > https://docs.google.com/document/d/1C_XVOG7iFkl6cgWWMyzUoIjfKt-X2UxqagPJrla0bAE/edit#heading=h.ie0ne0gm762l > > > > > >