On Tue, Oct 8, 2019 at 3:34 PM Wes McKinney <wesmck...@gmail.com> wrote: > > hi Jacques, > > On Tue, Oct 8, 2019 at 1:54 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > > I removing all my objections to this work. > > > > I wish there was more feedback from additional community members. I > > continue to be concerned about fragmentation. I don't agree with the > > arguments here that we need to add a new api to make it easy for people to > > *not* use Arrow codebase. It seems like a punt on building useful libraries > > within the project that will ultimately hurt the interoperability story. > > > > I think we'll have to take a "wait and see" approach. I believe the > community needs to build accessible libraries that offer value to > third party users, and we will continue to do that. I think there are > use cases here that fall outside of which library to use, but time > will tell. > > > As a side note, it seems like much of this is about people's distaste for > > flatbuffers. I know I regret using it. If we had a chance to do it over > > again, I would have chosen to use protobuf for everything except the data > > header, where I would hand write the encoding (since it is so simple > > anyway). If it is such a problem that people are contorting to work around > > it, maybe we should address that? Just a thought. > > > > I think that using an Protobuf-like with IDL and a compiler presents a > problem.
To clarify some inarticulate language since people reading may misinterpret. Using an IDL-based metadata representation _in this C API_ presents a potential roadblock for users. As a canonical metadata representation with backward and forward compatibility guarantees, it would be ill-advised to not use Protobuf/Flatbuffers/Thrift > Note that Flatbuffers is much better for C/C++ programmers and I still > think it was the right choice for the project. Unlike Flatbuffers, > C/C++ applications must either link libprotobuf.so or libprotobuf.a. > Flatbuffers in C++ is a header-only dependency that is trivial to > bundle with a project [1]. The same is true for Thrift, and this came > up in the TF discussion [2] > > [1]: > https://github.com/apache/arrow/tree/master/cpp/thirdparty/flatbuffers/include/flatbuffers > [2]: https://github.com/tensorflow/community/pull/162#discussion_r332610486 > > > Thanks for the discourse and patience. > > > > On Wed, Oct 2, 2019 at 10:12 PM Micah Kornfield <emkornfi...@gmail.com> > > wrote: > >> > >> Hi Wes, > >> I agree for third-parties "A" (Field data structures) is the most useful. > >> > >> At least in my mind the discussion was for both first and third-parties. I > >> was trying to point out that "A" is less necessary as a first step for > >> first-party integrations and could potentially require more effort if we > >> already have the code that does "B" (field reassembly). > >> > >> Thanks, > >> Micah > >> > >> On Wed, Oct 2, 2019 at 10:28 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> > >> > On Wed, Oct 2, 2019 at 11:05 PM Micah Kornfield <emkornfi...@gmail.com> > >> > wrote: > >> > > > >> > > I've tried to summarize my understanding of the debate so far and give > >> > some > >> > > initial thoughts. I think there are two potentially different sets of > >> > users > >> > > that we are targeting with a stable C API/ABI ourselves and external > >> > > parties. > >> > > > >> > > 1. Different language implementations within the Arrow project that > >> > > want > >> > > to call into each other's code. We still don't have a great story > >> > > around > >> > > this in terms of reusable libraries and questions like [1] are a > >> > motivating > >> > > examples of making something better in this context. > >> > > 2. third-parties wishing to support/integrate with Arrow. Some > >> > > conjectures about these users: > >> > > - Users in this group are NOT necessarily familiar with existing > >> > > technologies Arrow uses (i.e. flatbuffers) > >> > > - The stability of the API is the primary concern (consumers don't > >> > > want > >> > > to change when a new version of the library ships) > >> > > - An important secondary concern is additional libraries that need to > >> > be > >> > > integrated in addition to the API > >> > > > >> > > The main debate points seems to be: > >> > > > >> > > 1. Vector/Array oriented API vs existing Record Batch. Will an > >> > additional > >> > > column oriented API become too much of a maintenance headache/cause > >> > > fragmentation? > >> > > > >> > > - In my mind the question here is which set of users we are > >> > prioritizing. > >> > > IMO the combination of flatbuffers and translation to/from RecordBatch > >> > > format offers too much friction to make it easy for a third-party > >> > > implementer to use. If we are prioritizing for our own internal > >> > use-cases I > >> > > think we should try out a RecordBatch+Flatbuffers based C-API. We > >> > > already > >> > > have all the necessary building blocks. > >> > > > >> > > >> > If a C function passes you a string containing a RecordBatch > >> > Flatbuffers message, what happens next? This message has to be > >> > reassembled into a recursive data structure before you can "do" > >> > anything with it. Are we expecting every third party project to > >> > implement: > >> > > >> > A. Data structures appropriate to represent a logical "field" in a > >> > record batch (which have to be recursive to account for nested types' > >> > children) > >> > B. The logic to convert from the flattened Flatbuffers representation > >> > to some implementation of A > >> > > >> > I'm arguing that we should provide both to third parties. To build B, > >> > you need A. Some consumers will only use A. This discussion is > >> > essentially about developing an ultraminimalist "drop-in" C > >> > implementation of A. > >> > > >> > > 2. How onerous is the dependency on flat-buffers both from a learning > >> > > curve perspective and as dependency for third-party integrators? > >> > > - Flatbuffers aren't entirely straight-forward and I think if we do > >> > > move > >> > > forward with an API based on Column/Array we should consider > >> > > alternatives > >> > > as long as the necessary parsing code can be done in a small amount of > >> > code > >> > > (I'm personally against JSON for this, but can see the arguments for > >> > > it). > >> > > > >> > > 3. Do all existing library implementations need to support both > >> > > Column/Array a ABI? How will compliance be checked for the new > >> > > API/ABI? > >> > > > >> > > - I'm still thinking this through. > >> > > > >> > > [1] > >> > > > >> > https://lists.apache.org/thread.html/18244b294d0b9bd568b5cfd1b1ac2b6a25088383a08202cc7a8a3563@%3Cuser.arrow.apache.org%3E > >> > > > >> > > On Wed, Oct 2, 2019 at 6:46 PM Jacques Nadeau <jacq...@apache.org> > >> > wrote: > >> > > > >> > > > I'd like to hear more opinions from others on this topic. This > >> > conversation > >> > > > seems mostly dominated by comments from myself, Wes and Antoine. > >> > > > > >> > > > I think it is reasonable to argue that keeping any ABI (or > >> > header/struct > >> > > > pattern) as narrow as possible would allow us to minimize overlap > >> > > > with > >> > the > >> > > > existing in-memory specification. In Arrow's case, this could be as > >> > simple > >> > > > as a single memory pointer for schema (backed by flatbuffers) and a > >> > single > >> > > > memory location for data (that references the record batch header, > >> > which in > >> > > > turn provides pointers into the actual arrow data). Extensions would > >> > need > >> > > > to be added for reference management as done here but I continue to > >> > think > >> > > > we should defer discussion of that until the base data structures are > >> > > > resolved. I see the comments here as arguing for a much broader ABI, > >> > > > in > >> > > > part to support having people build "Arrow" components that > >> > interconnect > >> > > > using this new interface. I understand the desire to expand the ABI > >> > > > to > >> > be > >> > > > driven by needs to reduce dependencies and ease usability. > >> > > > > >> > > > The representation within the related patch is being presented as a > >> > way for > >> > > > applications to share Arrow data but is not easily accessible to all > >> > > > languages. I want to avoid a situation where someone says "I produced > >> > an > >> > > > Arrow API" when what they've really done is created a C interface > >> > > > which > >> > > > only a small subset of languages can actually leverage. For example, > >> > every > >> > > > language now knows how to parse the existing schema definition as > >> > rendered > >> > > > in flatbuf. In order to interact with something that implements this > >> > new > >> > > > pattern one would also be required to implement completely new schema > >> > > > consumption code. In the proposal itself it suggests this (for > >> > > > example > >> > > > enhancing the C++ library to consume structures produced this way). > >> > > > > >> > > > As I said, I really want to hear more opinions. Running this past > >> > various > >> > > > developers I know, many have echoed my concerns but that really > >> > > > doesn't > >> > > > matter (and who knows how much of that is colored by my presentation > >> > of the > >> > > > issue). What do people here think? If someone builds an "Arrow" > >> > > > library > >> > > > that implements this set of structures, how does one use it in Node? > >> > > > In > >> > > > Java? Does it drive creation of a secondary set of interfaces in each > >> > of > >> > > > those languages to work with this kind of pattern? (For example, in a > >> > JVM > >> > > > view of the world, working with a plain struct in java rather than a > >> > set of > >> > > > memory pointers against our existing IPC formats would be quite > >> > painful and > >> > > > we'd definitely need to create some glue code for users. I worry the > >> > same > >> > > > pattern would occur in many other languages.) > >> > > > > >> > > > To respond directly to some of Wes's most recent comments from the > >> > email > >> > > > below. I struggle to map your description of the situation to the > >> > > > rest > >> > of > >> > > > the thread and the proposed patch. For example, you say that a > >> > non-goal is > >> > > > "creating a new canonical way to serialize metadata" bute the patch > >> > > > proposes a concrete string based encoding system to describe data > >> > types. > >> > > > Aren't those things in conflict? > >> > > > > >> > > > I'll also think more on this and challenge my own perspective. This > >> > isn't > >> > > > where my focus is so my comments aren't as developed/thoughtful as > >> > > > I'd > >> > > > like. > >> > > > > >> > > > > >> > > > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <wesmck...@gmail.com> > >> > wrote: > >> > > > > >> > > > > hi Jacques, > >> > > > > > >> > > > > I think we've veered off course a bit and maybe we could reframe > >> > > > > the > >> > > > > discussion. > >> > > > > > >> > > > > Goals > >> > > > > * A "drop-in" header-only C file that projects can use as a > >> > > > > programming interface either internally only or to expose in-memory > >> > > > > data structures between C functions at call sites. Ideally little > >> > > > > to > >> > > > > no disassembly/reassembly should be required on either "side" of > >> > > > > the > >> > > > > call site. > >> > > > > * Simplifying adoption of Arrow for C programmers, or languages > >> > > > > based > >> > > > > around C FFI > >> > > > > > >> > > > > Non-goals > >> > > > > * Expanding the columnar format or creating an alternative > >> > > > > canonical > >> > > > > in-memory representation > >> > > > > * Creating a new canonical way to serialize metadata > >> > > > > > >> > > > > Note that this use case has been on my mind for more than 2 years: > >> > > > > https://issues.apache.org/jira/browse/ARROW-1058 > >> > > > > > >> > > > > I think there are a couple of potentially misleading things at play > >> > here > >> > > > > > >> > > > > 1. The use of the word "protocol". In C, a struct has a > >> > > > > well-defined > >> > > > > binary layout, so a C API is also an ABI. Using C structs to > >> > > > > communicate data can be considered to be a protocol, but it means > >> > > > > something different in the context of the "Arrow protocol". I think > >> > we > >> > > > > need to call this a "C API" > >> > > > > > >> > > > > 2. The documentation for this in Antoine's PR is in the format/ > >> > > > > directory. It would probably be better to have a "C API" section in > >> > > > > the documentation. > >> > > > > > >> > > > > The header file under discussion and the documentation about it is > >> > > > > best considered as a "library". > >> > > > > > >> > > > > It might be useful at some point to create a C99 implementation of > >> > the > >> > > > > IPC protocol as well using FlatCC with the goal of having a > >> > > > > complete > >> > > > > implementation of the columnar format in C with minimal binary > >> > > > > footprint. This is analogous to the NanoPB project which is an > >> > > > > implementation of Protocol Buffers with small code size > >> > > > > > >> > > > > https://github.com/nanopb/nanopb > >> > > > > > >> > > > > Let me know if this makes more sense. > >> > > > > > >> > > > > I think it's important to communicate clearly about this primarily > >> > for > >> > > > > the benefit of the outside world which can confuse easily as we > >> > > > > have > >> > > > > observed over the last few years =) > >> > > > > > >> > > > > Wes > >> > > > > > >> > > > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <jacq...@apache.org> > >> > > > wrote: > >> > > > > > > >> > > > > > I disagree with this statement: > >> > > > > > > >> > > > > > - the IPC format is meant for serialization while the C data > >> > protocol > >> > > > is > >> > > > > > meants for in-memory communication, so different concerns apply > >> > > > > > > >> > > > > > If that is how the a particular implementation presents it, that > >> > is a > >> > > > > > weaknesses of the implementation, not the format. The primary use > >> > case > >> > > > I > >> > > > > > was focused on when working on the initial format was > >> > > > > > communication > >> > > > > within > >> > > > > > the same process. It seems like this is being used as a basis for > >> > the > >> > > > > > introduction of new things when the premise is inconsistent with > >> > the > >> > > > > > intention of the creation. The specific reason we used > >> > > > > > flatbuffers > >> > in > >> > > > the > >> > > > > > project was to collapse the separation of in-process and > >> > out-of-process > >> > > > > > communication. It means the same thing it does with the Arrow > >> > > > > > data > >> > > > > itself: > >> > > > > > that a consumer doesn't have to use a particular library to > >> > interact > >> > > > with > >> > > > > > and use the data. > >> > > > > > > >> > > > > > It seems like there are two ideas here: > >> > > > > > > >> > > > > > 1) How do we make it easier for people to use Arrow? > >> > > > > > 2) Should we implement a new in memory representation of Arrow > >> > that is > >> > > > > > language specific. > >> > > > > > > >> > > > > > I'm entirely in support of number one. If for a particular type > >> > > > > > of > >> > > > > domain, > >> > > > > > people want an easier way to interact with Arrow, let's make a > >> > > > > > new > >> > > > > library > >> > > > > > that helps with that. In easy of our current libraries, we do > >> > > > > > many > >> > > > things > >> > > > > > to make it easier to work with Arrow. None of those require a > >> > change to > >> > > > > the > >> > > > > > core format or are formalized as a new in-memory standard. The > >> > > > in-memory > >> > > > > > representation of rust or javascript or java objects are > >> > implementation > >> > > > > > details. > >> > > > > > > >> > > > > > I'm against number two as it creates a fragmentation problem. > >> > Arrow is > >> > > > > > about having a single canonical format for memory for both > >> > metadata and > >> > > > > > data. Having multiple in-memory formats (especially when some are > >> > not > >> > > > > > language independent) is counter to the goals of the project. > >> > > > > > >> > > > > I don't think anyone is proposing anything that would cause > >> > > > fragmentation. > >> > > > > > >> > > > > A central question is whether it is useful to define a reusable C > >> > > > > ABI > >> > > > > for the Arrow columnar format, and if there is sufficient > >> > > > > interest, a > >> > > > > tiny C implementation of the IPC protocol (which uses the > >> > > > > Flatbuffers > >> > > > > message) that assembles and disassembles the data structures > >> > > > > defined > >> > > > > in the C ABI. > >> > > > > > >> > > > > We could separately create a tiny implementation of the Arrow IPC > >> > > > > protocol using FlatCC that could be dropped into applications > >> > > > > requiring only a C compiler and nothing else. > >> > > > > > >> > > > > > >> > > > > > > >> > > > > > Two other, separate comments: > >> > > > > > 1) I don't understand the idea that we need to change the way > >> > > > > > Arrow > >> > > > > > fundamentally works so that people can avoid using a dependency. > >> > If the > >> > > > > > dependency is small, open source and easy to build, people can > >> > fork it > >> > > > > and > >> > > > > > include directly if they want to. Let's not violate project > >> > principles > >> > > > > > because DuckDB has a religious perspective on dependencies. If > >> > > > > > the > >> > > > > problem > >> > > > > > is people have to swallow too large of a pill to do basic things > >> > with > >> > > > > Arrow > >> > > > > > in C, let's focus on fixing that (to our definition of ease, not > >> > > > someone > >> > > > > > else's). If FlatCC solves some those things, great. If we need to > >> > > > build a > >> > > > > > baby integration library that is more C centric, great. Neither > >> > > > > > of > >> > > > those > >> > > > > > things require implementing something at the format level. > >> > > > > > > >> > > > > > 2) It seems like we should discuss the data structure problem > >> > > > separately > >> > > > > > from the reference management concern. > >> > > > > > > >> > > > > > > >> > > > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <wesmck...@gmail.com> > >> > > > wrote: > >> > > > > > > >> > > > > > > hi Antoine, > >> > > > > > > > >> > > > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou < > >> > anto...@python.org> > >> > > > > wrote: > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit : > >> > > > > > > > > A couple things: > >> > > > > > > > > > >> > > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would > >> > > > > > > > > be > >> > > > > better > >> > > > > > > > > to have the same "shape" as an assembled array. Note that > >> > the C > >> > > > > > > > > structs here have very nearly the same "shape" as the data > >> > > > > structure > >> > > > > > > > > representing a C++ Array object [1]. The disassembly and > >> > > > reassembly > >> > > > > > > > > here is substantially simpler than the IPC protocol. A > >> > recursive > >> > > > > > > > > structure in Flatbuffers would make RecordBatch messages > >> > > > > > > > > much > >> > > > > larger, > >> > > > > > > > > so the flattened / disassembled representation we use for > >> > > > > serialized > >> > > > > > > > > record batches is the correct one > >> > > > > > > > > >> > > > > > > > I'm not sure I agree: > >> > > > > > > > > >> > > > > > > > - indeed, it's not a coincidence that the ArrowArray struct > >> > looks > >> > > > > quite > >> > > > > > > > closely like the C++ ArrayData object :-) We have good > >> > experience > >> > > > > with > >> > > > > > > > that abstraction and it has proven to work quite well > >> > > > > > > > > >> > > > > > > > - the IPC format is meant for serialization while the C data > >> > > > > protocol is > >> > > > > > > > meants for in-memory communication, so different concerns > >> > > > > > > > apply > >> > > > > > > > > >> > > > > > > > - the fact that this makes the layout slightly larger doesn't > >> > seem > >> > > > > > > > important at all; we're not talking about transferring data > >> > over > >> > > > the > >> > > > > wire > >> > > > > > > > > >> > > > > > > > There's also another argument for having a recursive struct: > >> > > > > > > > it > >> > > > > > > > simplifies how the data type is represented, since we can > >> > encode > >> > > > each > >> > > > > > > > child type individually instead of encoding it in the > >> > > > > > > > parent's > >> > > > format > >> > > > > > > > string (same applies for metadata and individual flags). > >> > > > > > > > > >> > > > > > > > >> > > > > > > I was saying something different here. I was making an argument > >> > about > >> > > > > > > why we use the flattened array-of-structs in the IPC protocol. > >> > One > >> > > > > > > reason is that it's a more compact representation. That is not > >> > very > >> > > > > > > important here because this protocol is only for *in-process* > >> > (for > >> > > > > > > languages that have a C FFI facility) rather than > >> > > > > > > *inter-process* > >> > > > > > > communication. > >> > > > > > > > >> > > > > > > I agree also that the type encoding is simple, here, too, since > >> > we > >> > > > > > > aren't having to split the schema and record batch between > >> > different > >> > > > > > > serialized messages. There is some potential waste with having > >> > > > > > > to > >> > > > > > > populate the type fields multiple times when communicating a > >> > sequence > >> > > > > > > of "chunks" from the same logical dataset. > >> > > > > > > > >> > > > > > > > > * The "formal" C protocol having the "assembled" shape > >> > > > > > > > > means > >> > that > >> > > > > many > >> > > > > > > > > minimal Arrow users won't have to implement any separate > >> > > > > > > > > data > >> > > > > > > > > structures. They can just use the C struct directly or a > >> > slightly > >> > > > > > > > > wrapped version thereof with some convenience functions. > >> > > > > > > > > >> > > > > > > > Yes, but the same applies to the current proposal. > >> > > > > > > > > >> > > > > > > > > * I think that requiring building a Flatbuffer for minimal > >> > use > >> > > > > cases > >> > > > > > > > > (e.g. communicating simple record batches with primitive > >> > types) > >> > > > > passes > >> > > > > > > > > on implementation burden to minimal users. > >> > > > > > > > > >> > > > > > > > It certainly does. > >> > > > > > > > > >> > > > > > > > > I think the mantra of the C protocol should be the > >> > > > > > > > > following: > >> > > > > > > > > > >> > > > > > > > > * Users of the protocol have to write little to no code to > >> > use > >> > > > it. > >> > > > > For > >> > > > > > > > > example, populating an INT32 array should require only a > >> > > > > > > > > few > >> > > > lines > >> > > > > of > >> > > > > > > > > code > >> > > > > > > > > >> > > > > > > > Agreed. As a sidenote, the spec should have an example of > >> > doing > >> > > > > this in > >> > > > > > > > raw C. > >> > > > > > > > > >> > > > > > > > Regards > >> > > > > > > > > >> > > > > > > > Antoine. > >> > > > > > > > >> > > > > > >> > > > > >> >