hi Jacques, I think we've veered off course a bit and maybe we could reframe the discussion.
Goals * A "drop-in" header-only C file that projects can use as a programming interface either internally only or to expose in-memory data structures between C functions at call sites. Ideally little to no disassembly/reassembly should be required on either "side" of the call site. * Simplifying adoption of Arrow for C programmers, or languages based around C FFI Non-goals * Expanding the columnar format or creating an alternative canonical in-memory representation * Creating a new canonical way to serialize metadata Note that this use case has been on my mind for more than 2 years: https://issues.apache.org/jira/browse/ARROW-1058 I think there are a couple of potentially misleading things at play here 1. The use of the word "protocol". In C, a struct has a well-defined binary layout, so a C API is also an ABI. Using C structs to communicate data can be considered to be a protocol, but it means something different in the context of the "Arrow protocol". I think we need to call this a "C API" 2. The documentation for this in Antoine's PR is in the format/ directory. It would probably be better to have a "C API" section in the documentation. The header file under discussion and the documentation about it is best considered as a "library". It might be useful at some point to create a C99 implementation of the IPC protocol as well using FlatCC with the goal of having a complete implementation of the columnar format in C with minimal binary footprint. This is analogous to the NanoPB project which is an implementation of Protocol Buffers with small code size https://github.com/nanopb/nanopb Let me know if this makes more sense. I think it's important to communicate clearly about this primarily for the benefit of the outside world which can confuse easily as we have observed over the last few years =) Wes On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <jacq...@apache.org> wrote: > > I disagree with this statement: > > - the IPC format is meant for serialization while the C data protocol is > meants for in-memory communication, so different concerns apply > > If that is how the a particular implementation presents it, that is a > weaknesses of the implementation, not the format. The primary use case I > was focused on when working on the initial format was communication within > the same process. It seems like this is being used as a basis for the > introduction of new things when the premise is inconsistent with the > intention of the creation. The specific reason we used flatbuffers in the > project was to collapse the separation of in-process and out-of-process > communication. It means the same thing it does with the Arrow data itself: > that a consumer doesn't have to use a particular library to interact with > and use the data. > > It seems like there are two ideas here: > > 1) How do we make it easier for people to use Arrow? > 2) Should we implement a new in memory representation of Arrow that is > language specific. > > I'm entirely in support of number one. If for a particular type of domain, > people want an easier way to interact with Arrow, let's make a new library > that helps with that. In easy of our current libraries, we do many things > to make it easier to work with Arrow. None of those require a change to the > core format or are formalized as a new in-memory standard. The in-memory > representation of rust or javascript or java objects are implementation > details. > > I'm against number two as it creates a fragmentation problem. Arrow is > about having a single canonical format for memory for both metadata and > data. Having multiple in-memory formats (especially when some are not > language independent) is counter to the goals of the project. I don't think anyone is proposing anything that would cause fragmentation. A central question is whether it is useful to define a reusable C ABI for the Arrow columnar format, and if there is sufficient interest, a tiny C implementation of the IPC protocol (which uses the Flatbuffers message) that assembles and disassembles the data structures defined in the C ABI. We could separately create a tiny implementation of the Arrow IPC protocol using FlatCC that could be dropped into applications requiring only a C compiler and nothing else. > > Two other, separate comments: > 1) I don't understand the idea that we need to change the way Arrow > fundamentally works so that people can avoid using a dependency. If the > dependency is small, open source and easy to build, people can fork it and > include directly if they want to. Let's not violate project principles > because DuckDB has a religious perspective on dependencies. If the problem > is people have to swallow too large of a pill to do basic things with Arrow > in C, let's focus on fixing that (to our definition of ease, not someone > else's). If FlatCC solves some those things, great. If we need to build a > baby integration library that is more C centric, great. Neither of those > things require implementing something at the format level. > > 2) It seems like we should discuss the data structure problem separately > from the reference management concern. > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <wesmck...@gmail.com> wrote: > > > hi Antoine, > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit : > > > > A couple things: > > > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be better > > > > to have the same "shape" as an assembled array. Note that the C > > > > structs here have very nearly the same "shape" as the data structure > > > > representing a C++ Array object [1]. The disassembly and reassembly > > > > here is substantially simpler than the IPC protocol. A recursive > > > > structure in Flatbuffers would make RecordBatch messages much larger, > > > > so the flattened / disassembled representation we use for serialized > > > > record batches is the correct one > > > > > > I'm not sure I agree: > > > > > > - indeed, it's not a coincidence that the ArrowArray struct looks quite > > > closely like the C++ ArrayData object :-) We have good experience with > > > that abstraction and it has proven to work quite well > > > > > > - the IPC format is meant for serialization while the C data protocol is > > > meants for in-memory communication, so different concerns apply > > > > > > - the fact that this makes the layout slightly larger doesn't seem > > > important at all; we're not talking about transferring data over the wire > > > > > > There's also another argument for having a recursive struct: it > > > simplifies how the data type is represented, since we can encode each > > > child type individually instead of encoding it in the parent's format > > > string (same applies for metadata and individual flags). > > > > > > > I was saying something different here. I was making an argument about > > why we use the flattened array-of-structs in the IPC protocol. One > > reason is that it's a more compact representation. That is not very > > important here because this protocol is only for *in-process* (for > > languages that have a C FFI facility) rather than *inter-process* > > communication. > > > > I agree also that the type encoding is simple, here, too, since we > > aren't having to split the schema and record batch between different > > serialized messages. There is some potential waste with having to > > populate the type fields multiple times when communicating a sequence > > of "chunks" from the same logical dataset. > > > > > > * The "formal" C protocol having the "assembled" shape means that many > > > > minimal Arrow users won't have to implement any separate data > > > > structures. They can just use the C struct directly or a slightly > > > > wrapped version thereof with some convenience functions. > > > > > > Yes, but the same applies to the current proposal. > > > > > > > * I think that requiring building a Flatbuffer for minimal use cases > > > > (e.g. communicating simple record batches with primitive types) passes > > > > on implementation burden to minimal users. > > > > > > It certainly does. > > > > > > > I think the mantra of the C protocol should be the following: > > > > > > > > * Users of the protocol have to write little to no code to use it. For > > > > example, populating an INT32 array should require only a few lines of > > > > code > > > > > > Agreed. As a sidenote, the spec should have an example of doing this in > > > raw C. > > > > > > Regards > > > > > > Antoine. > >