A couple things: * I think a C protocol / FFI for Arrow array/vectors would be better to have the same "shape" as an assembled array. Note that the C structs here have very nearly the same "shape" as the data structure representing a C++ Array object [1]. The disassembly and reassembly here is substantially simpler than the IPC protocol. A recursive structure in Flatbuffers would make RecordBatch messages much larger, so the flattened / disassembled representation we use for serialized record batches is the correct one
* The "formal" C protocol having the "assembled" shape means that many minimal Arrow users won't have to implement any separate data structures. They can just use the C struct directly or a slightly wrapped version thereof with some convenience functions. * I think that requiring building a Flatbuffer for minimal use cases (e.g. communicating simple record batches with primitive types) passes on implementation burden to minimal users. I think the mantra of the C protocol should be the following: * Users of the protocol have to write little to no code to use it. For example, populating an INT32 array should require only a few lines of code * The data structure in the protocol is suitable as an in-memory data structure for recursive assembly of nested structures I think that having a string miniformat or a pre-parsed type struct with enum values (along the lines of what Antoine is describing above) places less burden on downstream users. [1]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L203 On Mon, Sep 30, 2019 at 4:08 PM Antoine Pitrou <[email protected]> wrote: > > > FlatCC is still a dependency, with generated files etc. > Perhaps you want to evaluate FlatCC on a schema-like example and see > what the generated code and compile instructions look like? > > I'll point out again that the format string in my proposal uses an > extremely simple mini-format, that should be parsable very easily by any > developer, even in raw C: > https://github.com/apache/arrow/blob/3806fa9ba3ddf95f0d09b865071bf19c5e912756/docs/source/format/CProtocol.rst#data-type-description----format-strings > > The parent-child structure in the schema is represented as-is in the > ArrowArray parent-child relationship, so it doesn't need any encoding. > Using Flatbuffers for an enum-like field + (at most) a couple parameters > sounds overkill. > > Another possibility would be to replace the format string with > pre-parsed fields, for example: > > int32_t type; > int32_t subtype; // type-dependent (e.g. unit for temporal types) > int32_t type_width; // for width-parametered types > const int8_t* child_ids; // for unions > const char* auxiliary_type_param; // e.g. timezone for timestamp > > The downside is that there are more fields to consider (including two > optional pointers). > > Regards > > Antoine. > > > Le 30/09/2019 à 22:48, Ben Kietzman a écrit : > > FlatCC seems germane: https://github.com/dvidelabs/flatcc > > > > It compiles flatbuffer schemas down to (idiomatic?) C > > > > Perhaps the schema and batch serialization problems should be solved by > > storing everything in the flatbuffer format. > > Then the results of running flatcc plus a few simple helpers can be checked > > in to provide an accessible C API. > > With respect to lifetime, Antoine has already done good work on specifying > > a move only contract which could probably be adapted. > > > > > > On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou <[email protected]> wrote: > > > >> > >> One basic design point is to allow exchanging Arrow data with no > >> mandatory dependency (the exception is JSON and base64 if you want to > >> act on metadata - but that's highly optional, and those are extremely > >> widespread formats). I'm afraid that Flatbuffers may be a deterrent: > >> not only it introduces a library, but it requires the use of a compiler > >> to produce generated code. It also requires familiarizing with, well, > >> Flatbuffers :-) > >> > >> We can of course discuss this and feel it's not a problem. > >> > >> Regards > >> > >> Antoine. > >> > >> > >> Le 29/09/2019 à 19:47, Wes McKinney a écrit : > >>> There are two pieces of serialized data needed to communicate a record > >>> batch from one library to another > >>> > >>> * Serialized schema (i.e. what's in Schema.fbs) > >>> * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs > >>> > >>> You _do_ need to use a Flatbuffers library to fully create these > >>> message types to interact with any existing record batch disassembly / > >>> reassembly. > >>> > >>> I think I'm most concerned about having a new way to serialize > >>> schemas. We already have JSON-based schema serialization for > >>> integration test purposes, so one possibility is to standardize that > >>> and make it a more formalized part of the project specification. > >>> > >>> As far as a C protocol, I don't see an especial downside to using the > >>> Flatbuffers schema to communicate types. > >>> > >>> Another thought is to not deviate from the flattened > >>> Flatbuffers-styled representation but to translate the Flatbuffers > >>> types into C types: namely a C struct-based version of the > >>> "RecordBatch" message. > >>> > >>> Independent of the means to communicate the two pieces of serialized > >>> information above (respectively: schemas and record batch field memory > >>> addresses and field lengths), having a C-based FFI where project can > >>> drop in a header file containing the ABI they are supposed to > >>> implement, that seems pretty reasonable to me. > >>> > >>> If we don't define a standardized in-memory FFI (whether it uses the > >>> Flatbuffers objects as inputs/outputs or not) then downstream project > >>> will devise their own, and that will cause issues long term. > >>> > >>> On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <[email protected]> > >> wrote: > >>>> > >>>> > >>>> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit : > >>>>> * No dependency on Flatbuffers. > >>>>> * No buffer reassembly (data is already exposed in logical Arrow > >> format). > >>>>> * Zero-copy by design. > >>>>> * Easy to reimplement from scratch. > >>>>> > >>>>> I don't see how the flatbuffer pattern for data headers doesn't > >> accomplish > >>>>> all of these things. At its definition, is a very simple > >> representation of > >>>>> data that could be worked with independently of the flatbuffers > >> codebase. > >>>>> It was designed so systems could map directly into that memory without > >>>>> interacting with a flatbuffers library. > >>>>> > >>>>> Specifically the following three structures were designed to already > >> allow > >>>>> what I think this proposal is trying to recreate. All three are very > >> simple > >>>>> to construct in a direct, non-flatbuffer dependent read/write pattern. > >>>> > >>>> Are they? Personally, I wouldn't know how to do that. I don't know > >>>> which encoding Flatbuffers use, whether it's C ABI-compatible (how could > >>>> it be? if it's portable accross different platforms, then it's probably > >>>> not compatible with any particular platform's C ABI, or only as a > >>>> conincidence), how I'm supposed to make use of the "offset" field, or > >>>> what the lifetime / ownership of all this data is. > >>>> > >>>> I may be missing something, but if the answer is that it's easy to > >>>> reimplement Flatbuffers' encoding without relying on the Flatbuffers > >>>> project's source code, I'm a bit skeptical. > >>>> > >>>> Regards > >>>> > >>>> Antoine. > >>>> > >>>> > >>>>> > >>>>> struct FieldNode { > >>>>> length: long; > >>>>> null_count: long; > >>>>> } > >>>>> > >>>>> struct Buffer { > >>>>> offset: long; > >>>>> length: long; > >>>>> } > >>>>> > >>>>> table RecordBatch { > >>>>> length: long; > >>>>> nodes: [FieldNode]; > >>>>> buffers: [Buffer]; > >>>>> } > >>>>> > >>>>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <[email protected]> > >> wrote: > >>>>> > >>>>>> I'm not clear on why we need to introduce something beyond what > >>>>>> flatbuffers already provides. Can someone explain that to me? I'm not > >>>>>> really a fan of introducing a second representation of the same data > >> (as I > >>>>>> understand it). > >>>>>> > >>>>>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <[email protected]> > >> wrote: > >>>>>> > >>>>>>> This is helpful, I will leave some comments on the proposal when I > >>>>>>> can, sometime in the next week. > >>>>>>> > >>>>>>> I agree that it would likely be opening a can of worms to create a > >>>>>>> semantic mapping between a generalized type grammar and Arrow's > >>>>>>> specific logical types defined in Schema.fbs. If we go down this > >>>>>>> route, we should probably utilize the simplest possible grammar that > >>>>>>> is capable of encoding the Type Flatbuffers union values. > >>>>>>> > >>>>>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <[email protected]> > >>>>>>> wrote: > >>>>>>>> > >>>>>>>> > >>>>>>>> I've posted a draft specification PR here, this should help orient > >> the > >>>>>>>> discussion a bit: > >>>>>>>> https://github.com/apache/arrow/pull/5442 > >>>>>>>> > >>>>>>>> Regards > >>>>>>>> > >>>>>>>> Antoine. > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Wed, 18 Sep 2019 19:52:38 +0200 > >>>>>>>> Antoine Pitrou <[email protected]> wrote: > >>>>>>>>> Hello, > >>>>>>>>> > >>>>>>>>> One thing that was discussed in the sync call is the ability to > >> easily > >>>>>>>>> pass arrays at runtime between Arrow implementations or > >>>>>>> Arrow-supporting > >>>>>>>>> libraries in the same process, without bearing the cost of linking > >> to > >>>>>>>>> e.g. the C++ Arrow library. > >>>>>>>>> > >>>>>>>>> (for example: "Duckdb wants to provide an option to return Arrow > >> data > >>>>>>> of > >>>>>>>>> result sets, but they don't like having Arrow as a dependency") > >>>>>>>>> > >>>>>>>>> One possibility would be to define a C-level protocol similar in > >>>>>>> spirit > >>>>>>>>> to the Python buffer protocol, which some people may be familiar > >> with > >>>>>>> (*). > >>>>>>>>> > >>>>>>>>> The basic idea is to define a simple C struct, which is ABI-stable > >> and > >>>>>>>>> describes an Arrow away adequately. The struct can be > >>>>>>> stack-allocated. > >>>>>>>>> Its definition can also be copied in another project (or interfaced > >>>>>>> with > >>>>>>>>> using a C FFI layer, depending on the language). > >>>>>>>>> > >>>>>>>>> There is no formal proposal, this message is meant to stir the > >>>>>>> discussion. > >>>>>>>>> > >>>>>>>>> Issues to work out: > >>>>>>>>> > >>>>>>>>> * Memory lifetime issues: where Python simply associates the > >> Py_buffer > >>>>>>>>> with a PyObject owner (a garbage-collected Python object), we need > >>>>>>>>> another means to control lifetime of pointed areas. One simple > >>>>>>>>> possibility is to include a destructor function pointer in the > >>>>>>> protocol > >>>>>>>>> struct. > >>>>>>>>> > >>>>>>>>> * Arrow type representation. We probably need some kind of > >> "format" > >>>>>>>>> mini-language to represent Arrow types, so that a type can be > >>>>>>> described > >>>>>>>>> using a `const char*`. Ideally, primitives types at least should > >> be > >>>>>>>>> trivially parsable. We may take inspiration from Python here > >>>>>>> (`struct` > >>>>>>>>> module format characters, PEP 3118 format additions). > >>>>>>>>> > >>>>>>>>> Example C struct definition (not a formal proposal!): > >>>>>>>>> > >>>>>>>>> struct ArrowBuffer { > >>>>>>>>> void* data; > >>>>>>>>> int64_t nbytes; > >>>>>>>>> // Called by the consumer when it doesn't need the buffer anymore > >>>>>>>>> void (*release)(struct ArrowBuffer*); > >>>>>>>>> // Opaque user data (for e.g. the release callback) > >>>>>>>>> void* user_data; > >>>>>>>>> }; > >>>>>>>>> > >>>>>>>>> struct ArrowArray { > >>>>>>>>> // Type description > >>>>>>>>> const char* format; > >>>>>>>>> // Data description > >>>>>>>>> int64_t length; > >>>>>>>>> int64_t null_count; > >>>>>>>>> int64_t n_buffers; > >>>>>>>>> // Note: this pointers are probably owned by the ArrowArray > >> struct > >>>>>>>>> // and will be released and free()ed by the release callback. > >>>>>>>>> struct BufferDescriptor* buffers; > >>>>>>>>> struct ArrowDescriptor* dictionary; > >>>>>>>>> // Called by the consumer when it doesn't need the array anymore > >>>>>>>>> void (*release)(struct ArrowArrayDescriptor*); > >>>>>>>>> // Opaque user data (for e.g. the release callback) > >>>>>>>>> void* user_data; > >>>>>>>>> }; > >>>>>>>>> > >>>>>>>>> Thoughts? > >>>>>>>>> > >>>>>>>>> (*) For the record, the reference for the Python buffer protocol: > >>>>>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure > >>>>>>>>> and its C struct definition: > >>>>>>>>> > >>>>>>> > >> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 > >>>>>>>>> > >>>>>>>>> Regards > >>>>>>>>> > >>>>>>>>> Antoine. > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >> > >
