There are two pieces of serialized data needed to communicate a record batch from one library to another
* Serialized schema (i.e. what's in Schema.fbs) * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs You _do_ need to use a Flatbuffers library to fully create these message types to interact with any existing record batch disassembly / reassembly. I think I'm most concerned about having a new way to serialize schemas. We already have JSON-based schema serialization for integration test purposes, so one possibility is to standardize that and make it a more formalized part of the project specification. As far as a C protocol, I don't see an especial downside to using the Flatbuffers schema to communicate types. Another thought is to not deviate from the flattened Flatbuffers-styled representation but to translate the Flatbuffers types into C types: namely a C struct-based version of the "RecordBatch" message. Independent of the means to communicate the two pieces of serialized information above (respectively: schemas and record batch field memory addresses and field lengths), having a C-based FFI where project can drop in a header file containing the ABI they are supposed to implement, that seems pretty reasonable to me. If we don't define a standardized in-memory FFI (whether it uses the Flatbuffers objects as inputs/outputs or not) then downstream project will devise their own, and that will cause issues long term. On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <anto...@python.org> wrote: > > > Le 29/09/2019 à 06:10, Jacques Nadeau a écrit : > > * No dependency on Flatbuffers. > > * No buffer reassembly (data is already exposed in logical Arrow format). > > * Zero-copy by design. > > * Easy to reimplement from scratch. > > > > I don't see how the flatbuffer pattern for data headers doesn't accomplish > > all of these things. At its definition, is a very simple representation of > > data that could be worked with independently of the flatbuffers codebase. > > It was designed so systems could map directly into that memory without > > interacting with a flatbuffers library. > > > > Specifically the following three structures were designed to already allow > > what I think this proposal is trying to recreate. All three are very simple > > to construct in a direct, non-flatbuffer dependent read/write pattern. > > Are they? Personally, I wouldn't know how to do that. I don't know > which encoding Flatbuffers use, whether it's C ABI-compatible (how could > it be? if it's portable accross different platforms, then it's probably > not compatible with any particular platform's C ABI, or only as a > conincidence), how I'm supposed to make use of the "offset" field, or > what the lifetime / ownership of all this data is. > > I may be missing something, but if the answer is that it's easy to > reimplement Flatbuffers' encoding without relying on the Flatbuffers > project's source code, I'm a bit skeptical. > > Regards > > Antoine. > > > > > > struct FieldNode { > > length: long; > > null_count: long; > > } > > > > struct Buffer { > > offset: long; > > length: long; > > } > > > > table RecordBatch { > > length: long; > > nodes: [FieldNode]; > > buffers: [Buffer]; > > } > > > > On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org> wrote: > > > >> I'm not clear on why we need to introduce something beyond what > >> flatbuffers already provides. Can someone explain that to me? I'm not > >> really a fan of introducing a second representation of the same data (as I > >> understand it). > >> > >> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> > >>> This is helpful, I will leave some comments on the proposal when I > >>> can, sometime in the next week. > >>> > >>> I agree that it would likely be opening a can of worms to create a > >>> semantic mapping between a generalized type grammar and Arrow's > >>> specific logical types defined in Schema.fbs. If we go down this > >>> route, we should probably utilize the simplest possible grammar that > >>> is capable of encoding the Type Flatbuffers union values. > >>> > >>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net> > >>> wrote: > >>>> > >>>> > >>>> I've posted a draft specification PR here, this should help orient the > >>>> discussion a bit: > >>>> https://github.com/apache/arrow/pull/5442 > >>>> > >>>> Regards > >>>> > >>>> Antoine. > >>>> > >>>> > >>>> > >>>> On Wed, 18 Sep 2019 19:52:38 +0200 > >>>> Antoine Pitrou <anto...@python.org> wrote: > >>>>> Hello, > >>>>> > >>>>> One thing that was discussed in the sync call is the ability to easily > >>>>> pass arrays at runtime between Arrow implementations or > >>> Arrow-supporting > >>>>> libraries in the same process, without bearing the cost of linking to > >>>>> e.g. the C++ Arrow library. > >>>>> > >>>>> (for example: "Duckdb wants to provide an option to return Arrow data > >>> of > >>>>> result sets, but they don't like having Arrow as a dependency") > >>>>> > >>>>> One possibility would be to define a C-level protocol similar in > >>> spirit > >>>>> to the Python buffer protocol, which some people may be familiar with > >>> (*). > >>>>> > >>>>> The basic idea is to define a simple C struct, which is ABI-stable and > >>>>> describes an Arrow away adequately. The struct can be > >>> stack-allocated. > >>>>> Its definition can also be copied in another project (or interfaced > >>> with > >>>>> using a C FFI layer, depending on the language). > >>>>> > >>>>> There is no formal proposal, this message is meant to stir the > >>> discussion. > >>>>> > >>>>> Issues to work out: > >>>>> > >>>>> * Memory lifetime issues: where Python simply associates the Py_buffer > >>>>> with a PyObject owner (a garbage-collected Python object), we need > >>>>> another means to control lifetime of pointed areas. One simple > >>>>> possibility is to include a destructor function pointer in the > >>> protocol > >>>>> struct. > >>>>> > >>>>> * Arrow type representation. We probably need some kind of "format" > >>>>> mini-language to represent Arrow types, so that a type can be > >>> described > >>>>> using a `const char*`. Ideally, primitives types at least should be > >>>>> trivially parsable. We may take inspiration from Python here > >>> (`struct` > >>>>> module format characters, PEP 3118 format additions). > >>>>> > >>>>> Example C struct definition (not a formal proposal!): > >>>>> > >>>>> struct ArrowBuffer { > >>>>> void* data; > >>>>> int64_t nbytes; > >>>>> // Called by the consumer when it doesn't need the buffer anymore > >>>>> void (*release)(struct ArrowBuffer*); > >>>>> // Opaque user data (for e.g. the release callback) > >>>>> void* user_data; > >>>>> }; > >>>>> > >>>>> struct ArrowArray { > >>>>> // Type description > >>>>> const char* format; > >>>>> // Data description > >>>>> int64_t length; > >>>>> int64_t null_count; > >>>>> int64_t n_buffers; > >>>>> // Note: this pointers are probably owned by the ArrowArray struct > >>>>> // and will be released and free()ed by the release callback. > >>>>> struct BufferDescriptor* buffers; > >>>>> struct ArrowDescriptor* dictionary; > >>>>> // Called by the consumer when it doesn't need the array anymore > >>>>> void (*release)(struct ArrowArrayDescriptor*); > >>>>> // Opaque user data (for e.g. the release callback) > >>>>> void* user_data; > >>>>> }; > >>>>> > >>>>> Thoughts? > >>>>> > >>>>> (*) For the record, the reference for the Python buffer protocol: > >>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure > >>>>> and its C struct definition: > >>>>> > >>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 > >>>>> > >>>>> Regards > >>>>> > >>>>> Antoine. > >>>>> > >>>> > >>>> > >>>> > >>> > >> > >