One basic design point is to allow exchanging Arrow data with no mandatory dependency (the exception is JSON and base64 if you want to act on metadata - but that's highly optional, and those are extremely widespread formats). I'm afraid that Flatbuffers may be a deterrent: not only it introduces a library, but it requires the use of a compiler to produce generated code. It also requires familiarizing with, well, Flatbuffers :-)
We can of course discuss this and feel it's not a problem. Regards Antoine. Le 29/09/2019 à 19:47, Wes McKinney a écrit : > There are two pieces of serialized data needed to communicate a record > batch from one library to another > > * Serialized schema (i.e. what's in Schema.fbs) > * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs > > You _do_ need to use a Flatbuffers library to fully create these > message types to interact with any existing record batch disassembly / > reassembly. > > I think I'm most concerned about having a new way to serialize > schemas. We already have JSON-based schema serialization for > integration test purposes, so one possibility is to standardize that > and make it a more formalized part of the project specification. > > As far as a C protocol, I don't see an especial downside to using the > Flatbuffers schema to communicate types. > > Another thought is to not deviate from the flattened > Flatbuffers-styled representation but to translate the Flatbuffers > types into C types: namely a C struct-based version of the > "RecordBatch" message. > > Independent of the means to communicate the two pieces of serialized > information above (respectively: schemas and record batch field memory > addresses and field lengths), having a C-based FFI where project can > drop in a header file containing the ABI they are supposed to > implement, that seems pretty reasonable to me. > > If we don't define a standardized in-memory FFI (whether it uses the > Flatbuffers objects as inputs/outputs or not) then downstream project > will devise their own, and that will cause issues long term. > > On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <anto...@python.org> wrote: >> >> >> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit : >>> * No dependency on Flatbuffers. >>> * No buffer reassembly (data is already exposed in logical Arrow format). >>> * Zero-copy by design. >>> * Easy to reimplement from scratch. >>> >>> I don't see how the flatbuffer pattern for data headers doesn't accomplish >>> all of these things. At its definition, is a very simple representation of >>> data that could be worked with independently of the flatbuffers codebase. >>> It was designed so systems could map directly into that memory without >>> interacting with a flatbuffers library. >>> >>> Specifically the following three structures were designed to already allow >>> what I think this proposal is trying to recreate. All three are very simple >>> to construct in a direct, non-flatbuffer dependent read/write pattern. >> >> Are they? Personally, I wouldn't know how to do that. I don't know >> which encoding Flatbuffers use, whether it's C ABI-compatible (how could >> it be? if it's portable accross different platforms, then it's probably >> not compatible with any particular platform's C ABI, or only as a >> conincidence), how I'm supposed to make use of the "offset" field, or >> what the lifetime / ownership of all this data is. >> >> I may be missing something, but if the answer is that it's easy to >> reimplement Flatbuffers' encoding without relying on the Flatbuffers >> project's source code, I'm a bit skeptical. >> >> Regards >> >> Antoine. >> >> >>> >>> struct FieldNode { >>> length: long; >>> null_count: long; >>> } >>> >>> struct Buffer { >>> offset: long; >>> length: long; >>> } >>> >>> table RecordBatch { >>> length: long; >>> nodes: [FieldNode]; >>> buffers: [Buffer]; >>> } >>> >>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org> wrote: >>> >>>> I'm not clear on why we need to introduce something beyond what >>>> flatbuffers already provides. Can someone explain that to me? I'm not >>>> really a fan of introducing a second representation of the same data (as I >>>> understand it). >>>> >>>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com> wrote: >>>> >>>>> This is helpful, I will leave some comments on the proposal when I >>>>> can, sometime in the next week. >>>>> >>>>> I agree that it would likely be opening a can of worms to create a >>>>> semantic mapping between a generalized type grammar and Arrow's >>>>> specific logical types defined in Schema.fbs. If we go down this >>>>> route, we should probably utilize the simplest possible grammar that >>>>> is capable of encoding the Type Flatbuffers union values. >>>>> >>>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net> >>>>> wrote: >>>>>> >>>>>> >>>>>> I've posted a draft specification PR here, this should help orient the >>>>>> discussion a bit: >>>>>> https://github.com/apache/arrow/pull/5442 >>>>>> >>>>>> Regards >>>>>> >>>>>> Antoine. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, 18 Sep 2019 19:52:38 +0200 >>>>>> Antoine Pitrou <anto...@python.org> wrote: >>>>>>> Hello, >>>>>>> >>>>>>> One thing that was discussed in the sync call is the ability to easily >>>>>>> pass arrays at runtime between Arrow implementations or >>>>> Arrow-supporting >>>>>>> libraries in the same process, without bearing the cost of linking to >>>>>>> e.g. the C++ Arrow library. >>>>>>> >>>>>>> (for example: "Duckdb wants to provide an option to return Arrow data >>>>> of >>>>>>> result sets, but they don't like having Arrow as a dependency") >>>>>>> >>>>>>> One possibility would be to define a C-level protocol similar in >>>>> spirit >>>>>>> to the Python buffer protocol, which some people may be familiar with >>>>> (*). >>>>>>> >>>>>>> The basic idea is to define a simple C struct, which is ABI-stable and >>>>>>> describes an Arrow away adequately. The struct can be >>>>> stack-allocated. >>>>>>> Its definition can also be copied in another project (or interfaced >>>>> with >>>>>>> using a C FFI layer, depending on the language). >>>>>>> >>>>>>> There is no formal proposal, this message is meant to stir the >>>>> discussion. >>>>>>> >>>>>>> Issues to work out: >>>>>>> >>>>>>> * Memory lifetime issues: where Python simply associates the Py_buffer >>>>>>> with a PyObject owner (a garbage-collected Python object), we need >>>>>>> another means to control lifetime of pointed areas. One simple >>>>>>> possibility is to include a destructor function pointer in the >>>>> protocol >>>>>>> struct. >>>>>>> >>>>>>> * Arrow type representation. We probably need some kind of "format" >>>>>>> mini-language to represent Arrow types, so that a type can be >>>>> described >>>>>>> using a `const char*`. Ideally, primitives types at least should be >>>>>>> trivially parsable. We may take inspiration from Python here >>>>> (`struct` >>>>>>> module format characters, PEP 3118 format additions). >>>>>>> >>>>>>> Example C struct definition (not a formal proposal!): >>>>>>> >>>>>>> struct ArrowBuffer { >>>>>>> void* data; >>>>>>> int64_t nbytes; >>>>>>> // Called by the consumer when it doesn't need the buffer anymore >>>>>>> void (*release)(struct ArrowBuffer*); >>>>>>> // Opaque user data (for e.g. the release callback) >>>>>>> void* user_data; >>>>>>> }; >>>>>>> >>>>>>> struct ArrowArray { >>>>>>> // Type description >>>>>>> const char* format; >>>>>>> // Data description >>>>>>> int64_t length; >>>>>>> int64_t null_count; >>>>>>> int64_t n_buffers; >>>>>>> // Note: this pointers are probably owned by the ArrowArray struct >>>>>>> // and will be released and free()ed by the release callback. >>>>>>> struct BufferDescriptor* buffers; >>>>>>> struct ArrowDescriptor* dictionary; >>>>>>> // Called by the consumer when it doesn't need the array anymore >>>>>>> void (*release)(struct ArrowArrayDescriptor*); >>>>>>> // Opaque user data (for e.g. the release callback) >>>>>>> void* user_data; >>>>>>> }; >>>>>>> >>>>>>> Thoughts? >>>>>>> >>>>>>> (*) For the record, the reference for the Python buffer protocol: >>>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure >>>>>>> and its C struct definition: >>>>>>> >>>>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 >>>>>>> >>>>>>> Regards >>>>>>> >>>>>>> Antoine. >>>>>>> >>>>>> >>>>>> >>>>>> >>>>> >>>> >>>