FlatCC is still a dependency, with generated files etc. Perhaps you want to evaluate FlatCC on a schema-like example and see what the generated code and compile instructions look like?
I'll point out again that the format string in my proposal uses an extremely simple mini-format, that should be parsable very easily by any developer, even in raw C: https://github.com/apache/arrow/blob/3806fa9ba3ddf95f0d09b865071bf19c5e912756/docs/source/format/CProtocol.rst#data-type-description----format-strings The parent-child structure in the schema is represented as-is in the ArrowArray parent-child relationship, so it doesn't need any encoding. Using Flatbuffers for an enum-like field + (at most) a couple parameters sounds overkill. Another possibility would be to replace the format string with pre-parsed fields, for example: int32_t type; int32_t subtype; // type-dependent (e.g. unit for temporal types) int32_t type_width; // for width-parametered types const int8_t* child_ids; // for unions const char* auxiliary_type_param; // e.g. timezone for timestamp The downside is that there are more fields to consider (including two optional pointers). Regards Antoine. Le 30/09/2019 à 22:48, Ben Kietzman a écrit : > FlatCC seems germane: https://github.com/dvidelabs/flatcc > > It compiles flatbuffer schemas down to (idiomatic?) C > > Perhaps the schema and batch serialization problems should be solved by > storing everything in the flatbuffer format. > Then the results of running flatcc plus a few simple helpers can be checked > in to provide an accessible C API. > With respect to lifetime, Antoine has already done good work on specifying > a move only contract which could probably be adapted. > > > On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou <[email protected]> wrote: > >> >> One basic design point is to allow exchanging Arrow data with no >> mandatory dependency (the exception is JSON and base64 if you want to >> act on metadata - but that's highly optional, and those are extremely >> widespread formats). I'm afraid that Flatbuffers may be a deterrent: >> not only it introduces a library, but it requires the use of a compiler >> to produce generated code. It also requires familiarizing with, well, >> Flatbuffers :-) >> >> We can of course discuss this and feel it's not a problem. >> >> Regards >> >> Antoine. >> >> >> Le 29/09/2019 à 19:47, Wes McKinney a écrit : >>> There are two pieces of serialized data needed to communicate a record >>> batch from one library to another >>> >>> * Serialized schema (i.e. what's in Schema.fbs) >>> * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs >>> >>> You _do_ need to use a Flatbuffers library to fully create these >>> message types to interact with any existing record batch disassembly / >>> reassembly. >>> >>> I think I'm most concerned about having a new way to serialize >>> schemas. We already have JSON-based schema serialization for >>> integration test purposes, so one possibility is to standardize that >>> and make it a more formalized part of the project specification. >>> >>> As far as a C protocol, I don't see an especial downside to using the >>> Flatbuffers schema to communicate types. >>> >>> Another thought is to not deviate from the flattened >>> Flatbuffers-styled representation but to translate the Flatbuffers >>> types into C types: namely a C struct-based version of the >>> "RecordBatch" message. >>> >>> Independent of the means to communicate the two pieces of serialized >>> information above (respectively: schemas and record batch field memory >>> addresses and field lengths), having a C-based FFI where project can >>> drop in a header file containing the ABI they are supposed to >>> implement, that seems pretty reasonable to me. >>> >>> If we don't define a standardized in-memory FFI (whether it uses the >>> Flatbuffers objects as inputs/outputs or not) then downstream project >>> will devise their own, and that will cause issues long term. >>> >>> On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <[email protected]> >> wrote: >>>> >>>> >>>> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit : >>>>> * No dependency on Flatbuffers. >>>>> * No buffer reassembly (data is already exposed in logical Arrow >> format). >>>>> * Zero-copy by design. >>>>> * Easy to reimplement from scratch. >>>>> >>>>> I don't see how the flatbuffer pattern for data headers doesn't >> accomplish >>>>> all of these things. At its definition, is a very simple >> representation of >>>>> data that could be worked with independently of the flatbuffers >> codebase. >>>>> It was designed so systems could map directly into that memory without >>>>> interacting with a flatbuffers library. >>>>> >>>>> Specifically the following three structures were designed to already >> allow >>>>> what I think this proposal is trying to recreate. All three are very >> simple >>>>> to construct in a direct, non-flatbuffer dependent read/write pattern. >>>> >>>> Are they? Personally, I wouldn't know how to do that. I don't know >>>> which encoding Flatbuffers use, whether it's C ABI-compatible (how could >>>> it be? if it's portable accross different platforms, then it's probably >>>> not compatible with any particular platform's C ABI, or only as a >>>> conincidence), how I'm supposed to make use of the "offset" field, or >>>> what the lifetime / ownership of all this data is. >>>> >>>> I may be missing something, but if the answer is that it's easy to >>>> reimplement Flatbuffers' encoding without relying on the Flatbuffers >>>> project's source code, I'm a bit skeptical. >>>> >>>> Regards >>>> >>>> Antoine. >>>> >>>> >>>>> >>>>> struct FieldNode { >>>>> length: long; >>>>> null_count: long; >>>>> } >>>>> >>>>> struct Buffer { >>>>> offset: long; >>>>> length: long; >>>>> } >>>>> >>>>> table RecordBatch { >>>>> length: long; >>>>> nodes: [FieldNode]; >>>>> buffers: [Buffer]; >>>>> } >>>>> >>>>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <[email protected]> >> wrote: >>>>> >>>>>> I'm not clear on why we need to introduce something beyond what >>>>>> flatbuffers already provides. Can someone explain that to me? I'm not >>>>>> really a fan of introducing a second representation of the same data >> (as I >>>>>> understand it). >>>>>> >>>>>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <[email protected]> >> wrote: >>>>>> >>>>>>> This is helpful, I will leave some comments on the proposal when I >>>>>>> can, sometime in the next week. >>>>>>> >>>>>>> I agree that it would likely be opening a can of worms to create a >>>>>>> semantic mapping between a generalized type grammar and Arrow's >>>>>>> specific logical types defined in Schema.fbs. If we go down this >>>>>>> route, we should probably utilize the simplest possible grammar that >>>>>>> is capable of encoding the Type Flatbuffers union values. >>>>>>> >>>>>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <[email protected]> >>>>>>> wrote: >>>>>>>> >>>>>>>> >>>>>>>> I've posted a draft specification PR here, this should help orient >> the >>>>>>>> discussion a bit: >>>>>>>> https://github.com/apache/arrow/pull/5442 >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Antoine. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, 18 Sep 2019 19:52:38 +0200 >>>>>>>> Antoine Pitrou <[email protected]> wrote: >>>>>>>>> Hello, >>>>>>>>> >>>>>>>>> One thing that was discussed in the sync call is the ability to >> easily >>>>>>>>> pass arrays at runtime between Arrow implementations or >>>>>>> Arrow-supporting >>>>>>>>> libraries in the same process, without bearing the cost of linking >> to >>>>>>>>> e.g. the C++ Arrow library. >>>>>>>>> >>>>>>>>> (for example: "Duckdb wants to provide an option to return Arrow >> data >>>>>>> of >>>>>>>>> result sets, but they don't like having Arrow as a dependency") >>>>>>>>> >>>>>>>>> One possibility would be to define a C-level protocol similar in >>>>>>> spirit >>>>>>>>> to the Python buffer protocol, which some people may be familiar >> with >>>>>>> (*). >>>>>>>>> >>>>>>>>> The basic idea is to define a simple C struct, which is ABI-stable >> and >>>>>>>>> describes an Arrow away adequately. The struct can be >>>>>>> stack-allocated. >>>>>>>>> Its definition can also be copied in another project (or interfaced >>>>>>> with >>>>>>>>> using a C FFI layer, depending on the language). >>>>>>>>> >>>>>>>>> There is no formal proposal, this message is meant to stir the >>>>>>> discussion. >>>>>>>>> >>>>>>>>> Issues to work out: >>>>>>>>> >>>>>>>>> * Memory lifetime issues: where Python simply associates the >> Py_buffer >>>>>>>>> with a PyObject owner (a garbage-collected Python object), we need >>>>>>>>> another means to control lifetime of pointed areas. One simple >>>>>>>>> possibility is to include a destructor function pointer in the >>>>>>> protocol >>>>>>>>> struct. >>>>>>>>> >>>>>>>>> * Arrow type representation. We probably need some kind of >> "format" >>>>>>>>> mini-language to represent Arrow types, so that a type can be >>>>>>> described >>>>>>>>> using a `const char*`. Ideally, primitives types at least should >> be >>>>>>>>> trivially parsable. We may take inspiration from Python here >>>>>>> (`struct` >>>>>>>>> module format characters, PEP 3118 format additions). >>>>>>>>> >>>>>>>>> Example C struct definition (not a formal proposal!): >>>>>>>>> >>>>>>>>> struct ArrowBuffer { >>>>>>>>> void* data; >>>>>>>>> int64_t nbytes; >>>>>>>>> // Called by the consumer when it doesn't need the buffer anymore >>>>>>>>> void (*release)(struct ArrowBuffer*); >>>>>>>>> // Opaque user data (for e.g. the release callback) >>>>>>>>> void* user_data; >>>>>>>>> }; >>>>>>>>> >>>>>>>>> struct ArrowArray { >>>>>>>>> // Type description >>>>>>>>> const char* format; >>>>>>>>> // Data description >>>>>>>>> int64_t length; >>>>>>>>> int64_t null_count; >>>>>>>>> int64_t n_buffers; >>>>>>>>> // Note: this pointers are probably owned by the ArrowArray >> struct >>>>>>>>> // and will be released and free()ed by the release callback. >>>>>>>>> struct BufferDescriptor* buffers; >>>>>>>>> struct ArrowDescriptor* dictionary; >>>>>>>>> // Called by the consumer when it doesn't need the array anymore >>>>>>>>> void (*release)(struct ArrowArrayDescriptor*); >>>>>>>>> // Opaque user data (for e.g. the release callback) >>>>>>>>> void* user_data; >>>>>>>>> }; >>>>>>>>> >>>>>>>>> Thoughts? >>>>>>>>> >>>>>>>>> (*) For the record, the reference for the Python buffer protocol: >>>>>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure >>>>>>>>> and its C struct definition: >>>>>>>>> >>>>>>> >> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> Antoine. >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >> >
