* No dependency on Flatbuffers. * No buffer reassembly (data is already exposed in logical Arrow format). * Zero-copy by design. * Easy to reimplement from scratch.
I don't see how the flatbuffer pattern for data headers doesn't accomplish all of these things. At its definition, is a very simple representation of data that could be worked with independently of the flatbuffers codebase. It was designed so systems could map directly into that memory without interacting with a flatbuffers library. Specifically the following three structures were designed to already allow what I think this proposal is trying to recreate. All three are very simple to construct in a direct, non-flatbuffer dependent read/write pattern. struct FieldNode { length: long; null_count: long; } struct Buffer { offset: long; length: long; } table RecordBatch { length: long; nodes: [FieldNode]; buffers: [Buffer]; } On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org> wrote: > I'm not clear on why we need to introduce something beyond what > flatbuffers already provides. Can someone explain that to me? I'm not > really a fan of introducing a second representation of the same data (as I > understand it). > > On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> This is helpful, I will leave some comments on the proposal when I >> can, sometime in the next week. >> >> I agree that it would likely be opening a can of worms to create a >> semantic mapping between a generalized type grammar and Arrow's >> specific logical types defined in Schema.fbs. If we go down this >> route, we should probably utilize the simplest possible grammar that >> is capable of encoding the Type Flatbuffers union values. >> >> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net> >> wrote: >> > >> > >> > I've posted a draft specification PR here, this should help orient the >> > discussion a bit: >> > https://github.com/apache/arrow/pull/5442 >> > >> > Regards >> > >> > Antoine. >> > >> > >> > >> > On Wed, 18 Sep 2019 19:52:38 +0200 >> > Antoine Pitrou <anto...@python.org> wrote: >> > > Hello, >> > > >> > > One thing that was discussed in the sync call is the ability to easily >> > > pass arrays at runtime between Arrow implementations or >> Arrow-supporting >> > > libraries in the same process, without bearing the cost of linking to >> > > e.g. the C++ Arrow library. >> > > >> > > (for example: "Duckdb wants to provide an option to return Arrow data >> of >> > > result sets, but they don't like having Arrow as a dependency") >> > > >> > > One possibility would be to define a C-level protocol similar in >> spirit >> > > to the Python buffer protocol, which some people may be familiar with >> (*). >> > > >> > > The basic idea is to define a simple C struct, which is ABI-stable and >> > > describes an Arrow away adequately. The struct can be >> stack-allocated. >> > > Its definition can also be copied in another project (or interfaced >> with >> > > using a C FFI layer, depending on the language). >> > > >> > > There is no formal proposal, this message is meant to stir the >> discussion. >> > > >> > > Issues to work out: >> > > >> > > * Memory lifetime issues: where Python simply associates the Py_buffer >> > > with a PyObject owner (a garbage-collected Python object), we need >> > > another means to control lifetime of pointed areas. One simple >> > > possibility is to include a destructor function pointer in the >> protocol >> > > struct. >> > > >> > > * Arrow type representation. We probably need some kind of "format" >> > > mini-language to represent Arrow types, so that a type can be >> described >> > > using a `const char*`. Ideally, primitives types at least should be >> > > trivially parsable. We may take inspiration from Python here >> (`struct` >> > > module format characters, PEP 3118 format additions). >> > > >> > > Example C struct definition (not a formal proposal!): >> > > >> > > struct ArrowBuffer { >> > > void* data; >> > > int64_t nbytes; >> > > // Called by the consumer when it doesn't need the buffer anymore >> > > void (*release)(struct ArrowBuffer*); >> > > // Opaque user data (for e.g. the release callback) >> > > void* user_data; >> > > }; >> > > >> > > struct ArrowArray { >> > > // Type description >> > > const char* format; >> > > // Data description >> > > int64_t length; >> > > int64_t null_count; >> > > int64_t n_buffers; >> > > // Note: this pointers are probably owned by the ArrowArray struct >> > > // and will be released and free()ed by the release callback. >> > > struct BufferDescriptor* buffers; >> > > struct ArrowDescriptor* dictionary; >> > > // Called by the consumer when it doesn't need the array anymore >> > > void (*release)(struct ArrowArrayDescriptor*); >> > > // Opaque user data (for e.g. the release callback) >> > > void* user_data; >> > > }; >> > > >> > > Thoughts? >> > > >> > > (*) For the record, the reference for the Python buffer protocol: >> > > https://docs.python.org/3/c-api/buffer.html#buffer-structure >> > > and its C struct definition: >> > > >> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 >> > > >> > > Regards >> > > >> > > Antoine. >> > > >> > >> > >> > >> >