Re: [DISCUSS] C-level in-process array protocol

Antoine Pitrou Sun, 29 Sep 2019 11:45:20 -0700


One basic design point is to allow exchanging Arrow data with no
mandatory dependency (the exception is JSON and base64 if you want to
act on metadata - but that's highly optional, and those are extremely
widespread formats).  I'm afraid that Flatbuffers may be a deterrent:
not only it introduces a library, but it requires the use of a compiler
to produce generated code.  It also requires familiarizing with, well,
Flatbuffers :-)


We can of course discuss this and feel it's not a problem.

Regards

Antoine.


Le 29/09/2019 à 19:47, Wes McKinney a écrit :
> There are two pieces of serialized data needed to communicate a record
> batch from one library to another
> 
> * Serialized schema (i.e. what's in Schema.fbs)
> * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs
> 
> You _do_ need to use a Flatbuffers library to fully create these
> message types to interact with any existing record batch disassembly /
> reassembly.
> 
> I think I'm most concerned about having a new way to serialize
> schemas. We already have JSON-based schema serialization for
> integration test purposes, so one possibility is to standardize that
> and make it a more formalized part of the project specification.
> 
> As far as a C protocol, I don't see an especial downside to using the
> Flatbuffers schema to communicate types.
> 
> Another thought is to not deviate from the flattened
> Flatbuffers-styled representation but to translate the Flatbuffers
> types into C types: namely a C struct-based version of the
> "RecordBatch" message.
> 
> Independent of the means to communicate the two pieces of serialized
> information above (respectively: schemas and record batch field memory
> addresses and field lengths), having a C-based FFI where project can
> drop in a header file containing the ABI they are supposed to
> implement, that seems pretty reasonable to me.
> 
> If we don't define a standardized in-memory FFI (whether it uses the
> Flatbuffers objects as inputs/outputs or not) then downstream project
> will devise their own, and that will cause issues long term.
> 
> On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <[email protected]> wrote:
>>
>>
>> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
>>> * No dependency on Flatbuffers.
>>> * No buffer reassembly (data is already exposed in logical Arrow format).
>>> * Zero-copy by design.
>>> * Easy to reimplement from scratch.
>>>
>>> I don't see how the flatbuffer pattern for data headers doesn't accomplish
>>> all of these things. At its definition, is a very simple representation of
>>> data that could be worked with independently of the flatbuffers codebase.
>>> It was designed so systems could map directly into that memory without
>>> interacting with a flatbuffers library.
>>>
>>> Specifically the following three structures were designed to already allow
>>> what I think this proposal is trying to recreate. All three are very simple
>>> to construct in a direct, non-flatbuffer dependent read/write pattern.
>>
>> Are they?  Personally, I wouldn't know how to do that.  I don't know
>> which encoding Flatbuffers use, whether it's C ABI-compatible (how could
>> it be? if it's portable accross different platforms, then it's probably
>> not compatible with any particular platform's C ABI, or only as a
>> conincidence), how I'm supposed to make use of the "offset" field, or
>> what the lifetime / ownership of all this data is.
>>
>> I may be missing something, but if the answer is that it's easy to
>> reimplement Flatbuffers' encoding without relying on the Flatbuffers
>> project's source code, I'm a bit skeptical.
>>
>> Regards
>>
>> Antoine.
>>
>>
>>>
>>> struct FieldNode {
>>>   length: long;
>>>   null_count: long;
>>> }
>>>
>>> struct Buffer {
>>>   offset: long;
>>>   length: long;
>>> }
>>>
>>> table RecordBatch {
>>>   length: long;
>>>   nodes: [FieldNode];
>>>   buffers: [Buffer];
>>> }
>>>
>>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <[email protected]> wrote:
>>>
>>>> I'm not clear on why we need to introduce something beyond what
>>>> flatbuffers already provides. Can someone explain that to me? I'm not
>>>> really a fan of introducing a second representation of the same data (as I
>>>> understand it).
>>>>
>>>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <[email protected]> wrote:
>>>>
>>>>> This is helpful, I will leave some comments on the proposal when I
>>>>> can, sometime in the next week.
>>>>>
>>>>> I agree that it would likely be opening a can of worms to create a
>>>>> semantic mapping between a generalized type grammar and Arrow's
>>>>> specific logical types defined in Schema.fbs. If we go down this
>>>>> route, we should probably utilize the simplest possible grammar that
>>>>> is capable of encoding the Type Flatbuffers union values.
>>>>>
>>>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <[email protected]>
>>>>> wrote:
>>>>>>
>>>>>>
>>>>>> I've posted a draft specification PR here, this should help orient the
>>>>>> discussion a bit:
>>>>>> https://github.com/apache/arrow/pull/5442
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Antoine.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, 18 Sep 2019 19:52:38 +0200
>>>>>> Antoine Pitrou <[email protected]> wrote:
>>>>>>> Hello,
>>>>>>>
>>>>>>> One thing that was discussed in the sync call is the ability to easily
>>>>>>> pass arrays at runtime between Arrow implementations or
>>>>> Arrow-supporting
>>>>>>> libraries in the same process, without bearing the cost of linking to
>>>>>>> e.g. the C++ Arrow library.
>>>>>>>
>>>>>>> (for example: "Duckdb wants to provide an option to return Arrow data
>>>>> of
>>>>>>> result sets, but they don't like having Arrow as a dependency")
>>>>>>>
>>>>>>> One possibility would be to define a C-level protocol similar in
>>>>> spirit
>>>>>>> to the Python buffer protocol, which some people may be familiar with
>>>>> (*).
>>>>>>>
>>>>>>> The basic idea is to define a simple C struct, which is ABI-stable and
>>>>>>> describes an Arrow away adequately.  The struct can be
>>>>> stack-allocated.
>>>>>>> Its definition can also be copied in another project (or interfaced
>>>>> with
>>>>>>> using a C FFI layer, depending on the language).
>>>>>>>
>>>>>>> There is no formal proposal, this message is meant to stir the
>>>>> discussion.
>>>>>>>
>>>>>>> Issues to work out:
>>>>>>>
>>>>>>> * Memory lifetime issues: where Python simply associates the Py_buffer
>>>>>>> with a PyObject owner (a garbage-collected Python object), we need
>>>>>>> another means to control lifetime of pointed areas.  One simple
>>>>>>> possibility is to include a destructor function pointer in the
>>>>> protocol
>>>>>>> struct.
>>>>>>>
>>>>>>> * Arrow type representation.  We probably need some kind of "format"
>>>>>>> mini-language to represent Arrow types, so that a type can be
>>>>> described
>>>>>>> using a `const char*`.  Ideally, primitives types at least should be
>>>>>>> trivially parsable.  We may take inspiration from Python here
>>>>> (`struct`
>>>>>>> module format characters, PEP 3118 format additions).
>>>>>>>
>>>>>>> Example C struct definition (not a formal proposal!):
>>>>>>>
>>>>>>> struct ArrowBuffer {
>>>>>>>   void* data;
>>>>>>>   int64_t nbytes;
>>>>>>>   // Called by the consumer when it doesn't need the buffer anymore
>>>>>>>   void (*release)(struct ArrowBuffer*);
>>>>>>>   // Opaque user data (for e.g. the release callback)
>>>>>>>   void* user_data;
>>>>>>> };
>>>>>>>
>>>>>>> struct ArrowArray {
>>>>>>>   // Type description
>>>>>>>   const char* format;
>>>>>>>   // Data description
>>>>>>>   int64_t length;
>>>>>>>   int64_t null_count;
>>>>>>>   int64_t n_buffers;
>>>>>>>   // Note: this pointers are probably owned by the ArrowArray struct
>>>>>>>   // and will be released and free()ed by the release callback.
>>>>>>>   struct BufferDescriptor* buffers;
>>>>>>>   struct ArrowDescriptor* dictionary;
>>>>>>>   // Called by the consumer when it doesn't need the array anymore
>>>>>>>   void (*release)(struct ArrowArrayDescriptor*);
>>>>>>>   // Opaque user data (for e.g. the release callback)
>>>>>>>   void* user_data;
>>>>>>> };
>>>>>>>
>>>>>>> Thoughts?
>>>>>>>
>>>>>>> (*) For the record, the reference for the Python buffer protocol:
>>>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
>>>>>>> and its C struct definition:
>>>>>>>
>>>>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>>>>>>>
>>>>>>> Regards
>>>>>>>
>>>>>>> Antoine.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to