Re: [DISCUSS] C-level in-process array protocol

Antoine Pitrou Mon, 30 Sep 2019 14:08:01 -0700


FlatCC is still a dependency, with generated files etc.
Perhaps you want to evaluate FlatCC on a schema-like example and see
what the generated code and compile instructions look like?


I'll point out again that the format string in my proposal uses an
extremely simple mini-format, that should be parsable very easily by any
developer, even in raw C:
https://github.com/apache/arrow/blob/3806fa9ba3ddf95f0d09b865071bf19c5e912756/docs/source/format/CProtocol.rst#data-type-description----format-strings

The parent-child structure in the schema is represented as-is in the
ArrowArray parent-child relationship, so it doesn't need any encoding.
Using Flatbuffers for an enum-like field + (at most) a couple parameters
sounds overkill.

Another possibility would be to replace the format string with
pre-parsed fields, for example:

  int32_t type;
  int32_t subtype;      // type-dependent (e.g. unit for temporal types)
  int32_t type_width;   // for width-parametered types
  const int8_t* child_ids;   // for unions
  const char* auxiliary_type_param;  // e.g. timezone for timestamp

The downside is that there are more fields to consider (including two
optional pointers).

Regards

Antoine.


Le 30/09/2019 à 22:48, Ben Kietzman a écrit :
> FlatCC seems germane: https://github.com/dvidelabs/flatcc
> 
> It compiles flatbuffer schemas down to (idiomatic?) C
> 
> Perhaps the schema and batch serialization problems should be solved by
> storing everything in the flatbuffer format.
> Then the results of running flatcc plus a few simple helpers can be checked
> in to provide an accessible C API.
> With respect to lifetime, Antoine has already done good work on specifying
> a move only contract which could probably be adapted.
> 
> 
> On Sun, Sep 29, 2019 at 2:44 PM Antoine Pitrou <[email protected]> wrote:
> 
>>
>> One basic design point is to allow exchanging Arrow data with no
>> mandatory dependency (the exception is JSON and base64 if you want to
>> act on metadata - but that's highly optional, and those are extremely
>> widespread formats).  I'm afraid that Flatbuffers may be a deterrent:
>> not only it introduces a library, but it requires the use of a compiler
>> to produce generated code.  It also requires familiarizing with, well,
>> Flatbuffers :-)
>>
>> We can of course discuss this and feel it's not a problem.
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 29/09/2019 à 19:47, Wes McKinney a écrit :
>>> There are two pieces of serialized data needed to communicate a record
>>> batch from one library to another
>>>
>>> * Serialized schema (i.e. what's in Schema.fbs)
>>> * Serialized "data header", i.e. the "RecordBatch" message in Message.fbs
>>>
>>> You _do_ need to use a Flatbuffers library to fully create these
>>> message types to interact with any existing record batch disassembly /
>>> reassembly.
>>>
>>> I think I'm most concerned about having a new way to serialize
>>> schemas. We already have JSON-based schema serialization for
>>> integration test purposes, so one possibility is to standardize that
>>> and make it a more formalized part of the project specification.
>>>
>>> As far as a C protocol, I don't see an especial downside to using the
>>> Flatbuffers schema to communicate types.
>>>
>>> Another thought is to not deviate from the flattened
>>> Flatbuffers-styled representation but to translate the Flatbuffers
>>> types into C types: namely a C struct-based version of the
>>> "RecordBatch" message.
>>>
>>> Independent of the means to communicate the two pieces of serialized
>>> information above (respectively: schemas and record batch field memory
>>> addresses and field lengths), having a C-based FFI where project can
>>> drop in a header file containing the ABI they are supposed to
>>> implement, that seems pretty reasonable to me.
>>>
>>> If we don't define a standardized in-memory FFI (whether it uses the
>>> Flatbuffers objects as inputs/outputs or not) then downstream project
>>> will devise their own, and that will cause issues long term.
>>>
>>> On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <[email protected]>
>> wrote:
>>>>
>>>>
>>>> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
>>>>> * No dependency on Flatbuffers.
>>>>> * No buffer reassembly (data is already exposed in logical Arrow
>> format).
>>>>> * Zero-copy by design.
>>>>> * Easy to reimplement from scratch.
>>>>>
>>>>> I don't see how the flatbuffer pattern for data headers doesn't
>> accomplish
>>>>> all of these things. At its definition, is a very simple
>> representation of
>>>>> data that could be worked with independently of the flatbuffers
>> codebase.
>>>>> It was designed so systems could map directly into that memory without
>>>>> interacting with a flatbuffers library.
>>>>>
>>>>> Specifically the following three structures were designed to already
>> allow
>>>>> what I think this proposal is trying to recreate. All three are very
>> simple
>>>>> to construct in a direct, non-flatbuffer dependent read/write pattern.
>>>>
>>>> Are they?  Personally, I wouldn't know how to do that.  I don't know
>>>> which encoding Flatbuffers use, whether it's C ABI-compatible (how could
>>>> it be? if it's portable accross different platforms, then it's probably
>>>> not compatible with any particular platform's C ABI, or only as a
>>>> conincidence), how I'm supposed to make use of the "offset" field, or
>>>> what the lifetime / ownership of all this data is.
>>>>
>>>> I may be missing something, but if the answer is that it's easy to
>>>> reimplement Flatbuffers' encoding without relying on the Flatbuffers
>>>> project's source code, I'm a bit skeptical.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>>>
>>>>> struct FieldNode {
>>>>>   length: long;
>>>>>   null_count: long;
>>>>> }
>>>>>
>>>>> struct Buffer {
>>>>>   offset: long;
>>>>>   length: long;
>>>>> }
>>>>>
>>>>> table RecordBatch {
>>>>>   length: long;
>>>>>   nodes: [FieldNode];
>>>>>   buffers: [Buffer];
>>>>> }
>>>>>
>>>>> On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <[email protected]>
>> wrote:
>>>>>
>>>>>> I'm not clear on why we need to introduce something beyond what
>>>>>> flatbuffers already provides. Can someone explain that to me? I'm not
>>>>>> really a fan of introducing a second representation of the same data
>> (as I
>>>>>> understand it).
>>>>>>
>>>>>> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <[email protected]>
>> wrote:
>>>>>>
>>>>>>> This is helpful, I will leave some comments on the proposal when I
>>>>>>> can, sometime in the next week.
>>>>>>>
>>>>>>> I agree that it would likely be opening a can of worms to create a
>>>>>>> semantic mapping between a generalized type grammar and Arrow's
>>>>>>> specific logical types defined in Schema.fbs. If we go down this
>>>>>>> route, we should probably utilize the simplest possible grammar that
>>>>>>> is capable of encoding the Type Flatbuffers union values.
>>>>>>>
>>>>>>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <[email protected]>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> I've posted a draft specification PR here, this should help orient
>> the
>>>>>>>> discussion a bit:
>>>>>>>> https://github.com/apache/arrow/pull/5442
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> Antoine.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, 18 Sep 2019 19:52:38 +0200
>>>>>>>> Antoine Pitrou <[email protected]> wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> One thing that was discussed in the sync call is the ability to
>> easily
>>>>>>>>> pass arrays at runtime between Arrow implementations or
>>>>>>> Arrow-supporting
>>>>>>>>> libraries in the same process, without bearing the cost of linking
>> to
>>>>>>>>> e.g. the C++ Arrow library.
>>>>>>>>>
>>>>>>>>> (for example: "Duckdb wants to provide an option to return Arrow
>> data
>>>>>>> of
>>>>>>>>> result sets, but they don't like having Arrow as a dependency")
>>>>>>>>>
>>>>>>>>> One possibility would be to define a C-level protocol similar in
>>>>>>> spirit
>>>>>>>>> to the Python buffer protocol, which some people may be familiar
>> with
>>>>>>> (*).
>>>>>>>>>
>>>>>>>>> The basic idea is to define a simple C struct, which is ABI-stable
>> and
>>>>>>>>> describes an Arrow away adequately.  The struct can be
>>>>>>> stack-allocated.
>>>>>>>>> Its definition can also be copied in another project (or interfaced
>>>>>>> with
>>>>>>>>> using a C FFI layer, depending on the language).
>>>>>>>>>
>>>>>>>>> There is no formal proposal, this message is meant to stir the
>>>>>>> discussion.
>>>>>>>>>
>>>>>>>>> Issues to work out:
>>>>>>>>>
>>>>>>>>> * Memory lifetime issues: where Python simply associates the
>> Py_buffer
>>>>>>>>> with a PyObject owner (a garbage-collected Python object), we need
>>>>>>>>> another means to control lifetime of pointed areas.  One simple
>>>>>>>>> possibility is to include a destructor function pointer in the
>>>>>>> protocol
>>>>>>>>> struct.
>>>>>>>>>
>>>>>>>>> * Arrow type representation.  We probably need some kind of
>> "format"
>>>>>>>>> mini-language to represent Arrow types, so that a type can be
>>>>>>> described
>>>>>>>>> using a `const char*`.  Ideally, primitives types at least should
>> be
>>>>>>>>> trivially parsable.  We may take inspiration from Python here
>>>>>>> (`struct`
>>>>>>>>> module format characters, PEP 3118 format additions).
>>>>>>>>>
>>>>>>>>> Example C struct definition (not a formal proposal!):
>>>>>>>>>
>>>>>>>>> struct ArrowBuffer {
>>>>>>>>>   void* data;
>>>>>>>>>   int64_t nbytes;
>>>>>>>>>   // Called by the consumer when it doesn't need the buffer anymore
>>>>>>>>>   void (*release)(struct ArrowBuffer*);
>>>>>>>>>   // Opaque user data (for e.g. the release callback)
>>>>>>>>>   void* user_data;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> struct ArrowArray {
>>>>>>>>>   // Type description
>>>>>>>>>   const char* format;
>>>>>>>>>   // Data description
>>>>>>>>>   int64_t length;
>>>>>>>>>   int64_t null_count;
>>>>>>>>>   int64_t n_buffers;
>>>>>>>>>   // Note: this pointers are probably owned by the ArrowArray
>> struct
>>>>>>>>>   // and will be released and free()ed by the release callback.
>>>>>>>>>   struct BufferDescriptor* buffers;
>>>>>>>>>   struct ArrowDescriptor* dictionary;
>>>>>>>>>   // Called by the consumer when it doesn't need the array anymore
>>>>>>>>>   void (*release)(struct ArrowArrayDescriptor*);
>>>>>>>>>   // Opaque user data (for e.g. the release callback)
>>>>>>>>>   void* user_data;
>>>>>>>>> };
>>>>>>>>>
>>>>>>>>> Thoughts?
>>>>>>>>>
>>>>>>>>> (*) For the record, the reference for the Python buffer protocol:
>>>>>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
>>>>>>>>> and its C struct definition:
>>>>>>>>>
>>>>>>>
>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Antoine.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to