Re: [DISCUSS] C-level in-process array protocol

Wes McKinney Sun, 29 Sep 2019 11:12:02 -0700

There are two pieces of serialized data needed to communicate a record
batch from one library to another


* Serialized schema (i.e. what's in Schema.fbs)
* Serialized "data header", i.e. the "RecordBatch" message in Message.fbs

You _do_ need to use a Flatbuffers library to fully create these
message types to interact with any existing record batch disassembly /
reassembly.

I think I'm most concerned about having a new way to serialize
schemas. We already have JSON-based schema serialization for
integration test purposes, so one possibility is to standardize that
and make it a more formalized part of the project specification.

As far as a C protocol, I don't see an especial downside to using the
Flatbuffers schema to communicate types.

Another thought is to not deviate from the flattened
Flatbuffers-styled representation but to translate the Flatbuffers
types into C types: namely a C struct-based version of the
"RecordBatch" message.

Independent of the means to communicate the two pieces of serialized
information above (respectively: schemas and record batch field memory
addresses and field lengths), having a C-based FFI where project can
drop in a header file containing the ABI they are supposed to
implement, that seems pretty reasonable to me.

If we don't define a standardized in-memory FFI (whether it uses the
Flatbuffers objects as inputs/outputs or not) then downstream project
will devise their own, and that will cause issues long term.

On Sun, Sep 29, 2019 at 2:59 AM Antoine Pitrou <[email protected]> wrote:
>
>
> Le 29/09/2019 à 06:10, Jacques Nadeau a écrit :
> > * No dependency on Flatbuffers.
> > * No buffer reassembly (data is already exposed in logical Arrow format).
> > * Zero-copy by design.
> > * Easy to reimplement from scratch.
> >
> > I don't see how the flatbuffer pattern for data headers doesn't accomplish
> > all of these things. At its definition, is a very simple representation of
> > data that could be worked with independently of the flatbuffers codebase.
> > It was designed so systems could map directly into that memory without
> > interacting with a flatbuffers library.
> >
> > Specifically the following three structures were designed to already allow
> > what I think this proposal is trying to recreate. All three are very simple
> > to construct in a direct, non-flatbuffer dependent read/write pattern.
>
> Are they?  Personally, I wouldn't know how to do that.  I don't know
> which encoding Flatbuffers use, whether it's C ABI-compatible (how could
> it be? if it's portable accross different platforms, then it's probably
> not compatible with any particular platform's C ABI, or only as a
> conincidence), how I'm supposed to make use of the "offset" field, or
> what the lifetime / ownership of all this data is.
>
> I may be missing something, but if the answer is that it's easy to
> reimplement Flatbuffers' encoding without relying on the Flatbuffers
> project's source code, I'm a bit skeptical.
>
> Regards
>
> Antoine.
>
>
> >
> > struct FieldNode {
> >   length: long;
> >   null_count: long;
> > }
> >
> > struct Buffer {
> >   offset: long;
> >   length: long;
> > }
> >
> > table RecordBatch {
> >   length: long;
> >   nodes: [FieldNode];
> >   buffers: [Buffer];
> > }
> >
> > On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <[email protected]> wrote:
> >
> >> I'm not clear on why we need to introduce something beyond what
> >> flatbuffers already provides. Can someone explain that to me? I'm not
> >> really a fan of introducing a second representation of the same data (as I
> >> understand it).
> >>
> >> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <[email protected]> wrote:
> >>
> >>> This is helpful, I will leave some comments on the proposal when I
> >>> can, sometime in the next week.
> >>>
> >>> I agree that it would likely be opening a can of worms to create a
> >>> semantic mapping between a generalized type grammar and Arrow's
> >>> specific logical types defined in Schema.fbs. If we go down this
> >>> route, we should probably utilize the simplest possible grammar that
> >>> is capable of encoding the Type Flatbuffers union values.
> >>>
> >>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <[email protected]>
> >>> wrote:
> >>>>
> >>>>
> >>>> I've posted a draft specification PR here, this should help orient the
> >>>> discussion a bit:
> >>>> https://github.com/apache/arrow/pull/5442
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>>
> >>>>
> >>>> On Wed, 18 Sep 2019 19:52:38 +0200
> >>>> Antoine Pitrou <[email protected]> wrote:
> >>>>> Hello,
> >>>>>
> >>>>> One thing that was discussed in the sync call is the ability to easily
> >>>>> pass arrays at runtime between Arrow implementations or
> >>> Arrow-supporting
> >>>>> libraries in the same process, without bearing the cost of linking to
> >>>>> e.g. the C++ Arrow library.
> >>>>>
> >>>>> (for example: "Duckdb wants to provide an option to return Arrow data
> >>> of
> >>>>> result sets, but they don't like having Arrow as a dependency")
> >>>>>
> >>>>> One possibility would be to define a C-level protocol similar in
> >>> spirit
> >>>>> to the Python buffer protocol, which some people may be familiar with
> >>> (*).
> >>>>>
> >>>>> The basic idea is to define a simple C struct, which is ABI-stable and
> >>>>> describes an Arrow away adequately.  The struct can be
> >>> stack-allocated.
> >>>>> Its definition can also be copied in another project (or interfaced
> >>> with
> >>>>> using a C FFI layer, depending on the language).
> >>>>>
> >>>>> There is no formal proposal, this message is meant to stir the
> >>> discussion.
> >>>>>
> >>>>> Issues to work out:
> >>>>>
> >>>>> * Memory lifetime issues: where Python simply associates the Py_buffer
> >>>>> with a PyObject owner (a garbage-collected Python object), we need
> >>>>> another means to control lifetime of pointed areas.  One simple
> >>>>> possibility is to include a destructor function pointer in the
> >>> protocol
> >>>>> struct.
> >>>>>
> >>>>> * Arrow type representation.  We probably need some kind of "format"
> >>>>> mini-language to represent Arrow types, so that a type can be
> >>> described
> >>>>> using a `const char*`.  Ideally, primitives types at least should be
> >>>>> trivially parsable.  We may take inspiration from Python here
> >>> (`struct`
> >>>>> module format characters, PEP 3118 format additions).
> >>>>>
> >>>>> Example C struct definition (not a formal proposal!):
> >>>>>
> >>>>> struct ArrowBuffer {
> >>>>>   void* data;
> >>>>>   int64_t nbytes;
> >>>>>   // Called by the consumer when it doesn't need the buffer anymore
> >>>>>   void (*release)(struct ArrowBuffer*);
> >>>>>   // Opaque user data (for e.g. the release callback)
> >>>>>   void* user_data;
> >>>>> };
> >>>>>
> >>>>> struct ArrowArray {
> >>>>>   // Type description
> >>>>>   const char* format;
> >>>>>   // Data description
> >>>>>   int64_t length;
> >>>>>   int64_t null_count;
> >>>>>   int64_t n_buffers;
> >>>>>   // Note: this pointers are probably owned by the ArrowArray struct
> >>>>>   // and will be released and free()ed by the release callback.
> >>>>>   struct BufferDescriptor* buffers;
> >>>>>   struct ArrowDescriptor* dictionary;
> >>>>>   // Called by the consumer when it doesn't need the array anymore
> >>>>>   void (*release)(struct ArrowArrayDescriptor*);
> >>>>>   // Opaque user data (for e.g. the release callback)
> >>>>>   void* user_data;
> >>>>> };
> >>>>>
> >>>>> Thoughts?
> >>>>>
> >>>>> (*) For the record, the reference for the Python buffer protocol:
> >>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> >>>>> and its C struct definition:
> >>>>>
> >>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> >>>>>
> >>>>> Regards
> >>>>>
> >>>>> Antoine.
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >

Re: [DISCUSS] C-level in-process array protocol

Reply via email to