Re: [DISCUSS] C-level in-process array protocol

Jacques Nadeau Sat, 28 Sep 2019 21:11:04 -0700

* No dependency on Flatbuffers.
* No buffer reassembly (data is already exposed in logical Arrow format).
* Zero-copy by design.
* Easy to reimplement from scratch.


I don't see how the flatbuffer pattern for data headers doesn't accomplish
all of these things. At its definition, is a very simple representation of
data that could be worked with independently of the flatbuffers codebase.
It was designed so systems could map directly into that memory without
interacting with a flatbuffers library.

Specifically the following three structures were designed to already allow
what I think this proposal is trying to recreate. All three are very simple
to construct in a direct, non-flatbuffer dependent read/write pattern.

struct FieldNode {
  length: long;
  null_count: long;
}

struct Buffer {
  offset: long;
  length: long;
}

table RecordBatch {
  length: long;
  nodes: [FieldNode];
  buffers: [Buffer];
}

On Sat, Sep 28, 2019 at 9:02 PM Jacques Nadeau <jacq...@apache.org> wrote:

> I'm not clear on why we need to introduce something beyond what
> flatbuffers already provides. Can someone explain that to me? I'm not
> really a fan of introducing a second representation of the same data (as I
> understand it).
>
> On Thu, Sep 19, 2019 at 1:15 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> This is helpful, I will leave some comments on the proposal when I
>> can, sometime in the next week.
>>
>> I agree that it would likely be opening a can of worms to create a
>> semantic mapping between a generalized type grammar and Arrow's
>> specific logical types defined in Schema.fbs. If we go down this
>> route, we should probably utilize the simplest possible grammar that
>> is capable of encoding the Type Flatbuffers union values.
>>
>> On Thu, Sep 19, 2019 at 2:49 PM Antoine Pitrou <solip...@pitrou.net>
>> wrote:
>> >
>> >
>> > I've posted a draft specification PR here, this should help orient the
>> > discussion a bit:
>> > https://github.com/apache/arrow/pull/5442
>> >
>> > Regards
>> >
>> > Antoine.
>> >
>> >
>> >
>> > On Wed, 18 Sep 2019 19:52:38 +0200
>> > Antoine Pitrou <anto...@python.org> wrote:
>> > > Hello,
>> > >
>> > > One thing that was discussed in the sync call is the ability to easily
>> > > pass arrays at runtime between Arrow implementations or
>> Arrow-supporting
>> > > libraries in the same process, without bearing the cost of linking to
>> > > e.g. the C++ Arrow library.
>> > >
>> > > (for example: "Duckdb wants to provide an option to return Arrow data
>> of
>> > > result sets, but they don't like having Arrow as a dependency")
>> > >
>> > > One possibility would be to define a C-level protocol similar in
>> spirit
>> > > to the Python buffer protocol, which some people may be familiar with
>> (*).
>> > >
>> > > The basic idea is to define a simple C struct, which is ABI-stable and
>> > > describes an Arrow away adequately.  The struct can be
>> stack-allocated.
>> > > Its definition can also be copied in another project (or interfaced
>> with
>> > > using a C FFI layer, depending on the language).
>> > >
>> > > There is no formal proposal, this message is meant to stir the
>> discussion.
>> > >
>> > > Issues to work out:
>> > >
>> > > * Memory lifetime issues: where Python simply associates the Py_buffer
>> > > with a PyObject owner (a garbage-collected Python object), we need
>> > > another means to control lifetime of pointed areas.  One simple
>> > > possibility is to include a destructor function pointer in the
>> protocol
>> > > struct.
>> > >
>> > > * Arrow type representation.  We probably need some kind of "format"
>> > > mini-language to represent Arrow types, so that a type can be
>> described
>> > > using a `const char*`.  Ideally, primitives types at least should be
>> > > trivially parsable.  We may take inspiration from Python here
>> (`struct`
>> > > module format characters, PEP 3118 format additions).
>> > >
>> > > Example C struct definition (not a formal proposal!):
>> > >
>> > > struct ArrowBuffer {
>> > >   void* data;
>> > >   int64_t nbytes;
>> > >   // Called by the consumer when it doesn't need the buffer anymore
>> > >   void (*release)(struct ArrowBuffer*);
>> > >   // Opaque user data (for e.g. the release callback)
>> > >   void* user_data;
>> > > };
>> > >
>> > > struct ArrowArray {
>> > >   // Type description
>> > >   const char* format;
>> > >   // Data description
>> > >   int64_t length;
>> > >   int64_t null_count;
>> > >   int64_t n_buffers;
>> > >   // Note: this pointers are probably owned by the ArrowArray struct
>> > >   // and will be released and free()ed by the release callback.
>> > >   struct BufferDescriptor* buffers;
>> > >   struct ArrowDescriptor* dictionary;
>> > >   // Called by the consumer when it doesn't need the array anymore
>> > >   void (*release)(struct ArrowArrayDescriptor*);
>> > >   // Opaque user data (for e.g. the release callback)
>> > >   void* user_data;
>> > > };
>> > >
>> > > Thoughts?
>> > >
>> > > (*) For the record, the reference for the Python buffer protocol:
>> > > https://docs.python.org/3/c-api/buffer.html#buffer-structure
>> > > and its C struct definition:
>> > >
>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>> > >
>> > > Regards
>> > >
>> > > Antoine.
>> > >
>> >
>> >
>> >
>>
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to