Re: [DISCUSS] C-level in-process array protocol

Antoine Pitrou Thu, 19 Sep 2019 01:07:56 -0700


Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> I like the idea of a stable ABI for in-processing  that can be used for in
> process communication.  For instance, there was a recent question on
> stack-overflow on how to solve this [1].
> 
> A couple of thoughts/questions:
> * Would ArrowArray also need a self reference for children arrays?


Yes, I forgot that.  I also think we don't need a separate Buffer
struct, instead the Array struct should own all its buffers.

> * Should transferring key-value metadata be in scope?

Yes.  It could either be in the format string or a separate string.  The
upside of a separate string is that a consumer may ignore it trivially
if it doesn't need the information.

Another open question is for nested types: does the format string
represent the entire type including children?  Or must child types be
read in the child arrays?  If we mimick ArrayData, then the format
string should represent the entire type; it will then be more complex to
parse.

We should also make sure that extension types fit in the protocol.

> * Should the API more closely align the IPC spec (pass a schema separately
> and list of buffers instead of individual arrays)?

Then you have that's not immediately usable (you have to do some
processing to reconstitute the individual arrays).  One goal here is to
minimize implementation costs for producers and consumers.  The
assumption is a data model similar to the C++ ArrowData model; do we
have implementations that use an entirely different model?  Perhaps I
should take a look :-)

Note that the draft I posted only concerns arrays.  We may also want to
have a C struct for batches or tables.

Regards

Antoine.


> 
> Thanks,
> Micah
> 
> [1]
> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> 
> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <[email protected]> wrote:
> 
>>
>> Hello,
>>
>> One thing that was discussed in the sync call is the ability to easily
>> pass arrays at runtime between Arrow implementations or Arrow-supporting
>> libraries in the same process, without bearing the cost of linking to
>> e.g. the C++ Arrow library.
>>
>> (for example: "Duckdb wants to provide an option to return Arrow data of
>> result sets, but they don't like having Arrow as a dependency")
>>
>> One possibility would be to define a C-level protocol similar in spirit
>> to the Python buffer protocol, which some people may be familiar with (*).
>>
>> The basic idea is to define a simple C struct, which is ABI-stable and
>> describes an Arrow away adequately.  The struct can be stack-allocated.
>> Its definition can also be copied in another project (or interfaced with
>> using a C FFI layer, depending on the language).
>>
>> There is no formal proposal, this message is meant to stir the discussion.
>>
>> Issues to work out:
>>
>> * Memory lifetime issues: where Python simply associates the Py_buffer
>> with a PyObject owner (a garbage-collected Python object), we need
>> another means to control lifetime of pointed areas.  One simple
>> possibility is to include a destructor function pointer in the protocol
>> struct.
>>
>> * Arrow type representation.  We probably need some kind of "format"
>> mini-language to represent Arrow types, so that a type can be described
>> using a `const char*`.  Ideally, primitives types at least should be
>> trivially parsable.  We may take inspiration from Python here (`struct`
>> module format characters, PEP 3118 format additions).
>>
>> Example C struct definition (not a formal proposal!):
>>
>> struct ArrowBuffer {
>>   void* data;
>>   int64_t nbytes;
>>   // Called by the consumer when it doesn't need the buffer anymore
>>   void (*release)(struct ArrowBuffer*);
>>   // Opaque user data (for e.g. the release callback)
>>   void* user_data;
>> };
>>
>> struct ArrowArray {
>>   // Type description
>>   const char* format;
>>   // Data description
>>   int64_t length;
>>   int64_t null_count;
>>   int64_t n_buffers;
>>   // Note: this pointers are probably owned by the ArrowArray struct
>>   // and will be released and free()ed by the release callback.
>>   struct BufferDescriptor* buffers;
>>   struct ArrowDescriptor* dictionary;
>>   // Called by the consumer when it doesn't need the array anymore
>>   void (*release)(struct ArrowArrayDescriptor*);
>>   // Opaque user data (for e.g. the release callback)
>>   void* user_data;
>> };
>>
>> Thoughts?
>>
>> (*) For the record, the reference for the Python buffer protocol:
>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
>> and its C struct definition:
>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>>
>> Regards
>>
>> Antoine.
>>
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to