Re: [DISCUSS] C-level in-process array protocol

Antoine Pitrou Thu, 19 Sep 2019 12:44:30 -0700


I suppose it could be possible for an Arrow array to describe itself
using the ndtypes vocabulary at some point.  However, this is
non-trivial, both on the producer and consumer side.  Moreover, both
sides must ensure they use the same ndtypes description.


The idea here is to have a C data protocol, without any need for a
helper C library, that's a simple as possible and directly expresses the
Arrow data without needing any semantic mapping.  Also it should allow
transmission via FFI layers with as little complication as possible.

Which is why it most probably needs to be Arrow-specific.

Regards

Antoine.


Le 19/09/2019 à 21:14, Travis Oliphant a écrit :
> I know some on this list are familiar, but many may not have seen ndtypes
> in xnd:  https://github.com/xnd-project/ndtypes
> 
> It generalizes PEP 3118 for cross-language data-structure handling.
> 
> Either a dependency on the small C-library libndtypes or using the concepts
> could be done.
> 
> -Travis
> 
> 
> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org> wrote:
> 
>>
>> Hello,
>>
>> One thing that was discussed in the sync call is the ability to easily
>> pass arrays at runtime between Arrow implementations or Arrow-supporting
>> libraries in the same process, without bearing the cost of linking to
>> e.g. the C++ Arrow library.
>>
>> (for example: "Duckdb wants to provide an option to return Arrow data of
>> result sets, but they don't like having Arrow as a dependency")
>>
>> One possibility would be to define a C-level protocol similar in spirit
>> to the Python buffer protocol, which some people may be familiar with (*).
>>
>> The basic idea is to define a simple C struct, which is ABI-stable and
>> describes an Arrow away adequately.  The struct can be stack-allocated.
>> Its definition can also be copied in another project (or interfaced with
>> using a C FFI layer, depending on the language).
>>
>> There is no formal proposal, this message is meant to stir the discussion.
>>
>> Issues to work out:
>>
>> * Memory lifetime issues: where Python simply associates the Py_buffer
>> with a PyObject owner (a garbage-collected Python object), we need
>> another means to control lifetime of pointed areas.  One simple
>> possibility is to include a destructor function pointer in the protocol
>> struct.
>>
>> * Arrow type representation.  We probably need some kind of "format"
>> mini-language to represent Arrow types, so that a type can be described
>> using a `const char*`.  Ideally, primitives types at least should be
>> trivially parsable.  We may take inspiration from Python here (`struct`
>> module format characters, PEP 3118 format additions).
>>
>> Example C struct definition (not a formal proposal!):
>>
>> struct ArrowBuffer {
>>   void* data;
>>   int64_t nbytes;
>>   // Called by the consumer when it doesn't need the buffer anymore
>>   void (*release)(struct ArrowBuffer*);
>>   // Opaque user data (for e.g. the release callback)
>>   void* user_data;
>> };
>>
>> struct ArrowArray {
>>   // Type description
>>   const char* format;
>>   // Data description
>>   int64_t length;
>>   int64_t null_count;
>>   int64_t n_buffers;
>>   // Note: this pointers are probably owned by the ArrowArray struct
>>   // and will be released and free()ed by the release callback.
>>   struct BufferDescriptor* buffers;
>>   struct ArrowDescriptor* dictionary;
>>   // Called by the consumer when it doesn't need the array anymore
>>   void (*release)(struct ArrowArrayDescriptor*);
>>   // Opaque user data (for e.g. the release callback)
>>   void* user_data;
>> };
>>
>> Thoughts?
>>
>> (*) For the record, the reference for the Python buffer protocol:
>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
>> and its C struct definition:
>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>>
>> Regards
>>
>> Antoine.
>>
> 
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to