I've posted a draft specification PR here, this should help orient the
discussion a bit:
https://github.com/apache/arrow/pull/5442

Regards

Antoine.



On Wed, 18 Sep 2019 19:52:38 +0200
Antoine Pitrou <anto...@python.org> wrote:
> Hello,
> 
> One thing that was discussed in the sync call is the ability to easily
> pass arrays at runtime between Arrow implementations or Arrow-supporting
> libraries in the same process, without bearing the cost of linking to
> e.g. the C++ Arrow library.
> 
> (for example: "Duckdb wants to provide an option to return Arrow data of
> result sets, but they don't like having Arrow as a dependency")
> 
> One possibility would be to define a C-level protocol similar in spirit
> to the Python buffer protocol, which some people may be familiar with (*).
> 
> The basic idea is to define a simple C struct, which is ABI-stable and
> describes an Arrow away adequately.  The struct can be stack-allocated.
> Its definition can also be copied in another project (or interfaced with
> using a C FFI layer, depending on the language).
> 
> There is no formal proposal, this message is meant to stir the discussion.
> 
> Issues to work out:
> 
> * Memory lifetime issues: where Python simply associates the Py_buffer
> with a PyObject owner (a garbage-collected Python object), we need
> another means to control lifetime of pointed areas.  One simple
> possibility is to include a destructor function pointer in the protocol
> struct.
> 
> * Arrow type representation.  We probably need some kind of "format"
> mini-language to represent Arrow types, so that a type can be described
> using a `const char*`.  Ideally, primitives types at least should be
> trivially parsable.  We may take inspiration from Python here (`struct`
> module format characters, PEP 3118 format additions).
> 
> Example C struct definition (not a formal proposal!):
> 
> struct ArrowBuffer {
>   void* data;
>   int64_t nbytes;
>   // Called by the consumer when it doesn't need the buffer anymore
>   void (*release)(struct ArrowBuffer*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
> 
> struct ArrowArray {
>   // Type description
>   const char* format;
>   // Data description
>   int64_t length;
>   int64_t null_count;
>   int64_t n_buffers;
>   // Note: this pointers are probably owned by the ArrowArray struct
>   // and will be released and free()ed by the release callback.
>   struct BufferDescriptor* buffers;
>   struct ArrowDescriptor* dictionary;
>   // Called by the consumer when it doesn't need the array anymore
>   void (*release)(struct ArrowArrayDescriptor*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
> 
> Thoughts?
> 
> (*) For the record, the reference for the Python buffer protocol:
> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> and its C struct definition:
> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> 
> Regards
> 
> Antoine.
> 



Reply via email to