Re: [DISCUSS] C-level in-process array protocol

Micah Kornfield Thu, 19 Sep 2019 00:40:14 -0700

I like the idea of a stable ABI for in-processing  that can be used for in
process communication.  For instance, there was a recent question on
stack-overflow on how to solve this [1].


A couple of thoughts/questions:
* Would ArrowArray also need a self reference for children arrays?
* Should transferring key-value metadata be in scope?
* Should the API more closely align the IPC spec (pass a schema separately
and list of buffers instead of individual arrays)?

Thanks,
Micah

[1]
https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220

On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Hello,
>
> One thing that was discussed in the sync call is the ability to easily
> pass arrays at runtime between Arrow implementations or Arrow-supporting
> libraries in the same process, without bearing the cost of linking to
> e.g. the C++ Arrow library.
>
> (for example: "Duckdb wants to provide an option to return Arrow data of
> result sets, but they don't like having Arrow as a dependency")
>
> One possibility would be to define a C-level protocol similar in spirit
> to the Python buffer protocol, which some people may be familiar with (*).
>
> The basic idea is to define a simple C struct, which is ABI-stable and
> describes an Arrow away adequately.  The struct can be stack-allocated.
> Its definition can also be copied in another project (or interfaced with
> using a C FFI layer, depending on the language).
>
> There is no formal proposal, this message is meant to stir the discussion.
>
> Issues to work out:
>
> * Memory lifetime issues: where Python simply associates the Py_buffer
> with a PyObject owner (a garbage-collected Python object), we need
> another means to control lifetime of pointed areas.  One simple
> possibility is to include a destructor function pointer in the protocol
> struct.
>
> * Arrow type representation.  We probably need some kind of "format"
> mini-language to represent Arrow types, so that a type can be described
> using a `const char*`.  Ideally, primitives types at least should be
> trivially parsable.  We may take inspiration from Python here (`struct`
> module format characters, PEP 3118 format additions).
>
> Example C struct definition (not a formal proposal!):
>
> struct ArrowBuffer {
>   void* data;
>   int64_t nbytes;
>   // Called by the consumer when it doesn't need the buffer anymore
>   void (*release)(struct ArrowBuffer*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
>
> struct ArrowArray {
>   // Type description
>   const char* format;
>   // Data description
>   int64_t length;
>   int64_t null_count;
>   int64_t n_buffers;
>   // Note: this pointers are probably owned by the ArrowArray struct
>   // and will be released and free()ed by the release callback.
>   struct BufferDescriptor* buffers;
>   struct ArrowDescriptor* dictionary;
>   // Called by the consumer when it doesn't need the array anymore
>   void (*release)(struct ArrowArrayDescriptor*);
>   // Opaque user data (for e.g. the release callback)
>   void* user_data;
> };
>
> Thoughts?
>
> (*) For the record, the reference for the Python buffer protocol:
> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> and its C struct definition:
> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to