I like the idea of a stable ABI for in-processing that can be used for in process communication. For instance, there was a recent question on stack-overflow on how to solve this [1].
A couple of thoughts/questions: * Would ArrowArray also need a self reference for children arrays? * Should transferring key-value metadata be in scope? * Should the API more closely align the IPC spec (pass a schema separately and list of buffers instead of individual arrays)? Thanks, Micah [1] https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220 On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org> wrote: > > Hello, > > One thing that was discussed in the sync call is the ability to easily > pass arrays at runtime between Arrow implementations or Arrow-supporting > libraries in the same process, without bearing the cost of linking to > e.g. the C++ Arrow library. > > (for example: "Duckdb wants to provide an option to return Arrow data of > result sets, but they don't like having Arrow as a dependency") > > One possibility would be to define a C-level protocol similar in spirit > to the Python buffer protocol, which some people may be familiar with (*). > > The basic idea is to define a simple C struct, which is ABI-stable and > describes an Arrow away adequately. The struct can be stack-allocated. > Its definition can also be copied in another project (or interfaced with > using a C FFI layer, depending on the language). > > There is no formal proposal, this message is meant to stir the discussion. > > Issues to work out: > > * Memory lifetime issues: where Python simply associates the Py_buffer > with a PyObject owner (a garbage-collected Python object), we need > another means to control lifetime of pointed areas. One simple > possibility is to include a destructor function pointer in the protocol > struct. > > * Arrow type representation. We probably need some kind of "format" > mini-language to represent Arrow types, so that a type can be described > using a `const char*`. Ideally, primitives types at least should be > trivially parsable. We may take inspiration from Python here (`struct` > module format characters, PEP 3118 format additions). > > Example C struct definition (not a formal proposal!): > > struct ArrowBuffer { > void* data; > int64_t nbytes; > // Called by the consumer when it doesn't need the buffer anymore > void (*release)(struct ArrowBuffer*); > // Opaque user data (for e.g. the release callback) > void* user_data; > }; > > struct ArrowArray { > // Type description > const char* format; > // Data description > int64_t length; > int64_t null_count; > int64_t n_buffers; > // Note: this pointers are probably owned by the ArrowArray struct > // and will be released and free()ed by the release callback. > struct BufferDescriptor* buffers; > struct ArrowDescriptor* dictionary; > // Called by the consumer when it doesn't need the array anymore > void (*release)(struct ArrowArrayDescriptor*); > // Opaque user data (for e.g. the release callback) > void* user_data; > }; > > Thoughts? > > (*) For the record, the reference for the Python buffer protocol: > https://docs.python.org/3/c-api/buffer.html#buffer-structure > and its C struct definition: > https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 > > Regards > > Antoine. >