Hello,
One thing that was discussed in the sync call is the ability to easily
pass arrays at runtime between Arrow implementations or Arrow-supporting
libraries in the same process, without bearing the cost of linking to
e.g. the C++ Arrow library.
(for example: "Duckdb wants to provide an option to return Arrow data of
result sets, but they don't like having Arrow as a dependency")
One possibility would be to define a C-level protocol similar in spirit
to the Python buffer protocol, which some people may be familiar with (*).
The basic idea is to define a simple C struct, which is ABI-stable and
describes an Arrow away adequately. The struct can be stack-allocated.
Its definition can also be copied in another project (or interfaced with
using a C FFI layer, depending on the language).
There is no formal proposal, this message is meant to stir the discussion.
Issues to work out:
* Memory lifetime issues: where Python simply associates the Py_buffer
with a PyObject owner (a garbage-collected Python object), we need
another means to control lifetime of pointed areas. One simple
possibility is to include a destructor function pointer in the protocol
struct.
* Arrow type representation. We probably need some kind of "format"
mini-language to represent Arrow types, so that a type can be described
using a `const char*`. Ideally, primitives types at least should be
trivially parsable. We may take inspiration from Python here (`struct`
module format characters, PEP 3118 format additions).
Example C struct definition (not a formal proposal!):
struct ArrowBuffer {
void* data;
int64_t nbytes;
// Called by the consumer when it doesn't need the buffer anymore
void (*release)(struct ArrowBuffer*);
// Opaque user data (for e.g. the release callback)
void* user_data;
};
struct ArrowArray {
// Type description
const char* format;
// Data description
int64_t length;
int64_t null_count;
int64_t n_buffers;
// Note: this pointers are probably owned by the ArrowArray struct
// and will be released and free()ed by the release callback.
struct BufferDescriptor* buffers;
struct ArrowDescriptor* dictionary;
// Called by the consumer when it doesn't need the array anymore
void (*release)(struct ArrowArrayDescriptor*);
// Opaque user data (for e.g. the release callback)
void* user_data;
};
Thoughts?
(*) For the record, the reference for the Python buffer protocol:
https://docs.python.org/3/c-api/buffer.html#buffer-structure
and its C struct definition:
https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
Regards
Antoine.