Hello, I like this proposal as it will make interfacing inside a process between various Arrow supports much easier. I'm a bit critical though of using a string as the format representation as one needs to parse it correctly. Couldn't we use the enums we already have and reimplement them as C-defines instead?
Uwe On Thu, Sep 19, 2019, at 6:21 PM, Zhuo Peng wrote: > Hi Antoine, > > I'm also interested in a stable ABI (previously I posted on this mailing > list about the ABI issues I had [1]). Does having such an ABI-stable > C-struct imply that there will be a set of C-APIs exposed by the Arrow > (C++) library (which I think would lead to a solution to all the inherit > ABI issues caused by C++)? > > [1] > https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E > > On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > Le 19/09/2019 à 09:39, Micah Kornfield a écrit : > > > I like the idea of a stable ABI for in-processing that can be used for > > in > > > process communication. For instance, there was a recent question on > > > stack-overflow on how to solve this [1]. > > > > > > A couple of thoughts/questions: > > > * Would ArrowArray also need a self reference for children arrays? > > > > Yes, I forgot that. I also think we don't need a separate Buffer > > struct, instead the Array struct should own all its buffers. > > > > > * Should transferring key-value metadata be in scope? > > > > Yes. It could either be in the format string or a separate string. The > > upside of a separate string is that a consumer may ignore it trivially > > if it doesn't need the information. > > > > Another open question is for nested types: does the format string > > represent the entire type including children? Or must child types be > > read in the child arrays? If we mimick ArrayData, then the format > > string should represent the entire type; it will then be more complex to > > parse. > > > > We should also make sure that extension types fit in the protocol. > > > > > * Should the API more closely align the IPC spec (pass a schema > > separately > > > and list of buffers instead of individual arrays)? > > > > Then you have that's not immediately usable (you have to do some > > processing to reconstitute the individual arrays). One goal here is to > > minimize implementation costs for producers and consumers. The > > assumption is a data model similar to the C++ ArrowData model; do we > > have implementations that use an entirely different model? Perhaps I > > should take a look :-) > > > > Note that the draft I posted only concerns arrays. We may also want to > > have a C struct for batches or tables. > > > > Regards > > > > Antoine. > > > > > > > > > > Thanks, > > > Micah > > > > > > [1] > > > > > https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220 > > > > > > On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > >> > > >> Hello, > > >> > > >> One thing that was discussed in the sync call is the ability to easily > > >> pass arrays at runtime between Arrow implementations or Arrow-supporting > > >> libraries in the same process, without bearing the cost of linking to > > >> e.g. the C++ Arrow library. > > >> > > >> (for example: "Duckdb wants to provide an option to return Arrow data of > > >> result sets, but they don't like having Arrow as a dependency") > > >> > > >> One possibility would be to define a C-level protocol similar in spirit > > >> to the Python buffer protocol, which some people may be familiar with > > (*). > > >> > > >> The basic idea is to define a simple C struct, which is ABI-stable and > > >> describes an Arrow away adequately. The struct can be stack-allocated. > > >> Its definition can also be copied in another project (or interfaced with > > >> using a C FFI layer, depending on the language). > > >> > > >> There is no formal proposal, this message is meant to stir the > > discussion. > > >> > > >> Issues to work out: > > >> > > >> * Memory lifetime issues: where Python simply associates the Py_buffer > > >> with a PyObject owner (a garbage-collected Python object), we need > > >> another means to control lifetime of pointed areas. One simple > > >> possibility is to include a destructor function pointer in the protocol > > >> struct. > > >> > > >> * Arrow type representation. We probably need some kind of "format" > > >> mini-language to represent Arrow types, so that a type can be described > > >> using a `const char*`. Ideally, primitives types at least should be > > >> trivially parsable. We may take inspiration from Python here (`struct` > > >> module format characters, PEP 3118 format additions). > > >> > > >> Example C struct definition (not a formal proposal!): > > >> > > >> struct ArrowBuffer { > > >> void* data; > > >> int64_t nbytes; > > >> // Called by the consumer when it doesn't need the buffer anymore > > >> void (*release)(struct ArrowBuffer*); > > >> // Opaque user data (for e.g. the release callback) > > >> void* user_data; > > >> }; > > >> > > >> struct ArrowArray { > > >> // Type description > > >> const char* format; > > >> // Data description > > >> int64_t length; > > >> int64_t null_count; > > >> int64_t n_buffers; > > >> // Note: this pointers are probably owned by the ArrowArray struct > > >> // and will be released and free()ed by the release callback. > > >> struct BufferDescriptor* buffers; > > >> struct ArrowDescriptor* dictionary; > > >> // Called by the consumer when it doesn't need the array anymore > > >> void (*release)(struct ArrowArrayDescriptor*); > > >> // Opaque user data (for e.g. the release callback) > > >> void* user_data; > > >> }; > > >> > > >> Thoughts? > > >> > > >> (*) For the record, the reference for the Python buffer protocol: > > >> https://docs.python.org/3/c-api/buffer.html#buffer-structure > > >> and its C struct definition: > > >> > > https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 > > >> > > >> Regards > > >> > > >> Antoine. > > >> > > > > > >