Le 19/09/2019 à 19:11, Uwe L. Korn a écrit : > Hello, > > I like this proposal as it will make interfacing inside a process between > various Arrow supports much easier. I'm a bit critical though of using a > string as the format representation as one needs to parse it correctly. > Couldn't we use the enums we already have and reimplement them as C-defines > instead?
We could, but then we need to represent type parameters separately, as some types are parametric (such as Time-related types). So we would still have some kind of encoded representation for those parameters. So it may be as easy to represent everything inside the format string: the type class (a single character perhaps) and optionally the type instance parameters (if necessary). Note that for non-parametric primitive types such as int64_t, double, utf8... the format string will be a single character anyway. Regards Antoine. > > Uwe > > On Thu, Sep 19, 2019, at 6:21 PM, Zhuo Peng wrote: >> Hi Antoine, >> >> I'm also interested in a stable ABI (previously I posted on this mailing >> list about the ABI issues I had [1]). Does having such an ABI-stable >> C-struct imply that there will be a set of C-APIs exposed by the Arrow >> (C++) library (which I think would lead to a solution to all the inherit >> ABI issues caused by C++)? >> >> [1] >> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E >> >> On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou <anto...@python.org> wrote: >> >>> >>> Le 19/09/2019 à 09:39, Micah Kornfield a écrit : >>>> I like the idea of a stable ABI for in-processing that can be used for >>> in >>>> process communication. For instance, there was a recent question on >>>> stack-overflow on how to solve this [1]. >>>> >>>> A couple of thoughts/questions: >>>> * Would ArrowArray also need a self reference for children arrays? >>> >>> Yes, I forgot that. I also think we don't need a separate Buffer >>> struct, instead the Array struct should own all its buffers. >>> >>>> * Should transferring key-value metadata be in scope? >>> >>> Yes. It could either be in the format string or a separate string. The >>> upside of a separate string is that a consumer may ignore it trivially >>> if it doesn't need the information. >>> >>> Another open question is for nested types: does the format string >>> represent the entire type including children? Or must child types be >>> read in the child arrays? If we mimick ArrayData, then the format >>> string should represent the entire type; it will then be more complex to >>> parse. >>> >>> We should also make sure that extension types fit in the protocol. >>> >>>> * Should the API more closely align the IPC spec (pass a schema >>> separately >>>> and list of buffers instead of individual arrays)? >>> >>> Then you have that's not immediately usable (you have to do some >>> processing to reconstitute the individual arrays). One goal here is to >>> minimize implementation costs for producers and consumers. The >>> assumption is a data model similar to the C++ ArrowData model; do we >>> have implementations that use an entirely different model? Perhaps I >>> should take a look :-) >>> >>> Note that the draft I posted only concerns arrays. We may also want to >>> have a C struct for batches or tables. >>> >>> Regards >>> >>> Antoine. >>> >>> >>>> >>>> Thanks, >>>> Micah >>>> >>>> [1] >>>> >>> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220 >>>> >>>> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org> >>> wrote: >>>> >>>>> >>>>> Hello, >>>>> >>>>> One thing that was discussed in the sync call is the ability to easily >>>>> pass arrays at runtime between Arrow implementations or Arrow-supporting >>>>> libraries in the same process, without bearing the cost of linking to >>>>> e.g. the C++ Arrow library. >>>>> >>>>> (for example: "Duckdb wants to provide an option to return Arrow data of >>>>> result sets, but they don't like having Arrow as a dependency") >>>>> >>>>> One possibility would be to define a C-level protocol similar in spirit >>>>> to the Python buffer protocol, which some people may be familiar with >>> (*). >>>>> >>>>> The basic idea is to define a simple C struct, which is ABI-stable and >>>>> describes an Arrow away adequately. The struct can be stack-allocated. >>>>> Its definition can also be copied in another project (or interfaced with >>>>> using a C FFI layer, depending on the language). >>>>> >>>>> There is no formal proposal, this message is meant to stir the >>> discussion. >>>>> >>>>> Issues to work out: >>>>> >>>>> * Memory lifetime issues: where Python simply associates the Py_buffer >>>>> with a PyObject owner (a garbage-collected Python object), we need >>>>> another means to control lifetime of pointed areas. One simple >>>>> possibility is to include a destructor function pointer in the protocol >>>>> struct. >>>>> >>>>> * Arrow type representation. We probably need some kind of "format" >>>>> mini-language to represent Arrow types, so that a type can be described >>>>> using a `const char*`. Ideally, primitives types at least should be >>>>> trivially parsable. We may take inspiration from Python here (`struct` >>>>> module format characters, PEP 3118 format additions). >>>>> >>>>> Example C struct definition (not a formal proposal!): >>>>> >>>>> struct ArrowBuffer { >>>>> void* data; >>>>> int64_t nbytes; >>>>> // Called by the consumer when it doesn't need the buffer anymore >>>>> void (*release)(struct ArrowBuffer*); >>>>> // Opaque user data (for e.g. the release callback) >>>>> void* user_data; >>>>> }; >>>>> >>>>> struct ArrowArray { >>>>> // Type description >>>>> const char* format; >>>>> // Data description >>>>> int64_t length; >>>>> int64_t null_count; >>>>> int64_t n_buffers; >>>>> // Note: this pointers are probably owned by the ArrowArray struct >>>>> // and will be released and free()ed by the release callback. >>>>> struct BufferDescriptor* buffers; >>>>> struct ArrowDescriptor* dictionary; >>>>> // Called by the consumer when it doesn't need the array anymore >>>>> void (*release)(struct ArrowArrayDescriptor*); >>>>> // Opaque user data (for e.g. the release callback) >>>>> void* user_data; >>>>> }; >>>>> >>>>> Thoughts? >>>>> >>>>> (*) For the record, the reference for the Python buffer protocol: >>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure >>>>> and its C struct definition: >>>>> >>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195 >>>>> >>>>> Regards >>>>> >>>>> Antoine. >>>>> >>>> >>> >>