Re: [DISCUSS] C-level in-process array protocol

Antoine Pitrou Thu, 19 Sep 2019 10:21:22 -0700


Le 19/09/2019 à 19:11, Uwe L. Korn a écrit :
> Hello,
> 
> I like this proposal as it will make interfacing inside a process between 
> various Arrow supports much easier. I'm a bit critical though of using a 
> string as the format representation as one needs to parse it correctly. 
> Couldn't we use the enums we already have and reimplement them as C-defines 
> instead?


We could, but then we need to represent type parameters separately, as
some types are parametric (such as Time-related types).  So we would
still have some kind of encoded representation for those parameters.

So it may be as easy to represent everything inside the format string:
the type class (a single character perhaps) and optionally the type
instance parameters (if necessary).

Note that for non-parametric primitive types such as int64_t, double,
utf8... the format string will be a single character anyway.

Regards

Antoine.


> 
> Uwe
> 
> On Thu, Sep 19, 2019, at 6:21 PM, Zhuo Peng wrote:
>> Hi Antoine,
>>
>> I'm also interested in a stable ABI (previously I posted on this mailing
>> list about the ABI issues I had [1]). Does having such an ABI-stable
>> C-struct imply that there will be a set of C-APIs exposed by the Arrow
>> (C++) library (which I think would lead to a solution to all the inherit
>> ABI issues caused by C++)?
>>
>> [1]
>> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
>>
>> On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou <anto...@python.org> wrote:
>>
>>>
>>> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
>>>> I like the idea of a stable ABI for in-processing  that can be used for
>>> in
>>>> process communication.  For instance, there was a recent question on
>>>> stack-overflow on how to solve this [1].
>>>>
>>>> A couple of thoughts/questions:
>>>> * Would ArrowArray also need a self reference for children arrays?
>>>
>>> Yes, I forgot that.  I also think we don't need a separate Buffer
>>> struct, instead the Array struct should own all its buffers.
>>>
>>>> * Should transferring key-value metadata be in scope?
>>>
>>> Yes.  It could either be in the format string or a separate string.  The
>>> upside of a separate string is that a consumer may ignore it trivially
>>> if it doesn't need the information.
>>>
>>> Another open question is for nested types: does the format string
>>> represent the entire type including children?  Or must child types be
>>> read in the child arrays?  If we mimick ArrayData, then the format
>>> string should represent the entire type; it will then be more complex to
>>> parse.
>>>
>>> We should also make sure that extension types fit in the protocol.
>>>
>>>> * Should the API more closely align the IPC spec (pass a schema
>>> separately
>>>> and list of buffers instead of individual arrays)?
>>>
>>> Then you have that's not immediately usable (you have to do some
>>> processing to reconstitute the individual arrays).  One goal here is to
>>> minimize implementation costs for producers and consumers.  The
>>> assumption is a data model similar to the C++ ArrowData model; do we
>>> have implementations that use an entirely different model?  Perhaps I
>>> should take a look :-)
>>>
>>> Note that the draft I posted only concerns arrays.  We may also want to
>>> have a C struct for batches or tables.
>>>
>>> Regards
>>>
>>> Antoine.
>>>
>>>
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> [1]
>>>>
>>> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
>>>>
>>>> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org>
>>> wrote:
>>>>
>>>>>
>>>>> Hello,
>>>>>
>>>>> One thing that was discussed in the sync call is the ability to easily
>>>>> pass arrays at runtime between Arrow implementations or Arrow-supporting
>>>>> libraries in the same process, without bearing the cost of linking to
>>>>> e.g. the C++ Arrow library.
>>>>>
>>>>> (for example: "Duckdb wants to provide an option to return Arrow data of
>>>>> result sets, but they don't like having Arrow as a dependency")
>>>>>
>>>>> One possibility would be to define a C-level protocol similar in spirit
>>>>> to the Python buffer protocol, which some people may be familiar with
>>> (*).
>>>>>
>>>>> The basic idea is to define a simple C struct, which is ABI-stable and
>>>>> describes an Arrow away adequately.  The struct can be stack-allocated.
>>>>> Its definition can also be copied in another project (or interfaced with
>>>>> using a C FFI layer, depending on the language).
>>>>>
>>>>> There is no formal proposal, this message is meant to stir the
>>> discussion.
>>>>>
>>>>> Issues to work out:
>>>>>
>>>>> * Memory lifetime issues: where Python simply associates the Py_buffer
>>>>> with a PyObject owner (a garbage-collected Python object), we need
>>>>> another means to control lifetime of pointed areas.  One simple
>>>>> possibility is to include a destructor function pointer in the protocol
>>>>> struct.
>>>>>
>>>>> * Arrow type representation.  We probably need some kind of "format"
>>>>> mini-language to represent Arrow types, so that a type can be described
>>>>> using a `const char*`.  Ideally, primitives types at least should be
>>>>> trivially parsable.  We may take inspiration from Python here (`struct`
>>>>> module format characters, PEP 3118 format additions).
>>>>>
>>>>> Example C struct definition (not a formal proposal!):
>>>>>
>>>>> struct ArrowBuffer {
>>>>>   void* data;
>>>>>   int64_t nbytes;
>>>>>   // Called by the consumer when it doesn't need the buffer anymore
>>>>>   void (*release)(struct ArrowBuffer*);
>>>>>   // Opaque user data (for e.g. the release callback)
>>>>>   void* user_data;
>>>>> };
>>>>>
>>>>> struct ArrowArray {
>>>>>   // Type description
>>>>>   const char* format;
>>>>>   // Data description
>>>>>   int64_t length;
>>>>>   int64_t null_count;
>>>>>   int64_t n_buffers;
>>>>>   // Note: this pointers are probably owned by the ArrowArray struct
>>>>>   // and will be released and free()ed by the release callback.
>>>>>   struct BufferDescriptor* buffers;
>>>>>   struct ArrowDescriptor* dictionary;
>>>>>   // Called by the consumer when it doesn't need the array anymore
>>>>>   void (*release)(struct ArrowArrayDescriptor*);
>>>>>   // Opaque user data (for e.g. the release callback)
>>>>>   void* user_data;
>>>>> };
>>>>>
>>>>> Thoughts?
>>>>>
>>>>> (*) For the record, the reference for the Python buffer protocol:
>>>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
>>>>> and its C struct definition:
>>>>>
>>> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
>>>>>
>>>>> Regards
>>>>>
>>>>> Antoine.
>>>>>
>>>>
>>>
>>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to