Re: [DISCUSS] C-level in-process array protocol

Zhuo Peng Thu, 19 Sep 2019 10:53:31 -0700

On Thu, Sep 19, 2019 at 10:18 AM Antoine Pitrou <anto...@python.org> wrote:


>
> No, the plan for this proposal is to avoid providing a C API.  Each
> Arrow implementation could produce and consume the C data protocol, for
> example the C++ Array class could add these methods:
>
> class Array {
>   // ...
>
>  public:
>   // Export array to the C data protocol
>   void Share(ArrowArray* out);
>   // Import a C data protocol array
>   static Status FromShared(ArrowArray* input,
>                            std::shared_ptr<Array>* out);
> };
>
> Also, I don't know why a C API exposed by the C++ library would solve
> your problem.  You would still have a problem with bundling the .so,
> symbol conflicts if several libraries load libarrow.so, etc.

The problem is mainly about C++ not being able to provide a stable ABI for
templates (thus STL). If Arrow C++ library's public headers contain
templates or definitions from STL, the only way to guarantee safety is to
force the client library use the same toolchain and the same flags with
which the Arrow DSO was built. (Yes, distribution methods like Conda help
mitigate that issue by enforcing a uniform toolchain (almost), but problems
can still occur, if, say a client is built with --std=c++17 while
libarrow.so is built with --std=gnu11 (example at [1]).

The problems are only potential and theoretical, and won't bite anyone
until it occurs though, and it's more likely to happen with pip/wheel than
with conda.

But anyways, this idea is still nice. I could imagine at least in arrow's
Python-C-API, there would be a

PyObject* pyarrow_array_from_c_protocol(ArrayArray*);

this way the C++ APIs can be avoided while still allowing arrays to be
created in C/C++ and used in python.

[1] https://github.com/tensorflow/tensorflow/issues/23561

Regards
>
> Antoine.
>
>
> Le 19/09/2019 à 18:21, Zhuo Peng a écrit :
> > Hi Antoine,
> >
> > I'm also interested in a stable ABI (previously I posted on this mailing
> > list about the ABI issues I had [1]). Does having such an ABI-stable
> > C-struct imply that there will be a set of C-APIs exposed by the Arrow
> > (C++) library (which I think would lead to a solution to all the inherit
> > ABI issues caused by C++)?
> >
> > [1]
> >
> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
> >
> > On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >>
> >> Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> >>> I like the idea of a stable ABI for in-processing  that can be used for
> >> in
> >>> process communication.  For instance, there was a recent question on
> >>> stack-overflow on how to solve this [1].
> >>>
> >>> A couple of thoughts/questions:
> >>> * Would ArrowArray also need a self reference for children arrays?
> >>
> >> Yes, I forgot that.  I also think we don't need a separate Buffer
> >> struct, instead the Array struct should own all its buffers.
> >>
> >>> * Should transferring key-value metadata be in scope?
> >>
> >> Yes.  It could either be in the format string or a separate string.  The
> >> upside of a separate string is that a consumer may ignore it trivially
> >> if it doesn't need the information.
> >>
> >> Another open question is for nested types: does the format string
> >> represent the entire type including children?  Or must child types be
> >> read in the child arrays?  If we mimick ArrayData, then the format
> >> string should represent the entire type; it will then be more complex to
> >> parse.
> >>
> >> We should also make sure that extension types fit in the protocol.
> >>
> >>> * Should the API more closely align the IPC spec (pass a schema
> >> separately
> >>> and list of buffers instead of individual arrays)?
> >>
> >> Then you have that's not immediately usable (you have to do some
> >> processing to reconstitute the individual arrays).  One goal here is to
> >> minimize implementation costs for producers and consumers.  The
> >> assumption is a data model similar to the C++ ArrowData model; do we
> >> have implementations that use an entirely different model?  Perhaps I
> >> should take a look :-)
> >>
> >> Note that the draft I posted only concerns arrays.  We may also want to
> >> have a C struct for batches or tables.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>>
> >>> Thanks,
> >>> Micah
> >>>
> >>> [1]
> >>>
> >>
> https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> >>>
> >>> On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> Hello,
> >>>>
> >>>> One thing that was discussed in the sync call is the ability to easily
> >>>> pass arrays at runtime between Arrow implementations or
> Arrow-supporting
> >>>> libraries in the same process, without bearing the cost of linking to
> >>>> e.g. the C++ Arrow library.
> >>>>
> >>>> (for example: "Duckdb wants to provide an option to return Arrow data
> of
> >>>> result sets, but they don't like having Arrow as a dependency")
> >>>>
> >>>> One possibility would be to define a C-level protocol similar in
> spirit
> >>>> to the Python buffer protocol, which some people may be familiar with
> >> (*).
> >>>>
> >>>> The basic idea is to define a simple C struct, which is ABI-stable and
> >>>> describes an Arrow away adequately.  The struct can be
> stack-allocated.
> >>>> Its definition can also be copied in another project (or interfaced
> with
> >>>> using a C FFI layer, depending on the language).
> >>>>
> >>>> There is no formal proposal, this message is meant to stir the
> >> discussion.
> >>>>
> >>>> Issues to work out:
> >>>>
> >>>> * Memory lifetime issues: where Python simply associates the Py_buffer
> >>>> with a PyObject owner (a garbage-collected Python object), we need
> >>>> another means to control lifetime of pointed areas.  One simple
> >>>> possibility is to include a destructor function pointer in the
> protocol
> >>>> struct.
> >>>>
> >>>> * Arrow type representation.  We probably need some kind of "format"
> >>>> mini-language to represent Arrow types, so that a type can be
> described
> >>>> using a `const char*`.  Ideally, primitives types at least should be
> >>>> trivially parsable.  We may take inspiration from Python here
> (`struct`
> >>>> module format characters, PEP 3118 format additions).
> >>>>
> >>>> Example C struct definition (not a formal proposal!):
> >>>>
> >>>> struct ArrowBuffer {
> >>>>   void* data;
> >>>>   int64_t nbytes;
> >>>>   // Called by the consumer when it doesn't need the buffer anymore
> >>>>   void (*release)(struct ArrowBuffer*);
> >>>>   // Opaque user data (for e.g. the release callback)
> >>>>   void* user_data;
> >>>> };
> >>>>
> >>>> struct ArrowArray {
> >>>>   // Type description
> >>>>   const char* format;
> >>>>   // Data description
> >>>>   int64_t length;
> >>>>   int64_t null_count;
> >>>>   int64_t n_buffers;
> >>>>   // Note: this pointers are probably owned by the ArrowArray struct
> >>>>   // and will be released and free()ed by the release callback.
> >>>>   struct BufferDescriptor* buffers;
> >>>>   struct ArrowDescriptor* dictionary;
> >>>>   // Called by the consumer when it doesn't need the array anymore
> >>>>   void (*release)(struct ArrowArrayDescriptor*);
> >>>>   // Opaque user data (for e.g. the release callback)
> >>>>   void* user_data;
> >>>> };
> >>>>
> >>>> Thoughts?
> >>>>
> >>>> (*) For the record, the reference for the Python buffer protocol:
> >>>> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> >>>> and its C struct definition:
> >>>>
> >>
> https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>
> >>
> >
>

Re: [DISCUSS] C-level in-process array protocol

Reply via email to