Hello,

I like this proposal as it will make interfacing inside a process between 
various Arrow supports much easier. I'm a bit critical though of using a string 
as the format representation as one needs to parse it correctly. Couldn't we 
use the enums we already have and reimplement them as C-defines instead?

Uwe

On Thu, Sep 19, 2019, at 6:21 PM, Zhuo Peng wrote:
> Hi Antoine,
> 
> I'm also interested in a stable ABI (previously I posted on this mailing
> list about the ABI issues I had [1]). Does having such an ABI-stable
> C-struct imply that there will be a set of C-APIs exposed by the Arrow
> (C++) library (which I think would lead to a solution to all the inherit
> ABI issues caused by C++)?
> 
> [1]
> https://lists.apache.org/thread.html/27b6e2a30cf93c5f5f78de970c68c7d996f538d94ab61431fa342f41@%3Cdev.arrow.apache.org%3E
> 
> On Thu, Sep 19, 2019 at 1:07 AM Antoine Pitrou <anto...@python.org> wrote:
> 
> >
> > Le 19/09/2019 à 09:39, Micah Kornfield a écrit :
> > > I like the idea of a stable ABI for in-processing  that can be used for
> > in
> > > process communication.  For instance, there was a recent question on
> > > stack-overflow on how to solve this [1].
> > >
> > > A couple of thoughts/questions:
> > > * Would ArrowArray also need a self reference for children arrays?
> >
> > Yes, I forgot that.  I also think we don't need a separate Buffer
> > struct, instead the Array struct should own all its buffers.
> >
> > > * Should transferring key-value metadata be in scope?
> >
> > Yes.  It could either be in the format string or a separate string.  The
> > upside of a separate string is that a consumer may ignore it trivially
> > if it doesn't need the information.
> >
> > Another open question is for nested types: does the format string
> > represent the entire type including children?  Or must child types be
> > read in the child arrays?  If we mimick ArrayData, then the format
> > string should represent the entire type; it will then be more complex to
> > parse.
> >
> > We should also make sure that extension types fit in the protocol.
> >
> > > * Should the API more closely align the IPC spec (pass a schema
> > separately
> > > and list of buffers instead of individual arrays)?
> >
> > Then you have that's not immediately usable (you have to do some
> > processing to reconstitute the individual arrays).  One goal here is to
> > minimize implementation costs for producers and consumers.  The
> > assumption is a data model similar to the C++ ArrowData model; do we
> > have implementations that use an entirely different model?  Perhaps I
> > should take a look :-)
> >
> > Note that the draft I posted only concerns arrays.  We may also want to
> > have a C struct for batches or tables.
> >
> > Regards
> >
> > Antoine.
> >
> >
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> > https://stackoverflow.com/questions/57966032/how-does-apache-arrow-facilitate-no-overhead-for-cross-system-communication/57967220#57967220
> > >
> > > On Wed, Sep 18, 2019 at 10:52 AM Antoine Pitrou <anto...@python.org>
> > wrote:
> > >
> > >>
> > >> Hello,
> > >>
> > >> One thing that was discussed in the sync call is the ability to easily
> > >> pass arrays at runtime between Arrow implementations or Arrow-supporting
> > >> libraries in the same process, without bearing the cost of linking to
> > >> e.g. the C++ Arrow library.
> > >>
> > >> (for example: "Duckdb wants to provide an option to return Arrow data of
> > >> result sets, but they don't like having Arrow as a dependency")
> > >>
> > >> One possibility would be to define a C-level protocol similar in spirit
> > >> to the Python buffer protocol, which some people may be familiar with
> > (*).
> > >>
> > >> The basic idea is to define a simple C struct, which is ABI-stable and
> > >> describes an Arrow away adequately.  The struct can be stack-allocated.
> > >> Its definition can also be copied in another project (or interfaced with
> > >> using a C FFI layer, depending on the language).
> > >>
> > >> There is no formal proposal, this message is meant to stir the
> > discussion.
> > >>
> > >> Issues to work out:
> > >>
> > >> * Memory lifetime issues: where Python simply associates the Py_buffer
> > >> with a PyObject owner (a garbage-collected Python object), we need
> > >> another means to control lifetime of pointed areas.  One simple
> > >> possibility is to include a destructor function pointer in the protocol
> > >> struct.
> > >>
> > >> * Arrow type representation.  We probably need some kind of "format"
> > >> mini-language to represent Arrow types, so that a type can be described
> > >> using a `const char*`.  Ideally, primitives types at least should be
> > >> trivially parsable.  We may take inspiration from Python here (`struct`
> > >> module format characters, PEP 3118 format additions).
> > >>
> > >> Example C struct definition (not a formal proposal!):
> > >>
> > >> struct ArrowBuffer {
> > >>   void* data;
> > >>   int64_t nbytes;
> > >>   // Called by the consumer when it doesn't need the buffer anymore
> > >>   void (*release)(struct ArrowBuffer*);
> > >>   // Opaque user data (for e.g. the release callback)
> > >>   void* user_data;
> > >> };
> > >>
> > >> struct ArrowArray {
> > >>   // Type description
> > >>   const char* format;
> > >>   // Data description
> > >>   int64_t length;
> > >>   int64_t null_count;
> > >>   int64_t n_buffers;
> > >>   // Note: this pointers are probably owned by the ArrowArray struct
> > >>   // and will be released and free()ed by the release callback.
> > >>   struct BufferDescriptor* buffers;
> > >>   struct ArrowDescriptor* dictionary;
> > >>   // Called by the consumer when it doesn't need the array anymore
> > >>   void (*release)(struct ArrowArrayDescriptor*);
> > >>   // Opaque user data (for e.g. the release callback)
> > >>   void* user_data;
> > >> };
> > >>
> > >> Thoughts?
> > >>
> > >> (*) For the record, the reference for the Python buffer protocol:
> > >> https://docs.python.org/3/c-api/buffer.html#buffer-structure
> > >> and its C struct definition:
> > >>
> > https://github.com/python/cpython/blob/v3.7.4/Include/object.h#L181-L195
> > >>
> > >> Regards
> > >>
> > >> Antoine.
> > >>
> > >
> >
>

Reply via email to