Re: [DISCUSS] C-level in-process array protocol

Wes McKinney Wed, 02 Oct 2019 20:26:09 -0700

On Wed, Oct 2, 2019 at 10:19 PM Wes McKinney <[email protected]> wrote:
>
> On Wed, Oct 2, 2019 at 7:46 PM Jacques Nadeau <[email protected]> wrote:
> >
> > I'd like to hear more opinions from others on this topic. This conversation
> > seems mostly dominated by comments from myself, Wes and Antoine.
> >
> > I think it is reasonable to argue that keeping any ABI (or header/struct
> > pattern) as narrow as possible would allow us to minimize overlap with the
> > existing in-memory specification. In Arrow's case, this could be as simple
> > as a single memory pointer for schema (backed by flatbuffers) and a single
> > memory location for data (that references the record batch header, which in
> > turn provides pointers into the actual arrow data). Extensions would need
> > to be added for reference management as done here but I continue to think
> > we should defer discussion of that until the base data structures are
> > resolved. I see the comments here as arguing for a much broader ABI, in
> > part to support having people build "Arrow" components that interconnect
> > using this new interface. I understand the desire to expand the ABI to be
> > driven by needs to reduce dependencies and ease usability.
> >
> > The representation within the related patch is being presented as a way for
> > applications to share Arrow data but is not easily accessible to all
> > languages. I want to avoid a situation where someone says "I produced an
> > Arrow API" when what they've really done is created a C interface which
> > only a small subset of languages can actually leverage. For example, every
> > language now knows how to parse the existing schema definition as rendered
> > in flatbuf. In order to interact with something that implements this new
> > pattern one would also be required to implement completely new schema
> > consumption code. In the proposal itself it suggests this (for example
> > enhancing the C++ library to consume structures produced this way).
>
> I think we are creating a C-based in-memory representation of Arrow
> (significantly simpler than what we have in C++, which involves smart
> pointers and other C++ concepts) and how people use these structs is
> up to them.
>
> > As I said, I really want to hear more opinions. Running this past various
> > developers I know, many have echoed my concerns but that really doesn't
> > matter (and who knows how much of that is colored by my presentation of the
> > issue). What do people here think? If someone builds an "Arrow" library
> > that implements this set of structures, how does one use it in Node? In
> > Java? Does it drive creation of a secondary set of interfaces in each of
> > those languages to work with this kind of pattern? (For example, in a JVM
> > view of the world, working with a plain struct in java rather than a set of
> > memory pointers against our existing IPC formats would be quite painful and
> > we'd definitely need to create some glue code for users. I worry the same
> > pattern would occur in many other languages.)
> >
>
> I'm fine to wait for more opinions, but I don't think that creating a
> strict C programming interface means that all languages have to figure
> out how to do FFI with it.
>
> > To respond directly to some of Wes's most recent comments from the email
> > below. I struggle to map your description of the situation to the rest of
> > the thread and the proposed patch.  For example, you say that a non-goal is
> > "creating a new canonical way to serialize metadata" bute the patch
> > proposes a concrete string based encoding system to describe data types.
> > Aren't those things in conflict?
> >
>
> Each language implementation represents in-memory schemas in a
> different way. In C++ we have the arrow::DataType classes. If the goal
> is to create a very compact C-based data model or Arrow, why is using
> a string representation of types instead of a more verbose object
> model inappropriate?
>


FWIW, the string-style representation of types is widespread. It's
used in the so-called C "buffer protocol" in Python and the "struct"
standard library module

https://docs.python.org/3/library/struct.html#module-struct

> > I'll also think more on this and challenge my own perspective. This isn't
> > where my focus is so my comments aren't as developed/thoughtful as I'd like.
> >
> >
> > On Tue, Oct 1, 2019 at 7:33 PM Wes McKinney <[email protected]> wrote:
> >
> > > hi Jacques,
> > >
> > > I think we've veered off course a bit and maybe we could reframe the
> > > discussion.
> > >
> > > Goals
> > > * A "drop-in" header-only C file that projects can use as a
> > > programming interface either internally only or to expose in-memory
> > > data structures between C functions at call sites. Ideally little to
> > > no disassembly/reassembly should be required on either "side" of the
> > > call site.
> > > * Simplifying adoption of Arrow for C programmers, or languages based
> > > around C FFI
> > >
> > > Non-goals
> > > * Expanding the columnar format or creating an alternative canonical
> > > in-memory representation
> > > * Creating a new canonical way to serialize metadata
> > >
> > > Note that this use case has been on my mind for more than 2 years:
> > > https://issues.apache.org/jira/browse/ARROW-1058
> > >
> > > I think there are a couple of potentially misleading things at play here
> > >
> > > 1. The use of the word "protocol". In C, a struct has a well-defined
> > > binary layout, so a C API is also an ABI. Using C structs to
> > > communicate data can be considered to be a protocol, but it means
> > > something different in the context of the "Arrow protocol". I think we
> > > need to call this a "C API"
> > >
> > > 2. The documentation for this in Antoine's PR is in the format/
> > > directory. It would probably be better to have a "C API" section in
> > > the documentation.
> > >
> > > The header file under discussion and the documentation about it is
> > > best considered as a "library".
> > >
> > > It might be useful at some point to create a C99 implementation of the
> > > IPC protocol as well using FlatCC with the goal of having a complete
> > > implementation of the columnar format in C with minimal binary
> > > footprint. This is analogous to the NanoPB project which is an
> > > implementation of Protocol Buffers with small code size
> > >
> > > https://github.com/nanopb/nanopb
> > >
> > > Let me know if this makes more sense.
> > >
> > > I think it's important to communicate clearly about this primarily for
> > > the benefit of the outside world which can confuse easily as we have
> > > observed over the last few years =)
> > >
> > > Wes
> > >
> > > On Tue, Oct 1, 2019 at 2:55 PM Jacques Nadeau <[email protected]> wrote:
> > > >
> > > > I disagree with this statement:
> > > >
> > > > - the IPC format is meant for serialization while the C data protocol is
> > > > meants for in-memory communication, so different concerns apply
> > > >
> > > > If that is how the a particular implementation presents it, that is a
> > > > weaknesses of the implementation, not the format. The primary use case I
> > > > was focused on when working on the initial format was communication
> > > within
> > > > the same process. It seems like this is being used as a basis for the
> > > > introduction of new things when the premise is inconsistent with the
> > > > intention of the creation. The specific reason we used flatbuffers in 
> > > > the
> > > > project was to collapse the separation of in-process and out-of-process
> > > > communication. It means the same thing it does with the Arrow data
> > > itself:
> > > > that a consumer doesn't have to use a particular library to interact 
> > > > with
> > > > and use the data.
> > > >
> > > > It seems like there are two ideas here:
> > > >
> > > > 1) How do we make it easier for people to use Arrow?
> > > > 2) Should we implement a new in memory representation of Arrow that is
> > > > language specific.
> > > >
> > > > I'm entirely in support of number one. If for a particular type of
> > > domain,
> > > > people want an easier way to interact with Arrow, let's make a new
> > > library
> > > > that helps with that. In easy of our current libraries, we do many 
> > > > things
> > > > to make it easier to work with Arrow. None of those require a change to
> > > the
> > > > core format or are formalized as a new in-memory standard. The in-memory
> > > > representation of rust or javascript or java objects are implementation
> > > > details.
> > > >
> > > > I'm against number two as it creates a fragmentation problem. Arrow is
> > > > about having a single canonical format for memory for both metadata and
> > > > data. Having multiple in-memory formats (especially when some are not
> > > > language independent) is counter to the goals of the project.
> > >
> > > I don't think anyone is proposing anything that would cause fragmentation.
> > >
> > > A central question is whether it is useful to define a reusable C ABI
> > > for the Arrow columnar format, and if there is sufficient interest, a
> > > tiny C implementation of the IPC protocol (which uses the Flatbuffers
> > > message) that assembles and disassembles the data structures defined
> > > in the C ABI.
> > >
> > > We could separately create a tiny implementation of the Arrow IPC
> > > protocol using FlatCC that could be dropped into applications
> > > requiring only a C compiler and nothing else.
> > >
> > >
> > > >
> > > > Two other, separate comments:
> > > > 1) I don't understand the idea that we need to change the way Arrow
> > > > fundamentally works so that people can avoid using a dependency. If the
> > > > dependency is small, open source and easy to build, people can fork it
> > > and
> > > > include directly if they want to. Let's not violate project principles
> > > > because DuckDB has a religious perspective on dependencies. If the
> > > problem
> > > > is people have to swallow too large of a pill to do basic things with
> > > Arrow
> > > > in C, let's focus on fixing that (to our definition of ease, not someone
> > > > else's). If FlatCC solves some those things, great. If we need to build 
> > > > a
> > > > baby integration library that is more C centric, great. Neither of those
> > > > things require implementing something at the format level.
> > > >
> > > > 2) It seems like we should discuss the data structure problem separately
> > > > from the reference management concern.
> > > >
> > > >
> > > > On Tue, Oct 1, 2019 at 5:42 AM Wes McKinney <[email protected]> wrote:
> > > >
> > > > > hi Antoine,
> > > > >
> > > > > On Tue, Oct 1, 2019 at 4:29 AM Antoine Pitrou <[email protected]>
> > > wrote:
> > > > > >
> > > > > >
> > > > > > Le 01/10/2019 à 00:39, Wes McKinney a écrit :
> > > > > > > A couple things:
> > > > > > >
> > > > > > > * I think a C protocol / FFI for Arrow array/vectors would be
> > > better
> > > > > > > to have the same "shape" as an assembled array. Note that the C
> > > > > > > structs here have very nearly the same "shape" as the data
> > > structure
> > > > > > > representing a C++ Array object [1]. The disassembly and 
> > > > > > > reassembly
> > > > > > > here is substantially simpler than the IPC protocol. A recursive
> > > > > > > structure in Flatbuffers would make RecordBatch messages much
> > > larger,
> > > > > > > so the flattened / disassembled representation we use for
> > > serialized
> > > > > > > record batches is the correct one
> > > > > >
> > > > > > I'm not sure I agree:
> > > > > >
> > > > > > - indeed, it's not a coincidence that the ArrowArray struct looks
> > > quite
> > > > > > closely like the C++ ArrayData object :-)  We have good experience
> > > with
> > > > > > that abstraction and it has proven to work quite well
> > > > > >
> > > > > > - the IPC format is meant for serialization while the C data
> > > protocol is
> > > > > > meants for in-memory communication, so different concerns apply
> > > > > >
> > > > > > - the fact that this makes the layout slightly larger doesn't seem
> > > > > > important at all; we're not talking about transferring data over the
> > > wire
> > > > > >
> > > > > > There's also another argument for having a recursive struct: it
> > > > > > simplifies how the data type is represented, since we can encode 
> > > > > > each
> > > > > > child type individually instead of encoding it in the parent's 
> > > > > > format
> > > > > > string (same applies for metadata and individual flags).
> > > > > >
> > > > >
> > > > > I was saying something different here. I was making an argument about
> > > > > why we use the flattened array-of-structs in the IPC protocol. One
> > > > > reason is that it's a more compact representation. That is not very
> > > > > important here because this protocol is only for *in-process* (for
> > > > > languages that have a C FFI facility) rather than *inter-process*
> > > > > communication.
> > > > >
> > > > > I agree also that the type encoding is simple, here, too, since we
> > > > > aren't having to split the schema and record batch between different
> > > > > serialized messages. There is some potential waste with having to
> > > > > populate the type fields multiple times when communicating a sequence
> > > > > of "chunks" from the same logical dataset.
> > > > >
> > > > > > > * The "formal" C protocol having the "assembled" shape means that
> > > many
> > > > > > > minimal Arrow users won't have to implement any separate data
> > > > > > > structures. They can just use the C struct directly or a slightly
> > > > > > > wrapped version thereof with some convenience functions.
> > > > > >
> > > > > > Yes, but the same applies to the current proposal.
> > > > > >
> > > > > > > * I think that requiring building a Flatbuffer for minimal use
> > > cases
> > > > > > > (e.g. communicating simple record batches with primitive types)
> > > passes
> > > > > > > on implementation burden to minimal users.
> > > > > >
> > > > > > It certainly does.
> > > > > >
> > > > > > > I think the mantra of the C protocol should be the following:
> > > > > > >
> > > > > > > * Users of the protocol have to write little to no code to use it.
> > > For
> > > > > > > example, populating an INT32 array should require only a few lines
> > > of
> > > > > > > code
> > > > > >
> > > > > > Agreed.  As a sidenote, the spec should have an example of doing
> > > this in
> > > > > > raw C.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.
> > > > >
> > >

Re: [DISCUSS] C-level in-process array protocol

Reply via email to