Re: [DISCUSS] C Data Interface, take 2

Jacques Nadeau Mon, 20 Jan 2020 11:00:59 -0800

I don't see this as an endogenous concern of the C++ project. I appreciate
your goal with saying so but I think this has broader ramifications around
fragmentation of the project.


The core challenge that we're dealing with is we introduced foundational
concepts in some implementations that go beyond the spec and then provided
useful features based on them (in this case, the offset concept). Ideally,
those concepts are first introduced at the specification level so there
aren't inconsistent viewpoints of what Arrow is (which I believe is what is
happening here). Having a cross-language specification for in-memory
processing is a new concept so it isn't surprising that we're going to
learn these things along the way.

Without this, we create a slippery slope of fragmentation between the
specifications and the implementations. I understand that the toothpaste is
out of the tube in this particular case. We can respond in two ways: stop
the slip or continue to slide down the slope. I'm inclined to stop the slip.

As I said on the GitHub, I'm struggling with how much of this should be
solved in the project. I'm going to pause a bit on responding to reflect
further about this as well to reduce the likelihood that this devolves into
a flame war (which is always a risk with complex issues such as these).



On Mon, Jan 20, 2020 at 9:59 AM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Jacques,
>
> Taking a step back from the discussion, the original problem statement
> was to enable third party projects to produce the data structure used
> by C++ Array classes in C without depending on the C++ code
>
> That's the ArrayData class here
>
> https://github.com/apache/arrow/blob/master/cpp/src/arrow/array.h#L232
>
> It is important for us simplify the programming interface with the C++
> library, so I think that we should address this as an endogenous
> concern of the C++ project, namely providing a "C API for the C++
> project". The C API for the C++ library needs to mirror what's in the
> C++ project (i.e. the ArrayData data structure). We should not
> advertise this as being a part of the project specification.
>
> - Wes
>
> On Mon, Jan 20, 2020 at 11:51 AM Jacques Nadeau <jacq...@apache.org>
> wrote:
> >
> > As I noted on the pull request, I think fundamentally this work is at
> odds
> > with the Arrow specification and being used to introduce a shadow
> > specification.
> >
> > I don't think our intentions about how people should use something really
> > influence how people will actually use or perceive it. They'll just find
> > supported Arrow code and expose things based on it and call it "Arrow
> > compatible". In other words, I don't think people in the outside world
> will
> > be able to perceive the distinction between "Arrow C++ compatible" and
> > "Arrow compatible".
> >
> > On Mon, Jan 20, 2020 at 9:28 AM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >
> > > hi folks,
> > >
> > > I just made a comment in https://github.com/apache/arrow/pull/6026
> > > that I wanted to surface here on the mailing list.
> > >
> > > It seems that to reach consensus for a C interface that is intended to
> > > be broadly used by multiple programming languages, we may make some
> > > compromises that harm or outright undermine some of the use cases that
> > > motivated the creation of the C interface in the first place. That
> > > does not seem good. I wonder if it would be more productive to reduce
> > > the scope of the project to merely providing a C-header-based data
> > > interface to the C++ project only. That was the original problem
> > > statement and it seems in attempting to make it useful beyond C++ has
> > > made it difficult to reach consensus.
> > >
> > > Thanks
> > > Wes
> > >
> > > On Sat, Dec 21, 2019 at 4:38 PM Jacques Nadeau <jacq...@apache.org>
> wrote:
> > > >
> > > > Thanks for addressing my comments. I'm actively reviewing the
> proposal.
> > > It
> > > > is taking me more time than I would like given the time of the year
> but I
> > > > want to make sure that you know that I'm looking at it and hope to
> > > provide
> > > > additional feedback beyond that which I've provided thus far on the
> PR.
> > > > Will update soon.
> > > >
> > > > Thanks for your patience.
> > > >
> > > > On Tue, Dec 17, 2019 at 11:16 AM Antoine Pitrou <solip...@pitrou.net
> >
> > > wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > Following Jacques's feedback, I drafted a new version of the C data
> > > > > interface spec.
> > > > >
> > > > > The spec PR is here:
> > > > > https://github.com/apache/arrow/pull/6040
> > > > > Direct link to the RST file:
> > > > >
> > > > >
> > >
> https://github.com/apache/arrow/blob/5d8669d371401f9db12326b079e13c0058ba972b/docs/source/format/CDataInterface.rst
> > > > >
> > > > > There is also a C++ implementation, together with a Python <-> R
> > > > > bridge demonstrating the functionality:
> > > > > https://github.com/apache/arrow/pull/6026
> > > > >
> > > > > The main change from the previous spec is that there are now two C
> > > > > structures; one for the type or schema information, one for the
> > > > > array or record batch data. This allows exchanging both kinds of
> > > > > information independently (and so, potentially, to exchange schema
> once
> > > > > and then multiple arrays or record batches).
> > > > >
> > > > > Comments and questions welcome.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > > >
> > >
>

Re: [DISCUSS] C Data Interface, take 2

Reply via email to