Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

Wes McKinney Tue, 17 Sep 2019 18:01:10 -0700

On Tue, Sep 17, 2019 at 7:09 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> >
> > Let's take an example:
> >
> > * Dremio can execute SQL and uses Arrow as its native runtime format
> > * Apache Spark can execute SQL and offers UDF support with Arrow
> > format, i.e. so using Arrow for IO
> >
> > Both of these projects can say that they "use Apache Arrow", but the
> > extent to which Arrow is a key ingredient may not be obvious to the
> > average onlooker. To have more "Arrow-native" systems seems like one
> > of the missions of the project.
> >
>
> I'm not following you here. Are you suggesting that these systems are
> Arrow-native or not Arrow-native? Or that one is and the other is not? What
> does Arrow-native mean to you?
>
> Do you think there is enough problems around this right now that we need to
> do something? It seems like you're concerned about people claiming they are
> using Arrow when they aren't quite. Right now, it seems like the community
> mostly benefits from people saying they are using Arrow. Have you seen
> situations where users/consumers were frustrated because something was
> Arrow but not really Arrow?


I think it's good that using Arrow in some way has become a mark of
quality for systems.

My argument is mostly about brand quality control. Early on in Apache
Arrow, some people who learned about the project asked me, essentially
"what's the point of developing reference implementations if everyone
'just follows the specification'?". Even now people have said similar
to me in the context of our occasional difficulties scaling our build
and packaging, i.e. "why are you making your life so difficult
building all this systems software, if the specification is all you
really need to use Arrow?"

In an extreme case, Apache Arrow could be a single Markdown document
in a git repository describing the Arrow protocol and that's it.

As a project insider who's been overseeing the development of the
reference implementations, the prospect of a proliferation of
implementations lacking in integration tests with each other terrifies
me. This has already happened with the Parquet format in some ways.

One of the raison d'etres of the project is interoperability. I would
like for people to see "Arrow" and understand what they're getting, or
at least be advised about where a project falls short of
interoperability.

- Wes

Re: [DISCUSS] Improving Arrow columnar implementation guidelines for third parties

Reply via email to