[DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Wes McKinney Mon, 24 Jul 2017 09:48:59 -0700

hi folks,

In recent discussions, since the Arrow memory format and metadata has
become reasonably stabilized, and we're more likely to add new data
types than change existing ones, we may consider making a 1.0.0 to
declare to the rest of the open source world that "Arrow is open for
business" and can be relied upon in production applications (which
some reasonable tolerance for library API changes from major release
to major release). I hope we can all agree that forward and backward
compatibility in the zero-copy wire format and metadata is the most
essential thing.


To that end, I'd like to collect ideas for what needs to be
accomplished in the project before we'd be comfortable making a 1.0.0
release. I think it would be a good show of project stability /
production-readiness to do this (with the caveat the APIs will
continue to evolve).

The main things on my end are hardening the memory format and
integration tests for the remaining data types:

- Decimals
    - Lingering issues with 128-bit decimals
    - Need integration tests
  - Fixed size list
    - Java has implemented, but not C++. Need integration tests
  - Union
    - Two kinds of unions, Java only implements one. Need integration tests

On these, Decimals have the most work since the memory format needs to
be specified. On Unions, we may decide to not implement the dense
variant and focus on integration testing the sparse variant. I don't
think this is going to be too much work, but it needs to get sorted
out so we don't have incomplete or under-tested parts of the
specification.

There's some other things being discussed, like a Map logical type,
but that (at least as currently proposed) won't require any disruptive
modifications to the metadata.

As far as the metadata and memory format, we would use the Open/Closed
principle to guide our efforts
(https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
would be possible to add compression or encoding at the field level
without disrupting earlier versions of the software that lack these
features.

In the event that we do need to change the metadata or memory format
in the future (which would probably be an extreme circumstance), we
have the option of increasing the MetadataVersion which is one of the
first tags accompanying Arrow messages
(https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
So if you encounter a message that you do not support, you can raise
an appropriate exception.

There are some other things that would be nice to prototype or
specify, like a REST protocol for exposing Arrow datasets in a
client-server model (sending Arrow record batches via REST HTTP
calls).

Anything else that would need to go to move to a 1.x mainline for
development? One idea would be if we need to make any breaking changes
that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
maintenance mode.

Thanks
Wes

[DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Reply via email to