Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Julian Hyde Thu, 27 Jul 2017 10:04:23 -0700

Semantic versioning is a great tool, and we should use it as far as it
goes, but not push it.


I suggest that the Arrow specification should have a paragraph that
states the level of maturity of each part of the API; and each
implementation should have a paragraph that states which parts of the
spec are implemented, and to what quality. A lot can be accomplished
in one paragraph in terms of setting people's expectations.

And since you mentioned the open-closed principle earlier, the
robustness principle [1] should apply: be liberal in what you accept,
conservative in what you do. An arrow library should (ideally) not
fall over if it encounters a data structure that was experimental in a
previous version and has recently been removed.

Julian

[1] https://en.wikipedia.org/wiki/Robustness_principle


On Wed, Jul 26, 2017 at 12:30 PM, Wes McKinney <[email protected]> wrote:
> The combinatorics of code-level API stability are worrisome (with
> already 5 different language APIs in the project) while the maturity
> and development pace of different implementations may remain variable
> for some time.
>
> There are two possible things we can communicate with some form of
> major version number:
>
> * The Arrow specification (independent to implementation) is complete,
> with more than one reference implementations proving to have
> implemented it
>
> * The code is complete and stable
>
> The latter seems undesirable, at least on a 6 month horizon. I don't
> think it should keep us from making a public statement that we've
> hardened the Arrow format itself. Perhaps we need two kinds of major
> versions for the project.
>
> The worry I have is that strict semantic versioning might prove
> onerous to immature implementations. As a concrete example, suppose
> that someone starts a Go implementation shortly after we've made a 1.0
> release with integration tests for all the well-specified Arrow types.
> After a couple of months, the Go developers need to make some breaking
> API changes. Does that mean we need to bump the whole project to 2.x?
> As more languages come into the fold, this could happen more and more
> often. How would people interpret a fast escalating major version
> number?
>
> I am curious how Avro or Thrift have addressed this issue.
>
> - Wes
>
> On Wed, Jul 26, 2017 at 3:13 PM, Julian Hyde <[email protected]> wrote:
>> I agree with all that. But semantic versioning only pertains to public APIs. 
>> So, for it to work, you need to declare what are your public APIs. If you 
>> don’t, people will make assumptions about what are your public APIs, and 
>> they may get it wrong.
>>
>> The ability to add experimental APIs (not subject to semantic versioning 
>> until they are officially declared public) will help the project evolve and 
>> stay relevant.
>>
>> Julian
>>
>>
>>> On Jul 26, 2017, at 12:02 PM, Wes McKinney <[email protected]> wrote:
>>>
>>> I see the semantic versioning like this:
>>>
>>> Major version: Format and Metadata stability
>>> Minor version: API stability within fix versions
>>> Fix version: Bug fixes
>>>
>>> So an API might be deprecated from 1.0.0 to 1.1.0, but we could not
>>> make a breaking change to the memory format without increasing the
>>> major version. We also have the added protection of a version enum in
>>> the metadata
>>>
>>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22
>>>
>>> On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <[email protected]> wrote:
>>>> Given the nature of the Arrow project, where any number of different
>>>> implementations will be in flux at any given time, claiming any sort
>>>> of API stability at the code level across the whole project seems
>>>> impossible any time soon.
>>>>
>>>> The important commitment of a 1.0 release is that the metadata and
>>>> memory format is not changing (without a change in the major version
>>>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the
>>>> memory format and serialized metadata representation. That is, the
>>>> files in
>>>>
>>>> https://github.com/apache/arrow/tree/master/format
>>>>
>>>> Having this kind of stability is really important so that if any
>>>> systems know how to parse or emit Arrow 1.x data, but aren't
>>>> necessarily using the libraries provided by the project, they can have
>>>> some assurance that we aren't going to break the Flatbuffers or the
>>>> arrangement of bytes in a record batch on the wire. If that makes
>>>> sense.
>>>>
>>>> - Wes
>>>>
>>>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <[email protected]> wrote:
>>>>> 1.0 is a Big Deal because, under semantic versioning, there is a 
>>>>> commitment to not change public APIs. If it weren’t for that, 1.0 would 
>>>>> have vague marketing connotations of robustness, adoption etc. but 
>>>>> otherwise be no different from another release.
>>>>>
>>>>> So, if API and data format lifecycle and compatibility is the goal here, 
>>>>> would it be useful to introduce explicit flags on API maturity? Call out 
>>>>> which APIs are public, and therefore bound by the semantic versioning 
>>>>> contract. This will also give Arrow some room to add experimental 
>>>>> features after 1.0, and avoid calcification.
>>>>>
>>>>> Julian
>>>>>
>>>>>
>>>>>
>>>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <[email protected]> wrote:
>>>>>>
>>>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about
>>>>>> integration testing remaining data types. We are so close to having
>>>>>> everything tested and stable, we should push to complete these as soon
>>>>>> as possible (save for Map, which has only just been added to the
>>>>>> metadata)
>>>>>>
>>>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <[email protected]> 
>>>>>> wrote:
>>>>>>> I agree those things would be nice to have. Hardening the memory
>>>>>>> format details probably would not take longer than a month or so if we
>>>>>>> were to focus in on it.
>>>>>>>
>>>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will
>>>>>>> require a design period and then initial implementation. I think
>>>>>>> having the streaming format implementations is a good start, but the
>>>>>>> streams are a bit monothic -- e.g. in REST you might want to request
>>>>>>> metadata only, or only record batches given a known schema. We should
>>>>>>> create a proposal document (Google docs?) for the community to comment
>>>>>>> where we can iterate on requirements
>>>>>>>
>>>>>>> Separately, I'm interested in embedding Arrow streams in other
>>>>>>> transport layers, like GRPC. The recent refactoring in C++ to make the
>>>>>>> streams less monolithic was intended to help with that.
>>>>>>>
>>>>>>> - Wes
>>>>>>>
>>>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <[email protected]> 
>>>>>>> wrote:
>>>>>>>> Top things on my list:
>>>>>>>>
>>>>>>>> - Formalize Arrow RPC and/or REST
>>>>>>>> - Some reference transformation algorithms
>>>>>>>> - Prototype IPC
>>>>>>>>
>>>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <[email protected]> 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> hi folks,
>>>>>>>>>
>>>>>>>>> In recent discussions, since the Arrow memory format and metadata has
>>>>>>>>> become reasonably stabilized, and we're more likely to add new data
>>>>>>>>> types than change existing ones, we may consider making a 1.0.0 to
>>>>>>>>> declare to the rest of the open source world that "Arrow is open for
>>>>>>>>> business" and can be relied upon in production applications (which
>>>>>>>>> some reasonable tolerance for library API changes from major release
>>>>>>>>> to major release). I hope we can all agree that forward and backward
>>>>>>>>> compatibility in the zero-copy wire format and metadata is the most
>>>>>>>>> essential thing.
>>>>>>>>>
>>>>>>>>> To that end, I'd like to collect ideas for what needs to be
>>>>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0
>>>>>>>>> release. I think it would be a good show of project stability /
>>>>>>>>> production-readiness to do this (with the caveat the APIs will
>>>>>>>>> continue to evolve).
>>>>>>>>>
>>>>>>>>> The main things on my end are hardening the memory format and
>>>>>>>>> integration tests for the remaining data types:
>>>>>>>>>
>>>>>>>>> - Decimals
>>>>>>>>>   - Lingering issues with 128-bit decimals
>>>>>>>>>   - Need integration tests
>>>>>>>>> - Fixed size list
>>>>>>>>>   - Java has implemented, but not C++. Need integration tests
>>>>>>>>> - Union
>>>>>>>>>   - Two kinds of unions, Java only implements one. Need integration 
>>>>>>>>> tests
>>>>>>>>>
>>>>>>>>> On these, Decimals have the most work since the memory format needs to
>>>>>>>>> be specified. On Unions, we may decide to not implement the dense
>>>>>>>>> variant and focus on integration testing the sparse variant. I don't
>>>>>>>>> think this is going to be too much work, but it needs to get sorted
>>>>>>>>> out so we don't have incomplete or under-tested parts of the
>>>>>>>>> specification.
>>>>>>>>>
>>>>>>>>> There's some other things being discussed, like a Map logical type,
>>>>>>>>> but that (at least as currently proposed) won't require any disruptive
>>>>>>>>> modifications to the metadata.
>>>>>>>>>
>>>>>>>>> As far as the metadata and memory format, we would use the Open/Closed
>>>>>>>>> principle to guide our efforts
>>>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it
>>>>>>>>> would be possible to add compression or encoding at the field level
>>>>>>>>> without disrupting earlier versions of the software that lack these
>>>>>>>>> features.
>>>>>>>>>
>>>>>>>>> In the event that we do need to change the metadata or memory format
>>>>>>>>> in the future (which would probably be an extreme circumstance), we
>>>>>>>>> have the option of increasing the MetadataVersion which is one of the
>>>>>>>>> first tags accompanying Arrow messages
>>>>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22).
>>>>>>>>> So if you encounter a message that you do not support, you can raise
>>>>>>>>> an appropriate exception.
>>>>>>>>>
>>>>>>>>> There are some other things that would be nice to prototype or
>>>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a
>>>>>>>>> client-server model (sending Arrow record batches via REST HTTP
>>>>>>>>> calls).
>>>>>>>>>
>>>>>>>>> Anything else that would need to go to move to a 1.x mainline for
>>>>>>>>> development? One idea would be if we need to make any breaking changes
>>>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into
>>>>>>>>> maintenance mode.
>>>>>>>>>
>>>>>>>>> Thanks
>>>>>>>>> Wes
>>>>>>>>>
>>>>>
>>

Re: [DISCUSS] The road from Arrow 0.5.0 to 1.0.0

Reply via email to