Semantic versioning is a great tool, and we should use it as far as it goes, but not push it.
I suggest that the Arrow specification should have a paragraph that states the level of maturity of each part of the API; and each implementation should have a paragraph that states which parts of the spec are implemented, and to what quality. A lot can be accomplished in one paragraph in terms of setting people's expectations. And since you mentioned the open-closed principle earlier, the robustness principle [1] should apply: be liberal in what you accept, conservative in what you do. An arrow library should (ideally) not fall over if it encounters a data structure that was experimental in a previous version and has recently been removed. Julian [1] https://en.wikipedia.org/wiki/Robustness_principle On Wed, Jul 26, 2017 at 12:30 PM, Wes McKinney <wesmck...@gmail.com> wrote: > The combinatorics of code-level API stability are worrisome (with > already 5 different language APIs in the project) while the maturity > and development pace of different implementations may remain variable > for some time. > > There are two possible things we can communicate with some form of > major version number: > > * The Arrow specification (independent to implementation) is complete, > with more than one reference implementations proving to have > implemented it > > * The code is complete and stable > > The latter seems undesirable, at least on a 6 month horizon. I don't > think it should keep us from making a public statement that we've > hardened the Arrow format itself. Perhaps we need two kinds of major > versions for the project. > > The worry I have is that strict semantic versioning might prove > onerous to immature implementations. As a concrete example, suppose > that someone starts a Go implementation shortly after we've made a 1.0 > release with integration tests for all the well-specified Arrow types. > After a couple of months, the Go developers need to make some breaking > API changes. Does that mean we need to bump the whole project to 2.x? > As more languages come into the fold, this could happen more and more > often. How would people interpret a fast escalating major version > number? > > I am curious how Avro or Thrift have addressed this issue. > > - Wes > > On Wed, Jul 26, 2017 at 3:13 PM, Julian Hyde <jh...@apache.org> wrote: >> I agree with all that. But semantic versioning only pertains to public APIs. >> So, for it to work, you need to declare what are your public APIs. If you >> don’t, people will make assumptions about what are your public APIs, and >> they may get it wrong. >> >> The ability to add experimental APIs (not subject to semantic versioning >> until they are officially declared public) will help the project evolve and >> stay relevant. >> >> Julian >> >> >>> On Jul 26, 2017, at 12:02 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> I see the semantic versioning like this: >>> >>> Major version: Format and Metadata stability >>> Minor version: API stability within fix versions >>> Fix version: Bug fixes >>> >>> So an API might be deprecated from 1.0.0 to 1.1.0, but we could not >>> make a breaking change to the memory format without increasing the >>> major version. We also have the added protection of a version enum in >>> the metadata >>> >>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22 >>> >>> On Wed, Jul 26, 2017 at 2:56 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>>> Given the nature of the Arrow project, where any number of different >>>> implementations will be in flux at any given time, claiming any sort >>>> of API stability at the code level across the whole project seems >>>> impossible any time soon. >>>> >>>> The important commitment of a 1.0 release is that the metadata and >>>> memory format is not changing (without a change in the major version >>>> number, i.e. Arrow 1.x.y to 2.x.y); so Arrow's "API" in a sense is the >>>> memory format and serialized metadata representation. That is, the >>>> files in >>>> >>>> https://github.com/apache/arrow/tree/master/format >>>> >>>> Having this kind of stability is really important so that if any >>>> systems know how to parse or emit Arrow 1.x data, but aren't >>>> necessarily using the libraries provided by the project, they can have >>>> some assurance that we aren't going to break the Flatbuffers or the >>>> arrangement of bytes in a record batch on the wire. If that makes >>>> sense. >>>> >>>> - Wes >>>> >>>> On Wed, Jul 26, 2017 at 2:35 PM, Julian Hyde <jh...@apache.org> wrote: >>>>> 1.0 is a Big Deal because, under semantic versioning, there is a >>>>> commitment to not change public APIs. If it weren’t for that, 1.0 would >>>>> have vague marketing connotations of robustness, adoption etc. but >>>>> otherwise be no different from another release. >>>>> >>>>> So, if API and data format lifecycle and compatibility is the goal here, >>>>> would it be useful to introduce explicit flags on API maturity? Call out >>>>> which APIs are public, and therefore bound by the semantic versioning >>>>> contract. This will also give Arrow some room to add experimental >>>>> features after 1.0, and avoid calcification. >>>>> >>>>> Julian >>>>> >>>>> >>>>> >>>>>> On Jul 26, 2017, at 7:40 AM, Wes McKinney <wesmck...@gmail.com> wrote: >>>>>> >>>>>> I created https://issues.apache.org/jira/browse/ARROW-1277 about >>>>>> integration testing remaining data types. We are so close to having >>>>>> everything tested and stable, we should push to complete these as soon >>>>>> as possible (save for Map, which has only just been added to the >>>>>> metadata) >>>>>> >>>>>> On Mon, Jul 24, 2017 at 5:35 PM, Wes McKinney <wesmck...@gmail.com> >>>>>> wrote: >>>>>>> I agree those things would be nice to have. Hardening the memory >>>>>>> format details probably would not take longer than a month or so if we >>>>>>> were to focus in on it. >>>>>>> >>>>>>> Formalizing REST / RPC or IPC seems like it will be more work, or will >>>>>>> require a design period and then initial implementation. I think >>>>>>> having the streaming format implementations is a good start, but the >>>>>>> streams are a bit monothic -- e.g. in REST you might want to request >>>>>>> metadata only, or only record batches given a known schema. We should >>>>>>> create a proposal document (Google docs?) for the community to comment >>>>>>> where we can iterate on requirements >>>>>>> >>>>>>> Separately, I'm interested in embedding Arrow streams in other >>>>>>> transport layers, like GRPC. The recent refactoring in C++ to make the >>>>>>> streams less monolithic was intended to help with that. >>>>>>> >>>>>>> - Wes >>>>>>> >>>>>>> On Mon, Jul 24, 2017 at 4:01 PM, Jacques Nadeau <jacq...@apache.org> >>>>>>> wrote: >>>>>>>> Top things on my list: >>>>>>>> >>>>>>>> - Formalize Arrow RPC and/or REST >>>>>>>> - Some reference transformation algorithms >>>>>>>> - Prototype IPC >>>>>>>> >>>>>>>> On Mon, Jul 24, 2017 at 9:47 AM, Wes McKinney <wesmck...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> hi folks, >>>>>>>>> >>>>>>>>> In recent discussions, since the Arrow memory format and metadata has >>>>>>>>> become reasonably stabilized, and we're more likely to add new data >>>>>>>>> types than change existing ones, we may consider making a 1.0.0 to >>>>>>>>> declare to the rest of the open source world that "Arrow is open for >>>>>>>>> business" and can be relied upon in production applications (which >>>>>>>>> some reasonable tolerance for library API changes from major release >>>>>>>>> to major release). I hope we can all agree that forward and backward >>>>>>>>> compatibility in the zero-copy wire format and metadata is the most >>>>>>>>> essential thing. >>>>>>>>> >>>>>>>>> To that end, I'd like to collect ideas for what needs to be >>>>>>>>> accomplished in the project before we'd be comfortable making a 1.0.0 >>>>>>>>> release. I think it would be a good show of project stability / >>>>>>>>> production-readiness to do this (with the caveat the APIs will >>>>>>>>> continue to evolve). >>>>>>>>> >>>>>>>>> The main things on my end are hardening the memory format and >>>>>>>>> integration tests for the remaining data types: >>>>>>>>> >>>>>>>>> - Decimals >>>>>>>>> - Lingering issues with 128-bit decimals >>>>>>>>> - Need integration tests >>>>>>>>> - Fixed size list >>>>>>>>> - Java has implemented, but not C++. Need integration tests >>>>>>>>> - Union >>>>>>>>> - Two kinds of unions, Java only implements one. Need integration >>>>>>>>> tests >>>>>>>>> >>>>>>>>> On these, Decimals have the most work since the memory format needs to >>>>>>>>> be specified. On Unions, we may decide to not implement the dense >>>>>>>>> variant and focus on integration testing the sparse variant. I don't >>>>>>>>> think this is going to be too much work, but it needs to get sorted >>>>>>>>> out so we don't have incomplete or under-tested parts of the >>>>>>>>> specification. >>>>>>>>> >>>>>>>>> There's some other things being discussed, like a Map logical type, >>>>>>>>> but that (at least as currently proposed) won't require any disruptive >>>>>>>>> modifications to the metadata. >>>>>>>>> >>>>>>>>> As far as the metadata and memory format, we would use the Open/Closed >>>>>>>>> principle to guide our efforts >>>>>>>>> (https://en.wikipedia.org/wiki/Open/closed_principle). For example, it >>>>>>>>> would be possible to add compression or encoding at the field level >>>>>>>>> without disrupting earlier versions of the software that lack these >>>>>>>>> features. >>>>>>>>> >>>>>>>>> In the event that we do need to change the metadata or memory format >>>>>>>>> in the future (which would probably be an extreme circumstance), we >>>>>>>>> have the option of increasing the MetadataVersion which is one of the >>>>>>>>> first tags accompanying Arrow messages >>>>>>>>> (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22). >>>>>>>>> So if you encounter a message that you do not support, you can raise >>>>>>>>> an appropriate exception. >>>>>>>>> >>>>>>>>> There are some other things that would be nice to prototype or >>>>>>>>> specify, like a REST protocol for exposing Arrow datasets in a >>>>>>>>> client-server model (sending Arrow record batches via REST HTTP >>>>>>>>> calls). >>>>>>>>> >>>>>>>>> Anything else that would need to go to move to a 1.x mainline for >>>>>>>>> development? One idea would be if we need to make any breaking changes >>>>>>>>> that we would leap from 1.x to 2.0.0 and throw the 1.x branches into >>>>>>>>> maintenance mode. >>>>>>>>> >>>>>>>>> Thanks >>>>>>>>> Wes >>>>>>>>> >>>>> >>