Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Chamikara Jayalath Tue, 28 Aug 2018 10:32:01 -0700

Thanks Tim for raising this and Thanks JB and Ismaël for all the great
points.


I agree that one size fit all solution will not work when it comes to
dependencies. Based on past examples, clearly there are many cases where we
should proceed with caution and upgrade dependencies with care.

That said, given that Beam respects semantic versioning and most of our
dependencies respect semantic versioning I think we should be able to
upgrade most minor (and patch) versions of dependencies with relative ease.
Current policy is to automatically create JIRAs if we are more than three
minor versions behind. I'm not sure if HBase respects semantic versioning.
If it does not, I think, it should be the exception not the norm.

When it comes major version upgrades though we'll have to proceed with
caution. In addition to all the case-by-case reasoning Ismaël gave above
there's also the real possibility of a major version upgrade changing Beam
API (syntax or semantics) in a non backwards compatible way and breaking
the backwards compatibility guarantee offered by Beam. Current dependency
policy [1] try to capture this in a separate section and requires all PRs
that upgrade dependencies to contain a statement regarding backwards
compatibility.

I agree that there might be many modifications we have to make to existing
policies when it comes to upgrading Beam dependencies in according to
industry standards. Current policies are there as a first version for us to
try out. We should definitely time to time reevaluate and update the
policies as needed. I'm also extremely eager to hear what others in the
community think about this.

Thanks,
Cham

[1] https://beam.apache.org/contribute/dependencies/

On Tue, Aug 28, 2018 at 8:38 AM Ismaël Mejía <ieme...@gmail.com> wrote:

> I think we should refine the strategy on dependencies discussed
> recently. Sorry to come late with this (I did not follow closely the
> previous discussion), but the current approach is clearly not in line
> with the industry reality (at least not for IO connectors + Hadoop +
> Spark/Flink use).
>
> A really proactive approach to dependency updates is a good practice
> for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
> Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
> Bigquery, AWS S3, etc. However when we talk about self hosted data
> sources or processing systems this gets more complicated and I think
> we should be more flexible and do this case by case (and remove these
> from the auto update email reminder).
>
> Some open source projects have at least three maintained versions:
> - LTS – maps to what most of the people have installed (or the big
> data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
> - Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
> - Next – latest release. HBase 2.1.x Hadoop 3.1.x
>
> Following the most recent versions can be good to be close to the
> current development of other projects and some of the fixes, but these
> versions are commonly not deployed for most users and adopting a LTS
> or stable only approach won't satisfy all cases either. To understand
> why this is complex let’s see some historical issues:
>
> IO versioning
> * Elasticsearch. We delayed the move to version 6 until we heard of
> more active users needing it (more deployments). We support 2.x and
> 5.x (but 2.x went recently EOL). Support for 6.x is in progress.
> * SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
> because most big data distributions still use 5.x (however 5.x has
> been EOL).
> * KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
> most of the deployments of Kafka use earlier versions than 1.x. This
> module uses a single version with the kafka client as a provided
> dependency and so far it works (but we don’t have multi version
> tests).
>
> Runners versioning
> * The move to Spark 1 to Spark 2 was decided after evaluating the
> tradeoffs between maintaining multiple version support and to have
> breaking changes with the issues of maintaining multiple versions.
> This is a rare case but also with consequences. This dependency is
> provided but we don't actively test issues on version migration.
> * Flink moved to version 1.5, introducing incompatibility in
> checkpointing (discussed recently and with not yet consensus on how to
> handle).
>
> As you can see, it seems really hard to have a solution that fits all
> cases. Probably the only rule that I see from this list is that we
> should upgrade versions for connectors that have been deprecated or
> arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).
>
> For the case of the provided dependencies I wonder if as part of the
> tests we should provide tests with multiple versions (note that this
> is currently blocked by BEAM-4087).
>
> Any other ideas or opinions to see how we can handle this? What other
> people in the community think ? (Notice that this can have relation
> with the ongoing LTS discussion.
>
>
> On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
> <timrobertson...@gmail.com> wrote:
> >
> > Hi folks,
> >
> > I'd like to revisit the discussion around our versioning policy
> specifically for the Hadoop ecosystem and make sure we are aware of the
> implications.
> >
> > As an example our policy today would have us on HBase 2.1 and I have
> reminders to address this.
> >
> > However, currently the versions of HBase in the major hadoop distros are:
> >
> >  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
> >  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can
> assume is not widely adopted)
> >  - AWS EMR HBase on 1.4
> >
> > On the versioning I think we might need a more nuanced approach to
> ensure that we target real communities of existing and potential users.
> Enterprise users need to stick to the supported versions in the
> distributions to maintain support contracts from the vendors.
> >
> > Should our versioning policy have more room to consider on a case by
> case basis?
> >
> > For Hadoop might we benefit from a strategy on which community of users
> Beam is targeting?
> >
> > (OT: I'm collecting some thoughts on what we might consider to target
> enterprise hadoop users - kerberos on all relevant IO, performance, leaking
> beyond encryption zones with temporary files etc)
> >
> > Thanks,
> > Tim
>

Re: [DISCUSS] Versioning, Hadoop related dependencies and enterprise users

Reply via email to