I think we should refine the strategy on dependencies discussed
recently. Sorry to come late with this (I did not follow closely the
previous discussion), but the current approach is clearly not in line
with the industry reality (at least not for IO connectors + Hadoop +
Spark/Flink use).

A really proactive approach to dependency updates is a good practice
for the core dependencies we have e.g. Guava, Bytebuddy, Avro,
Protobuf, etc, and of course for the case of cloud based IOs e.g. GCS,
Bigquery, AWS S3, etc. However when we talk about self hosted data
sources or processing systems this gets more complicated and I think
we should be more flexible and do this case by case (and remove these
from the auto update email reminder).

Some open source projects have at least three maintained versions:
- LTS – maps to what most of the people have installed (or the big
data distributions use) e.g. HBase 1.1.x, Hadoop 2.6.x
- Stable – current recommended version. HBase 1.4.x, Hadoop 2.8.x
- Next – latest release. HBase 2.1.x Hadoop 3.1.x

Following the most recent versions can be good to be close to the
current development of other projects and some of the fixes, but these
versions are commonly not deployed for most users and adopting a LTS
or stable only approach won't satisfy all cases either. To understand
why this is complex let’s see some historical issues:

IO versioning
* Elasticsearch. We delayed the move to version 6 until we heard of
more active users needing it (more deployments). We support 2.x and
5.x (but 2.x went recently EOL). Support for 6.x is in progress.
* SolrIO, stable version is 7.x, LTS is 6.x. We support only 5.x
because most big data distributions still use 5.x (however 5.x has
been EOL).
* KafkaIO uses version 1.x but Kafka recently moved to 2.x, however
most of the deployments of Kafka use earlier versions than 1.x. This
module uses a single version with the kafka client as a provided
dependency and so far it works (but we don’t have multi version
tests).

Runners versioning
* The move to Spark 1 to Spark 2 was decided after evaluating the
tradeoffs between maintaining multiple version support and to have
breaking changes with the issues of maintaining multiple versions.
This is a rare case but also with consequences. This dependency is
provided but we don't actively test issues on version migration.
* Flink moved to version 1.5, introducing incompatibility in
checkpointing (discussed recently and with not yet consensus on how to
handle).

As you can see, it seems really hard to have a solution that fits all
cases. Probably the only rule that I see from this list is that we
should upgrade versions for connectors that have been deprecated or
arrived to the EOL (e.g. Solr 5.x, Elasticsearch 2.x).

For the case of the provided dependencies I wonder if as part of the
tests we should provide tests with multiple versions (note that this
is currently blocked by BEAM-4087).

Any other ideas or opinions to see how we can handle this? What other
people in the community think ? (Notice that this can have relation
with the ongoing LTS discussion.


On Tue, Aug 28, 2018 at 10:44 AM Tim Robertson
<timrobertson...@gmail.com> wrote:
>
> Hi folks,
>
> I'd like to revisit the discussion around our versioning policy specifically 
> for the Hadoop ecosystem and make sure we are aware of the implications.
>
> As an example our policy today would have us on HBase 2.1 and I have 
> reminders to address this.
>
> However, currently the versions of HBase in the major hadoop distros are:
>
>  - Cloudera 5 on HBase 1.2 (Cloudera 6 is 2.1 but is only in beta)
>  - Hortonworks HDP3 on HBase 2.0 (only recently released so we can assume is 
> not widely adopted)
>  - AWS EMR HBase on 1.4
>
> On the versioning I think we might need a more nuanced approach to ensure 
> that we target real communities of existing and potential users. Enterprise 
> users need to stick to the supported versions in the distributions to 
> maintain support contracts from the vendors.
>
> Should our versioning policy have more room to consider on a case by case 
> basis?
>
> For Hadoop might we benefit from a strategy on which community of users Beam 
> is targeting?
>
> (OT: I'm collecting some thoughts on what we might consider to target 
> enterprise hadoop users - kerberos on all relevant IO, performance, leaking 
> beyond encryption zones with temporary files etc)
>
> Thanks,
> Tim

Reply via email to