Re: [DISCUSS] Supporting Hadoop-1 and experimental features

kulkarni.swar...@gmail.com Fri, 22 May 2015 11:44:36 -0700

+1 on the new proposal. Feedback below:

> New features must be put into master.  Whether to put them into branch-1
is at the discretion of the developer.


How about we change this to "*All* features must be put into master.
Whether to put them into branch-1 is at the discretion of the *committer*."
The reason I think is going forward for us to sustain as a happy and
healthy community, it's imperative for us to make it not only easy for the
users, but also for developers and committers to contribute/commit patches.
To me being a hive contributor would be hard to determine which branch my
code belongs. Also IMO(and I might be wrong) but many committers have their
own areas of expertise and it's also very hard for them to immediately
determine what branch a patch should go to unless very well documented
somewhere. Putting all code into the master would be an easy approach to
follow and then cherry picking to other branches can be done. So even if
people forget to do that, we can always go back to master and port the
patches out to these branches. So we have a master branch, a branch-1 for
stable code, branch-2 for experimental and "bleeding edge" code and so on.
Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on.

Another reason I say this is because in my experience, a pretty significant
amount of work is hive is still bug fixes and I think that is what the user
cares most about(correctness above anything else). So with this approach,
might be very obvious to what branches to commit this to.

On Fri, May 22, 2015 at 1:11 PM, Alan Gates <alanfga...@gmail.com> wrote:

> Thanks for your feedback Chris.  It sounds like there are a couple of
> reasonable concerns being voiced repeatedly:
> 1) Fragmentation, the two branches will drift too far apart.
> 2) Stagnation, branch-1 will effectively become a dead-end.
>
> So I modify the proposal as follows to deal with those:
>
> 1) New features must be put into master.  Whether to put them into
> branch-1 is at the discretion of the developer.  The exception would be
> features that would not apply in master (e.g. say someone developed a way
> to double the speed of map reduce jobs Hive produces).  For example, I
> might choose to put the materialized view work I'm doing in both branch-1
> and master, but the HBase metastore work only in master.  This should avoid
> fragmentation by keeping branch-1 a subset of master.
>
> 2) For the next 12 months we will port critical bug fixes (crashes,
> security issues, wrong results) to branch-1 as well as fixing them on
> master.  We might choose to lengthen this time depending on how stable
> master is and how fast the uptake is.  This avoids branch-1 being
> immediately abandoned by developers while users are still depending on it.
>
> Alan.
>
>   Chris Drome <cdr...@yahoo-inc.com.INVALID>
>  May 22, 2015 at 0:49
> I understand the motivation and benefits of creating a branch-2 where more
> disruptive work can go on without affecting branch-1. While not necessarily
> against this approach, from Yahoo's standpoint, I do have some questions
> (concerns).
> Upgrading to a new version of Hive requires a significant commitment of
> time and resources to stabilize and certify a build for deployment to our
> clusters. Given the size of our clusters and scale of datasets, we have to
> be particularly careful about adopting new functionality. However, at the
> same time we are interested in new testing and making available new
> features and functionality. That said, we would have to rely on branch-1
> for the immediate future.
> One concern is that branch-1 would be left to stagnate, at which point
> there would be no option but for users to move to branch-2 as branch-1
> would be effectively end-of-lifed. I'm not sure how long this would take,
> but it would eventually happen as a direct result of the very reason for
> creating branch-2.
> A related concern is how disruptive the code changes will be in branch-2.
> I imagine that changes in early in branch-2 will be easy to backport to
> branch-1, while this effort will become more difficult, if not impractical,
> as time goes. If the code bases diverge too much then this could lead to
> more pressure for users of branch-1 to add features just to branch-1, which
> has been mentioned as undesirable. By the same token, backporting any code
> in branch-2 will require an increasing amount of effort, which contributors
> to branch-2 may not be interested in committing to.
> These questions affect us directly because, while we require a certain
> amount of stability, we also like to pull in new functionality that will be
> of value to our users. For example, our current 0.13 release is probably
> closer to 0.14 at this point. Given the lifespan of a release, it is often
> more palatable to backport features and bugfixes than to jump to a new
> version.
>
> The good thing about this proposal is the opportunity to evaluate and
> clean up alot of the old code.
> Thanks,
> chris
>
>
>
> On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin
> <ser...@hortonworks.com> <ser...@hortonworks.com> wrote:
>
>
> Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
> people are set in their ways or have practical considerations and don’t
> care for new shiny stuff.
>
>
>
>
>
>   Sergey Shelukhin <ser...@hortonworks.com>
>  May 18, 2015 at 11:47
> Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
> people are set in their ways or have practical considerations and don’t
> care for new shiny stuff.
>
>
>   Sergey Shelukhin <ser...@hortonworks.com>
>  May 18, 2015 at 11:46
> I think we need some path for deprecating old Hadoop versions, the same
> way we deprecate old Java version support or old RDBMS version support.
> At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
> goes for stuff like MR; supporting it, esp. for perf work, becomes a
> burden, and it’s outdated with 2 alternatives, one of which has been
> around for 2 releases.
> The branches are a graceful way to get rid of the legacy burden.
>
> Alternatively, when sweeping changes are made, we can do what Hbase did
> (which is not pretty imho), where 0.94 version had ~30 dot releases
> because people cannot upgrade to 0.96 “singularity” release.
>
>
> I posit that people who run Hadoop 1 and MR at this day and age (and more
> so as time passes) are people who either don’t care about perf and new
> features, only stability; so, stability-focused branch would be perfect to
> support them.
>
>
>
>   Edward Capriolo <edlinuxg...@gmail.com>
>  May 18, 2015 at 10:04
> Up until recently Hive supported numerous versions of Hadoop code base with
> a simple shim layer. I would rather we stick to the shim layer. I think
> this was easily the best part about hive was that a single release worked
> well regardless of your hadoop version. It was also a key element to hive's
> success. I do not want to see us have multiple branches.
>
>
>   Xuefu Zhang <xzh...@cloudera.com>
>  May 15, 2015 at 22:29
> Thanks for the explanation, Alan!
>
> While I have understood more on the proposal, I actually see more problems
> than the confusion of two lines of releases. Essentially, this proposal
> forces a user to make a hard choice between a stabler, legacy-aware release
> line and an adventurous, pioneering release line. And once the choice is
> made, there is no easy way back or forward.
>
> Here is my interpretation. Let's say we have two main branches as
> proposed. I develop a new feature which I think useful for both branches.
> So, I commit it to both branches. My feature requires additional schema
> support, so I provide upgrade scripts for both branches. The scripts are
> different because the two branches have already diverged in schema.
>
> Now the two branches evolve in a diverging fashion like this. This is all
> good as long as a user stays in his line. The moment the user considers a
> switch, mostly likely, from branch-1 to branch-2, he is stuck. Why? Because
> there is no upgrade path from a release in branch-1 to a release in
> branch-2!
>
> If we want to provide an upgrade path, then there will be MxN paths, where
> M and N are the number of releases in the two branches, respectively. This
> is going to be next to a nightmare, not only for users, but also for us.
>
> Also, the proposal will require two sets of things that Hive provides:
> double documentation, double feature tracking, double build/test
> infrastructures, etc.
>
> This approach can also potentially cause the problem we saw in hadoop
> releases, where 0.23 release was greater than 1.0 release.
>
> To me, the problem we are trying to solve is deprecating old things such
> hadoop-1, Hive CLI, etc. This a valid problem to be solved. As I see,
> however, we approached the problem in less favorable ways.
>
> First, it seemed we wanted to deprecate something just for the sake of
> deprecation, and it's not based on the rationale that supports the desire.
> Dev might write code that accidentally break hadoop-1 build. However, this
> is more a build infrastructure problem rather than the burden of supporting
> hadoop-1. If our build could catch it at precommit test, then I would think
> the accident can be well avoided. Most of the times, fixing the build is
> trivial. And we have already addressed the build infrastructure problem.
>
> Secondly, if we do have a strong reason to deprecate something, we should
> have a deprecation plan rather than declaring on the spot that the current
> release is the last one supporting X. I think Microsoft did a better job in
> terms production deprecation. For instance, they announced long before the
> last day desupporting Windows XP. In my opinion, we should have a similar
> vision, giving users, distributions enough time to adjust rather than
> shocking them with breaking news.
>
> In summary, I do see the need of deprecation in Hive, but I am afraid the
> way we take, including the proposal here, isn't going to nicely solve the
> problem. On the contrary, I foresee a spectrum of confusion, frustration,
> and burden for the user as well as for developers.
>
> Thanks,
> Xuefu
>
>
>


-- 
Swarnim

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

Reply via email to