Re: [DISCUSS] Supporting Hadoop-1 and experimental features

Alan Gates Fri, 22 May 2015 11:12:07 -0700

Thanks for your feedback Chris. It sounds like there are a couple ofreasonable concerns being voiced repeatedly:

1) Fragmentation, the two branches will drift too far apart.
2) Stagnation, branch-1 will effectively become a dead-end.


So I modify the proposal as follows to deal with those:

1) New features must be put into master. Whether to put them intobranch-1 is at the discretion of the developer. The exception would befeatures that would not apply in master (e.g. say someone developed away to double the speed of map reduce jobs Hive produces). For example,I might choose to put the materialized view work I'm doing in bothbranch-1 and master, but the HBase metastore work only in master. Thisshould avoid fragmentation by keeping branch-1 a subset of master.

2) For the next 12 months we will port critical bug fixes (crashes,security issues, wrong results) to branch-1 as well as fixing them onmaster. We might choose to lengthen this time depending on how stablemaster is and how fast the uptake is. This avoids branch-1 beingimmediately abandoned by developers while users are still depending on it.


Alan.

Chris Drome <mailto:cdr...@yahoo-inc.com.INVALID>
May 22, 2015 at 0:49
I understand the motivation and benefits of creating a branch-2 wheremore disruptive work can go on without affecting branch-1. While notnecessarily against this approach, from Yahoo's standpoint, I do havesome questions (concerns).Upgrading to a new version of Hive requires a significant commitmentof time and resources to stabilize and certify a build for deploymentto our clusters. Given the size of our clusters and scale of datasets,we have to be particularly careful about adopting new functionality.However, at the same time we are interested in new testing and makingavailable new features and functionality. That said, we would have torely on branch-1 for the immediate future.One concern is that branch-1 would be left to stagnate, at which pointthere would be no option but for users to move to branch-2 as branch-1would be effectively end-of-lifed. I'm not sure how long this wouldtake, but it would eventually happen as a direct result of the veryreason for creating branch-2.A related concern is how disruptive the code changes will be inbranch-2. I imagine that changes in early in branch-2 will be easy tobackport to branch-1, while this effort will become more difficult, ifnot impractical, as time goes. If the code bases diverge too much thenthis could lead to more pressure for users of branch-1 to add featuresjust to branch-1, which has been mentioned as undesirable. By the sametoken, backporting any code in branch-2 will require an increasingamount of effort, which contributors to branch-2 may not be interestedin committing to.These questions affect us directly because, while we require a certainamount of stability, we also like to pull in new functionality thatwill be of value to our users. For example, our current 0.13 releaseis probably closer to 0.14 at this point. Given the lifespan of arelease, it is often more palatable to backport features and bugfixesthan to jump to a new version.
The good thing about this proposal is the opportunity to evaluate andclean up alot of the old code.
Thanks,
chris
On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin<ser...@hortonworks.com> wrote:
Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.





Sergey Shelukhin <mailto:ser...@hortonworks.com>
May 18, 2015 at 11:47
Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.


Sergey Shelukhin <mailto:ser...@hortonworks.com>
May 18, 2015 at 11:46
I think we need some path for deprecating old Hadoop versions, the same
way we deprecate old Java version support or old RDBMS version support.
At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
goes for stuff like MR; supporting it, esp. for perf work, becomes a
burden, and it’s outdated with 2 alternatives, one of which has been
around for 2 releases.
The branches are a graceful way to get rid of the legacy burden.

Alternatively, when sweeping changes are made, we can do what Hbase did
(which is not pretty imho), where 0.94 version had ~30 dot releases
because people cannot upgrade to 0.96 “singularity” release.


I posit that people who run Hadoop 1 and MR at this day and age (and more
so as time passes) are people who either don’t care about perf and new
features, only stability; so, stability-focused branch would be perfect to
support them.



Edward Capriolo <mailto:edlinuxg...@gmail.com>
May 18, 2015 at 10:04
Up until recently Hive supported numerous versions of Hadoop code basewith
a simple shim layer. I would rather we stick to the shim layer. I think
this was easily the best part about hive was that a single release worked
well regardless of your hadoop version. It was also a key element tohive's
success. I do not want to see us have multiple branches.


Xuefu Zhang <mailto:xzh...@cloudera.com>
May 15, 2015 at 22:29
Thanks for the explanation, Alan!
While I have understood more on the proposal, I actually see moreproblems than the confusion of two lines of releases. Essentially,this proposal forces a user to make a hard choice between a stabler,legacy-aware release line and an adventurous, pioneering release line.And once the choice is made, there is no easy way back or forward.
Here is my interpretation. Let's say we have two main branches asproposed. I develop a new feature which I think useful for bothbranches. So, I commit it to both branches. My feature requiresadditional schema support, so I provide upgrade scripts for bothbranches. The scripts are different because the two branches havealready diverged in schema.
Now the two branches evolve in a diverging fashion like this. This isall good as long as a user stays in his line. The moment the userconsiders a switch, mostly likely, from branch-1 to branch-2, he isstuck. Why? Because there is no upgrade path from a release inbranch-1 to a release in branch-2!
If we want to provide an upgrade path, then there will be MxN paths,where M and N are the number of releases in the two branches,respectively. This is going to be next to a nightmare, not only forusers, but also for us.
Also, the proposal will require two sets of things that Hive provides:double documentation, double feature tracking, double build/testinfrastructures, etc.
This approach can also potentially cause the problem we saw in hadoopreleases, where 0.23 release was greater than 1.0 release.
To me, the problem we are trying to solve is deprecating old thingssuch hadoop-1, Hive CLI, etc. This a valid problem to be solved. As Isee, however, we approached the problem in less favorable ways.
First, it seemed we wanted to deprecate something just for the sake ofdeprecation, and it's not based on the rationale that supports thedesire. Dev might write code that accidentally break hadoop-1 build.However, this is more a build infrastructure problem rather than theburden of supporting hadoop-1. If our build could catch it atprecommit test, then I would think the accident can be well avoided.Most of the times, fixing the build is trivial. And we have alreadyaddressed the build infrastructure problem.
Secondly, if we do have a strong reason to deprecate something, weshould have a deprecation plan rather than declaring on the spot thatthe current release is the last one supporting X. I think Microsoftdid a better job in terms production deprecation. For instance, theyannounced long before the last day desupporting Windows XP. In myopinion, we should have a similar vision, giving users, distributionsenough time to adjust rather than shocking them with breaking news.
In summary, I do see the need of deprecation in Hive, but I am afraidthe way we take, including the proposal here, isn't going to nicelysolve the problem. On the contrary, I foresee a spectrum of confusion,frustration, and burden for the user as well as for developers.
Thanks,
Xuefu

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

Reply via email to