Re: [DISCUSS] Supporting Hadoop-1 and experimental features

Chris Drome Fri, 22 May 2015 00:50:19 -0700

I understand the motivation and benefits of creating a branch-2 where more 
disruptive work can go on without affecting branch-1. While not necessarily 
against this approach, from Yahoo's standpoint, I do have some questions 
(concerns).
Upgrading to a new version of Hive requires a significant commitment of time 
and resources to stabilize and certify a build for deployment to our clusters. 
Given the size of our clusters and scale of datasets, we have to be 
particularly careful about adopting new functionality. However, at the same 
time we are interested in new testing and making available new features and 
functionality. That said, we would have to rely on branch-1 for the immediate 
future.
One concern is that branch-1 would be left to stagnate, at which point there 
would be no option but for users to move to branch-2 as branch-1 would be 
effectively end-of-lifed. I'm not sure how long this would take, but it would 
eventually happen as a direct result of the very reason for creating branch-2.
A related concern is how disruptive the code changes will be in branch-2. I 
imagine that changes in early in branch-2 will be easy to backport to branch-1, 
while this effort will become more difficult, if not impractical, as time goes. 
If the code bases diverge too much then this could lead to more pressure for 
users of branch-1 to add features just to branch-1, which has been mentioned as 
undesirable. By the same token, backporting any code in branch-2 will require 
an increasing amount of effort, which contributors to branch-2 may not be 
interested in committing to.
These questions affect us directly because, while we require a certain amount 
of stability, we also like to pull in new functionality that will be of value 
to our users. For example, our current 0.13 release is probably closer to 0.14 
at this point. Given the lifespan of a release, it is often more palatable to 
backport features and bugfixes than to jump to a new version.


The good thing about this proposal is the opportunity to evaluate and clean up 
alot of the old code.
Thanks,
chris
 


     On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin 
<ser...@hortonworks.com> wrote:
   

 Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some
people are set in their ways or have practical considerations and don’t
care for new shiny stuff.

On 15/5/18, 11:46, "Sergey Shelukhin" <ser...@hortonworks.com> wrote:

>I think we need some path for deprecating old Hadoop versions, the same
>way we deprecate old Java version support or old RDBMS version support.
>At some point the cost of supporting Hadoop 1 exceeds the benefit. Same
>goes for stuff like MR; supporting it, esp. for perf work, becomes a
>burden, and it’s outdated with 2 alternatives, one of which has been
>around for 2 releases.
>The branches are a graceful way to get rid of the legacy burden.
>
>Alternatively, when sweeping changes are made, we can do what Hbase did
>(which is not pretty imho), where 0.94 version had ~30 dot releases
>because people cannot upgrade to 0.96 “singularity” release.
>
>
>I posit that people who run Hadoop 1 and MR at this day and age (and more
>so as time passes) are people who either don’t care about perf and new
>features, only stability; so, stability-focused branch would be perfect to
>support them.
>
>
>On 15/5/18, 10:04, "Edward Capriolo" <edlinuxg...@gmail.com> wrote:
>
>>Up until recently Hive supported numerous versions of Hadoop code base
>>with
>>a simple shim layer. I would rather we stick to the shim layer. I think
>>this was easily the best part about hive was that a single release worked
>>well regardless of your hadoop version. It was also a key element to
>>hive's
>>success. I do not want to see us have multiple branches.
>>
>>On Sat, May 16, 2015 at 1:29 AM, Xuefu Zhang <xzh...@cloudera.com> wrote:
>>
>>> Thanks for the explanation, Alan!
>>>
>>> While I have understood more on the proposal, I actually see more
>>>problems
>>> than the confusion of two lines of releases. Essentially, this proposal
>>> forces a user to make a hard choice between a stabler, legacy-aware
>>>release
>>> line and an adventurous, pioneering release line. And once the choice
>>>is
>>> made, there is no easy way back or forward.
>>>
>>> Here is my interpretation. Let's say we have two main branches as
>>> proposed. I develop a new feature which I think useful for both
>>>branches.
>>> So, I commit it to both branches. My feature requires additional schema
>>> support, so I provide upgrade scripts for both branches. The scripts
>>>are
>>> different because the two branches have already diverged in schema.
>>>
>>> Now the two branches evolve in a diverging fashion like this. This is
>>>all
>>> good as long as a user stays in his line. The moment the user considers
>>>a
>>> switch, mostly likely, from branch-1 to branch-2, he is stuck. Why?
>>>Because
>>> there is no upgrade path from a release in branch-1 to a release in
>>> branch-2!
>>>
>>> If we want to provide an upgrade path, then there will be MxN paths,
>>>where
>>> M and N are the number of releases in the two branches, respectively.
>>>This
>>> is going to be next to a nightmare, not only for users, but also for
>>>us.
>>>
>>> Also, the proposal will require two sets of things that Hive provides:
>>> double documentation, double feature tracking, double build/test
>>> infrastructures, etc.
>>>
>>> This approach can also potentially cause the problem we saw in hadoop
>>> releases, where 0.23 release was greater than 1.0 release.
>>>
>>> To me, the problem we are trying to solve is deprecating old things
>>>such
>>> hadoop-1, Hive CLI, etc. This a valid problem to be solved. As I see,
>>> however, we approached the problem in less favorable ways.
>>>
>>> First, it seemed we wanted to deprecate something just for the sake of
>>> deprecation, and it's not based on the rationale that supports the
>>>desire.
>>> Dev might write code that accidentally break hadoop-1 build. However,
>>>this
>>> is more a build infrastructure problem rather than the burden of
>>>supporting
>>> hadoop-1. If our build could catch it at precommit test, then I would
>>>think
>>> the accident can be well avoided. Most of the times, fixing the build
>>>is
>>> trivial. And we have already addressed the build infrastructure
>>>problem.
>>>
>>> Secondly, if we do have a strong reason to deprecate something, we
>>>should
>>> have a deprecation plan rather than declaring on the spot that the
>>>current
>>> release is the last one supporting X. I think Microsoft did a better
>>>job in
>>> terms production deprecation. For instance, they announced long before
>>>the
>>> last day desupporting Windows XP. In my opinion, we should have a
>>>similar
>>> vision, giving users, distributions enough time to adjust rather than
>>> shocking them with breaking news.
>>>
>>> In summary, I do see the need of deprecation in Hive, but I am afraid
>>>the
>>> way we take, including the proposal here, isn't going to nicely solve
>>>the
>>> problem. On the contrary, I foresee a spectrum of confusion,
>>>frustration,
>>> and burden for the user as well as for developers.
>>>
>>> Thanks,
>>> Xuefu
>>>
>>> On Fri, May 15, 2015 at 8:19 PM, Alan Gates <alanfga...@gmail.com>
>>>wrote:
>>>
>>>>
>>>>
>>>>  Xuefu Zhang <xzh...@cloudera.com>
>>>>  May 15, 2015 at 17:31
>>>>
>>>> Just make sure that I understand the proposal correctly: we are going
>>>>to
>>>> have two main branches, one for hadoop-1 and one for hadoop-2.
>>>>
>>>>  We shouldn't tie this to hadoop-1 and 2.  It's about Hive not Hadoop.
>>>> It will be some time before Hive's branch-2 is stable, while Hadoop-2
>>>>is
>>>> already well established.
>>>>
>>>>  New features
>>>> are only merged to branch-2. That essentially says we stop development
>>>>for
>>>> hadoop-1, right?
>>>>
>>>>  If developers want to keep contributing patches to branch-1 then
>>>> there's no need for it to stop.  We would want to avoid putting new
>>>> features only on branch-1, unless they only made sense in that
>>>>context.
>>>> But I assume we'll see people contributing to branch-1 for some time.
>>>>
>>>>  Are we also making two lines of releases: ene for branch-1
>>>> and one for branch-2? Won't that be confusing and also burdensome if
>>>>we
>>>> release say 1.3, 2.0, 2.1, 1.4...
>>>>
>>>>  I'm asserting that it will be less confusing than the alternatives.
>>>>We
>>>> need some way to make early releases of many of the new features.  I
>>>> believe that this proposal is less confusing than if we start putting
>>>>the
>>>> new features in 1.x branches.  This is particularly true because it
>>>>would
>>>> help us to start being able to drop older functionality like Hadoop-1
>>>>and
>>>> MapReduce, which is very hard to do in the 1.x line without stranding
>>>>users.
>>>>
>>>>  Please note that we will have hadoop 3 soon. What's the story there?
>>>>
>>>>  As I said above, I don't see this as tied to Hadoop versions.
>>>>
>>>> Alan.
>>>>
>>>>  Thanks,
>>>> Xuefu
>>>>
>>>>
>>>>
>>>> On Fri, May 15, 2015 at 4:43 PM, Vaibhav Gumashta
>>>><vgumas...@hortonworks.com
>>>>
>>>> wrote:
>>>>
>>>>  +1 on the new branch. I think it’ll help in faster dev time for these
>>>> important changes.
>>>>
>>>>  —Vaibhav
>>>>
>>>>  From: Alan Gates <alanfga...@gmail.com> <alanfga...@gmail.com>
>>>> Reply-To: "dev@hive.apache.org" <dev@hive.apache.org>
>>>><dev@hive.apache.org> <dev@hive.apache.org>
>>>> Date: Friday, May 15, 2015 at 4:11 PM
>>>> To: "dev@hive.apache.org" <dev@hive.apache.org> <dev@hive.apache.org>
>>>><dev@hive.apache.org>
>>>> Subject: Re: [DISCUSS] Supporting Hadoop-1 and experimental features
>>>>
>>>>  Anyone else have feedback on this?  If not I'll start a vote next
>>>>week.
>>>>
>>>> Alan.
>>>>
>>>>    Gopal Vijayaraghavan <gop...@apache.org> <gop...@apache.org>
>>>> May 14, 2015 at 10:44
>>>>  Hi,
>>>>
>>>> +1 on the idea.
>>>>
>>>> Having a stable release branch with ongoing fixes where we do not drop
>>>> major features would be good all around.
>>>>
>>>> It lets us accelerate the pace of development, drop major features or
>>>> rewrite them entirely without dragging everyone else kicking &
>>>>screaming
>>>> into that release.
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>>    Sergey Shelukhin <ser...@hortonworks.com> <ser...@hortonworks.com>
>>>> May 11, 2015 at 19:17
>>>>  That sounds like a good idea.
>>>> Some features could be back ported to branch-1 if viable, but at least
>>>>new
>>>> stuff would not be burdened by Hadoop 1/MR code paths.
>>>> Probably also a good place to enable vectorization and other perf
>>>>features
>>>> by default while we make alpha releases.
>>>>
>>>> +1
>>>>
>>>>
>>>>    Alan Gates <alanfga...@gmail.com> <alanfga...@gmail.com>
>>>> May 11, 2015 at 15:38
>>>>  There is a lot of forward-looking work going on in various branches
>>>>of
>>>> Hive:  LLAP, the HBase metastore, and the work to drop the CLI.  It
>>>>would
>>>> be good to have a way to release this code to users so that they can
>>>> experiment with it.  Releasing it will also provide feedback to
>>>>developers.
>>>>
>>>> At the same time there are discussions on whether to keep supporting
>>>> Hadoop-1.  The burden of supporting older, less used functionality
>>>>such as
>>>> Hadoop-1 is becoming ever harder as many new features are added.
>>>>
>>>> I propose that the best way to deal with this would be to make a
>>>> branch-1.  We could continue to make new feature releases off of this
>>>> branch (1.3, 1.4, etc.).  This branch would not drop old
>>>>functionality.
>>>> This provides stability and continuity for users and developers.
>>>>
>>>> We could then merge these new features branches (LLAP, HBase
>>>>metastore,
>>>> CLI drop) into the trunk, as well as turn on by default newer features
>>>>such
>>>> as the vectorization and ACID.  We could also drop older, less used
>>>> features such as support for Hadoop-1 and MapReduce.  It will be a
>>>>while
>>>> before we are ready to make stable, production ready releases of this
>>>> code.  But we could start making alpha quality releases soon.  We
>>>>would
>>>> call these releases 2.x, to stress the non-backward compatible changes
>>>>such
>>>> as dropping Hadoop-1.  This will give users a chance to play with the
>>>>new
>>>> code and developers a chance to get feedback.
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>>
>>>>  Vaibhav Gumashta <vgumas...@hortonworks.com>
>>>>  May 15, 2015 at 16:43
>>>>  +1 on the new branch. I think it’ll help in faster dev time for these
>>>> important changes.
>>>>
>>>>  —Vaibhav
>>>>
>>>>  From: Alan Gates <alanfga...@gmail.com>
>>>> Reply-To: "dev@hive.apache.org" <dev@hive.apache.org>
>>>> Date: Friday, May 15, 2015 at 4:11 PM
>>>> To: "dev@hive.apache.org" <dev@hive.apache.org>
>>>> Subject: Re: [DISCUSS] Supporting Hadoop-1 and experimental features
>>>>
>>>>  Anyone else have feedback on this?  If not I'll start a vote next
>>>>week.
>>>>
>>>> Alan.
>>>>
>>>>    Gopal Vijayaraghavan <gop...@apache.org>
>>>>  May 14, 2015 at 10:44
>>>> Hi,
>>>>
>>>> +1 on the idea.
>>>>
>>>> Having a stable release branch with ongoing fixes where we do not drop
>>>> major features would be good all around.
>>>>
>>>> It lets us accelerate the pace of development, drop major features or
>>>> rewrite them entirely without dragging everyone else kicking &
>>>>screaming
>>>> into that release.
>>>>
>>>> Cheers,
>>>> Gopal
>>>>
>>>>
>>>>
>>>>  Sergey Shelukhin <ser...@hortonworks.com>
>>>>  May 11, 2015 at 19:17
>>>> That sounds like a good idea.
>>>> Some features could be back ported to branch-1 if viable, but at least
>>>>new
>>>> stuff would not be burdened by Hadoop 1/MR code paths.
>>>> Probably also a good place to enable vectorization and other perf
>>>>features
>>>> by default while we make alpha releases.
>>>>
>>>> +1
>>>>
>>>>
>>>>  Alan Gates <alanfga...@gmail.com>
>>>>  May 11, 2015 at 15:38
>>>> There is a lot of forward-looking work going on in various branches of
>>>> Hive:  LLAP, the HBase metastore, and the work to drop the CLI.  It
>>>>would
>>>> be good to have a way to release this code to users so that they can
>>>> experiment with it.  Releasing it will also provide feedback to
>>>>developers.
>>>>
>>>> At the same time there are discussions on whether to keep supporting
>>>> Hadoop-1.  The burden of supporting older, less used functionality
>>>>such as
>>>> Hadoop-1 is becoming ever harder as many new features are added.
>>>>
>>>> I propose that the best way to deal with this would be to make a
>>>> branch-1.  We could continue to make new feature releases off of this
>>>> branch (1.3, 1.4, etc.).  This branch would not drop old
>>>>functionality.
>>>> This provides stability and continuity for users and developers.
>>>>
>>>> We could then merge these new features branches (LLAP, HBase
>>>>metastore,
>>>> CLI drop) into the trunk, as well as turn on by default newer features
>>>>such
>>>> as the vectorization and ACID.  We could also drop older, less used
>>>> features such as support for Hadoop-1 and MapReduce.  It will be a
>>>>while
>>>> before we are ready to make stable, production ready releases of this
>>>> code.  But we could start making alpha quality releases soon.  We
>>>>would
>>>> call these releases 2.x, to stress the non-backward compatible changes
>>>>such
>>>> as dropping Hadoop-1.  This will give users a chance to play with the
>>>>new
>>>> code and developers a chance to get feedback.
>>>>
>>>> Thoughts?
>>>>
>>>>
>>>
>

Re: [DISCUSS] Supporting Hadoop-1 and experimental features

Reply via email to