"Same goes for stuff like MR; supporting it, esp. for perf work, becomes a burden, and it’s outdated with 2 alternatives, one of which has been around for 2 releases."
I am not trying to pick on your words here but I want to acknowledge something. "Been around for 2 releases" means less to people than you would think. Many of users are locked in by when the distribution chooses to cut a release. Now as it turns outs there are two major distributions, one distribution does pretty much nothing to support tez. Here is what "around for two releases" means for a CDH user: http://search-hadoop.com/m/8er9RFVSf2&subj=Re+Getting+Tez+working+against+cdh+5+3 After much hacking with a rather new CDH version I was actually unable to get the alternative running. The other alternative, which I am presuming, to mean hive-on-spark probably has not shipped in many distributions either. I do not think either "alternative" has much real world battlefield experience. The reality is a normal user has to test a series of processes before they can pull the trigger on an upgrade. For example, I used to work at a adtech company. Hive added a feature called "Exchange partitions".Tthis actually broke a number of our processes because we use the word "exchange" all the time.It became a keyword many of our scripts broke. This is not a fault of hive or the feature, but is is just a fact that no one wants to touch test big lumbering ETL proceses (even with lightning fast sexy engines) five times a year. I mentioned this before but I want to repeat. Hive was "releasable trunk" for a long time and it served users well. We never had 2-4 feature branches. One binary dropped ontop of hadoop 17, 20, 21, 203 and 2.0. If we get in a situation where all the "old users" "don't care about new features" we can easily land in a situation where are actual users are running the "old" hadoop unable to upgrade to the "hive with the new features" because it requires dependencies < 2 months old not ported to their distribution yet. As a user I am already starting to see this where the distributions behind hive because a point upgrade is not compelling for the distributor. On Fri, May 22, 2015 at 4:19 PM, Alan Gates <alanfga...@gmail.com> wrote: > I agree with *All* features with the exception that some features might be > branch-1 specific (if it's a feature on something no longer supported in > master, like hadoop-1). Without this we prevent new features for older > technology, which doesn't strike me as reasonable. > > I see your point on saying the contributor may not understand where best > to put the patch, and thus the committer decides. However, it would be > very disappointing for a contributor who uses branch-1 to build a new > feature only to have the committer put it only in master. So I would > modify your modification to say "at the discretion of the contributor and > Hive committers". > > Alan. > > kulkarni.swar...@gmail.com > May 22, 2015 at 11:41 > +1 on the new proposal. Feedback below: > > > New features must be put into master. Whether to put them into > branch-1 is at the discretion of the developer. > > How about we change this to "*All* features must be put into master. > Whether to put them into branch-1 is at the discretion of the *committer*." > The reason I think is going forward for us to sustain as a happy and > healthy community, it's imperative for us to make it not only easy for the > users, but also for developers and committers to contribute/commit patches. > To me being a hive contributor would be hard to determine which branch my > code belongs. Also IMO(and I might be wrong) but many committers have their > own areas of expertise and it's also very hard for them to immediately > determine what branch a patch should go to unless very well documented > somewhere. Putting all code into the master would be an easy approach to > follow and then cherry picking to other branches can be done. So even if > people forget to do that, we can always go back to master and port the > patches out to these branches. So we have a master branch, a branch-1 for > stable code, branch-2 for experimental and "bleeding edge" code and so on. > Once branch-2 is stable, we deprecate branch-1, create branch-3 and move on. > > Another reason I say this is because in my experience, a pretty > significant amount of work is hive is still bug fixes and I think that is > what the user cares most about(correctness above anything else). So with > this approach, might be very obvious to what branches to commit this to. > > > > > -- > Swarnim > Chris Drome <cdr...@yahoo-inc.com.INVALID> > May 22, 2015 at 0:49 > I understand the motivation and benefits of creating a branch-2 where more > disruptive work can go on without affecting branch-1. While not necessarily > against this approach, from Yahoo's standpoint, I do have some questions > (concerns). > Upgrading to a new version of Hive requires a significant commitment of > time and resources to stabilize and certify a build for deployment to our > clusters. Given the size of our clusters and scale of datasets, we have to > be particularly careful about adopting new functionality. However, at the > same time we are interested in new testing and making available new > features and functionality. That said, we would have to rely on branch-1 > for the immediate future. > One concern is that branch-1 would be left to stagnate, at which point > there would be no option but for users to move to branch-2 as branch-1 > would be effectively end-of-lifed. I'm not sure how long this would take, > but it would eventually happen as a direct result of the very reason for > creating branch-2. > A related concern is how disruptive the code changes will be in branch-2. > I imagine that changes in early in branch-2 will be easy to backport to > branch-1, while this effort will become more difficult, if not impractical, > as time goes. If the code bases diverge too much then this could lead to > more pressure for users of branch-1 to add features just to branch-1, which > has been mentioned as undesirable. By the same token, backporting any code > in branch-2 will require an increasing amount of effort, which contributors > to branch-2 may not be interested in committing to. > These questions affect us directly because, while we require a certain > amount of stability, we also like to pull in new functionality that will be > of value to our users. For example, our current 0.13 release is probably > closer to 0.14 at this point. Given the lifespan of a release, it is often > more palatable to backport features and bugfixes than to jump to a new > version. > > The good thing about this proposal is the opportunity to evaluate and > clean up alot of the old code. > Thanks, > chris > > > > On Monday, May 18, 2015 11:48 AM, Sergey Shelukhin > <ser...@hortonworks.com> <ser...@hortonworks.com> wrote: > > > Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some > people are set in their ways or have practical considerations and don’t > care for new shiny stuff. > > > > > > Sergey Shelukhin <ser...@hortonworks.com> > May 18, 2015 at 11:47 > Note: by “cannot” I mean “are unwilling to”; upgrade paths exist, but some > people are set in their ways or have practical considerations and don’t > care for new shiny stuff. > > > Sergey Shelukhin <ser...@hortonworks.com> > May 18, 2015 at 11:46 > I think we need some path for deprecating old Hadoop versions, the same > way we deprecate old Java version support or old RDBMS version support. > At some point the cost of supporting Hadoop 1 exceeds the benefit. Same > goes for stuff like MR; supporting it, esp. for perf work, becomes a > burden, and it’s outdated with 2 alternatives, one of which has been > around for 2 releases. > The branches are a graceful way to get rid of the legacy burden. > > Alternatively, when sweeping changes are made, we can do what Hbase did > (which is not pretty imho), where 0.94 version had ~30 dot releases > because people cannot upgrade to 0.96 “singularity” release. > > > I posit that people who run Hadoop 1 and MR at this day and age (and more > so as time passes) are people who either don’t care about perf and new > features, only stability; so, stability-focused branch would be perfect to > support them. > > > > Edward Capriolo <edlinuxg...@gmail.com> > May 18, 2015 at 10:04 > Up until recently Hive supported numerous versions of Hadoop code base with > a simple shim layer. I would rather we stick to the shim layer. I think > this was easily the best part about hive was that a single release worked > well regardless of your hadoop version. It was also a key element to hive's > success. I do not want to see us have multiple branches. > > >