+1 On Wed, Mar 18, 2015 at 3:50 PM, Michael Kjellman < mkjell...@internalcircle.com> wrote:
> For most of my life I’ve lived on the software bleeding edge both > personally and professionally. Maybe it’s a personal weakness, but I guess > I get a thrill out of the problem solving aspect? > > Recently I came to a bit of an epiphany — the closer I keep to the daily > build — generally the happier I am on a daily basis. Bugs happen, but for > the most part (aside from show stopper bugs), pain points for myself in a > given daily build can generally can be debugged to 1 or maybe 2 root > causes, fixed in ~24 hours, and then life is better the next day again. In > comparison, the old waterfall model generally means taking an “official” > release at some point and waiting for some poor soul (or developer) to > actually run the thing. No matter how good the QA team is, until it’s > actually used in the real world, most bugs aren’t found. > > If you and your organization can wait 24 hours * number of bugs discovered > after people actually started using the thing, you end up with a “usable > build” around the holy-grail minor X.X.5 release of Cassandra. > > I love the idea of the LTS model Jonathan describes because it means more > code can get real testing and “bake” for longer instead of sitting largely > unused on some git repository in a datacenter far far away. A lot of code > has changed between 2.0 and trunk today. The code has diverged to the point > that if you write something for 2.0 (as the most stable major branch > currently available), merging it forward to 3.0 or after generally means > rewriting it. If the only thing that comes out of this is a smaller delta > of LOC between the deployable version/branch and what we can develop > against and what QA is focused on I think that’s a massive win. > > Something like CASSANDRA-8099 will need 2x the baking time of even many of > the more risky changes the project has made. While I wouldn’t want to run a > build with CASSANDRA-8099 in it anytime soon, there are now hundreds of > other changes blocked, most likely many containing new bugs of their own, > but have no exposure at all to even the most involved C* developers. > > I really think this will be a huge win for the project and I’m super > thankful for Sylvian, Ariel, Jonathan, Aleksey, and Jake for guiding this > change to a much more sustainable release model for the entire community. > > best, > kjellman > > > > On Mar 18, 2015, at 3:02 PM, Ariel Weisberg <ariel.weisb...@datastax.com> > wrote: > > > > Hi, > > > > Keep in mind it is a bug fix release every month and a feature release > every two months. > > > > For development that is really a two month cycle with all bug fixes > being backported one release. As a developer if you want to get something > in a release you have two months and you should be sizing pieces of large > tasks so they ship at least every two months. > > > > Ariel > >> On Mar 18, 2015, at 5:58 PM, Terrance Shepherd <tscana...@gmail.com> > wrote: > >> > >> I like the idea but I agree that every month is a bit aggressive. I > have no > >> say but: > >> > >> I would say 4 releases a year instead of 12. with 2 months of new > features > >> and 1 month of bug squashing per a release. With the 4th quarter just > bugs. > >> > >> I would also proposed 2 year LTS releases for the releases after the 4th > >> quarter. So everyone could get a new feature release every quarter and > the > >> stability of super major versions for 2 years. > >> > >> On Wed, Mar 18, 2015 at 2:34 PM, Dave Brosius <dbros...@mebigfatguy.com > > > >> wrote: > >> > >>> It would seem the practical implications of this is that there would be > >>> significantly more development on branches, with potentially more > >>> significant delays on merging these branches. This would imply to me > that > >>> more Jenkins servers would need to be set up to handle auto-testing of > more > >>> branches, as if feature work spends more time on external branches, it > is > >>> then likely to be be less tested (even if by accident) as less > developers > >>> would be working on that branch. Only when a feature was blessed to > make it > >>> to the release-tracked branch, would it become exposed to the majority > of > >>> developers/testers, etc doing normal running/playing/testing. > >>> > >>> This isn't to knock the idea in anyway, just wanted to mention what i > >>> think the outcome would be. > >>> > >>> dave > >>> > >>> > >>> > >>>> > >>>>>> On Tue, Mar 17, 2015 at 5:06 PM, Jonathan Ellis <jbel...@gmail.com> > >>>>> wrote: > >>>>>>> Cassandra 2.1 was released in September, which means that if we > were > >>>>> on > >>>>>>> track with our stated goal of six month releases, 3.0 would be done > >>>>> about > >>>>>>> now. Instead, we haven't even delivered a beta. The immediate > cause > >>>>>> this > >>>>>>> time is blocking for 8099 > >>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-8099>, but the > >>>>> reality > >>>>>> is > >>>>>>> that nobody should really be surprised. Something always comes up > -- > >>>>>> we've > >>>>>>> averaged about nine months since 1.0, with 2.1 taking an entire > year. > >>>>>>> > >>>>>>> We could make theory align with reality by acknowledging, "if nine > >>>>> months > >>>>>>> is our 'natural' release schedule, then so be it." But I think we > >>>>> can > >>>>> do > >>>>>>> better. > >>>>>>> > >>>>>>> Broadly speaking, we have two constituencies with Cassandra > releases: > >>>>>>> > >>>>>>> First, we have the users who are building or porting an application > >>>>> on > >>>>>>> Cassandra. These users want the newest features to make their job > >>>>>> easier. > >>>>>>> If 2.1.0 has a few bugs, it's not the end of the world. They have > >>>>> time > >>>>>> to > >>>>>>> wait for 2.1.x to stabilize while they write their code. They > would > >>>>> like > >>>>>>> to see us deliver on our six month schedule or even faster. > >>>>>>> > >>>>>>> Second, we have the users who have an application in production. > >>>>> These > >>>>>>> users, or their bosses, want Cassandra to be as stable as possible. > >>>>>>> Assuming they deploy on a stable release like 2.0.12, they don't > want > >>>>> to > >>>>>>> touch it. They would like to see us release *less* often. > (Because > >>>>> that > >>>>>>> means they have to do less upgrades while remaining in our > backwards > >>>>>>> compatibility window.) > >>>>>>> > >>>>>>> With our current "big release every X months" model, these users' > >>>>> needs > >>>>>> are > >>>>>>> in tension. > >>>>>>> > >>>>>>> We discussed this six months ago, and ended up with this: > >>>>>>> > >>>>>>> What if we tried a [four month] release cycle, BUT we would > guarantee > >>>>>> that > >>>>>>>> you could do a rolling upgrade until we bump the supermajor > version? > >>>>> So > >>>>>> 2.0 > >>>>>>>> could upgrade to 3.0 without having to go through 2.1. (But to go > >>>>> to > >>>>>> 3.1 > >>>>>>>> or 4.0 you would have to go through 3.0.) > >>>>>>>> > >>>>>>> > >>>>>>> Crucially, I added > >>>>>>> > >>>>>>> Whether this is reasonable depends on how fast we can stabilize > >>>>> releases. > >>>>>>>> 2.1.0 will be a good test of this. > >>>>>>>> > >>>>>>> > >>>>>>> Unfortunately, even after DataStax hired half a dozen full-time > test > >>>>>>> engineers, 2.1.0 continued the proud tradition of being unready for > >>>>>>> production use, with "wait for .5 before upgrading" once again > >>>>> looking > >>>>>> like > >>>>>>> a good guideline. > >>>>>>> > >>>>>>> I’m starting to think that the entire model of “write a bunch of > new > >>>>>>> features all at once and then try to stabilize it for release” is > >>>>> broken. > >>>>>>> We’ve been trying that for years and empirically speaking the > >>>>> evidence > >>>>> is > >>>>>>> that it just doesn’t work, either from a stability standpoint or > even > >>>>>> just > >>>>>>> shipping on time. > >>>>>>> > >>>>>>> A big reason that it takes us so long to stabilize new releases now > >>>>> is > >>>>>>> that, because our major release cycle is so long, it’s super > tempting > >>>>> to > >>>>>>> slip in “just one” new feature into bugfix releases, and I’m as > >>>>> guilty > >>>>> of > >>>>>>> that as anyone. > >>>>>>> > >>>>>>> For similar reasons, it’s difficult to do a meaningful freeze with > >>>>> big > >>>>>>> feature releases. A look at 3.0 shows why: we have 8099 coming, > but > >>>>> we > >>>>>>> also have significant work done (but not finished) on 6230, 7970, > >>>>> 6696, > >>>>>> and > >>>>>>> 6477, all of which are meaningful improvements that address > >>>>> demonstrated > >>>>>>> user pain. So if we keep doing what we’ve been doing, our choices > >>>>> are > >>>>> to > >>>>>>> either delay 3.0 further while we finish and stabilize these, or we > >>>>> wait > >>>>>>> nine months to a year for the next release. Either way, one of our > >>>>>>> constituencies gets disappointed. > >>>>>>> > >>>>>>> So, I’d like to try something different. I think we were on the > >>>>> right > >>>>>>> track with shorter releases with more compatibility. But I’d like > to > >>>>>> throw > >>>>>>> in a twist. Intel cuts down on risk with a “tick-tock” schedule > for > >>>>> new > >>>>>>> architectures and process shrinks instead of trying to do both at > >>>>> once. > >>>>>> We > >>>>>>> can do something similar here: > >>>>>>> > >>>>>>> One month releases. Period. If it’s not done, it can wait. > >>>>>>> *Every other release only accepts bug fixes.* > >>>>>>> > >>>>>>> By itself, one-month releases are going to dramatically reduce the > >>>>>>> complexity of testing and debugging new releases -- and bugs that > do > >>>>> slip > >>>>>>> past us will only affect a smaller percentage of users, avoiding > the > >>>>> “big > >>>>>>> release has a bunch of bugs no one has seen before and pretty much > >>>>>> everyone > >>>>>>> is hit by something” scenario. But by adding in the second rule, I > >>>>> think > >>>>>>> we have a real chance to make a quantum leap here: stable, > >>>>>> production-ready > >>>>>>> releases every two months. > >>>>>>> > >>>>>>> So here is my proposal for 3.0: > >>>>>>> > >>>>>>> We’re just about ready to start serious review of 8099. When > that’s > >>>>>> done, > >>>>>>> we branch 3.0 and cut a beta and then release candidates. Whatever > >>>>> isn’t > >>>>>>> done by then, has to wait; unlike prior betas, we will only accept > >>>>> bug > >>>>>>> fixes into 3.0 after branching. > >>>>>>> > >>>>>>> One month after 3.0, we will ship 3.1 (with new features). At the > >>>>> same > >>>>>>> time, we will branch 3.2. New features in trunk will go into 3.3. > >>>>> The > >>>>>> 3.2 > >>>>>>> branch will only get bug fixes. We will maintain backwards > >>>>> compatibility > >>>>>>> for all of 3.x; eventually (no less than a year) we will pick a > >>>>> release > >>>>>> to > >>>>>>> be 4.0, and drop deprecated features and old backwards > >>>>> compatibilities. > >>>>>>> Otherwise there will be nothing special about the 4.0 designation. > >>>>> (Note > >>>>>>> that with an “odd releases have new features, even releases only > have > >>>>> bug > >>>>>>> fixes” policy, 4.0 will actually be *more* stable than 3.11.) > >>>>>>> > >>>>>>> Larger features can continue to be developed in separate branches, > >>>>> the > >>>>>> way > >>>>>>> 8099 is being worked on today, and committed to trunk when ready. > So > >>>>>> this > >>>>>>> is not saying that we are limited only to features we can build in > a > >>>>>> single > >>>>>>> month. > >>>>>>> > >>>>>>> Some things will have to change with our dev process, for the > better. > >>>>> In > >>>>>>> particular, with one month to commit new features, we don’t have > room > >>>>> for > >>>>>>> committing sloppy work and stabilizing it later. Trunk has to be > >>>>> stable > >>>>>> at > >>>>>>> all times. I asked Ariel Weisberg to put together his thoughts > >>>>>> separately > >>>>>>> on what worked for his team at VoltDB, and how we can apply that to > >>>>>>> Cassandra -- see his email from Friday <http://bit.ly/1MHaOKX>. > >>>>> (TLDR: > >>>>>>> Redefine “done” to include automated tests. Infrastructure to run > >>>>> tests > >>>>>>> against github branches before merging to trunk. A new test > harness > >>>>> for > >>>>>>> long-running regression tests.) > >>>>>>> > >>>>>>> I’m optimistic that as we improve our process this way, our even > >>>>> releases > >>>>>>> will become increasingly stable. If so, we can skip sub-minor > >>>>> releases > >>>>>>> (3.2.x) entirely, and focus on keeping the release train moving. > In > >>>>> the > >>>>>>> meantime, we will continue delivering 2.1.x stability releases. > >>>>>>> > >>>>>>> This won’t be an entirely smooth transition. In particular, you > will > >>>>>> have > >>>>>>> noticed that 3.1 will get more than a month’s worth of new features > >>>>> while > >>>>>>> we stabilize 3.0 as the last of the old way of doing things, so > some > >>>>>>> patience is in order as we try this out. By 3.4 and 3.6 later this > >>>>> year > >>>>>> we > >>>>>>> should have a good idea if this is working, and we can make > >>>>> adjustments > >>>>>> as > >>>>>>> warranted. > >>>>>>> > >>>>>>> -- > >>>>>>> Jonathan Ellis > >>>>>>> Project Chair, Apache Cassandra > >>>>>>> co-founder, http://www.datastax.com > >>>>>>> @spyced > >>>>> > >>>> > >>> > > > >