Re: 3.0 and the Cassandra release process

Pavel Yaskevich Wed, 18 Mar 2015 17:50:46 -0700

+1

On Wed, Mar 18, 2015 at 3:50 PM, Michael Kjellman <
mkjell...@internalcircle.com> wrote:


> For most of my life I’ve lived on the software bleeding edge both
> personally and professionally. Maybe it’s a personal weakness, but I guess
> I get a thrill out of the problem solving aspect?
>
> Recently I came to a bit of an epiphany — the closer I keep to the daily
> build — generally the happier I am on a daily basis. Bugs happen, but for
> the most part (aside from show stopper bugs), pain points for myself in a
> given daily build can generally can be debugged to 1 or maybe 2 root
> causes, fixed in ~24 hours, and then life is better the next day again. In
> comparison, the old waterfall model generally means taking an “official”
> release at some point and waiting for some poor soul (or developer) to
> actually run the thing. No matter how good the QA team is, until it’s
> actually used in the real world, most bugs aren’t found.
>
> If you and your organization can wait 24 hours * number of bugs discovered
> after people actually started using the thing, you end up with a “usable
> build” around the holy-grail minor X.X.5 release of Cassandra.
>
> I love the idea of the LTS model Jonathan describes because it means more
> code can get real testing and “bake” for longer instead of sitting largely
> unused on some git repository in a datacenter far far away. A lot of code
> has changed between 2.0 and trunk today. The code has diverged to the point
> that if you write something for 2.0 (as the most stable major branch
> currently available), merging it forward to 3.0 or after generally means
> rewriting it. If the only thing that comes out of this is a smaller delta
> of LOC between the deployable version/branch and what we can develop
> against and what QA is focused on I think that’s a massive win.
>
> Something like CASSANDRA-8099 will need 2x the baking time of even many of
> the more risky changes the project has made. While I wouldn’t want to run a
> build with CASSANDRA-8099 in it anytime soon, there are now hundreds of
> other changes blocked, most likely many containing new bugs of their own,
> but have no exposure at all to even the most involved C* developers.
>
> I really think this will be a huge win for the project and I’m super
> thankful for Sylvian, Ariel, Jonathan, Aleksey, and Jake for guiding this
> change to a much more sustainable release model for the entire community.
>
> best,
> kjellman
>
>
> > On Mar 18, 2015, at 3:02 PM, Ariel Weisberg <ariel.weisb...@datastax.com>
> wrote:
> >
> > Hi,
> >
> > Keep in mind it is a bug fix release every month and a feature release
> every two months.
> >
> > For development that is really a two month cycle with all bug fixes
> being backported one release. As a developer if you want to get something
> in a release you have two months and you should be sizing pieces of large
> tasks so they ship at least every two months.
> >
> > Ariel
> >> On Mar 18, 2015, at 5:58 PM, Terrance Shepherd <tscana...@gmail.com>
> wrote:
> >>
> >> I like the idea but I agree that every month is a bit aggressive. I
> have no
> >> say but:
> >>
> >> I would say 4 releases a year instead of 12. with 2 months of new
> features
> >> and 1 month of bug squashing per a release. With the 4th quarter just
> bugs.
> >>
> >> I would also proposed 2 year LTS releases for the releases after the 4th
> >> quarter. So everyone could get a new feature release every quarter and
> the
> >> stability of super major versions for 2 years.
> >>
> >> On Wed, Mar 18, 2015 at 2:34 PM, Dave Brosius <dbros...@mebigfatguy.com
> >
> >> wrote:
> >>
> >>> It would seem the practical implications of this is that there would be
> >>> significantly more development on branches, with potentially more
> >>> significant delays on merging these branches. This would imply to me
> that
> >>> more Jenkins servers would need to be set up to handle auto-testing of
> more
> >>> branches, as if feature work spends more time on external branches, it
> is
> >>> then likely to be be less tested (even if by accident) as less
> developers
> >>> would be working on that branch. Only when a feature was blessed to
> make it
> >>> to the release-tracked branch, would it become exposed to the majority
> of
> >>> developers/testers, etc doing normal running/playing/testing.
> >>>
> >>> This isn't to knock the idea in anyway, just wanted to mention what i
> >>> think the outcome would be.
> >>>
> >>> dave
> >>>
> >>>
> >>>
> >>>>
> >>>>>> On Tue, Mar 17, 2015 at 5:06 PM, Jonathan Ellis <jbel...@gmail.com>
> >>>>> wrote:
> >>>>>>> Cassandra 2.1 was released in September, which means that if we
> were
> >>>>> on
> >>>>>>> track with our stated goal of six month releases, 3.0 would be done
> >>>>> about
> >>>>>>> now.  Instead, we haven't even delivered a beta.  The immediate
> cause
> >>>>>> this
> >>>>>>> time is blocking for 8099
> >>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-8099>, but the
> >>>>> reality
> >>>>>> is
> >>>>>>> that nobody should really be surprised.  Something always comes up
> --
> >>>>>> we've
> >>>>>>> averaged about nine months since 1.0, with 2.1 taking an entire
> year.
> >>>>>>>
> >>>>>>> We could make theory align with reality by acknowledging, "if nine
> >>>>> months
> >>>>>>> is our 'natural' release schedule, then so be it."  But I think we
> >>>>> can
> >>>>> do
> >>>>>>> better.
> >>>>>>>
> >>>>>>> Broadly speaking, we have two constituencies with Cassandra
> releases:
> >>>>>>>
> >>>>>>> First, we have the users who are building or porting an application
> >>>>> on
> >>>>>>> Cassandra.  These users want the newest features to make their job
> >>>>>> easier.
> >>>>>>> If 2.1.0 has a few bugs, it's not the end of the world.  They have
> >>>>> time
> >>>>>> to
> >>>>>>> wait for 2.1.x to stabilize while they write their code.  They
> would
> >>>>> like
> >>>>>>> to see us deliver on our six month schedule or even faster.
> >>>>>>>
> >>>>>>> Second, we have the users who have an application in production.
> >>>>> These
> >>>>>>> users, or their bosses, want Cassandra to be as stable as possible.
> >>>>>>> Assuming they deploy on a stable release like 2.0.12, they don't
> want
> >>>>> to
> >>>>>>> touch it.  They would like to see us release *less* often.
> (Because
> >>>>> that
> >>>>>>> means they have to do less upgrades while remaining in our
> backwards
> >>>>>>> compatibility window.)
> >>>>>>>
> >>>>>>> With our current "big release every X months" model, these users'
> >>>>> needs
> >>>>>> are
> >>>>>>> in tension.
> >>>>>>>
> >>>>>>> We discussed this six months ago, and ended up with this:
> >>>>>>>
> >>>>>>> What if we tried a [four month] release cycle, BUT we would
> guarantee
> >>>>>> that
> >>>>>>>> you could do a rolling upgrade until we bump the supermajor
> version?
> >>>>> So
> >>>>>> 2.0
> >>>>>>>> could upgrade to 3.0 without having to go through 2.1.  (But to go
> >>>>> to
> >>>>>> 3.1
> >>>>>>>> or 4.0 you would have to go through 3.0.)
> >>>>>>>>
> >>>>>>>
> >>>>>>> Crucially, I added
> >>>>>>>
> >>>>>>> Whether this is reasonable depends on how fast we can stabilize
> >>>>> releases.
> >>>>>>>> 2.1.0 will be a good test of this.
> >>>>>>>>
> >>>>>>>
> >>>>>>> Unfortunately, even after DataStax hired half a dozen full-time
> test
> >>>>>>> engineers, 2.1.0 continued the proud tradition of being unready for
> >>>>>>> production use, with "wait for .5 before upgrading" once again
> >>>>> looking
> >>>>>> like
> >>>>>>> a good guideline.
> >>>>>>>
> >>>>>>> I’m starting to think that the entire model of “write a bunch of
> new
> >>>>>>> features all at once and then try to stabilize it for release” is
> >>>>> broken.
> >>>>>>> We’ve been trying that for years and empirically speaking the
> >>>>> evidence
> >>>>> is
> >>>>>>> that it just doesn’t work, either from a stability standpoint or
> even
> >>>>>> just
> >>>>>>> shipping on time.
> >>>>>>>
> >>>>>>> A big reason that it takes us so long to stabilize new releases now
> >>>>> is
> >>>>>>> that, because our major release cycle is so long, it’s super
> tempting
> >>>>> to
> >>>>>>> slip in “just one” new feature into bugfix releases, and I’m as
> >>>>> guilty
> >>>>> of
> >>>>>>> that as anyone.
> >>>>>>>
> >>>>>>> For similar reasons, it’s difficult to do a meaningful freeze with
> >>>>> big
> >>>>>>> feature releases.  A look at 3.0 shows why: we have 8099 coming,
> but
> >>>>> we
> >>>>>>> also have significant work done (but not finished) on 6230, 7970,
> >>>>> 6696,
> >>>>>> and
> >>>>>>> 6477, all of which are meaningful improvements that address
> >>>>> demonstrated
> >>>>>>> user pain.  So if we keep doing what we’ve been doing, our choices
> >>>>> are
> >>>>> to
> >>>>>>> either delay 3.0 further while we finish and stabilize these, or we
> >>>>> wait
> >>>>>>> nine months to a year for the next release.  Either way, one of our
> >>>>>>> constituencies gets disappointed.
> >>>>>>>
> >>>>>>> So, I’d like to try something different.  I think we were on the
> >>>>> right
> >>>>>>> track with shorter releases with more compatibility.  But I’d like
> to
> >>>>>> throw
> >>>>>>> in a twist.  Intel cuts down on risk with a “tick-tock” schedule
> for
> >>>>> new
> >>>>>>> architectures and process shrinks instead of trying to do both at
> >>>>> once.
> >>>>>> We
> >>>>>>> can do something similar here:
> >>>>>>>
> >>>>>>> One month releases.  Period.  If it’s not done, it can wait.
> >>>>>>> *Every other release only accepts bug fixes.*
> >>>>>>>
> >>>>>>> By itself, one-month releases are going to dramatically reduce the
> >>>>>>> complexity of testing and debugging new releases -- and bugs that
> do
> >>>>> slip
> >>>>>>> past us will only affect a smaller percentage of users, avoiding
> the
> >>>>> “big
> >>>>>>> release has a bunch of bugs no one has seen before and pretty much
> >>>>>> everyone
> >>>>>>> is hit by something” scenario.  But by adding in the second rule, I
> >>>>> think
> >>>>>>> we have a real chance to make a quantum leap here: stable,
> >>>>>> production-ready
> >>>>>>> releases every two months.
> >>>>>>>
> >>>>>>> So here is my proposal for 3.0:
> >>>>>>>
> >>>>>>> We’re just about ready to start serious review of 8099.  When
> that’s
> >>>>>> done,
> >>>>>>> we branch 3.0 and cut a beta and then release candidates.  Whatever
> >>>>> isn’t
> >>>>>>> done by then, has to wait; unlike prior betas, we will only accept
> >>>>> bug
> >>>>>>> fixes into 3.0 after branching.
> >>>>>>>
> >>>>>>> One month after 3.0, we will ship 3.1 (with new features).  At the
> >>>>> same
> >>>>>>> time, we will branch 3.2.  New features in trunk will go into 3.3.
> >>>>> The
> >>>>>> 3.2
> >>>>>>> branch will only get bug fixes.  We will maintain backwards
> >>>>> compatibility
> >>>>>>> for all of 3.x; eventually (no less than a year) we will pick a
> >>>>> release
> >>>>>> to
> >>>>>>> be 4.0, and drop deprecated features and old backwards
> >>>>> compatibilities.
> >>>>>>> Otherwise there will be nothing special about the 4.0 designation.
> >>>>> (Note
> >>>>>>> that with an “odd releases have new features, even releases only
> have
> >>>>> bug
> >>>>>>> fixes” policy, 4.0 will actually be *more* stable than 3.11.)
> >>>>>>>
> >>>>>>> Larger features can continue to be developed in separate branches,
> >>>>> the
> >>>>>> way
> >>>>>>> 8099 is being worked on today, and committed to trunk when ready.
> So
> >>>>>> this
> >>>>>>> is not saying that we are limited only to features we can build in
> a
> >>>>>> single
> >>>>>>> month.
> >>>>>>>
> >>>>>>> Some things will have to change with our dev process, for the
> better.
> >>>>> In
> >>>>>>> particular, with one month to commit new features, we don’t have
> room
> >>>>> for
> >>>>>>> committing sloppy work and stabilizing it later.  Trunk has to be
> >>>>> stable
> >>>>>> at
> >>>>>>> all times.  I asked Ariel Weisberg to put together his thoughts
> >>>>>> separately
> >>>>>>> on what worked for his team at VoltDB, and how we can apply that to
> >>>>>>> Cassandra -- see his email from Friday <http://bit.ly/1MHaOKX>.
> >>>>> (TLDR:
> >>>>>>> Redefine “done” to include automated tests.  Infrastructure to run
> >>>>> tests
> >>>>>>> against github branches before merging to trunk.  A new test
> harness
> >>>>> for
> >>>>>>> long-running regression tests.)
> >>>>>>>
> >>>>>>> I’m optimistic that as we improve our process this way, our even
> >>>>> releases
> >>>>>>> will become increasingly stable.  If so, we can skip sub-minor
> >>>>> releases
> >>>>>>> (3.2.x) entirely, and focus on keeping the release train moving.
> In
> >>>>> the
> >>>>>>> meantime, we will continue delivering 2.1.x stability releases.
> >>>>>>>
> >>>>>>> This won’t be an entirely smooth transition.  In particular, you
> will
> >>>>>> have
> >>>>>>> noticed that 3.1 will get more than a month’s worth of new features
> >>>>> while
> >>>>>>> we stabilize 3.0 as the last of the old way of doing things, so
> some
> >>>>>>> patience is in order as we try this out.  By 3.4 and 3.6 later this
> >>>>> year
> >>>>>> we
> >>>>>>> should have a good idea if this is working, and we can make
> >>>>> adjustments
> >>>>>> as
> >>>>>>> warranted.
> >>>>>>>
> >>>>>>> --
> >>>>>>> Jonathan Ellis
> >>>>>>> Project Chair, Apache Cassandra
> >>>>>>> co-founder, http://www.datastax.com
> >>>>>>> @spyced
> >>>>>
> >>>>
> >>>
> >
>
>

Re: 3.0 and the Cassandra release process

Reply via email to