I'm taking notes from the infrastructure doc and wrote down some action
items for my team:

https://gist.github.com/EnigmaCurry/d53eccb55f5d0986c976


--

[image: datastax_logo.png] <http://www.datastax.com/>

Ryan McGuire

Software Engineering Manager in Test | r...@datastax.com

[image: linkedin.png] <https://www.linkedin.com/in/enigmacurry> [image:
twitter.png] <http://twitter.com/enigmacurry>
<http://github.com/enigmacurry>


On Thu, Mar 19, 2015 at 1:08 PM, Ariel Weisberg <ariel.weisb...@datastax.com
> wrote:

> Hi,
>
> I realized one of the documents we didn't send out was the infrastructure
> side changes I am looking for. This one is maybe a little rougher as it was
> the first one I wrote on the subject.
>
>
> https://docs.google.com/document/d/1Seku0vPwChbnH3uYYxon0UO-b6LDtSqluZiH--sWWi0/edit?usp=sharing
>
> The goal is to have infrastructure that gives developers as close to
> immediate feedback as possible on their code before they merge. Feedback
> that is delayed to after merging to trunk should come in a day or two and
> there is a product owner (Michael Shuler) responsible for making sure that
> issues are addressed quickly.
>
> QA is going to help by providing developers with a better tools for writing
> higher level functional tests that explore all of the functions together
> along with the configuration space without developers having to do any work
> other then plugging in functionality to exercise and then validate
> something specific. This kind of harness is hard to get right and make
> reliable and expressive so they have their work cut out for them.
>
> It's going to be an iterative process where the tests improve as new work
> introduces missing coverage and as bugs/regressions drive the introduction
> of new tests. The monthly retrospective (planning on doing that first of
> the month) is also going to help us refine the testing and development
> process.
>
> Ariel
>
> On Thu, Mar 19, 2015 at 7:23 AM, Jason Brown <jasedbr...@gmail.com> wrote:
>
> > +1 to this general proposal. I think the time has finally come for us to
> > try something new, and this sounds legit. Thanks!
> >
> > On Thu, Mar 19, 2015 at 12:49 AM, Phil Yang <ud1...@gmail.com> wrote:
> >
> > > Can I regard the odd version as the "development preview" and the even
> > > version as the "production ready"?
> > >
> > > IMO, as a database infrastructure project, "stable" is more important
> > than
> > > other kinds of projects. LTS is a good idea, but if we don't support
> > > non-LTS releases for enough time to fix their bugs, users on non-LTS
> > > release may have to upgrade a new major release to fix the bugs and may
> > > have to handle some new bugs by the new features. I'm afraid that
> > > eventually people would only think about the LTS one.
> > >
> > >
> > > 2015-03-19 8:48 GMT+08:00 Pavel Yaskevich <pove...@gmail.com>:
> > >
> > > > +1
> > > >
> > > > On Wed, Mar 18, 2015 at 3:50 PM, Michael Kjellman <
> > > > mkjell...@internalcircle.com> wrote:
> > > >
> > > > > For most of my life I’ve lived on the software bleeding edge both
> > > > > personally and professionally. Maybe it’s a personal weakness, but
> I
> > > > guess
> > > > > I get a thrill out of the problem solving aspect?
> > > > >
> > > > > Recently I came to a bit of an epiphany — the closer I keep to the
> > > daily
> > > > > build — generally the happier I am on a daily basis. Bugs happen,
> but
> > > for
> > > > > the most part (aside from show stopper bugs), pain points for
> myself
> > > in a
> > > > > given daily build can generally can be debugged to 1 or maybe 2
> root
> > > > > causes, fixed in ~24 hours, and then life is better the next day
> > again.
> > > > In
> > > > > comparison, the old waterfall model generally means taking an
> > > “official”
> > > > > release at some point and waiting for some poor soul (or developer)
> > to
> > > > > actually run the thing. No matter how good the QA team is, until
> it’s
> > > > > actually used in the real world, most bugs aren’t found.
> > > > >
> > > > > If you and your organization can wait 24 hours * number of bugs
> > > > discovered
> > > > > after people actually started using the thing, you end up with a
> > > “usable
> > > > > build” around the holy-grail minor X.X.5 release of Cassandra.
> > > > >
> > > > > I love the idea of the LTS model Jonathan describes because it
> means
> > > more
> > > > > code can get real testing and “bake” for longer instead of sitting
> > > > largely
> > > > > unused on some git repository in a datacenter far far away. A lot
> of
> > > code
> > > > > has changed between 2.0 and trunk today. The code has diverged to
> the
> > > > point
> > > > > that if you write something for 2.0 (as the most stable major
> branch
> > > > > currently available), merging it forward to 3.0 or after generally
> > > means
> > > > > rewriting it. If the only thing that comes out of this is a smaller
> > > delta
> > > > > of LOC between the deployable version/branch and what we can
> develop
> > > > > against and what QA is focused on I think that’s a massive win.
> > > > >
> > > > > Something like CASSANDRA-8099 will need 2x the baking time of even
> > many
> > > > of
> > > > > the more risky changes the project has made. While I wouldn’t want
> to
> > > > run a
> > > > > build with CASSANDRA-8099 in it anytime soon, there are now
> hundreds
> > of
> > > > > other changes blocked, most likely many containing new bugs of
> their
> > > own,
> > > > > but have no exposure at all to even the most involved C*
> developers.
> > > > >
> > > > > I really think this will be a huge win for the project and I’m
> super
> > > > > thankful for Sylvian, Ariel, Jonathan, Aleksey, and Jake for
> guiding
> > > this
> > > > > change to a much more sustainable release model for the entire
> > > community.
> > > > >
> > > > > best,
> > > > > kjellman
> > > > >
> > > > >
> > > > > > On Mar 18, 2015, at 3:02 PM, Ariel Weisberg <
> > > > ariel.weisb...@datastax.com>
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Keep in mind it is a bug fix release every month and a feature
> > > release
> > > > > every two months.
> > > > > >
> > > > > > For development that is really a two month cycle with all bug
> fixes
> > > > > being backported one release. As a developer if you want to get
> > > something
> > > > > in a release you have two months and you should be sizing pieces of
> > > large
> > > > > tasks so they ship at least every two months.
> > > > > >
> > > > > > Ariel
> > > > > >> On Mar 18, 2015, at 5:58 PM, Terrance Shepherd <
> > tscana...@gmail.com
> > > >
> > > > > wrote:
> > > > > >>
> > > > > >> I like the idea but I agree that every month is a bit
> aggressive.
> > I
> > > > > have no
> > > > > >> say but:
> > > > > >>
> > > > > >> I would say 4 releases a year instead of 12. with 2 months of
> new
> > > > > features
> > > > > >> and 1 month of bug squashing per a release. With the 4th quarter
> > > just
> > > > > bugs.
> > > > > >>
> > > > > >> I would also proposed 2 year LTS releases for the releases after
> > the
> > > > 4th
> > > > > >> quarter. So everyone could get a new feature release every
> quarter
> > > and
> > > > > the
> > > > > >> stability of super major versions for 2 years.
> > > > > >>
> > > > > >> On Wed, Mar 18, 2015 at 2:34 PM, Dave Brosius <
> > > > dbros...@mebigfatguy.com
> > > > > >
> > > > > >> wrote:
> > > > > >>
> > > > > >>> It would seem the practical implications of this is that there
> > > would
> > > > be
> > > > > >>> significantly more development on branches, with potentially
> more
> > > > > >>> significant delays on merging these branches. This would imply
> to
> > > me
> > > > > that
> > > > > >>> more Jenkins servers would need to be set up to handle
> > auto-testing
> > > > of
> > > > > more
> > > > > >>> branches, as if feature work spends more time on external
> > branches,
> > > > it
> > > > > is
> > > > > >>> then likely to be be less tested (even if by accident) as less
> > > > > developers
> > > > > >>> would be working on that branch. Only when a feature was
> blessed
> > to
> > > > > make it
> > > > > >>> to the release-tracked branch, would it become exposed to the
> > > > majority
> > > > > of
> > > > > >>> developers/testers, etc doing normal running/playing/testing.
> > > > > >>>
> > > > > >>> This isn't to knock the idea in anyway, just wanted to mention
> > > what i
> > > > > >>> think the outcome would be.
> > > > > >>>
> > > > > >>> dave
> > > > > >>>
> > > > > >>>
> > > > > >>>
> > > > > >>>>
> > > > > >>>>>> On Tue, Mar 17, 2015 at 5:06 PM, Jonathan Ellis <
> > > > jbel...@gmail.com>
> > > > > >>>>> wrote:
> > > > > >>>>>>> Cassandra 2.1 was released in September, which means that
> if
> > we
> > > > > were
> > > > > >>>>> on
> > > > > >>>>>>> track with our stated goal of six month releases, 3.0 would
> > be
> > > > done
> > > > > >>>>> about
> > > > > >>>>>>> now.  Instead, we haven't even delivered a beta.  The
> > immediate
> > > > > cause
> > > > > >>>>>> this
> > > > > >>>>>>> time is blocking for 8099
> > > > > >>>>>>> <https://issues.apache.org/jira/browse/CASSANDRA-8099>,
> but
> > > the
> > > > > >>>>> reality
> > > > > >>>>>> is
> > > > > >>>>>>> that nobody should really be surprised.  Something always
> > comes
> > > > up
> > > > > --
> > > > > >>>>>> we've
> > > > > >>>>>>> averaged about nine months since 1.0, with 2.1 taking an
> > entire
> > > > > year.
> > > > > >>>>>>>
> > > > > >>>>>>> We could make theory align with reality by acknowledging,
> "if
> > > > nine
> > > > > >>>>> months
> > > > > >>>>>>> is our 'natural' release schedule, then so be it."  But I
> > think
> > > > we
> > > > > >>>>> can
> > > > > >>>>> do
> > > > > >>>>>>> better.
> > > > > >>>>>>>
> > > > > >>>>>>> Broadly speaking, we have two constituencies with Cassandra
> > > > > releases:
> > > > > >>>>>>>
> > > > > >>>>>>> First, we have the users who are building or porting an
> > > > application
> > > > > >>>>> on
> > > > > >>>>>>> Cassandra.  These users want the newest features to make
> > their
> > > > job
> > > > > >>>>>> easier.
> > > > > >>>>>>> If 2.1.0 has a few bugs, it's not the end of the world.
> They
> > > > have
> > > > > >>>>> time
> > > > > >>>>>> to
> > > > > >>>>>>> wait for 2.1.x to stabilize while they write their code.
> > They
> > > > > would
> > > > > >>>>> like
> > > > > >>>>>>> to see us deliver on our six month schedule or even faster.
> > > > > >>>>>>>
> > > > > >>>>>>> Second, we have the users who have an application in
> > > production.
> > > > > >>>>> These
> > > > > >>>>>>> users, or their bosses, want Cassandra to be as stable as
> > > > possible.
> > > > > >>>>>>> Assuming they deploy on a stable release like 2.0.12, they
> > > don't
> > > > > want
> > > > > >>>>> to
> > > > > >>>>>>> touch it.  They would like to see us release *less* often.
> > > > > (Because
> > > > > >>>>> that
> > > > > >>>>>>> means they have to do less upgrades while remaining in our
> > > > > backwards
> > > > > >>>>>>> compatibility window.)
> > > > > >>>>>>>
> > > > > >>>>>>> With our current "big release every X months" model, these
> > > users'
> > > > > >>>>> needs
> > > > > >>>>>> are
> > > > > >>>>>>> in tension.
> > > > > >>>>>>>
> > > > > >>>>>>> We discussed this six months ago, and ended up with this:
> > > > > >>>>>>>
> > > > > >>>>>>> What if we tried a [four month] release cycle, BUT we would
> > > > > guarantee
> > > > > >>>>>> that
> > > > > >>>>>>>> you could do a rolling upgrade until we bump the
> supermajor
> > > > > version?
> > > > > >>>>> So
> > > > > >>>>>> 2.0
> > > > > >>>>>>>> could upgrade to 3.0 without having to go through 2.1.
> (But
> > > to
> > > > go
> > > > > >>>>> to
> > > > > >>>>>> 3.1
> > > > > >>>>>>>> or 4.0 you would have to go through 3.0.)
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Crucially, I added
> > > > > >>>>>>>
> > > > > >>>>>>> Whether this is reasonable depends on how fast we can
> > stabilize
> > > > > >>>>> releases.
> > > > > >>>>>>>> 2.1.0 will be a good test of this.
> > > > > >>>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Unfortunately, even after DataStax hired half a dozen
> > full-time
> > > > > test
> > > > > >>>>>>> engineers, 2.1.0 continued the proud tradition of being
> > unready
> > > > for
> > > > > >>>>>>> production use, with "wait for .5 before upgrading" once
> > again
> > > > > >>>>> looking
> > > > > >>>>>> like
> > > > > >>>>>>> a good guideline.
> > > > > >>>>>>>
> > > > > >>>>>>> I’m starting to think that the entire model of “write a
> bunch
> > > of
> > > > > new
> > > > > >>>>>>> features all at once and then try to stabilize it for
> > release”
> > > is
> > > > > >>>>> broken.
> > > > > >>>>>>> We’ve been trying that for years and empirically speaking
> the
> > > > > >>>>> evidence
> > > > > >>>>> is
> > > > > >>>>>>> that it just doesn’t work, either from a stability
> standpoint
> > > or
> > > > > even
> > > > > >>>>>> just
> > > > > >>>>>>> shipping on time.
> > > > > >>>>>>>
> > > > > >>>>>>> A big reason that it takes us so long to stabilize new
> > releases
> > > > now
> > > > > >>>>> is
> > > > > >>>>>>> that, because our major release cycle is so long, it’s
> super
> > > > > tempting
> > > > > >>>>> to
> > > > > >>>>>>> slip in “just one” new feature into bugfix releases, and
> I’m
> > as
> > > > > >>>>> guilty
> > > > > >>>>> of
> > > > > >>>>>>> that as anyone.
> > > > > >>>>>>>
> > > > > >>>>>>> For similar reasons, it’s difficult to do a meaningful
> freeze
> > > > with
> > > > > >>>>> big
> > > > > >>>>>>> feature releases.  A look at 3.0 shows why: we have 8099
> > > coming,
> > > > > but
> > > > > >>>>> we
> > > > > >>>>>>> also have significant work done (but not finished) on 6230,
> > > 7970,
> > > > > >>>>> 6696,
> > > > > >>>>>> and
> > > > > >>>>>>> 6477, all of which are meaningful improvements that address
> > > > > >>>>> demonstrated
> > > > > >>>>>>> user pain.  So if we keep doing what we’ve been doing, our
> > > > choices
> > > > > >>>>> are
> > > > > >>>>> to
> > > > > >>>>>>> either delay 3.0 further while we finish and stabilize
> these,
> > > or
> > > > we
> > > > > >>>>> wait
> > > > > >>>>>>> nine months to a year for the next release.  Either way,
> one
> > of
> > > > our
> > > > > >>>>>>> constituencies gets disappointed.
> > > > > >>>>>>>
> > > > > >>>>>>> So, I’d like to try something different.  I think we were
> on
> > > the
> > > > > >>>>> right
> > > > > >>>>>>> track with shorter releases with more compatibility.  But
> I’d
> > > > like
> > > > > to
> > > > > >>>>>> throw
> > > > > >>>>>>> in a twist.  Intel cuts down on risk with a “tick-tock”
> > > schedule
> > > > > for
> > > > > >>>>> new
> > > > > >>>>>>> architectures and process shrinks instead of trying to do
> > both
> > > at
> > > > > >>>>> once.
> > > > > >>>>>> We
> > > > > >>>>>>> can do something similar here:
> > > > > >>>>>>>
> > > > > >>>>>>> One month releases.  Period.  If it’s not done, it can
> wait.
> > > > > >>>>>>> *Every other release only accepts bug fixes.*
> > > > > >>>>>>>
> > > > > >>>>>>> By itself, one-month releases are going to dramatically
> > reduce
> > > > the
> > > > > >>>>>>> complexity of testing and debugging new releases -- and
> bugs
> > > that
> > > > > do
> > > > > >>>>> slip
> > > > > >>>>>>> past us will only affect a smaller percentage of users,
> > > avoiding
> > > > > the
> > > > > >>>>> “big
> > > > > >>>>>>> release has a bunch of bugs no one has seen before and
> pretty
> > > > much
> > > > > >>>>>> everyone
> > > > > >>>>>>> is hit by something” scenario.  But by adding in the second
> > > > rule, I
> > > > > >>>>> think
> > > > > >>>>>>> we have a real chance to make a quantum leap here: stable,
> > > > > >>>>>> production-ready
> > > > > >>>>>>> releases every two months.
> > > > > >>>>>>>
> > > > > >>>>>>> So here is my proposal for 3.0:
> > > > > >>>>>>>
> > > > > >>>>>>> We’re just about ready to start serious review of 8099.
> When
> > > > > that’s
> > > > > >>>>>> done,
> > > > > >>>>>>> we branch 3.0 and cut a beta and then release candidates.
> > > > Whatever
> > > > > >>>>> isn’t
> > > > > >>>>>>> done by then, has to wait; unlike prior betas, we will only
> > > > accept
> > > > > >>>>> bug
> > > > > >>>>>>> fixes into 3.0 after branching.
> > > > > >>>>>>>
> > > > > >>>>>>> One month after 3.0, we will ship 3.1 (with new features).
> > At
> > > > the
> > > > > >>>>> same
> > > > > >>>>>>> time, we will branch 3.2.  New features in trunk will go
> into
> > > > 3.3.
> > > > > >>>>> The
> > > > > >>>>>> 3.2
> > > > > >>>>>>> branch will only get bug fixes.  We will maintain backwards
> > > > > >>>>> compatibility
> > > > > >>>>>>> for all of 3.x; eventually (no less than a year) we will
> > pick a
> > > > > >>>>> release
> > > > > >>>>>> to
> > > > > >>>>>>> be 4.0, and drop deprecated features and old backwards
> > > > > >>>>> compatibilities.
> > > > > >>>>>>> Otherwise there will be nothing special about the 4.0
> > > > designation.
> > > > > >>>>> (Note
> > > > > >>>>>>> that with an “odd releases have new features, even releases
> > > only
> > > > > have
> > > > > >>>>> bug
> > > > > >>>>>>> fixes” policy, 4.0 will actually be *more* stable than
> 3.11.)
> > > > > >>>>>>>
> > > > > >>>>>>> Larger features can continue to be developed in separate
> > > > branches,
> > > > > >>>>> the
> > > > > >>>>>> way
> > > > > >>>>>>> 8099 is being worked on today, and committed to trunk when
> > > ready.
> > > > > So
> > > > > >>>>>> this
> > > > > >>>>>>> is not saying that we are limited only to features we can
> > build
> > > > in
> > > > > a
> > > > > >>>>>> single
> > > > > >>>>>>> month.
> > > > > >>>>>>>
> > > > > >>>>>>> Some things will have to change with our dev process, for
> the
> > > > > better.
> > > > > >>>>> In
> > > > > >>>>>>> particular, with one month to commit new features, we don’t
> > > have
> > > > > room
> > > > > >>>>> for
> > > > > >>>>>>> committing sloppy work and stabilizing it later.  Trunk has
> > to
> > > be
> > > > > >>>>> stable
> > > > > >>>>>> at
> > > > > >>>>>>> all times.  I asked Ariel Weisberg to put together his
> > thoughts
> > > > > >>>>>> separately
> > > > > >>>>>>> on what worked for his team at VoltDB, and how we can apply
> > > that
> > > > to
> > > > > >>>>>>> Cassandra -- see his email from Friday <
> > http://bit.ly/1MHaOKX
> > > >.
> > > > > >>>>> (TLDR:
> > > > > >>>>>>> Redefine “done” to include automated tests.  Infrastructure
> > to
> > > > run
> > > > > >>>>> tests
> > > > > >>>>>>> against github branches before merging to trunk.  A new
> test
> > > > > harness
> > > > > >>>>> for
> > > > > >>>>>>> long-running regression tests.)
> > > > > >>>>>>>
> > > > > >>>>>>> I’m optimistic that as we improve our process this way, our
> > > even
> > > > > >>>>> releases
> > > > > >>>>>>> will become increasingly stable.  If so, we can skip
> > sub-minor
> > > > > >>>>> releases
> > > > > >>>>>>> (3.2.x) entirely, and focus on keeping the release train
> > > moving.
> > > > > In
> > > > > >>>>> the
> > > > > >>>>>>> meantime, we will continue delivering 2.1.x stability
> > releases.
> > > > > >>>>>>>
> > > > > >>>>>>> This won’t be an entirely smooth transition.  In
> particular,
> > > you
> > > > > will
> > > > > >>>>>> have
> > > > > >>>>>>> noticed that 3.1 will get more than a month’s worth of new
> > > > features
> > > > > >>>>> while
> > > > > >>>>>>> we stabilize 3.0 as the last of the old way of doing
> things,
> > so
> > > > > some
> > > > > >>>>>>> patience is in order as we try this out.  By 3.4 and 3.6
> > later
> > > > this
> > > > > >>>>> year
> > > > > >>>>>> we
> > > > > >>>>>>> should have a good idea if this is working, and we can make
> > > > > >>>>> adjustments
> > > > > >>>>>> as
> > > > > >>>>>>> warranted.
> > > > > >>>>>>>
> > > > > >>>>>>> --
> > > > > >>>>>>> Jonathan Ellis
> > > > > >>>>>>> Project Chair, Apache Cassandra
> > > > > >>>>>>> co-founder, http://www.datastax.com
> > > > > >>>>>>> @spyced
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >
> > > > >
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Thanks,
> > > Phil Yang
> > >
> >
>

Reply via email to