I agree tick-tock is a failure.  But for two reasons IMO:

1) Ultimately, the users are the real testers and it takes a while for a
release to percolate into the wild for feedback.  The reality is that a
release doesn't have its tires properly kicked for at least three months
after it's cut.  So if we are to have any tocks, they should be completely
unwed from the ticks, and should probably happen on a ~3M cadence to keep
the labour down but the utility up (and there should probably still be more
than one tock per tick)

2) Those promised resources to improved process never happened.  We haven't
even reached parity with the 2.1 release until very recently, i.e. no
failing u/dtests.


On 15 September 2016 at 19:08, Jeff Jirsa <jeff.ji...@crowdstrike.com>
wrote:

> I know we’ve got a lot of folks following the dev list without a lot of
> background, so let’s make sure we get some context here so everyone can be
> on the same page.
>
> Going to preface this wall of text by saying I’m +1 on a 3.5.1 (and 3.3.1,
> etc) if it’s done AFTER 3.9 (I think we need to get 3.9 out first before
> the RE manpower is spent on backporting fixes, even critical fixes, because
> 3.9 has multiple critical fixes for people running 3.7).
>
> Now some background:
>
> For many years, Cassandra used to have a dev process that kept 3 active
> branches - “bleeding edge”, a “stable”, and an “old stable” branch, where
> developers would be committing ALL new contributions to the bleeding edge,
> non-api-breaking changes to stable, and bugfixes only to old stable. While
> the api changed and major features were added, that bleeding edge would
> just be ‘trunk’, and it’d get cut into a major version when it was ready to
> ship. We saw that with 2.2 / 2.1 / 2.0 (and before that, 2.1 / 2.0 / 1.2,
> and before that 2.0 / 1.2 / 1.1 ). When that bleeding edge got released as
> a major x.y.0, the third, oldest, most stable branch went EOL, and new
> features would go into trunk for the next major version.
>
> There were two big negatives observed with this:
>
> The first big negative is that if multiple major new features were in
> flight, releases were prone to delay. Nobody wants to break an API on a
> x.y.1 release, and nobody wants to add a new feature to a x.y.2 release, so
> the project would delay the x.y releases if major features were close, and
> then there’d be pressure to slip them in before they were fully tested, or
> cut features to avoid delaying the release. This pressure was observed to
> be bad for the project – it forced technical compromises.
>
> The second downside that was observed was that nobody would try to run the
> new versions when they launched, because they were buggy because they were
> filled with new features. 2.2, for example, introduced RBAC, commitlog
> compression, and user defined functions – major features that needed to be
> tested. Unfortunately, because there were few real-world testers, there
> were still major bugs being found for months – the first production-ready
> version of 2.2 is probably in the 2.2.5 or 2.2.6 range.
>
> For version 3, we moved to an alternate release, modeled on Intel’s
> tick/tock https://en.wikipedia.org/wiki/Tick-Tock_model
>
> The intention was to allow new features into 3.even releases (3.0, 3.2,
> 3.4, 3.6, and so on), with bugfixes in 3.odd releases (3.1, … ). The hope
> was to allow more frequent releases to address the first big negative
> (flood of new features that blocked releases), while also helping to
> address the second – with fewer major features in a release, they better
> get more/better test coverage.
>
> In the tick/tock model, anyone running 3.odd (like 3.5) should be looking
> for bugfixes in 3.7. It’s certainly true that 3.5 is horribly broken (as is
> 3.3, and 3.4, etc), but with this release model, the bugfix SHOULD BE in
> 3.7. As I mentioned previously, we have precedent for backporting critical
> fixes, but we don’t have a well defined bar (that I see) for what’s
> critical enough for a backport.
>
> Jon is noting (and what many of us who run Cassandra in production have
> really known for a very long time) is that nobody wants to run 3.newest
> (even or odd), because 3.newest is likely broken (because it’s a complex
> distributed database, and testing is hard, and it takes time and complex
> workloads to find bugs). In the tick/tock model, because new features went
> into 3.6, there are new features that may not be adequately
> tested/validated in 3.7 a user of 3.5 doesn’t want, and isn’t willing to
> accept the risk.
>
> The bottom line here is that tick/tock is probably a well intentioned but
> failed attempt to bring stability to Cassandra’s releases. The problems
> tick/tock was meant to solve are real problems, but tick/tock doesn’t seem
> to be addressing them – new features invalidate old testing, which makes it
> difficult/impossible for real users to sit on the 3.odd versions.
>
> We’re due for cutting 3.9 and 3.0.9, and we have limited RE manpower to
> get those out. Only after those are out would I be +1 on a 3.5.1, and then
> only because if I were running 3.5, and I hit this bug, I wouldn’t want to
> spend the ~$100k it would cost my organization to validate 3.7 prior to
> upgrading, and I don’t think it’s reasonable to ask users to recompile a
> release for a ~10 line fix for a very nasty bug.
>
> I’m also very strongly recommend we (committers/PMC) reconsider tick/tock
> for 4.x releases, because this is exactly the type of problem that will
> continue to happen as we move forward. I suggest that we either need to go
> back to the old model and do a better job of dealing with feature creep and
> testing, or we need to better define what gets backported, because the
> community needs a stable version to run, and running latest odd release of
> tick/tock isn’t it.
>
> - Jeff
>
>
> On 9/15/16, 10:31 AM, "dave_les...@apple.com on behalf of Dave Lester" <
> dave_les...@apple.com> wrote:
>
> >How would cutting a 3.5.1 release possibly confuse users of the software?
> It would be easy to document the change and to send release notes.
> >
> >Given the bug’s critical nature and that it's a minor fix, I’m +1
> (non-binding) to a new release.
> >
> >Dave
> >
> >> On Sep 15, 2016, at 7:18 AM, Jeremiah D Jordan <https://urldefense.
> proofpoint.com/v2/url?u=http-3A__jeremiah.jordan-40gmail.com&d=DQIFaQ&c=
> 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
> yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> srNzKwrs8hKPoJMZ4Ao18CYaMYKnbWaCHou6ui5tqdM&s=iM_
> LKKIhaiC0w6uz3lhK1lob4gJbKhLPqGNfPPLye6w&e= > wrote:
> >>
> >> I’m with Jeff on this, 3.7 (bug fixes on 3.6) has already been released
> with the fix.  Since the fix applies cleanly anyone is free to put it on
> top of 3.5 on their own if they like, but I see no reason to put out a
> 3.5.1 right now and confuse people further.
> >>
> >> -Jeremiah
> >>
> >>
> >>> On Sep 15, 2016, at 9:07 AM, Jonathan Haddad <j...@jonhaddad.com>
> wrote:
> >>>
> >>> As I follow up, I suppose I'm only advocating for a fix to the odd
> >>> releases.  Sadly, Tick Tock versioning is misleading.
> >>>
> >>> If tick tock were to continue (and I'm very much against how it
> currently
> >>> works) the whole even-features odd-fixes thing needs to stop ASAP, all
> it
> >>> does it confuse people.
> >>>
> >>> The follow up to 3.4 (3.5) should have been 3.4.1, following semver, so
> >>> people know it's bug fixes only to 3.4.
> >>>
> >>> Jon
> >>>
> >>> On Wed, Sep 14, 2016 at 10:37 PM Jonathan Haddad <j...@jonhaddad.com>
> wrote:
> >>>
> >>>> In this particular case, I'd say adding a bug fix release for every
> >>>> version that's affected would be the right thing.  The issue is so
> easily
> >>>> reproducible and will likely result in massive data loss for anyone
> on 3.X
> >>>> WHERE X < 6 and uses the "date" type.
> >>>>
> >>>> This is how easy it is to reproduce:
> >>>>
> >>>> 1. Start Cassandra 3.5
> >>>> 2. create KEYSPACE test WITH replication = {'class': 'SimpleStrategy',
> >>>> 'replication_factor': 1};
> >>>> 3. use test;
> >>>> 4. create table fail (id int primary key, d date);
> >>>> 5. delete d from fail where id = 1;
> >>>> 6. Stop Cassandra
> >>>> 7. Start Cassandra
> >>>>
> >>>> You will get this, and startup will fail:
> >>>>
> >>>> ERROR 05:32:09 Exiting due to error while processing commit log during
> >>>> initialization.
> >>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$
> CommitLogReplayException:
> >>>> Unexpected error deserializing mutation; saved to
> >>>> /var/folders/0l/g2p6cnyd5kx_1wkl83nd3y4r0000gn/T/
> mutation6313332720566971713dat.
> >>>> This may be caused by replaying a mutation against a table with the
> same
> >>>> name but incompatible schema.  Exception follows:
> >>>> org.apache.cassandra.serializers.MarshalException: Expected 4 byte
> long for
> >>>> date (0)
> >>>>
> >>>> I mean.. come on.  It's an easy fix.  It cleanly merges against 3.5
> (and
> >>>> probably the other releases) and requires very little investment from
> >>>> anyone.
> >>>>
> >>>>
> >>>> On Wed, Sep 14, 2016 at 9:40 PM Jeff Jirsa <
> jeff.ji...@crowdstrike.com>
> >>>> wrote:
> >>>>
> >>>>> We did 3.1.1 and 3.2.1, so there’s SOME precedent for emergency
> fixes,
> >>>>> but we certainly didn’t/won’t go back and cut new releases from every
> >>>>> branch for every critical bug in future releases, so I think we need
> to
> >>>>> draw the line somewhere. If it’s fixed in 3.7 and 3.0.x (x >= 6), it
> seems
> >>>>> like you’ve got options (either stay on the tick and go up to 3.7,
> or bail
> >>>>> down to 3.0.x)
> >>>>>
> >>>>> Perhaps, though, this highlights the fact that tick/tock may not be
> the
> >>>>> best option long term. We’ve tried it for a year, perhaps we should
> instead
> >>>>> discuss whether or not it should continue, or if there’s another
> process
> >>>>> that gives us a better way to get useful patches into versions
> people are
> >>>>> willing to run in production.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 9/14/16, 8:55 PM, "Jonathan Haddad" <j...@jonhaddad.com> wrote:
> >>>>>
> >>>>>> Common sense is what prevents someone from upgrading to yet another
> >>>>>> completely unknown version with new features which have probably
> broken
> >>>>>> even more stuff that nobody is aware of.  The folks I'm helping
> right
> >>>>>> deployed 3.5 when they got started because
> >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__
> cassandra.apache.org&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kq
> hAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=pLP3udocOcAG6k_
> sAb9p8tcAhtOhpFm6JB7owGhPQEs&e=
> >>>>> suggests
> >>>>>> it's acceptable for production.  It turns out using 4 of the built
> in
> >>>>>> datatypes of the database result in the server being unable to
> restart
> >>>>>> without clearing out the commit logs and running a repair.  That
> screams
> >>>>>> critical to me.  You shouldn't even be able to install 3.5 without
> the
> >>>>>> patch I've supplied - that bug is a ticking time bomb for anyone
> that
> >>>>>> installs it.
> >>>>>>
> >>>>>> On Wed, Sep 14, 2016 at 8:12 PM Michael Shuler <
> mich...@pbandjelly.org>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> What's preventing the use of the 3.6 or 3.7 releases where this
> bug is
> >>>>>>> already fixed? This is also fixed in the 3.0.6/7/8 releases.
> >>>>>>>
> >>>>>>> Michael
> >>>>>>>
> >>>>>>> On 09/14/2016 08:30 PM, Jonathan Haddad wrote:
> >>>>>>>> Unfortunately CASSANDRA-11618 was fixed in 3.6 but was not back
> >>>>> ported to
> >>>>>>>> 3.5 as well, and it makes Cassandra effectively unusable if
> someone
> >>>>> is
> >>>>>>>> using any of the 4 types affected in any of their schema.
> >>>>>>>>
> >>>>>>>> I have cherry picked & merged the patch back to here and will put
> it
> >>>>> in a
> >>>>>>>> JIRA as well tonight, I just wanted to get the ball rolling asap
> on
> >>>>> this.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
> com_rustyrazorblade_cassandra_tree_fix-5Fcommitlog-5Fexception&d=DQIBaQ&c=
> 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
> yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=ktY5tkT-
> nO1jtyc0EicbgZHXJYl03DvzuxqzyyOgzII&e=
> >>>>>>>>
> >>>>>>>> Jon
> >>>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>
> >
>

Reply via email to