Re: Proposal - 3.5.1

Edward Capriolo Thu, 15 Sep 2016 13:20:17 -0700

It is funny you say this:

"tick-tock started based off of the 3.0 big bang “we broke everything”
release"


*"Brain battles itself over short-term rewards, long-term goals"*
https://www.princeton.edu/pr/news/04/q4/1014-brain.htm

*Normalization of deviance in software: how broken practices become
standard*
https://news.ycombinator.com/item?id=10811822

I had something really long written. I summarized to this thought. Huge
generalization coming:

Group 1 "I have 1GB of data on a 200GB disk, I am going to switch to level
DB and see what happens. YOLO DB!"

v.s.

Group 2 "I have 60GB data on a 200GB disk, If i switch to level DB I have
to do in a way that does not impact my current users, and a way that won't
fill my disks, and doing this in a controlled way might take days"

Users gravitate toward Group 2 as they move they become more risk adverse.
They are not going to want to upgrade more than twice a year. If they see
risk they will not upgrade at all. If Group 2 is not upgrading all the
"testers" become that of Group 1.

I think a new metric systems would be fun. In the readme.txt

TestAdded
T
DTestAdded
D
Feature
F
Fix
B
Ninja Fix
N
Refactor
R

Version 3.0
DDFFBBBBBBRRRRRTTTTDDDDDD

Version 3.1
FBBBBBBBBBBRRRRTTDD

Over time IF these did not gravitate toward FTD we know we are headed in
the wrong direction.











On Thu, Sep 15, 2016 at 2:57 PM, Jeremiah D Jordan <
jeremiah.jor...@gmail.com> wrote:

> Because tick-tock started based off of the 3.0 big bang “we broke
> everything” release I don’t think we can judge wether or not it is working
> until we are another 6 months in.  AKA when we would have been releasing
> the next big bang release.  Right now a lot if not most of the bugs in a
> given tick tock release are bugs that were introduced in 3.0.  Even the bug
> mentioned here, it is not a tick tock bug, it is a 3.0 bug.
>
>
> > On Sep 15, 2016, at 1:48 PM, Jake Luciani <jak...@gmail.com> wrote:
> >
> > I'm pretty sure everyone will agree Tick-Tock didn't go well and needs to
> > change.
> >
> > The problem for me is going back to the old way doesn't sound great.
> There
> > are parts of tick-tock I really like,
> > for example, the cadence and limited scope per release.
> >
> > I know at the summit there were a lot of ideas thrown around I can
> > regurgitate but perhaps people
> > who have been thinking about this would like to chime in and present
> ideas?
> >
> > -Jake
> >
> > On Thu, Sep 15, 2016 at 2:28 PM, Benedict Elliott Smith <
> bened...@apache.org
> >> wrote:
> >
> >> I agree tick-tock is a failure.  But for two reasons IMO:
> >>
> >> 1) Ultimately, the users are the real testers and it takes a while for a
> >> release to percolate into the wild for feedback.  The reality is that a
> >> release doesn't have its tires properly kicked for at least three months
> >> after it's cut.  So if we are to have any tocks, they should be
> completely
> >> unwed from the ticks, and should probably happen on a ~3M cadence to
> keep
> >> the labour down but the utility up (and there should probably still be
> more
> >> than one tock per tick)
> >>
> >> 2) Those promised resources to improved process never happened.  We
> haven't
> >> even reached parity with the 2.1 release until very recently, i.e. no
> >> failing u/dtests.
> >>
> >>
> >> On 15 September 2016 at 19:08, Jeff Jirsa <jeff.ji...@crowdstrike.com>
> >> wrote:
> >>
> >>> I know we’ve got a lot of folks following the dev list without a lot of
> >>> background, so let’s make sure we get some context here so everyone can
> >> be
> >>> on the same page.
> >>>
> >>> Going to preface this wall of text by saying I’m +1 on a 3.5.1 (and
> >> 3.3.1,
> >>> etc) if it’s done AFTER 3.9 (I think we need to get 3.9 out first
> before
> >>> the RE manpower is spent on backporting fixes, even critical fixes,
> >> because
> >>> 3.9 has multiple critical fixes for people running 3.7).
> >>>
> >>> Now some background:
> >>>
> >>> For many years, Cassandra used to have a dev process that kept 3 active
> >>> branches - “bleeding edge”, a “stable”, and an “old stable” branch,
> where
> >>> developers would be committing ALL new contributions to the bleeding
> >> edge,
> >>> non-api-breaking changes to stable, and bugfixes only to old stable.
> >> While
> >>> the api changed and major features were added, that bleeding edge would
> >>> just be ‘trunk’, and it’d get cut into a major version when it was
> ready
> >> to
> >>> ship. We saw that with 2.2 / 2.1 / 2.0 (and before that, 2.1 / 2.0 /
> 1.2,
> >>> and before that 2.0 / 1.2 / 1.1 ). When that bleeding edge got released
> >> as
> >>> a major x.y.0, the third, oldest, most stable branch went EOL, and new
> >>> features would go into trunk for the next major version.
> >>>
> >>> There were two big negatives observed with this:
> >>>
> >>> The first big negative is that if multiple major new features were in
> >>> flight, releases were prone to delay. Nobody wants to break an API on a
> >>> x.y.1 release, and nobody wants to add a new feature to a x.y.2
> release,
> >> so
> >>> the project would delay the x.y releases if major features were close,
> >> and
> >>> then there’d be pressure to slip them in before they were fully tested,
> >> or
> >>> cut features to avoid delaying the release. This pressure was observed
> to
> >>> be bad for the project – it forced technical compromises.
> >>>
> >>> The second downside that was observed was that nobody would try to run
> >> the
> >>> new versions when they launched, because they were buggy because they
> >> were
> >>> filled with new features. 2.2, for example, introduced RBAC, commitlog
> >>> compression, and user defined functions – major features that needed to
> >> be
> >>> tested. Unfortunately, because there were few real-world testers, there
> >>> were still major bugs being found for months – the first
> production-ready
> >>> version of 2.2 is probably in the 2.2.5 or 2.2.6 range.
> >>>
> >>> For version 3, we moved to an alternate release, modeled on Intel’s
> >>> tick/tock https://en.wikipedia.org/wiki/Tick-Tock_model
> >>>
> >>> The intention was to allow new features into 3.even releases (3.0, 3.2,
> >>> 3.4, 3.6, and so on), with bugfixes in 3.odd releases (3.1, … ). The
> hope
> >>> was to allow more frequent releases to address the first big negative
> >>> (flood of new features that blocked releases), while also helping to
> >>> address the second – with fewer major features in a release, they
> better
> >>> get more/better test coverage.
> >>>
> >>> In the tick/tock model, anyone running 3.odd (like 3.5) should be
> looking
> >>> for bugfixes in 3.7. It’s certainly true that 3.5 is horribly broken
> (as
> >> is
> >>> 3.3, and 3.4, etc), but with this release model, the bugfix SHOULD BE
> in
> >>> 3.7. As I mentioned previously, we have precedent for backporting
> >> critical
> >>> fixes, but we don’t have a well defined bar (that I see) for what’s
> >>> critical enough for a backport.
> >>>
> >>> Jon is noting (and what many of us who run Cassandra in production have
> >>> really known for a very long time) is that nobody wants to run 3.newest
> >>> (even or odd), because 3.newest is likely broken (because it’s a
> complex
> >>> distributed database, and testing is hard, and it takes time and
> complex
> >>> workloads to find bugs). In the tick/tock model, because new features
> >> went
> >>> into 3.6, there are new features that may not be adequately
> >>> tested/validated in 3.7 a user of 3.5 doesn’t want, and isn’t willing
> to
> >>> accept the risk.
> >>>
> >>> The bottom line here is that tick/tock is probably a well intentioned
> but
> >>> failed attempt to bring stability to Cassandra’s releases. The problems
> >>> tick/tock was meant to solve are real problems, but tick/tock doesn’t
> >> seem
> >>> to be addressing them – new features invalidate old testing, which
> makes
> >> it
> >>> difficult/impossible for real users to sit on the 3.odd versions.
> >>>
> >>> We’re due for cutting 3.9 and 3.0.9, and we have limited RE manpower to
> >>> get those out. Only after those are out would I be +1 on a 3.5.1, and
> >> then
> >>> only because if I were running 3.5, and I hit this bug, I wouldn’t want
> >> to
> >>> spend the ~$100k it would cost my organization to validate 3.7 prior to
> >>> upgrading, and I don’t think it’s reasonable to ask users to recompile
> a
> >>> release for a ~10 line fix for a very nasty bug.
> >>>
> >>> I’m also very strongly recommend we (committers/PMC) reconsider
> tick/tock
> >>> for 4.x releases, because this is exactly the type of problem that will
> >>> continue to happen as we move forward. I suggest that we either need to
> >> go
> >>> back to the old model and do a better job of dealing with feature creep
> >> and
> >>> testing, or we need to better define what gets backported, because the
> >>> community needs a stable version to run, and running latest odd release
> >> of
> >>> tick/tock isn’t it.
> >>>
> >>> - Jeff
> >>>
> >>>
> >>> On 9/15/16, 10:31 AM, "dave_les...@apple.com on behalf of Dave
> Lester" <
> >>> dave_les...@apple.com> wrote:
> >>>
> >>>> How would cutting a 3.5.1 release possibly confuse users of the
> >> software?
> >>> It would be easy to document the change and to send release notes.
> >>>>
> >>>> Given the bug’s critical nature and that it's a minor fix, I’m +1
> >>> (non-binding) to a new release.
> >>>>
> >>>> Dave
> >>>>
> >>>>> On Sep 15, 2016, at 7:18 AM, Jeremiah D Jordan <https://urldefense.
> >>> proofpoint.com/v2/url?u=http-3A__jeremiah.jordan-40gmail.
> com&d=DQIFaQ&c=
> >>> 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
> >>> yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> >>> srNzKwrs8hKPoJMZ4Ao18CYaMYKnbWaCHou6ui5tqdM&s=iM_
> >>> LKKIhaiC0w6uz3lhK1lob4gJbKhLPqGNfPPLye6w&e= > wrote:
> >>>>>
> >>>>> I’m with Jeff on this, 3.7 (bug fixes on 3.6) has already been
> >> released
> >>> with the fix.  Since the fix applies cleanly anyone is free to put it
> on
> >>> top of 3.5 on their own if they like, but I see no reason to put out a
> >>> 3.5.1 right now and confuse people further.
> >>>>>
> >>>>> -Jeremiah
> >>>>>
> >>>>>
> >>>>>> On Sep 15, 2016, at 9:07 AM, Jonathan Haddad <j...@jonhaddad.com>
> >>> wrote:
> >>>>>>
> >>>>>> As I follow up, I suppose I'm only advocating for a fix to the odd
> >>>>>> releases.  Sadly, Tick Tock versioning is misleading.
> >>>>>>
> >>>>>> If tick tock were to continue (and I'm very much against how it
> >>> currently
> >>>>>> works) the whole even-features odd-fixes thing needs to stop ASAP,
> >> all
> >>> it
> >>>>>> does it confuse people.
> >>>>>>
> >>>>>> The follow up to 3.4 (3.5) should have been 3.4.1, following semver,
> >> so
> >>>>>> people know it's bug fixes only to 3.4.
> >>>>>>
> >>>>>> Jon
> >>>>>>
> >>>>>> On Wed, Sep 14, 2016 at 10:37 PM Jonathan Haddad <j...@jonhaddad.com
> >
> >>> wrote:
> >>>>>>
> >>>>>>> In this particular case, I'd say adding a bug fix release for every
> >>>>>>> version that's affected would be the right thing.  The issue is so
> >>> easily
> >>>>>>> reproducible and will likely result in massive data loss for anyone
> >>> on 3.X
> >>>>>>> WHERE X < 6 and uses the "date" type.
> >>>>>>>
> >>>>>>> This is how easy it is to reproduce:
> >>>>>>>
> >>>>>>> 1. Start Cassandra 3.5
> >>>>>>> 2. create KEYSPACE test WITH replication = {'class':
> >> 'SimpleStrategy',
> >>>>>>> 'replication_factor': 1};
> >>>>>>> 3. use test;
> >>>>>>> 4. create table fail (id int primary key, d date);
> >>>>>>> 5. delete d from fail where id = 1;
> >>>>>>> 6. Stop Cassandra
> >>>>>>> 7. Start Cassandra
> >>>>>>>
> >>>>>>> You will get this, and startup will fail:
> >>>>>>>
> >>>>>>> ERROR 05:32:09 Exiting due to error while processing commit log
> >> during
> >>>>>>> initialization.
> >>>>>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$
> >>> CommitLogReplayException:
> >>>>>>> Unexpected error deserializing mutation; saved to
> >>>>>>> /var/folders/0l/g2p6cnyd5kx_1wkl83nd3y4r0000gn/T/
> >>> mutation6313332720566971713dat.
> >>>>>>> This may be caused by replaying a mutation against a table with the
> >>> same
> >>>>>>> name but incompatible schema.  Exception follows:
> >>>>>>> org.apache.cassandra.serializers.MarshalException: Expected 4 byte
> >>> long for
> >>>>>>> date (0)
> >>>>>>>
> >>>>>>> I mean.. come on.  It's an easy fix.  It cleanly merges against 3.5
> >>> (and
> >>>>>>> probably the other releases) and requires very little investment
> >> from
> >>>>>>> anyone.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Wed, Sep 14, 2016 at 9:40 PM Jeff Jirsa <
> >>> jeff.ji...@crowdstrike.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> We did 3.1.1 and 3.2.1, so there’s SOME precedent for emergency
> >>> fixes,
> >>>>>>>> but we certainly didn’t/won’t go back and cut new releases from
> >> every
> >>>>>>>> branch for every critical bug in future releases, so I think we
> >> need
> >>> to
> >>>>>>>> draw the line somewhere. If it’s fixed in 3.7 and 3.0.x (x >= 6),
> >> it
> >>> seems
> >>>>>>>> like you’ve got options (either stay on the tick and go up to 3.7,
> >>> or bail
> >>>>>>>> down to 3.0.x)
> >>>>>>>>
> >>>>>>>> Perhaps, though, this highlights the fact that tick/tock may not
> be
> >>> the
> >>>>>>>> best option long term. We’ve tried it for a year, perhaps we
> should
> >>> instead
> >>>>>>>> discuss whether or not it should continue, or if there’s another
> >>> process
> >>>>>>>> that gives us a better way to get useful patches into versions
> >>> people are
> >>>>>>>> willing to run in production.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 9/14/16, 8:55 PM, "Jonathan Haddad" <j...@jonhaddad.com> wrote:
> >>>>>>>>
> >>>>>>>>> Common sense is what prevents someone from upgrading to yet
> >> another
> >>>>>>>>> completely unknown version with new features which have probably
> >>> broken
> >>>>>>>>> even more stuff that nobody is aware of.  The folks I'm helping
> >>> right
> >>>>>>>>> deployed 3.5 when they got started because
> >>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__
> >>> cassandra.apache.org&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kq
> >>> hAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> >>> MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=pLP3udocOcAG6k_
> >>> sAb9p8tcAhtOhpFm6JB7owGhPQEs&e=
> >>>>>>>> suggests
> >>>>>>>>> it's acceptable for production.  It turns out using 4 of the
> built
> >>> in
> >>>>>>>>> datatypes of the database result in the server being unable to
> >>> restart
> >>>>>>>>> without clearing out the commit logs and running a repair.  That
> >>> screams
> >>>>>>>>> critical to me.  You shouldn't even be able to install 3.5
> without
> >>> the
> >>>>>>>>> patch I've supplied - that bug is a ticking time bomb for anyone
> >>> that
> >>>>>>>>> installs it.
> >>>>>>>>>
> >>>>>>>>> On Wed, Sep 14, 2016 at 8:12 PM Michael Shuler <
> >>> mich...@pbandjelly.org>
> >>>>>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> What's preventing the use of the 3.6 or 3.7 releases where this
> >>> bug is
> >>>>>>>>>> already fixed? This is also fixed in the 3.0.6/7/8 releases.
> >>>>>>>>>>
> >>>>>>>>>> Michael
> >>>>>>>>>>
> >>>>>>>>>> On 09/14/2016 08:30 PM, Jonathan Haddad wrote:
> >>>>>>>>>>> Unfortunately CASSANDRA-11618 was fixed in 3.6 but was not back
> >>>>>>>> ported to
> >>>>>>>>>>> 3.5 as well, and it makes Cassandra effectively unusable if
> >>> someone
> >>>>>>>> is
> >>>>>>>>>>> using any of the 4 types affected in any of their schema.
> >>>>>>>>>>>
> >>>>>>>>>>> I have cherry picked & merged the patch back to here and will
> >> put
> >>> it
> >>>>>>>> in a
> >>>>>>>>>>> JIRA as well tonight, I just wanted to get the ball rolling
> asap
> >>> on
> >>>>>>>> this.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
> >>> com_rustyrazorblade_cassandra_tree_fix-5Fcommitlog-
> >> 5Fexception&d=DQIBaQ&c=
> >>> 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
> >>> yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
> >>> MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=ktY5tkT-
> >>> nO1jtyc0EicbgZHXJYl03DvzuxqzyyOgzII&e=
> >>>>>>>>>>>
> >>>>>>>>>>> Jon
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
> >
> >
> > --
> > http://twitter.com/tjake
>
>

Re: Proposal - 3.5.1

Reply via email to