Right - I think like Jake and others have said, it seems appropriate to do 
something at this point.  Would a clearer, more liberal backport policy to the 
odd versions be worthwhile until we find our footing?  As Jeremiah said, it 
does seem like the big bang 3.0 release has caused much of the baggage that 
we’re facing.  Combine with that the slow uptake on any specific version so far 
at least partly because of the newness of the release model.

To me, the hard thing to me about 3 month releases is that then you get into 
the larger untested feature releases which is what it was originally supposed 
to get away from.

So in essence, would we
1) do nothing and see it through
2) have a more liberal backport policy in the 3.x line and revisit once we get 
to 4
3) do a tick-tock(-tock-tock) sort of model
4) do some sort of LTS
5) go back to the drawing board
6) go back to the old model

I think the earlier numbers imply some confidence in the thinking behind 
tick-tock.  Would 2 be acceptable to see the 3.x line through with the current 
release model?  Or do we need to do something more extensive at this stage?

> On Sep 15, 2016, at 1:59 PM, Jonathan Haddad <j...@jonhaddad.com> wrote:
> 
> I don't think it's binary - we don't have to do year long insanity or
> bleeding edge crazyness.
> 
> How about a release every 3 months, with each release accepting 6 months of
> patches?  (oldstable & newstable)  Also provide nightly builds & stick to
> the idea of stable trunk.
> 
> The issue is the number of bug fixes a given release gets.  1 bug fix
> release for a new feature is just terrible.  The community as a whole
> despises this system and is lowering confidence in the project.
> 
> Jon
> 
> 
> On Thu, Sep 15, 2016 at 11:48 AM Jake Luciani <jak...@gmail.com> wrote:
> 
>> I'm pretty sure everyone will agree Tick-Tock didn't go well and needs to
>> change.
>> 
>> The problem for me is going back to the old way doesn't sound great. There
>> are parts of tick-tock I really like,
>> for example, the cadence and limited scope per release.
>> 
>> I know at the summit there were a lot of ideas thrown around I can
>> regurgitate but perhaps people
>> who have been thinking about this would like to chime in and present ideas?
>> 
>> -Jake
>> 
>> On Thu, Sep 15, 2016 at 2:28 PM, Benedict Elliott Smith <
>> bened...@apache.org
>>> wrote:
>> 
>>> I agree tick-tock is a failure.  But for two reasons IMO:
>>> 
>>> 1) Ultimately, the users are the real testers and it takes a while for a
>>> release to percolate into the wild for feedback.  The reality is that a
>>> release doesn't have its tires properly kicked for at least three months
>>> after it's cut.  So if we are to have any tocks, they should be
>> completely
>>> unwed from the ticks, and should probably happen on a ~3M cadence to keep
>>> the labour down but the utility up (and there should probably still be
>> more
>>> than one tock per tick)
>>> 
>>> 2) Those promised resources to improved process never happened.  We
>> haven't
>>> even reached parity with the 2.1 release until very recently, i.e. no
>>> failing u/dtests.
>>> 
>>> 
>>> On 15 September 2016 at 19:08, Jeff Jirsa <jeff.ji...@crowdstrike.com>
>>> wrote:
>>> 
>>>> I know we’ve got a lot of folks following the dev list without a lot of
>>>> background, so let’s make sure we get some context here so everyone can
>>> be
>>>> on the same page.
>>>> 
>>>> Going to preface this wall of text by saying I’m +1 on a 3.5.1 (and
>>> 3.3.1,
>>>> etc) if it’s done AFTER 3.9 (I think we need to get 3.9 out first
>> before
>>>> the RE manpower is spent on backporting fixes, even critical fixes,
>>> because
>>>> 3.9 has multiple critical fixes for people running 3.7).
>>>> 
>>>> Now some background:
>>>> 
>>>> For many years, Cassandra used to have a dev process that kept 3 active
>>>> branches - “bleeding edge”, a “stable”, and an “old stable” branch,
>> where
>>>> developers would be committing ALL new contributions to the bleeding
>>> edge,
>>>> non-api-breaking changes to stable, and bugfixes only to old stable.
>>> While
>>>> the api changed and major features were added, that bleeding edge would
>>>> just be ‘trunk’, and it’d get cut into a major version when it was
>> ready
>>> to
>>>> ship. We saw that with 2.2 / 2.1 / 2.0 (and before that, 2.1 / 2.0 /
>> 1.2,
>>>> and before that 2.0 / 1.2 / 1.1 ). When that bleeding edge got released
>>> as
>>>> a major x.y.0, the third, oldest, most stable branch went EOL, and new
>>>> features would go into trunk for the next major version.
>>>> 
>>>> There were two big negatives observed with this:
>>>> 
>>>> The first big negative is that if multiple major new features were in
>>>> flight, releases were prone to delay. Nobody wants to break an API on a
>>>> x.y.1 release, and nobody wants to add a new feature to a x.y.2
>> release,
>>> so
>>>> the project would delay the x.y releases if major features were close,
>>> and
>>>> then there’d be pressure to slip them in before they were fully tested,
>>> or
>>>> cut features to avoid delaying the release. This pressure was observed
>> to
>>>> be bad for the project – it forced technical compromises.
>>>> 
>>>> The second downside that was observed was that nobody would try to run
>>> the
>>>> new versions when they launched, because they were buggy because they
>>> were
>>>> filled with new features. 2.2, for example, introduced RBAC, commitlog
>>>> compression, and user defined functions – major features that needed to
>>> be
>>>> tested. Unfortunately, because there were few real-world testers, there
>>>> were still major bugs being found for months – the first
>> production-ready
>>>> version of 2.2 is probably in the 2.2.5 or 2.2.6 range.
>>>> 
>>>> For version 3, we moved to an alternate release, modeled on Intel’s
>>>> tick/tock https://en.wikipedia.org/wiki/Tick-Tock_model
>>>> 
>>>> The intention was to allow new features into 3.even releases (3.0, 3.2,
>>>> 3.4, 3.6, and so on), with bugfixes in 3.odd releases (3.1, … ). The
>> hope
>>>> was to allow more frequent releases to address the first big negative
>>>> (flood of new features that blocked releases), while also helping to
>>>> address the second – with fewer major features in a release, they
>> better
>>>> get more/better test coverage.
>>>> 
>>>> In the tick/tock model, anyone running 3.odd (like 3.5) should be
>> looking
>>>> for bugfixes in 3.7. It’s certainly true that 3.5 is horribly broken
>> (as
>>> is
>>>> 3.3, and 3.4, etc), but with this release model, the bugfix SHOULD BE
>> in
>>>> 3.7. As I mentioned previously, we have precedent for backporting
>>> critical
>>>> fixes, but we don’t have a well defined bar (that I see) for what’s
>>>> critical enough for a backport.
>>>> 
>>>> Jon is noting (and what many of us who run Cassandra in production have
>>>> really known for a very long time) is that nobody wants to run 3.newest
>>>> (even or odd), because 3.newest is likely broken (because it’s a
>> complex
>>>> distributed database, and testing is hard, and it takes time and
>> complex
>>>> workloads to find bugs). In the tick/tock model, because new features
>>> went
>>>> into 3.6, there are new features that may not be adequately
>>>> tested/validated in 3.7 a user of 3.5 doesn’t want, and isn’t willing
>> to
>>>> accept the risk.
>>>> 
>>>> The bottom line here is that tick/tock is probably a well intentioned
>> but
>>>> failed attempt to bring stability to Cassandra’s releases. The problems
>>>> tick/tock was meant to solve are real problems, but tick/tock doesn’t
>>> seem
>>>> to be addressing them – new features invalidate old testing, which
>> makes
>>> it
>>>> difficult/impossible for real users to sit on the 3.odd versions.
>>>> 
>>>> We’re due for cutting 3.9 and 3.0.9, and we have limited RE manpower to
>>>> get those out. Only after those are out would I be +1 on a 3.5.1, and
>>> then
>>>> only because if I were running 3.5, and I hit this bug, I wouldn’t want
>>> to
>>>> spend the ~$100k it would cost my organization to validate 3.7 prior to
>>>> upgrading, and I don’t think it’s reasonable to ask users to recompile
>> a
>>>> release for a ~10 line fix for a very nasty bug.
>>>> 
>>>> I’m also very strongly recommend we (committers/PMC) reconsider
>> tick/tock
>>>> for 4.x releases, because this is exactly the type of problem that will
>>>> continue to happen as we move forward. I suggest that we either need to
>>> go
>>>> back to the old model and do a better job of dealing with feature creep
>>> and
>>>> testing, or we need to better define what gets backported, because the
>>>> community needs a stable version to run, and running latest odd release
>>> of
>>>> tick/tock isn’t it.
>>>> 
>>>> - Jeff
>>>> 
>>>> 
>>>> On 9/15/16, 10:31 AM, "dave_les...@apple.com on behalf of Dave
>> Lester" <
>>>> dave_les...@apple.com> wrote:
>>>> 
>>>>> How would cutting a 3.5.1 release possibly confuse users of the
>>> software?
>>>> It would be easy to document the change and to send release notes.
>>>>> 
>>>>> Given the bug’s critical nature and that it's a minor fix, I’m +1
>>>> (non-binding) to a new release.
>>>>> 
>>>>> Dave
>>>>> 
>>>>>> On Sep 15, 2016, at 7:18 AM, Jeremiah D Jordan <https://urldefense.
>>>> 
>> proofpoint.com/v2/url?u=http-3A__jeremiah.jordan-40gmail.com&d=DQIFaQ&c=
>>>> 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
>>>> yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
>>>> srNzKwrs8hKPoJMZ4Ao18CYaMYKnbWaCHou6ui5tqdM&s=iM_
>>>> LKKIhaiC0w6uz3lhK1lob4gJbKhLPqGNfPPLye6w&e= > wrote:
>>>>>> 
>>>>>> I’m with Jeff on this, 3.7 (bug fixes on 3.6) has already been
>>> released
>>>> with the fix.  Since the fix applies cleanly anyone is free to put it
>> on
>>>> top of 3.5 on their own if they like, but I see no reason to put out a
>>>> 3.5.1 right now and confuse people further.
>>>>>> 
>>>>>> -Jeremiah
>>>>>> 
>>>>>> 
>>>>>>> On Sep 15, 2016, at 9:07 AM, Jonathan Haddad <j...@jonhaddad.com>
>>>> wrote:
>>>>>>> 
>>>>>>> As I follow up, I suppose I'm only advocating for a fix to the odd
>>>>>>> releases.  Sadly, Tick Tock versioning is misleading.
>>>>>>> 
>>>>>>> If tick tock were to continue (and I'm very much against how it
>>>> currently
>>>>>>> works) the whole even-features odd-fixes thing needs to stop ASAP,
>>> all
>>>> it
>>>>>>> does it confuse people.
>>>>>>> 
>>>>>>> The follow up to 3.4 (3.5) should have been 3.4.1, following
>> semver,
>>> so
>>>>>>> people know it's bug fixes only to 3.4.
>>>>>>> 
>>>>>>> Jon
>>>>>>> 
>>>>>>> On Wed, Sep 14, 2016 at 10:37 PM Jonathan Haddad <
>> j...@jonhaddad.com>
>>>> wrote:
>>>>>>> 
>>>>>>>> In this particular case, I'd say adding a bug fix release for
>> every
>>>>>>>> version that's affected would be the right thing.  The issue is so
>>>> easily
>>>>>>>> reproducible and will likely result in massive data loss for
>> anyone
>>>> on 3.X
>>>>>>>> WHERE X < 6 and uses the "date" type.
>>>>>>>> 
>>>>>>>> This is how easy it is to reproduce:
>>>>>>>> 
>>>>>>>> 1. Start Cassandra 3.5
>>>>>>>> 2. create KEYSPACE test WITH replication = {'class':
>>> 'SimpleStrategy',
>>>>>>>> 'replication_factor': 1};
>>>>>>>> 3. use test;
>>>>>>>> 4. create table fail (id int primary key, d date);
>>>>>>>> 5. delete d from fail where id = 1;
>>>>>>>> 6. Stop Cassandra
>>>>>>>> 7. Start Cassandra
>>>>>>>> 
>>>>>>>> You will get this, and startup will fail:
>>>>>>>> 
>>>>>>>> ERROR 05:32:09 Exiting due to error while processing commit log
>>> during
>>>>>>>> initialization.
>>>>>>>> org.apache.cassandra.db.commitlog.CommitLogReplayer$
>>>> CommitLogReplayException:
>>>>>>>> Unexpected error deserializing mutation; saved to
>>>>>>>> /var/folders/0l/g2p6cnyd5kx_1wkl83nd3y4r0000gn/T/
>>>> mutation6313332720566971713dat.
>>>>>>>> This may be caused by replaying a mutation against a table with
>> the
>>>> same
>>>>>>>> name but incompatible schema.  Exception follows:
>>>>>>>> org.apache.cassandra.serializers.MarshalException: Expected 4 byte
>>>> long for
>>>>>>>> date (0)
>>>>>>>> 
>>>>>>>> I mean.. come on.  It's an easy fix.  It cleanly merges against
>> 3.5
>>>> (and
>>>>>>>> probably the other releases) and requires very little investment
>>> from
>>>>>>>> anyone.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Wed, Sep 14, 2016 at 9:40 PM Jeff Jirsa <
>>>> jeff.ji...@crowdstrike.com>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> We did 3.1.1 and 3.2.1, so there’s SOME precedent for emergency
>>>> fixes,
>>>>>>>>> but we certainly didn’t/won’t go back and cut new releases from
>>> every
>>>>>>>>> branch for every critical bug in future releases, so I think we
>>> need
>>>> to
>>>>>>>>> draw the line somewhere. If it’s fixed in 3.7 and 3.0.x (x >= 6),
>>> it
>>>> seems
>>>>>>>>> like you’ve got options (either stay on the tick and go up to
>> 3.7,
>>>> or bail
>>>>>>>>> down to 3.0.x)
>>>>>>>>> 
>>>>>>>>> Perhaps, though, this highlights the fact that tick/tock may not
>> be
>>>> the
>>>>>>>>> best option long term. We’ve tried it for a year, perhaps we
>> should
>>>> instead
>>>>>>>>> discuss whether or not it should continue, or if there’s another
>>>> process
>>>>>>>>> that gives us a better way to get useful patches into versions
>>>> people are
>>>>>>>>> willing to run in production.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> On 9/14/16, 8:55 PM, "Jonathan Haddad" <j...@jonhaddad.com>
>> wrote:
>>>>>>>>> 
>>>>>>>>>> Common sense is what prevents someone from upgrading to yet
>>> another
>>>>>>>>>> completely unknown version with new features which have probably
>>>> broken
>>>>>>>>>> even more stuff that nobody is aware of.  The folks I'm helping
>>>> right
>>>>>>>>>> deployed 3.5 when they got started because
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__
>>>> cassandra.apache.org&d=DQIBaQ&c=08AGY6txKsvMOP6lYkHQpPMRA1U6kq
>>>> hAwGa8-0QCg3M&r=yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
>>>> MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=pLP3udocOcAG6k_
>>>> sAb9p8tcAhtOhpFm6JB7owGhPQEs&e=
>>>>>>>>> suggests
>>>>>>>>>> it's acceptable for production.  It turns out using 4 of the
>> built
>>>> in
>>>>>>>>>> datatypes of the database result in the server being unable to
>>>> restart
>>>>>>>>>> without clearing out the commit logs and running a repair.  That
>>>> screams
>>>>>>>>>> critical to me.  You shouldn't even be able to install 3.5
>> without
>>>> the
>>>>>>>>>> patch I've supplied - that bug is a ticking time bomb for anyone
>>>> that
>>>>>>>>>> installs it.
>>>>>>>>>> 
>>>>>>>>>> On Wed, Sep 14, 2016 at 8:12 PM Michael Shuler <
>>>> mich...@pbandjelly.org>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> What's preventing the use of the 3.6 or 3.7 releases where this
>>>> bug is
>>>>>>>>>>> already fixed? This is also fixed in the 3.0.6/7/8 releases.
>>>>>>>>>>> 
>>>>>>>>>>> Michael
>>>>>>>>>>> 
>>>>>>>>>>> On 09/14/2016 08:30 PM, Jonathan Haddad wrote:
>>>>>>>>>>>> Unfortunately CASSANDRA-11618 was fixed in 3.6 but was not
>> back
>>>>>>>>> ported to
>>>>>>>>>>>> 3.5 as well, and it makes Cassandra effectively unusable if
>>>> someone
>>>>>>>>> is
>>>>>>>>>>>> using any of the 4 types affected in any of their schema.
>>>>>>>>>>>> 
>>>>>>>>>>>> I have cherry picked & merged the patch back to here and will
>>> put
>>>> it
>>>>>>>>> in a
>>>>>>>>>>>> JIRA as well tonight, I just wanted to get the ball rolling
>> asap
>>>> on
>>>>>>>>> this.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.
>>>> com_rustyrazorblade_cassandra_tree_fix-5Fcommitlog-
>>> 5Fexception&d=DQIBaQ&c=
>>>> 08AGY6txKsvMOP6lYkHQpPMRA1U6kqhAwGa8-0QCg3M&r=
>>>> yfYEBHVkX6l0zImlOIBID0gmhluYPD5Jje-3CtaT3ow&m=
>>>> MZ9nLcNNhQZkuXyH0NBbP1kSEE2M-SYgyVqZ88IJcXY&s=ktY5tkT-
>>>> nO1jtyc0EicbgZHXJYl03DvzuxqzyyOgzII&e=
>>>>>>>>>>>> 
>>>>>>>>>>>> Jon
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> http://twitter.com/tjake
>> 

Reply via email to