Re: Reminder: critical fixes only in 2.1

sankalp kohli Wed, 20 Jul 2016 14:27:37 -0700

+1 on only allowing critical bug fixes.
I agree with Sylvain that CASSANDRA-11349 is a border line critical bug. I
would vote for  CASSANDRA-11349 as being critical since over streaming is a
big issue for us as well. I am also fine taking it as an internal patch
since we already maintain an internal branch with bug fixes.


On Wed, Jul 20, 2016 at 2:10 PM, Fabien Rousseau <[email protected]>
wrote:

> Hi,
>
> I'm the reporter for CASSANDRA-11349.
> Just a bit of history:
>
> This was a big pain point for us because the over-streaming was really
> important.
> Each week, repair increased data size by a few percents.
> To give an example, one cluster with 12 nodes had a CF with around 250GB of
> data per node and after compaction, reduced to 80GB.
> Before fully understanding the problem, we just doubled the cluster size
> just because of this issue (from 6 nodes to 12 nodes).
> We spotted the problem in November 2015, in one cluster, without having
> hints of what was happening at first (we just knew that there was a lot of
> streaming during repairs, ie a few MB/s of data, and that there was quite a
> high ratio of mismatch on read repairs)
> After putting in production more clusters (with completely different data
> models), we continued to observe this pattern (without finding any
> correlation).
> It took some time to identify the issue and fix it.
> Note that, once we reported the issue, we stopped repairing the affected
> CFs. (I know it's a bad practice, but this was a temporary solution on
> bare-metal servers which are quite stable).
>
> I'd say that I'm surprised that this was not reported by more people (our
> data model has nothing particular).
> (Maybe other people are affected but have not paid attention to this
> phenomenon)
>
> Please note that this patch does not affect how SSTables are serialized,
> but only how digest are computed thus reduces risks.
>
> So, to conclude, it was critical for us, and it may help other people in
> the wild.
> Nevertheless, we can live with this patch not being included in 2.1.X (by
> maintaining our own C* package until we migrate to 3.0.X)
>
> 2016-07-20 18:06 GMT+02:00 Sylvain Lebresne <[email protected]>:
>
> > Definitively agrees that CASSANDRA-10433 and CASSANDRA-12030 aren't
> > critical. In fact, they are both marked as "improvements" and "minor".
> I'm
> > to blame for their commit, so mea culpa. But to my defense, I've long
> > advocated for being stricter on sticking to critical-only fixes on old
> > releases and have long felt that this sentiment wasn't really shared in
> > practice by other devs/committers (could be my fault getting the wrong
> > impression, but I've lost track of the number of time were other
> > devs/committers argue for "just this time because it's really simple"
> when
> > those questions cam up), so I've partly gave up on arguing. I'm happy to
> > get more consistent on this and +1 reverting those.
> >
> > I think CASSANDRA-11349 is the one possibly more debatable. Fabien on the
> > ticket seemed to suggest that on his clusters the bug was make repair
> > unmanageable ("a few days after filing the bug, we decided to temporarily
> > stop repairing some tables [...] which were heavily impacted by those
> > bugs"). He mention in particular having seen "around a hundred"
> mismatches
> > with the patch against "a few hundred of thousands" without. I suppose
> one
> > could make a case that having repair over-stream by 3 order of magnitude
> in
> > some case is critical-ish. Note that I wouldn't oppose reverting this
> too,
> > as it doesn't fully meet *my* definition of critical, but I'm willing to
> > accept my definition is on the stronger side of things and leave it in.
> >
> > --
> > Sylvain
> >
> > On Wed, Jul 20, 2016 at 5:29 PM, Aleksey Yeschenko <[email protected]>
> > wrote:
> >
> > > +1 from me (and I don’t see any resistance to it either).
> > >
> > > --
> > > AY
> > >
> > > On 18 July 2016 at 18:36:42, Jonathan Ellis ([email protected]) wrote:
> > >
> > > Except there really wasn't.
> > >
> > > Patch submitter: "I want this in 2.1."
> > > Reviewer: "Okay."
> > >
> > > That's not exactly the bar we're looking for. To consider a performance
> > > fix "critical" for example, you really need to show at the very least
> > what
> > > new workload you found that isn't able to live with it the way everyone
> > > else did for the previous 15 releases.
> > >
> > > I note that on 10433 the committer even said, "I'm not [sure] I agree
> > this
> > > is critical for 2.1 at this point, but as it's simple enough and has
> been
> > > somewhat vetted on 2.2 by now, not going to argue."
> > >
> > > So consider this me putting on my bad cop hat and opening up the
> > argument.
> > >
> > > On Mon, Jul 18, 2016 at 12:24 PM, Jeremiah D Jordan <
> > [email protected]
> > > >
> > > wrote:
> > >
> > > > Looking at those tickets in all three of them the “is this critical
> to
> > > > fix” question came up in the JIRA discussion and it was decided that
> > they
> > > > were indeed critical enough to commit to 2.1.
> > > >
> > > > > On Jul 18, 2016, at 11:47 AM, Jonathan Ellis <[email protected]>
> > > wrote:
> > > > >
> > > > > We're at the stage of the release cycle where we should be
> committing
> > > > > critical fixes only to the 2.1 branch. Many people depend on 2.1
> > > working
> > > > > reliably and it's not worth the risk of introducing regressions for
> > > > (e.g.)
> > > > > performance improvements.
> > > > >
> > > > > I think some of the patches committed so far for 2.1.16 do not meet
> > > this
> > > > > bar and should be reverted. I include a summary of what people have
> > to
> > > > > live with if we leave them unfixed:
> > > > >
> > > > > https://issues.apache.org/jira/browse/CASSANDRA-11349
> > > > > Repair suffers false-negative tree mismatches and overstreams data.
> > > > >
> > > > > https://issues.apache.org/jira/browse/CASSANDRA-10433
> > > > > Reduced performance on inserts (and reads?) (for Thrift clients
> > only?)
> > > > >
> > > > > https://issues.apache.org/jira/browse/CASSANDRA-12030
> > > > > Reduced performance on reads for workloads with range tombstones
> > > > >
> > > > > Anyone want to make a case that these are more critical than they
> > > appear
> > > > > and should not be reverted?
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > Project Chair, Apache Cassandra
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > >
> > > >
> > >
> > >
> > > --
> > > Jonathan Ellis
> > > Project Chair, Apache Cassandra
> > > co-founder, http://www.datastax.com
> > > @spyced
> > >
> >
>

Re: Reminder: critical fixes only in 2.1

Reply via email to