Re: Reminder: critical fixes only in 2.1

Jonathan Ellis Thu, 21 Jul 2016 07:38:48 -0700

Sounds like we have consensus on that.  I will handle the reverts and
update Jira.


On Wed, Jul 20, 2016 at 7:11 PM, Aleksey Yeschenko <alek...@apache.org>
wrote:

> Keep #11349, revert the rest sounds reasonable.
>
> --
> AY
>
> On 20 July 2016 at 22:27:05, sankalp kohli (kohlisank...@gmail.com) wrote:
>
> +1 on only allowing critical bug fixes.
> I agree with Sylvain that CASSANDRA-11349 is a border line critical bug. I
> would vote for CASSANDRA-11349 as being critical since over streaming is a
> big issue for us as well. I am also fine taking it as an internal patch
> since we already maintain an internal branch with bug fixes.
>
> On Wed, Jul 20, 2016 at 2:10 PM, Fabien Rousseau <fabifab...@gmail.com>
> wrote:
>
> > Hi,
> >
> > I'm the reporter for CASSANDRA-11349.
> > Just a bit of history:
> >
> > This was a big pain point for us because the over-streaming was really
> > important.
> > Each week, repair increased data size by a few percents.
> > To give an example, one cluster with 12 nodes had a CF with around 250GB
> of
> > data per node and after compaction, reduced to 80GB.
> > Before fully understanding the problem, we just doubled the cluster size
> > just because of this issue (from 6 nodes to 12 nodes).
> > We spotted the problem in November 2015, in one cluster, without having
> > hints of what was happening at first (we just knew that there was a lot
> of
> > streaming during repairs, ie a few MB/s of data, and that there was
> quite a
> > high ratio of mismatch on read repairs)
> > After putting in production more clusters (with completely different data
> > models), we continued to observe this pattern (without finding any
> > correlation).
> > It took some time to identify the issue and fix it.
> > Note that, once we reported the issue, we stopped repairing the affected
> > CFs. (I know it's a bad practice, but this was a temporary solution on
> > bare-metal servers which are quite stable).
> >
> > I'd say that I'm surprised that this was not reported by more people (our
> > data model has nothing particular).
> > (Maybe other people are affected but have not paid attention to this
> > phenomenon)
> >
> > Please note that this patch does not affect how SSTables are serialized,
> > but only how digest are computed thus reduces risks.
> >
> > So, to conclude, it was critical for us, and it may help other people in
> > the wild.
> > Nevertheless, we can live with this patch not being included in 2.1.X (by
> > maintaining our own C* package until we migrate to 3.0.X)
> >
> > 2016-07-20 18:06 GMT+02:00 Sylvain Lebresne <sylv...@datastax.com>:
> >
> > > Definitively agrees that CASSANDRA-10433 and CASSANDRA-12030 aren't
> > > critical. In fact, they are both marked as "improvements" and "minor".
> > I'm
> > > to blame for their commit, so mea culpa. But to my defense, I've long
> > > advocated for being stricter on sticking to critical-only fixes on old
> > > releases and have long felt that this sentiment wasn't really shared in
> > > practice by other devs/committers (could be my fault getting the wrong
> > > impression, but I've lost track of the number of time were other
> > > devs/committers argue for "just this time because it's really simple"
> > when
> > > those questions cam up), so I've partly gave up on arguing. I'm happy
> to
> > > get more consistent on this and +1 reverting those.
> > >
> > > I think CASSANDRA-11349 is the one possibly more debatable. Fabien on
> the
> > > ticket seemed to suggest that on his clusters the bug was make repair
> > > unmanageable ("a few days after filing the bug, we decided to
> temporarily
> > > stop repairing some tables [...] which were heavily impacted by those
> > > bugs"). He mention in particular having seen "around a hundred"
> > mismatches
> > > with the patch against "a few hundred of thousands" without. I suppose
> > one
> > > could make a case that having repair over-stream by 3 order of
> magnitude
> > in
> > > some case is critical-ish. Note that I wouldn't oppose reverting this
> > too,
> > > as it doesn't fully meet *my* definition of critical, but I'm willing
> to
> > > accept my definition is on the stronger side of things and leave it in.
> > >
> > > --
> > > Sylvain
> > >
> > > On Wed, Jul 20, 2016 at 5:29 PM, Aleksey Yeschenko <alek...@apache.org
> >
> > > wrote:
> > >
> > > > +1 from me (and I don’t see any resistance to it either).
> > > >
> > > > --
> > > > AY
> > > >
> > > > On 18 July 2016 at 18:36:42, Jonathan Ellis (jbel...@gmail.com)
> wrote:
> > > >
> > > > Except there really wasn't.
> > > >
> > > > Patch submitter: "I want this in 2.1."
> > > > Reviewer: "Okay."
> > > >
> > > > That's not exactly the bar we're looking for. To consider a
> performance
> > > > fix "critical" for example, you really need to show at the very least
> > > what
> > > > new workload you found that isn't able to live with it the way
> everyone
> > > > else did for the previous 15 releases.
> > > >
> > > > I note that on 10433 the committer even said, "I'm not [sure] I agree
> > > this
> > > > is critical for 2.1 at this point, but as it's simple enough and has
> > been
> > > > somewhat vetted on 2.2 by now, not going to argue."
> > > >
> > > > So consider this me putting on my bad cop hat and opening up the
> > > argument.
> > > >
> > > > On Mon, Jul 18, 2016 at 12:24 PM, Jeremiah D Jordan <
> > > jerem...@datastax.com
> > > > >
> > > > wrote:
> > > >
> > > > > Looking at those tickets in all three of them the “is this critical
> > to
> > > > > fix” question came up in the JIRA discussion and it was decided
> that
> > > they
> > > > > were indeed critical enough to commit to 2.1.
> > > > >
> > > > > > On Jul 18, 2016, at 11:47 AM, Jonathan Ellis <jbel...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > We're at the stage of the release cycle where we should be
> > committing
> > > > > > critical fixes only to the 2.1 branch. Many people depend on 2.1
> > > > working
> > > > > > reliably and it's not worth the risk of introducing regressions
> for
> > > > > (e.g.)
> > > > > > performance improvements.
> > > > > >
> > > > > > I think some of the patches committed so far for 2.1.16 do not
> meet
> > > > this
> > > > > > bar and should be reverted. I include a summary of what people
> have
> > > to
> > > > > > live with if we leave them unfixed:
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/CASSANDRA-11349
> > > > > > Repair suffers false-negative tree mismatches and overstreams
> data.
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/CASSANDRA-10433
> > > > > > Reduced performance on inserts (and reads?) (for Thrift clients
> > > only?)
> > > > > >
> > > > > > https://issues.apache.org/jira/browse/CASSANDRA-12030
> > > > > > Reduced performance on reads for workloads with range tombstones
> > > > > >
> > > > > > Anyone want to make a case that these are more critical than they
> > > > appear
> > > > > > and should not be reverted?
> > > > > >
> > > > > > --
> > > > > > Jonathan Ellis
> > > > > > Project Chair, Apache Cassandra
> > > > > > co-founder, http://www.datastax.com
> > > > > > @spyced
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Jonathan Ellis
> > > > Project Chair, Apache Cassandra
> > > > co-founder, http://www.datastax.com
> > > > @spyced
> > > >
> > >
> >
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder, http://www.datastax.com
@spyced

Re: Reminder: critical fixes only in 2.1

Reply via email to