Keep #11349, revert the rest sounds reasonable. -- AY
On 20 July 2016 at 22:27:05, sankalp kohli (kohlisank...@gmail.com) wrote: +1 on only allowing critical bug fixes. I agree with Sylvain that CASSANDRA-11349 is a border line critical bug. I would vote for CASSANDRA-11349 as being critical since over streaming is a big issue for us as well. I am also fine taking it as an internal patch since we already maintain an internal branch with bug fixes. On Wed, Jul 20, 2016 at 2:10 PM, Fabien Rousseau <fabifab...@gmail.com> wrote: > Hi, > > I'm the reporter for CASSANDRA-11349. > Just a bit of history: > > This was a big pain point for us because the over-streaming was really > important. > Each week, repair increased data size by a few percents. > To give an example, one cluster with 12 nodes had a CF with around 250GB of > data per node and after compaction, reduced to 80GB. > Before fully understanding the problem, we just doubled the cluster size > just because of this issue (from 6 nodes to 12 nodes). > We spotted the problem in November 2015, in one cluster, without having > hints of what was happening at first (we just knew that there was a lot of > streaming during repairs, ie a few MB/s of data, and that there was quite a > high ratio of mismatch on read repairs) > After putting in production more clusters (with completely different data > models), we continued to observe this pattern (without finding any > correlation). > It took some time to identify the issue and fix it. > Note that, once we reported the issue, we stopped repairing the affected > CFs. (I know it's a bad practice, but this was a temporary solution on > bare-metal servers which are quite stable). > > I'd say that I'm surprised that this was not reported by more people (our > data model has nothing particular). > (Maybe other people are affected but have not paid attention to this > phenomenon) > > Please note that this patch does not affect how SSTables are serialized, > but only how digest are computed thus reduces risks. > > So, to conclude, it was critical for us, and it may help other people in > the wild. > Nevertheless, we can live with this patch not being included in 2.1.X (by > maintaining our own C* package until we migrate to 3.0.X) > > 2016-07-20 18:06 GMT+02:00 Sylvain Lebresne <sylv...@datastax.com>: > > > Definitively agrees that CASSANDRA-10433 and CASSANDRA-12030 aren't > > critical. In fact, they are both marked as "improvements" and "minor". > I'm > > to blame for their commit, so mea culpa. But to my defense, I've long > > advocated for being stricter on sticking to critical-only fixes on old > > releases and have long felt that this sentiment wasn't really shared in > > practice by other devs/committers (could be my fault getting the wrong > > impression, but I've lost track of the number of time were other > > devs/committers argue for "just this time because it's really simple" > when > > those questions cam up), so I've partly gave up on arguing. I'm happy to > > get more consistent on this and +1 reverting those. > > > > I think CASSANDRA-11349 is the one possibly more debatable. Fabien on the > > ticket seemed to suggest that on his clusters the bug was make repair > > unmanageable ("a few days after filing the bug, we decided to temporarily > > stop repairing some tables [...] which were heavily impacted by those > > bugs"). He mention in particular having seen "around a hundred" > mismatches > > with the patch against "a few hundred of thousands" without. I suppose > one > > could make a case that having repair over-stream by 3 order of magnitude > in > > some case is critical-ish. Note that I wouldn't oppose reverting this > too, > > as it doesn't fully meet *my* definition of critical, but I'm willing to > > accept my definition is on the stronger side of things and leave it in. > > > > -- > > Sylvain > > > > On Wed, Jul 20, 2016 at 5:29 PM, Aleksey Yeschenko <alek...@apache.org> > > wrote: > > > > > +1 from me (and I don’t see any resistance to it either). > > > > > > -- > > > AY > > > > > > On 18 July 2016 at 18:36:42, Jonathan Ellis (jbel...@gmail.com) wrote: > > > > > > Except there really wasn't. > > > > > > Patch submitter: "I want this in 2.1." > > > Reviewer: "Okay." > > > > > > That's not exactly the bar we're looking for. To consider a performance > > > fix "critical" for example, you really need to show at the very least > > what > > > new workload you found that isn't able to live with it the way everyone > > > else did for the previous 15 releases. > > > > > > I note that on 10433 the committer even said, "I'm not [sure] I agree > > this > > > is critical for 2.1 at this point, but as it's simple enough and has > been > > > somewhat vetted on 2.2 by now, not going to argue." > > > > > > So consider this me putting on my bad cop hat and opening up the > > argument. > > > > > > On Mon, Jul 18, 2016 at 12:24 PM, Jeremiah D Jordan < > > jerem...@datastax.com > > > > > > > wrote: > > > > > > > Looking at those tickets in all three of them the “is this critical > to > > > > fix” question came up in the JIRA discussion and it was decided that > > they > > > > were indeed critical enough to commit to 2.1. > > > > > > > > > On Jul 18, 2016, at 11:47 AM, Jonathan Ellis <jbel...@gmail.com> > > > wrote: > > > > > > > > > > We're at the stage of the release cycle where we should be > committing > > > > > critical fixes only to the 2.1 branch. Many people depend on 2.1 > > > working > > > > > reliably and it's not worth the risk of introducing regressions for > > > > (e.g.) > > > > > performance improvements. > > > > > > > > > > I think some of the patches committed so far for 2.1.16 do not meet > > > this > > > > > bar and should be reverted. I include a summary of what people have > > to > > > > > live with if we leave them unfixed: > > > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-11349 > > > > > Repair suffers false-negative tree mismatches and overstreams data. > > > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-10433 > > > > > Reduced performance on inserts (and reads?) (for Thrift clients > > only?) > > > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-12030 > > > > > Reduced performance on reads for workloads with range tombstones > > > > > > > > > > Anyone want to make a case that these are more critical than they > > > appear > > > > > and should not be reverted? > > > > > > > > > > -- > > > > > Jonathan Ellis > > > > > Project Chair, Apache Cassandra > > > > > co-founder, http://www.datastax.com > > > > > @spyced > > > > > > > > > > > > > > > > > -- > > > Jonathan Ellis > > > Project Chair, Apache Cassandra > > > co-founder, http://www.datastax.com > > > @spyced > > > > > >