Hi, I'm the reporter for CASSANDRA-11349. Just a bit of history:
This was a big pain point for us because the over-streaming was really important. Each week, repair increased data size by a few percents. To give an example, one cluster with 12 nodes had a CF with around 250GB of data per node and after compaction, reduced to 80GB. Before fully understanding the problem, we just doubled the cluster size just because of this issue (from 6 nodes to 12 nodes). We spotted the problem in November 2015, in one cluster, without having hints of what was happening at first (we just knew that there was a lot of streaming during repairs, ie a few MB/s of data, and that there was quite a high ratio of mismatch on read repairs) After putting in production more clusters (with completely different data models), we continued to observe this pattern (without finding any correlation). It took some time to identify the issue and fix it. Note that, once we reported the issue, we stopped repairing the affected CFs. (I know it's a bad practice, but this was a temporary solution on bare-metal servers which are quite stable). I'd say that I'm surprised that this was not reported by more people (our data model has nothing particular). (Maybe other people are affected but have not paid attention to this phenomenon) Please note that this patch does not affect how SSTables are serialized, but only how digest are computed thus reduces risks. So, to conclude, it was critical for us, and it may help other people in the wild. Nevertheless, we can live with this patch not being included in 2.1.X (by maintaining our own C* package until we migrate to 3.0.X) 2016-07-20 18:06 GMT+02:00 Sylvain Lebresne <sylv...@datastax.com>: > Definitively agrees that CASSANDRA-10433 and CASSANDRA-12030 aren't > critical. In fact, they are both marked as "improvements" and "minor". I'm > to blame for their commit, so mea culpa. But to my defense, I've long > advocated for being stricter on sticking to critical-only fixes on old > releases and have long felt that this sentiment wasn't really shared in > practice by other devs/committers (could be my fault getting the wrong > impression, but I've lost track of the number of time were other > devs/committers argue for "just this time because it's really simple" when > those questions cam up), so I've partly gave up on arguing. I'm happy to > get more consistent on this and +1 reverting those. > > I think CASSANDRA-11349 is the one possibly more debatable. Fabien on the > ticket seemed to suggest that on his clusters the bug was make repair > unmanageable ("a few days after filing the bug, we decided to temporarily > stop repairing some tables [...] which were heavily impacted by those > bugs"). He mention in particular having seen "around a hundred" mismatches > with the patch against "a few hundred of thousands" without. I suppose one > could make a case that having repair over-stream by 3 order of magnitude in > some case is critical-ish. Note that I wouldn't oppose reverting this too, > as it doesn't fully meet *my* definition of critical, but I'm willing to > accept my definition is on the stronger side of things and leave it in. > > -- > Sylvain > > On Wed, Jul 20, 2016 at 5:29 PM, Aleksey Yeschenko <alek...@apache.org> > wrote: > > > +1 from me (and I don’t see any resistance to it either). > > > > -- > > AY > > > > On 18 July 2016 at 18:36:42, Jonathan Ellis (jbel...@gmail.com) wrote: > > > > Except there really wasn't. > > > > Patch submitter: "I want this in 2.1." > > Reviewer: "Okay." > > > > That's not exactly the bar we're looking for. To consider a performance > > fix "critical" for example, you really need to show at the very least > what > > new workload you found that isn't able to live with it the way everyone > > else did for the previous 15 releases. > > > > I note that on 10433 the committer even said, "I'm not [sure] I agree > this > > is critical for 2.1 at this point, but as it's simple enough and has been > > somewhat vetted on 2.2 by now, not going to argue." > > > > So consider this me putting on my bad cop hat and opening up the > argument. > > > > On Mon, Jul 18, 2016 at 12:24 PM, Jeremiah D Jordan < > jerem...@datastax.com > > > > > wrote: > > > > > Looking at those tickets in all three of them the “is this critical to > > > fix” question came up in the JIRA discussion and it was decided that > they > > > were indeed critical enough to commit to 2.1. > > > > > > > On Jul 18, 2016, at 11:47 AM, Jonathan Ellis <jbel...@gmail.com> > > wrote: > > > > > > > > We're at the stage of the release cycle where we should be committing > > > > critical fixes only to the 2.1 branch. Many people depend on 2.1 > > working > > > > reliably and it's not worth the risk of introducing regressions for > > > (e.g.) > > > > performance improvements. > > > > > > > > I think some of the patches committed so far for 2.1.16 do not meet > > this > > > > bar and should be reverted. I include a summary of what people have > to > > > > live with if we leave them unfixed: > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-11349 > > > > Repair suffers false-negative tree mismatches and overstreams data. > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-10433 > > > > Reduced performance on inserts (and reads?) (for Thrift clients > only?) > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-12030 > > > > Reduced performance on reads for workloads with range tombstones > > > > > > > > Anyone want to make a case that these are more critical than they > > appear > > > > and should not be reverted? > > > > > > > > -- > > > > Jonathan Ellis > > > > Project Chair, Apache Cassandra > > > > co-founder, http://www.datastax.com > > > > @spyced > > > > > > > > > > > > -- > > Jonathan Ellis > > Project Chair, Apache Cassandra > > co-founder, http://www.datastax.com > > @spyced > > >