Sounds like we have consensus on that. I will handle the reverts and update Jira.
On Wed, Jul 20, 2016 at 7:11 PM, Aleksey Yeschenko <alek...@apache.org> wrote: > Keep #11349, revert the rest sounds reasonable. > > -- > AY > > On 20 July 2016 at 22:27:05, sankalp kohli (kohlisank...@gmail.com) wrote: > > +1 on only allowing critical bug fixes. > I agree with Sylvain that CASSANDRA-11349 is a border line critical bug. I > would vote for CASSANDRA-11349 as being critical since over streaming is a > big issue for us as well. I am also fine taking it as an internal patch > since we already maintain an internal branch with bug fixes. > > On Wed, Jul 20, 2016 at 2:10 PM, Fabien Rousseau <fabifab...@gmail.com> > wrote: > > > Hi, > > > > I'm the reporter for CASSANDRA-11349. > > Just a bit of history: > > > > This was a big pain point for us because the over-streaming was really > > important. > > Each week, repair increased data size by a few percents. > > To give an example, one cluster with 12 nodes had a CF with around 250GB > of > > data per node and after compaction, reduced to 80GB. > > Before fully understanding the problem, we just doubled the cluster size > > just because of this issue (from 6 nodes to 12 nodes). > > We spotted the problem in November 2015, in one cluster, without having > > hints of what was happening at first (we just knew that there was a lot > of > > streaming during repairs, ie a few MB/s of data, and that there was > quite a > > high ratio of mismatch on read repairs) > > After putting in production more clusters (with completely different data > > models), we continued to observe this pattern (without finding any > > correlation). > > It took some time to identify the issue and fix it. > > Note that, once we reported the issue, we stopped repairing the affected > > CFs. (I know it's a bad practice, but this was a temporary solution on > > bare-metal servers which are quite stable). > > > > I'd say that I'm surprised that this was not reported by more people (our > > data model has nothing particular). > > (Maybe other people are affected but have not paid attention to this > > phenomenon) > > > > Please note that this patch does not affect how SSTables are serialized, > > but only how digest are computed thus reduces risks. > > > > So, to conclude, it was critical for us, and it may help other people in > > the wild. > > Nevertheless, we can live with this patch not being included in 2.1.X (by > > maintaining our own C* package until we migrate to 3.0.X) > > > > 2016-07-20 18:06 GMT+02:00 Sylvain Lebresne <sylv...@datastax.com>: > > > > > Definitively agrees that CASSANDRA-10433 and CASSANDRA-12030 aren't > > > critical. In fact, they are both marked as "improvements" and "minor". > > I'm > > > to blame for their commit, so mea culpa. But to my defense, I've long > > > advocated for being stricter on sticking to critical-only fixes on old > > > releases and have long felt that this sentiment wasn't really shared in > > > practice by other devs/committers (could be my fault getting the wrong > > > impression, but I've lost track of the number of time were other > > > devs/committers argue for "just this time because it's really simple" > > when > > > those questions cam up), so I've partly gave up on arguing. I'm happy > to > > > get more consistent on this and +1 reverting those. > > > > > > I think CASSANDRA-11349 is the one possibly more debatable. Fabien on > the > > > ticket seemed to suggest that on his clusters the bug was make repair > > > unmanageable ("a few days after filing the bug, we decided to > temporarily > > > stop repairing some tables [...] which were heavily impacted by those > > > bugs"). He mention in particular having seen "around a hundred" > > mismatches > > > with the patch against "a few hundred of thousands" without. I suppose > > one > > > could make a case that having repair over-stream by 3 order of > magnitude > > in > > > some case is critical-ish. Note that I wouldn't oppose reverting this > > too, > > > as it doesn't fully meet *my* definition of critical, but I'm willing > to > > > accept my definition is on the stronger side of things and leave it in. > > > > > > -- > > > Sylvain > > > > > > On Wed, Jul 20, 2016 at 5:29 PM, Aleksey Yeschenko <alek...@apache.org > > > > > wrote: > > > > > > > +1 from me (and I don’t see any resistance to it either). > > > > > > > > -- > > > > AY > > > > > > > > On 18 July 2016 at 18:36:42, Jonathan Ellis (jbel...@gmail.com) > wrote: > > > > > > > > Except there really wasn't. > > > > > > > > Patch submitter: "I want this in 2.1." > > > > Reviewer: "Okay." > > > > > > > > That's not exactly the bar we're looking for. To consider a > performance > > > > fix "critical" for example, you really need to show at the very least > > > what > > > > new workload you found that isn't able to live with it the way > everyone > > > > else did for the previous 15 releases. > > > > > > > > I note that on 10433 the committer even said, "I'm not [sure] I agree > > > this > > > > is critical for 2.1 at this point, but as it's simple enough and has > > been > > > > somewhat vetted on 2.2 by now, not going to argue." > > > > > > > > So consider this me putting on my bad cop hat and opening up the > > > argument. > > > > > > > > On Mon, Jul 18, 2016 at 12:24 PM, Jeremiah D Jordan < > > > jerem...@datastax.com > > > > > > > > > wrote: > > > > > > > > > Looking at those tickets in all three of them the “is this critical > > to > > > > > fix” question came up in the JIRA discussion and it was decided > that > > > they > > > > > were indeed critical enough to commit to 2.1. > > > > > > > > > > > On Jul 18, 2016, at 11:47 AM, Jonathan Ellis <jbel...@gmail.com> > > > > wrote: > > > > > > > > > > > > We're at the stage of the release cycle where we should be > > committing > > > > > > critical fixes only to the 2.1 branch. Many people depend on 2.1 > > > > working > > > > > > reliably and it's not worth the risk of introducing regressions > for > > > > > (e.g.) > > > > > > performance improvements. > > > > > > > > > > > > I think some of the patches committed so far for 2.1.16 do not > meet > > > > this > > > > > > bar and should be reverted. I include a summary of what people > have > > > to > > > > > > live with if we leave them unfixed: > > > > > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-11349 > > > > > > Repair suffers false-negative tree mismatches and overstreams > data. > > > > > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-10433 > > > > > > Reduced performance on inserts (and reads?) (for Thrift clients > > > only?) > > > > > > > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-12030 > > > > > > Reduced performance on reads for workloads with range tombstones > > > > > > > > > > > > Anyone want to make a case that these are more critical than they > > > > appear > > > > > > and should not be reverted? > > > > > > > > > > > > -- > > > > > > Jonathan Ellis > > > > > > Project Chair, Apache Cassandra > > > > > > co-founder, http://www.datastax.com > > > > > > @spyced > > > > > > > > > > > > > > > > > > > > > > -- > > > > Jonathan Ellis > > > > Project Chair, Apache Cassandra > > > > co-founder, http://www.datastax.com > > > > @spyced > > > > > > > > > > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder, http://www.datastax.com @spyced