Keep #11349, revert the rest sounds reasonable.

-- 
AY

On 20 July 2016 at 22:27:05, sankalp kohli (kohlisank...@gmail.com) wrote:

+1 on only allowing critical bug fixes.  
I agree with Sylvain that CASSANDRA-11349 is a border line critical bug. I  
would vote for CASSANDRA-11349 as being critical since over streaming is a  
big issue for us as well. I am also fine taking it as an internal patch  
since we already maintain an internal branch with bug fixes.  

On Wed, Jul 20, 2016 at 2:10 PM, Fabien Rousseau <fabifab...@gmail.com>  
wrote:  

> Hi,  
>  
> I'm the reporter for CASSANDRA-11349.  
> Just a bit of history:  
>  
> This was a big pain point for us because the over-streaming was really  
> important.  
> Each week, repair increased data size by a few percents.  
> To give an example, one cluster with 12 nodes had a CF with around 250GB of  
> data per node and after compaction, reduced to 80GB.  
> Before fully understanding the problem, we just doubled the cluster size  
> just because of this issue (from 6 nodes to 12 nodes).  
> We spotted the problem in November 2015, in one cluster, without having  
> hints of what was happening at first (we just knew that there was a lot of  
> streaming during repairs, ie a few MB/s of data, and that there was quite a  
> high ratio of mismatch on read repairs)  
> After putting in production more clusters (with completely different data  
> models), we continued to observe this pattern (without finding any  
> correlation).  
> It took some time to identify the issue and fix it.  
> Note that, once we reported the issue, we stopped repairing the affected  
> CFs. (I know it's a bad practice, but this was a temporary solution on  
> bare-metal servers which are quite stable).  
>  
> I'd say that I'm surprised that this was not reported by more people (our  
> data model has nothing particular).  
> (Maybe other people are affected but have not paid attention to this  
> phenomenon)  
>  
> Please note that this patch does not affect how SSTables are serialized,  
> but only how digest are computed thus reduces risks.  
>  
> So, to conclude, it was critical for us, and it may help other people in  
> the wild.  
> Nevertheless, we can live with this patch not being included in 2.1.X (by  
> maintaining our own C* package until we migrate to 3.0.X)  
>  
> 2016-07-20 18:06 GMT+02:00 Sylvain Lebresne <sylv...@datastax.com>:  
>  
> > Definitively agrees that CASSANDRA-10433 and CASSANDRA-12030 aren't  
> > critical. In fact, they are both marked as "improvements" and "minor".  
> I'm  
> > to blame for their commit, so mea culpa. But to my defense, I've long  
> > advocated for being stricter on sticking to critical-only fixes on old  
> > releases and have long felt that this sentiment wasn't really shared in  
> > practice by other devs/committers (could be my fault getting the wrong  
> > impression, but I've lost track of the number of time were other  
> > devs/committers argue for "just this time because it's really simple"  
> when  
> > those questions cam up), so I've partly gave up on arguing. I'm happy to  
> > get more consistent on this and +1 reverting those.  
> >  
> > I think CASSANDRA-11349 is the one possibly more debatable. Fabien on the  
> > ticket seemed to suggest that on his clusters the bug was make repair  
> > unmanageable ("a few days after filing the bug, we decided to temporarily  
> > stop repairing some tables [...] which were heavily impacted by those  
> > bugs"). He mention in particular having seen "around a hundred"  
> mismatches  
> > with the patch against "a few hundred of thousands" without. I suppose  
> one  
> > could make a case that having repair over-stream by 3 order of magnitude  
> in  
> > some case is critical-ish. Note that I wouldn't oppose reverting this  
> too,  
> > as it doesn't fully meet *my* definition of critical, but I'm willing to  
> > accept my definition is on the stronger side of things and leave it in.  
> >  
> > --  
> > Sylvain  
> >  
> > On Wed, Jul 20, 2016 at 5:29 PM, Aleksey Yeschenko <alek...@apache.org>  
> > wrote:  
> >  
> > > +1 from me (and I don’t see any resistance to it either).  
> > >  
> > > --  
> > > AY  
> > >  
> > > On 18 July 2016 at 18:36:42, Jonathan Ellis (jbel...@gmail.com) wrote:  
> > >  
> > > Except there really wasn't.  
> > >  
> > > Patch submitter: "I want this in 2.1."  
> > > Reviewer: "Okay."  
> > >  
> > > That's not exactly the bar we're looking for. To consider a performance  
> > > fix "critical" for example, you really need to show at the very least  
> > what  
> > > new workload you found that isn't able to live with it the way everyone  
> > > else did for the previous 15 releases.  
> > >  
> > > I note that on 10433 the committer even said, "I'm not [sure] I agree  
> > this  
> > > is critical for 2.1 at this point, but as it's simple enough and has  
> been  
> > > somewhat vetted on 2.2 by now, not going to argue."  
> > >  
> > > So consider this me putting on my bad cop hat and opening up the  
> > argument.  
> > >  
> > > On Mon, Jul 18, 2016 at 12:24 PM, Jeremiah D Jordan <  
> > jerem...@datastax.com  
> > > >  
> > > wrote:  
> > >  
> > > > Looking at those tickets in all three of them the “is this critical  
> to  
> > > > fix” question came up in the JIRA discussion and it was decided that  
> > they  
> > > > were indeed critical enough to commit to 2.1.  
> > > >  
> > > > > On Jul 18, 2016, at 11:47 AM, Jonathan Ellis <jbel...@gmail.com>  
> > > wrote:  
> > > > >  
> > > > > We're at the stage of the release cycle where we should be  
> committing  
> > > > > critical fixes only to the 2.1 branch. Many people depend on 2.1  
> > > working  
> > > > > reliably and it's not worth the risk of introducing regressions for  
> > > > (e.g.)  
> > > > > performance improvements.  
> > > > >  
> > > > > I think some of the patches committed so far for 2.1.16 do not meet  
> > > this  
> > > > > bar and should be reverted. I include a summary of what people have  
> > to  
> > > > > live with if we leave them unfixed:  
> > > > >  
> > > > > https://issues.apache.org/jira/browse/CASSANDRA-11349  
> > > > > Repair suffers false-negative tree mismatches and overstreams data.  
> > > > >  
> > > > > https://issues.apache.org/jira/browse/CASSANDRA-10433  
> > > > > Reduced performance on inserts (and reads?) (for Thrift clients  
> > only?)  
> > > > >  
> > > > > https://issues.apache.org/jira/browse/CASSANDRA-12030  
> > > > > Reduced performance on reads for workloads with range tombstones  
> > > > >  
> > > > > Anyone want to make a case that these are more critical than they  
> > > appear  
> > > > > and should not be reverted?  
> > > > >  
> > > > > --  
> > > > > Jonathan Ellis  
> > > > > Project Chair, Apache Cassandra  
> > > > > co-founder, http://www.datastax.com  
> > > > > @spyced  
> > > >  
> > > >  
> > >  
> > >  
> > > --  
> > > Jonathan Ellis  
> > > Project Chair, Apache Cassandra  
> > > co-founder, http://www.datastax.com  
> > > @spyced  
> > >  
> >  
>  

Reply via email to