Re: Reminder: critical fixes only in 2.1

Jason Brown Thu, 21 Jul 2016 07:56:22 -0700

D'oh, too late to respond :)

I was going to comment that leaving the non-critical commits in 2.1 is
probably OK at this point (the deed has already been done), as long as we
agree to become more rigorous in the future about critical bugs vs bugs vs
minor enhancements vs major features and in which versions they land.





On Thu, Jul 21, 2016 at 7:37 AM, Jonathan Ellis <[email protected]> wrote:

> Sounds like we have consensus on that.  I will handle the reverts and
> update Jira.
>
> On Wed, Jul 20, 2016 at 7:11 PM, Aleksey Yeschenko <[email protected]>
> wrote:
>
> > Keep #11349, revert the rest sounds reasonable.
> >
> > --
> > AY
> >
> > On 20 July 2016 at 22:27:05, sankalp kohli ([email protected])
> wrote:
> >
> > +1 on only allowing critical bug fixes.
> > I agree with Sylvain that CASSANDRA-11349 is a border line critical bug.
> I
> > would vote for CASSANDRA-11349 as being critical since over streaming is
> a
> > big issue for us as well. I am also fine taking it as an internal patch
> > since we already maintain an internal branch with bug fixes.
> >
> > On Wed, Jul 20, 2016 at 2:10 PM, Fabien Rousseau <[email protected]>
> > wrote:
> >
> > > Hi,
> > >
> > > I'm the reporter for CASSANDRA-11349.
> > > Just a bit of history:
> > >
> > > This was a big pain point for us because the over-streaming was really
> > > important.
> > > Each week, repair increased data size by a few percents.
> > > To give an example, one cluster with 12 nodes had a CF with around
> 250GB
> > of
> > > data per node and after compaction, reduced to 80GB.
> > > Before fully understanding the problem, we just doubled the cluster
> size
> > > just because of this issue (from 6 nodes to 12 nodes).
> > > We spotted the problem in November 2015, in one cluster, without having
> > > hints of what was happening at first (we just knew that there was a lot
> > of
> > > streaming during repairs, ie a few MB/s of data, and that there was
> > quite a
> > > high ratio of mismatch on read repairs)
> > > After putting in production more clusters (with completely different
> data
> > > models), we continued to observe this pattern (without finding any
> > > correlation).
> > > It took some time to identify the issue and fix it.
> > > Note that, once we reported the issue, we stopped repairing the
> affected
> > > CFs. (I know it's a bad practice, but this was a temporary solution on
> > > bare-metal servers which are quite stable).
> > >
> > > I'd say that I'm surprised that this was not reported by more people
> (our
> > > data model has nothing particular).
> > > (Maybe other people are affected but have not paid attention to this
> > > phenomenon)
> > >
> > > Please note that this patch does not affect how SSTables are
> serialized,
> > > but only how digest are computed thus reduces risks.
> > >
> > > So, to conclude, it was critical for us, and it may help other people
> in
> > > the wild.
> > > Nevertheless, we can live with this patch not being included in 2.1.X
> (by
> > > maintaining our own C* package until we migrate to 3.0.X)
> > >
> > > 2016-07-20 18:06 GMT+02:00 Sylvain Lebresne <[email protected]>:
> > >
> > > > Definitively agrees that CASSANDRA-10433 and CASSANDRA-12030 aren't
> > > > critical. In fact, they are both marked as "improvements" and
> "minor".
> > > I'm
> > > > to blame for their commit, so mea culpa. But to my defense, I've long
> > > > advocated for being stricter on sticking to critical-only fixes on
> old
> > > > releases and have long felt that this sentiment wasn't really shared
> in
> > > > practice by other devs/committers (could be my fault getting the
> wrong
> > > > impression, but I've lost track of the number of time were other
> > > > devs/committers argue for "just this time because it's really simple"
> > > when
> > > > those questions cam up), so I've partly gave up on arguing. I'm happy
> > to
> > > > get more consistent on this and +1 reverting those.
> > > >
> > > > I think CASSANDRA-11349 is the one possibly more debatable. Fabien on
> > the
> > > > ticket seemed to suggest that on his clusters the bug was make repair
> > > > unmanageable ("a few days after filing the bug, we decided to
> > temporarily
> > > > stop repairing some tables [...] which were heavily impacted by those
> > > > bugs"). He mention in particular having seen "around a hundred"
> > > mismatches
> > > > with the patch against "a few hundred of thousands" without. I
> suppose
> > > one
> > > > could make a case that having repair over-stream by 3 order of
> > magnitude
> > > in
> > > > some case is critical-ish. Note that I wouldn't oppose reverting this
> > > too,
> > > > as it doesn't fully meet *my* definition of critical, but I'm willing
> > to
> > > > accept my definition is on the stronger side of things and leave it
> in.
> > > >
> > > > --
> > > > Sylvain
> > > >
> > > > On Wed, Jul 20, 2016 at 5:29 PM, Aleksey Yeschenko <
> [email protected]
> > >
> > > > wrote:
> > > >
> > > > > +1 from me (and I don’t see any resistance to it either).
> > > > >
> > > > > --
> > > > > AY
> > > > >
> > > > > On 18 July 2016 at 18:36:42, Jonathan Ellis ([email protected])
> > wrote:
> > > > >
> > > > > Except there really wasn't.
> > > > >
> > > > > Patch submitter: "I want this in 2.1."
> > > > > Reviewer: "Okay."
> > > > >
> > > > > That's not exactly the bar we're looking for. To consider a
> > performance
> > > > > fix "critical" for example, you really need to show at the very
> least
> > > > what
> > > > > new workload you found that isn't able to live with it the way
> > everyone
> > > > > else did for the previous 15 releases.
> > > > >
> > > > > I note that on 10433 the committer even said, "I'm not [sure] I
> agree
> > > > this
> > > > > is critical for 2.1 at this point, but as it's simple enough and
> has
> > > been
> > > > > somewhat vetted on 2.2 by now, not going to argue."
> > > > >
> > > > > So consider this me putting on my bad cop hat and opening up the
> > > > argument.
> > > > >
> > > > > On Mon, Jul 18, 2016 at 12:24 PM, Jeremiah D Jordan <
> > > > [email protected]
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Looking at those tickets in all three of them the “is this
> critical
> > > to
> > > > > > fix” question came up in the JIRA discussion and it was decided
> > that
> > > > they
> > > > > > were indeed critical enough to commit to 2.1.
> > > > > >
> > > > > > > On Jul 18, 2016, at 11:47 AM, Jonathan Ellis <
> [email protected]>
> > > > > wrote:
> > > > > > >
> > > > > > > We're at the stage of the release cycle where we should be
> > > committing
> > > > > > > critical fixes only to the 2.1 branch. Many people depend on
> 2.1
> > > > > working
> > > > > > > reliably and it's not worth the risk of introducing regressions
> > for
> > > > > > (e.g.)
> > > > > > > performance improvements.
> > > > > > >
> > > > > > > I think some of the patches committed so far for 2.1.16 do not
> > meet
> > > > > this
> > > > > > > bar and should be reverted. I include a summary of what people
> > have
> > > > to
> > > > > > > live with if we leave them unfixed:
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-11349
> > > > > > > Repair suffers false-negative tree mismatches and overstreams
> > data.
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-10433
> > > > > > > Reduced performance on inserts (and reads?) (for Thrift clients
> > > > only?)
> > > > > > >
> > > > > > > https://issues.apache.org/jira/browse/CASSANDRA-12030
> > > > > > > Reduced performance on reads for workloads with range
> tombstones
> > > > > > >
> > > > > > > Anyone want to make a case that these are more critical than
> they
> > > > > appear
> > > > > > > and should not be reverted?
> > > > > > >
> > > > > > > --
> > > > > > > Jonathan Ellis
> > > > > > > Project Chair, Apache Cassandra
> > > > > > > co-founder, http://www.datastax.com
> > > > > > > @spyced
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Jonathan Ellis
> > > > > Project Chair, Apache Cassandra
> > > > > co-founder, http://www.datastax.com
> > > > > @spyced
> > > > >
> > > >
> > >
> >
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder, http://www.datastax.com
> @spyced
>

Re: Reminder: critical fixes only in 2.1

Reply via email to