Re: Stabilising Internode Messaging in 4.0

Oleksandr Petrov Wed, 10 Apr 2019 08:04:50 -0700

To be fair, even though the patch totals to 20K LoC, the core of the patch
is within reasonable bounds (around net.async.*). There are many changes
because of the code that got moved around. Some parts changes look large
because Java is quite a verbose language (e.g., metric tables).
Unfortunately, I do not see a good way to split out the bug fixes specific
to netty refactor from some other changes, since some things were
fundamentally reworked.


I'll leave a more elaborate comment that summarises the way I've been
approaching the patch review that might help to identify the "important"
parts that one should concentrate on while reviewing.

I've been reviewing the patch and can say that it has several advantages,
including the fact that incoming and outgoing paths are now easier to test
(e.g., write unit tests for). This has helped to validate rather complex
scenarios, such as handshake protocol, back pressure, dropping messages,
large messages, expiry, counting and more. Patch also integrates well with
in-jvm tests, which allowed to verify several behaviors which otherwise
would've been somewhat difficult to reproduce.

I do agree that validating this patch is no easy endeavor, but having that
said, at that point, we would have to invest a similar amount of time to
validate 4.0 with and without this patch.

I'm personally leaning towards 4.0, since this helps to keep the community
unified testing the same version all together instead of some concentrating
on making 4.0 work, while some are skipping it and proceeding with 4.next.


On Wed, Apr 10, 2019 at 4:55 PM Benedict Elliott Smith <bened...@apache.org>
wrote:

> Appreciate everyone's input. It sounds like there's broad agreement that
> the fixes need to make it into 4.0, which I'm really pleased to see. The
> question seems to be moving towards scope and timing.
>
> TL;DR: This patch is probably the most efficient route to a stable 4.0.
> The patch is already reviewed, extensively tested, and more thorough
> testing is coming. From our perspective, a timeline for merging this is on
> the order of two weeks.
>
> To best answer the question of ideal scope and timing, it's perhaps
> worthwhile considering what our ideal approach would look like, given
> realistic breathing room, and then how other pressures might prefer to
> modify that. In my opinion, both views endorse proceeding with the patch
> largely as it stands, though I will highlight some of the strongest cases
> for splitting.
>
> To answer the first question: Were I starting with 3.0, and producing a
> major refactor of messaging, would I have produced this patch, or one very
> like it? The answer is yes.
>
> These changes have been designed together, intentionally. By making them
> simultaneously, we not only reduce the review and work burden, we reduce
> the overall risk by ensuring all of the semantics of each component are
> designed in harmony. A trickle of changes requires subtly updating the
> semantics of each component, across multiple constantly shifting interfaces
> between old and new. Each time we touch one of these interfaces, we risk
> failing to properly modify (or more likely hack) one side to understand the
> other, and carrying those misunderstandings and hacks through to later
> patches.
>
> Each time we do this, it is costly in both time and risk. The project
> can't really afford either, even in the wider picture sense. There is
> already too much work to do, and risk to mitigate.
>
> Just as importantly, most of these changes are necessary, and actually fix
> bugs. For example, the “performance improvements” from re-implementing
> Netty classes are actually to avoid bugs in our usage, since they are
> designed for use only on the event loop.
>
> - Using our own allocator avoids cross-thread recycling, which can cause
> unbounded heap growth in Netty today (and is the topic of active discussion
> on the Netty bug tracker [1]). We already have our own allocator that
> works, and arguably it is lowest risk to repurpose it. We also reduce risk
> by using existing ByteBuffer methods, keeping the code more idiomatic.
> - AsyncPromise is used to avoid blocking the event loop. Modifying and
> invoking listeners on a DefaultPromise requires taking a mutex, and we do
> this on threads besides the event loop. This could stall the event loop for
> an entire scheduler quanta (or more, if the owning thread is low priority
> or has been extremely busy recently), which is a bug for a real-time
> networking thread that could be servicing dozens of connections.
> - FutureCombiner in Netty is not thread-safe, and we need it to be.
>
> Some are perhaps not bug fixes, but a fundamental part of the design of
> the patch. You either get them for free, or you write a different patch.
> For example, checksumming, which falls naturally out of framing - which is
> integral to the new message flow. Or dropping messages eagerly when reading
> off the wire, which is simply natural to do in this design. Back pressure
> falls roughly into this category as well, and fixes a major stability bug.
> Perhaps the biggest stability bug we have today.
>
> The new handshake protocol is probably the best candidate for splitting
> into its own patch. There's a good case to be made here, but the question
> is: what are we hoping to achieve by separating it? Reviewing these
> specific changes is not a significant burden, I don't think. Testing it
> independently probably doesn't yield significant benefit - if anything it's
> probably advantageous to include the changes in our atypically thorough
> testing of this subsystem.
>
> Perhaps we can take this discussion onto Jira, to dive into specifics on
> more items? I've tried to keep it high level, and only select a few items,
> and it's perhaps already too deep in the weeds. We're absolutely open to
> modifying the scope if it definitely makes sense, and that's probably best
> established with some deep discussions on specific points.
>
> So, what about our current pressures?
>
> I think it's clear that landing this patch is the most efficient route to
> 4.0 release. It has already exceeded the normal barrier for inclusion in
> the project - it has been reviewed by two people already (50% each by me
> and Aleksey, and 100% by Alex Petrov). It is already well tested, and I
> will shortly be posting a test plan to the Wiki outlining plans for further
> coverage. Suffice to say we anticipate atypically thorough test coverage
> before the patch is committed, and an extremely high confidence in the
> result. We also don't anticipate this taking a significant length of time
> to achieve.
>
> Attempting to incorporate the patch in a piecemeal fashion can only slow
> things down and, in my opinion, lead to a worse and riskier result.
>
> [1] https://github.com/netty/netty/pull/8052 <
> https://github.com/netty/netty/pull/8052>
>
>
> > On 10 Apr 2019, at 03:28, Joseph Lynch <joe.e.ly...@gmail.com> wrote:
> >
> > Let's try this again, apparently email is hard ...
> >
> > I am relatively new to these code paths—especially compared to the
> > committers that have been working on these issues for years such as
> > the 15066 authors as well as Jason Brown—but like many Cassandra users
> > I am familiar with many of the classes of issues Aleksey and Benedict
> > have identified with this patchset (especially related to messaging
> > correctness, performance and the lack of message backpressure). We
> > believe that every single fix and feature in this patch is valuable
> > and we desire that we are able to get them all merged in and
> > validated. We don’t think it’s even a question if we want to merge
> > these: we should want these excellent changes. The only questions—in
> > my opinion—are how do we safely merge them and when do we merge them?
> >
> > Due to my and Vinay’s relative lack of knowledge of these code paths,
> > we hope that we can get as many experienced eyes as we can to review
> > the patch and evaluate the risk-reward tradeoffs of some of the deeper
> > changes. We don’t feel qualified to make assertions about risk vs
> > reward in this patchset, but I know there are a number of people on
> > this mailing list who are qualified and I think we would all
> > appreciate their insight and help.
> >
> > I completely understand that we don’t live in an ideal world, but I do
> > personally feel that in an ideal world it would be possible to pull
> > the bug fixes (bugs specific to the 4.0 netty refactor) out from the
> > semantic changes (e.g. droppability, checksumming, back pressure,
> > handshake changes), code refactors (e.g. verb handler,
> > MessageIn/MessageOut) and performance changes (various
> > re-implementations of Netty internals, some optimizations around
> > dropping dead messages earlier). Then we can review, validate, and
> > benchmark each change independently and iteratively move towards
> > better messaging. At the same time, I recognize that it may be hard to
> > pull these changes apart, but I worry that review and validation of
> > the patch, as is, may take the testing community many months to
> > properly vet and will either mean that we cut 4.0 many, many months
> > from now or we cut 4.0 before we can properly test the patchset.
> >
> > I think we are all agreed we don’t want an unstable 4.0, so the main
> > decision point here is: what set of changes from this valuable and
> > important patch set do we put in 4.0, and which do we try to put in
> > 4.next? Once we determine that, the community can hopefully start
> > allocating the necessary review, testing, and benchmarking resources
> > to ensure that 4.0 is our first ever rock solid “.0” release.
> >
> > -Joey
> >
> >
> > On Thu, Apr 4, 2019 at 5:56 PM Jon Haddad <j...@jonhaddad.com> wrote:
> >>
> >> Given the number of issues that are addressed, I definitely think it's
> >> worth strongly considering merging this in.  I think it might be a
> >> little unrealistic to cut the first alpha after the merge though.
> >> Being realistic, any 20K+ LOC change is going to introduce its own
> >> bugs, and we should be honest with ourselves about that.  It seems
> >> likely the issues the patch addressed would have affected the 4.0
> >> release in some form *anyways* so the question might be do we fix them
> >> now or after someone's cluster burns down because there's no inbound /
> >> outbound message load shedding.
> >>
> >> Giving it a quick code review and going through the JIRA comments
> >> (well written, thanks guys) there seem to be some pretty important bug
> >> fixes in here as well as paying off a bit of technical debt.
> >>
> >> Jon
> >>
> >> On Thu, Apr 4, 2019 at 1:37 PM Pavel Yaskevich <xe...@apache.org>
> wrote:
> >>>
> >>> Great to see such a significant progress made in the area!
> >>>
> >>> On Thu, Apr 4, 2019 at 1:13 PM Aleksey Yeschenko <alek...@apache.org>
> wrote:
> >>>
> >>>> I would like to propose CASSANDRA-15066 [1] - an important set of bug
> fixes
> >>>> and stability improvements to internode messaging code that Benedict,
> I,
> >>>> and others have been working on for the past couple of months.
> >>>>
> >>>> First, some context.   This work started off as a review of
> CASSANDRA-14503
> >>>> (Internode connection management is race-prone [2]), CASSANDRA-13630
> >>>> (Support large internode messages with netty) [3], and a pre-4.0
> >>>> confirmatory review of such a major new feature.
> >>>>
> >>>> However, as we dug in, we realized this was insufficient. With more
> than 50
> >>>> bugs uncovered [4] - dozens of them critical to correctness and/or
> >>>> stability of the system - a substantial rework was necessary to
> guarantee a
> >>>> solid internode messaging subsystem for the 4.0 release.
> >>>>
> >>>> In addition to addressing all of the uncovered bugs [4] that were
> unique to
> >>>> trunk + 13630 [3] + 14503 [2], we used this opportunity to correct
> some
> >>>> long-existing, pre-4.0 bugs and stability issues. For the complete
> list of
> >>>> notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d
> like
> >>>> to highlight a few.
> >>>>
> >>>> # Lack of message integrity checks
> >>>>
> >>>> It’s known that TCP checksums are too weak [5] and Ethernet CRC
> cannot be
> >>>> relied upon [6] for integrity. With sufficient scale or time, you
> will hit
> >>>> bit flips. Sadly, most of the time these go undetected.  Cassandra’s
> >>>> replication model makes this issue much more serious, as the faulty
> data
> >>>> can infect the cluster.
> >>>>
> >>>> We recognised this problem, and recently introduced a fix for
> server-client
> >>>> messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to
> the
> >>>> native protocol) [7].
> >>>>
> >>>> But until CASSANDRA-15066 [1] lands, this is also a critical flaw
> >>>> internode. We have addressed it by ensuring that no matter what,
> whether
> >>>> you use SSL or not, whether you use internode compression or not, a
> >>>> protocol level CRC is always present, for every message frame. It’s
> our
> >>>> deep and sincere belief that shipping a new implementation of the
> messaging
> >>>> protocol without application-level data integrity checks would be
> >>>> unacceptable for a modern database.
> >>>>
> >>>
> >>> I'm all for introducing more correctness checks at all levels
> especially in
> >>> communication.
> >>> Having dealt with multiple data corruption bugs that could have been
> easily
> >>> prevented by
> >>> having a checksum, it's great to see that we are moving in this
> direction.
> >>>
> >>>
> >>>> # Lack of back-pressure and memory usage constraints
> >>>>
> >>>> As it stands today, it’s far too easy for a single slow node to become
> >>>> overwhelmed by messages from its peers.  Conversely, multiple
> coordinators
> >>>> can be made unstable by the backlog of messages to deliver to just one
> >>>> struggling node.
> >>>>
> >>>> To address this problem, we have introduced strict memory usage
> constraints
> >>>> that apply TCP-level back-pressure, on both outbound and inbound.  It
> is
> >>>> now impossible for a node to be swamped on inbound, and on outbound
> it is
> >>>> made significantly harder to overcommit resources.  It’s a simple,
> reliable
> >>>> mechanism that drastically improves cluster stability under load, and
> >>>> especially overload.
> >>>>
> >>>> Cassandra is a mature system, and introducing an entirely new
> messaging
> >>>> implementation without resolving this fundamental stability issue is
> >>>> difficult to justify in our view.
> >>>>
> >>>
> >>> I'd say that this is required to be able to ship 4.0 as a release
> focused
> >>> on stability.
> >>> I personally have been waiting for this to happen for years.
> Significant
> >>> step forward in our QoS story.
> >>>
> >>>
> >>>>
> >>>> # State of the patch, feature freeze and 4.0 timeline concerns
> >>>>
> >>>> The patch is essentially complete, with much improved unit tests all
> >>>> passing, dtests green, and extensive fuzz testing underway - with
> initial
> >>>> results all positive.  We intend to further improve in-code
> documentation
> >>>> and test coverage in the next week or two, and do some minor
> additional
> >>>> code review, but we believe it will be basically good to commit in a
> couple
> >>>> weeks.
> >>>>
> >>>> The patch could also use some performance testing. We are hoping that
> our
> >>>> colleagues at Netflix and new Cassandra committers Joey and Vinay
> will help
> >>>> us with this aspect.  However, this work needs to be done regardless,
> so
> >>>> provides no significant delay.
> >>>>
> >>>> I would classify absolutely most of the work done in this patch as
> >>>> necessary bug fixes and stability improvements - in line with the
> stated
> >>>> goals of the feature freeze.
> >>>>
> >>>> The only clear “feature” introduced is the expanded metrics, and
> virtual
> >>>> tables to observe them.  If the freeze is to be strictly interpreted
> these
> >>>> can be removed, but we think this would be unwise.
> >>>>
> >>>> We hope that the community will appreciate and welcome these changes.
> >>>> We’ve worked hard to deliver this promptly, to minimise delays to 4.0
> were
> >>>> these issues to be addressed piecemeal, and we sincerely believe this
> work
> >>>> will have a positive impact on stability and performance of Cassandra
> >>>> clusters for years to come.
> >>>>
> >>>> P.S. I believe that once it’s committed, we should cut our first 4.0
> alpha.
> >>>> It’s about time we started this train (:
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-15066
> >>>> [2] https://issues.apache.org/jira/browse/CASSANDRA-14503
> >>>> [3] https://issues.apache.org/jira/browse/CASSANDRA-13630
> >>>> [4]
> https://gist.github.com/belliottsmith/0d12df678d8e9ab06776e29116d56b91
> >>>> (incomplete list)
> >>>> [5] https://www.evanjones.ca/tcp-checksums.html
> >>>> [6] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
> >>>> [7] https://issues.apache.org/jira/browse/CASSANDRA-13304
> >>>>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> >> For additional commands, e-mail: dev-h...@cassandra.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
>
>

-- 
alex p

Re: Stabilising Internode Messaging in 4.0

Reply via email to