Re: Stabilising Internode Messaging in 4.0

Pavel Yaskevich Thu, 04 Apr 2019 13:38:09 -0700

Great to see such a significant progress made in the area!

On Thu, Apr 4, 2019 at 1:13 PM Aleksey Yeschenko <alek...@apache.org> wrote:


> I would like to propose CASSANDRA-15066 [1] - an important set of bug fixes
> and stability improvements to internode messaging code that Benedict, I,
> and others have been working on for the past couple of months.
>
> First, some context.   This work started off as a review of CASSANDRA-14503
> (Internode connection management is race-prone [2]), CASSANDRA-13630
> (Support large internode messages with netty) [3], and a pre-4.0
> confirmatory review of such a major new feature.
>
> However, as we dug in, we realized this was insufficient. With more than 50
> bugs uncovered [4] - dozens of them critical to correctness and/or
> stability of the system - a substantial rework was necessary to guarantee a
> solid internode messaging subsystem for the 4.0 release.
>
> In addition to addressing all of the uncovered bugs [4] that were unique to
> trunk + 13630 [3] + 14503 [2], we used this opportunity to correct some
> long-existing, pre-4.0 bugs and stability issues. For the complete list of
> notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d like
> to highlight a few.
>
> # Lack of message integrity checks
>
> It’s known that TCP checksums are too weak [5] and Ethernet CRC cannot be
> relied upon [6] for integrity. With sufficient scale or time, you will hit
> bit flips. Sadly, most of the time these go undetected.  Cassandra’s
> replication model makes this issue much more serious, as the faulty data
> can infect the cluster.
>
> We recognised this problem, and recently introduced a fix for server-client
> messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to the
> native protocol) [7].
>
> But until CASSANDRA-15066 [1] lands, this is also a critical flaw
> internode. We have addressed it by ensuring that no matter what, whether
> you use SSL or not, whether you use internode compression or not, a
> protocol level CRC is always present, for every message frame. It’s our
> deep and sincere belief that shipping a new implementation of the messaging
> protocol without application-level data integrity checks would be
> unacceptable for a modern database.
>

I'm all for introducing more correctness checks at all levels especially in
communication.
Having dealt with multiple data corruption bugs that could have been easily
prevented by
having a checksum, it's great to see that we are moving in this direction.


> # Lack of back-pressure and memory usage constraints
>
> As it stands today, it’s far too easy for a single slow node to become
> overwhelmed by messages from its peers.  Conversely, multiple coordinators
> can be made unstable by the backlog of messages to deliver to just one
> struggling node.
>
> To address this problem, we have introduced strict memory usage constraints
> that apply TCP-level back-pressure, on both outbound and inbound.  It is
> now impossible for a node to be swamped on inbound, and on outbound it is
> made significantly harder to overcommit resources.  It’s a simple, reliable
> mechanism that drastically improves cluster stability under load, and
> especially overload.
>
> Cassandra is a mature system, and introducing an entirely new messaging
> implementation without resolving this fundamental stability issue is
> difficult to justify in our view.
>

I'd say that this is required to be able to ship 4.0 as a release focused
on stability.
I personally have been waiting for this to happen for years. Significant
step forward in our QoS story.


>
> # State of the patch, feature freeze and 4.0 timeline concerns
>
> The patch is essentially complete, with much improved unit tests all
> passing, dtests green, and extensive fuzz testing underway - with initial
> results all positive.  We intend to further improve in-code documentation
> and test coverage in the next week or two, and do some minor additional
> code review, but we believe it will be basically good to commit in a couple
> weeks.
>
> The patch could also use some performance testing. We are hoping that our
> colleagues at Netflix and new Cassandra committers Joey and Vinay will help
> us with this aspect.  However, this work needs to be done regardless, so
> provides no significant delay.
>
> I would classify absolutely most of the work done in this patch as
> necessary bug fixes and stability improvements - in line with the stated
> goals of the feature freeze.
>
> The only clear “feature” introduced is the expanded metrics, and virtual
> tables to observe them.  If the freeze is to be strictly interpreted these
> can be removed, but we think this would be unwise.
>
> We hope that the community will appreciate and welcome these changes.
> We’ve worked hard to deliver this promptly, to minimise delays to 4.0 were
> these issues to be addressed piecemeal, and we sincerely believe this work
> will have a positive impact on stability and performance of Cassandra
> clusters for years to come.
>
> P.S. I believe that once it’s committed, we should cut our first 4.0 alpha.
> It’s about time we started this train (:
>
> [1] https://issues.apache.org/jira/browse/CASSANDRA-15066
> [2] https://issues.apache.org/jira/browse/CASSANDRA-14503
> [3] https://issues.apache.org/jira/browse/CASSANDRA-13630
> [4] https://gist.github.com/belliottsmith/0d12df678d8e9ab06776e29116d56b91
> (incomplete list)
> [5] https://www.evanjones.ca/tcp-checksums.html
> [6] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
> [7] https://issues.apache.org/jira/browse/CASSANDRA-13304
>

Re: Stabilising Internode Messaging in 4.0

Reply via email to