Re: Stabilising Internode Messaging in 4.0

Dinesh Joshi Wed, 10 Apr 2019 08:22:09 -0700

Here's are my 2¢.

I think the general direction of this work is valuable but I have a few 
concerns I’d like to address. More inline.

> On Apr 4, 2019, at 1:13 PM, Aleksey Yeschenko <alek...@apache.org> wrote:
> 
> I would like to propose CASSANDRA-15066 [1] - an important set of bug fixes
> and stability improvements to internode messaging code that Benedict, I,
> and others have been working on for the past couple of months.
> 
> First, some context.   This work started off as a review of CASSANDRA-14503
> (Internode connection management is race-prone [2]), CASSANDRA-13630
> (Support large internode messages with netty) [3], and a pre-4.0
> confirmatory review of such a major new feature.
> 
> However, as we dug in, we realized this was insufficient. With more than 50
> bugs uncovered [4] - dozens of them critical to correctness and/or
> stability of the system - a substantial rework was necessary to guarantee a
> solid internode messaging subsystem for the 4.0 release.
> 
> In addition to addressing all of the uncovered bugs [4] that were unique to
> trunk + 13630 [3] + 14503 [2], we used this opportunity to correct some
> long-existing, pre-4.0 bugs and stability issues. For the complete list of
> notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d like
> to highlight a few.

Do you have regression tests that will fail if these bugs are reintroduced at a 
later point?

> # Lack of message integrity checks
> 
> It’s known that TCP checksums are too weak [5] and Ethernet CRC cannot be
> relied upon [6] for integrity. With sufficient scale or time, you will hit
> bit flips. Sadly, most of the time these go undetected.  Cassandra’s
> replication model makes this issue much more serious, as the faulty data
> can infect the cluster.
> 
> We recognised this problem, and recently introduced a fix for server-client
> messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to the
> native protocol) [7].

This was discussed in my review and Jason created a ticket[1] to track it. We 
explicitly decided to defer this work not only due to the feature freeze in the 
community but also for technical reasons detailed below. 

Regarding new features during the feature freeze window, we have had such 
discussions in the past. The most recent being the one I initiated on Zstd 
Compressor which went positively and we have moved forward after assessing risk 
& community consensus.

Regarding checksumming, please scroll down to the comments section in the 
link[2] you provided. You'll notice this discussion –

>> Daniel Fox Franke:
>> Please don't design new network protocols that don't either run over TLS or 
>> do some other kind of cryptographic authentication. If you have 
>> cryptographic authentication, then CRC is redundant.
> 
> Evan Jones:
> Good point. Even internal applications inside a data center should be using 
> encryption, and today the performance impact is probably small (I haven't 
> actually measured it myself these days).

Enabling TLS & internode compression are mitigation strategies to avoid data 
corruption in transit. By your own admission in CASSANDRA-13304[3], we don't 
require checksumming if TLS is present. Here's your full quote –

> Aleksey Yeschenko:
> 

> Checksums and TLS are orthogonal. It just so happens that you don't need the 
> former if you already have the latter.

I want to be fair, later you did say that we don't want to force people to pay 
the cost of TLS overhead. However, I would also like to point out that with 
introduction of Netty for internode communication, we have 4-5x the TLS 
performance thanks to OpenSSL JNI bindings. You can refer to Norman or my talks 
on the topic. So TLS is practical & compression is necessary for performance. 
Both strategies work fine to protect against data corruption making 
checksumming redundant. With SSL certificate hot reloading, it also avoids 
unnecessary cluster restarts providing maximum availability.

In the same vein, it's 2019 and if people are not using TLS for internode, then 
it is really really bad for data security in our industry and we should not be 
encouraging it. In fact, I would go so far as to make TLS as the default. 

Managing TLS infrastructure is beyond Cassandra's scope and operators should 
figure it out by now for their & their user's sake. Cassandra makes it super 
easy & performant to have TLS enabled. People should be using it.

> But until CASSANDRA-15066 [1] lands, this is also a critical flaw
> internode. We have addressed it by ensuring that no matter what, whether
> you use SSL or not, whether you use internode compression or not, a
> protocol level CRC is always present, for every message frame. It’s our
> deep and sincere belief that shipping a new implementation of the messaging
> protocol without application-level data integrity checks would be
> unacceptable for a modern database.

My previous technical arguments have provided enough evidence that protocol 
level checksumming is not a show stopper.

The only reason I see for adding checksums in the protocol is when some user 
doesn't want to enable TLS and internode compression. As it stands, from your 
comments[7] it seems to be mandatory and adds unnecessary overhead when TLS 
and/or Compression is enabled. Frankly I don't think we need to risk 
destabilizing trunk for these use-cases. I want to reiterate that I believe in 
doing the right thing but we have to make acceptable tradeoffs – as a community.

> # Lack of back-pressure and memory usage constraints
> 
> As it stands today, it’s far too easy for a single slow node to become
> overwhelmed by messages from its peers.  Conversely, multiple coordinators
> can be made unstable by the backlog of messages to deliver to just one
> struggling node.

This is a known issue and it could have been addressed a separate bug fix – one 
that could be independently verified.

> To address this problem, we have introduced strict memory usage constraints
> that apply TCP-level back-pressure, on both outbound and inbound.  It is
> now impossible for a node to be swamped on inbound, and on outbound it is
> made significantly harder to overcommit resources.  It’s a simple, reliable
> mechanism that drastically improves cluster stability under load, and
> especially overload.
> 
> Cassandra is a mature system, and introducing an entirely new messaging
> implementation without resolving this fundamental stability issue is
> difficult to justify in our view.

This is great in theory. I would really like to see objective measurements like 
Chris did in CASSANDRA-14654[4]. Netflix engineers tested the Netty refactor 
with a 200 node cluster[5], Zero Copy Streaming[6] and reported their results. 
It's invaluable work. It would be great to see something similar.

> # State of the patch, feature freeze and 4.0 timeline concerns
> 
> The patch is essentially complete, with much improved unit tests all
> passing, dtests green, and extensive fuzz testing underway - with initial
> results all positive.  We intend to further improve in-code documentation
> and test coverage in the next week or two, and do some minor additional
> code review, but we believe it will be basically good to commit in a couple
> weeks.

A 20K LoC patch is unverifiable especially without much documentation. It 
places undue burden on reviewers. It also messes up everyone's branches once 
you commit such a large refactor. Let's be considerate to others in the 
community. It is a good engineering practice to limit patches to a size that is 
reasonable. 

More importantly such large refactors lend themselves to new bugs that cannot 
be caught easily unless you have a very strong regression test suite which 
Cassandra arguably lacks. Therefore I am of the opinion that the bug fixes can 
be applied piecemeal into the codebase. They should be small enough that can be 
individually reviewed and independently verified. 

I also noticed that you have replaced Netty's classes[7]. I am of the opinion 
that they should be upstreamed if they're better so the wider Netty community 
benefits from it and we don't have to maintain our own classes.

> P.S. I believe that once it’s committed, we should cut our first 4.0 alpha.
> It’s about time we started this train (:

+1 on working towards an alpha but that is a separate discussion from this 
issue.

Thanks,

Dinesh

[1] https://issues.apache.org/jira/browse/CASSANDRA-14578
[2] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
[3] 
https://issues.apache.org/jira/browse/CASSANDRA-13304?focusedCommentId=16183034&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16183034
[4] https://issues.apache.org/jira/browse/CASSANDRA-14654
[5] https://issues.apache.org/jira/browse/CASSANDRA-14747
[6] https://issues.apache.org/jira/browse/CASSANDRA-14765
[7] 
https://issues.apache.org/jira/browse/CASSANDRA-15066?focusedCommentId=16801277&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16801277

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Stabilising Internode Messaging in 4.0

Reply via email to