Great to see such a significant progress made in the area! On Thu, Apr 4, 2019 at 1:13 PM Aleksey Yeschenko <alek...@apache.org> wrote:
> I would like to propose CASSANDRA-15066 [1] - an important set of bug fixes > and stability improvements to internode messaging code that Benedict, I, > and others have been working on for the past couple of months. > > First, some context. This work started off as a review of CASSANDRA-14503 > (Internode connection management is race-prone [2]), CASSANDRA-13630 > (Support large internode messages with netty) [3], and a pre-4.0 > confirmatory review of such a major new feature. > > However, as we dug in, we realized this was insufficient. With more than 50 > bugs uncovered [4] - dozens of them critical to correctness and/or > stability of the system - a substantial rework was necessary to guarantee a > solid internode messaging subsystem for the 4.0 release. > > In addition to addressing all of the uncovered bugs [4] that were unique to > trunk + 13630 [3] + 14503 [2], we used this opportunity to correct some > long-existing, pre-4.0 bugs and stability issues. For the complete list of > notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d like > to highlight a few. > > # Lack of message integrity checks > > It’s known that TCP checksums are too weak [5] and Ethernet CRC cannot be > relied upon [6] for integrity. With sufficient scale or time, you will hit > bit flips. Sadly, most of the time these go undetected. Cassandra’s > replication model makes this issue much more serious, as the faulty data > can infect the cluster. > > We recognised this problem, and recently introduced a fix for server-client > messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to the > native protocol) [7]. > > But until CASSANDRA-15066 [1] lands, this is also a critical flaw > internode. We have addressed it by ensuring that no matter what, whether > you use SSL or not, whether you use internode compression or not, a > protocol level CRC is always present, for every message frame. It’s our > deep and sincere belief that shipping a new implementation of the messaging > protocol without application-level data integrity checks would be > unacceptable for a modern database. > I'm all for introducing more correctness checks at all levels especially in communication. Having dealt with multiple data corruption bugs that could have been easily prevented by having a checksum, it's great to see that we are moving in this direction. > # Lack of back-pressure and memory usage constraints > > As it stands today, it’s far too easy for a single slow node to become > overwhelmed by messages from its peers. Conversely, multiple coordinators > can be made unstable by the backlog of messages to deliver to just one > struggling node. > > To address this problem, we have introduced strict memory usage constraints > that apply TCP-level back-pressure, on both outbound and inbound. It is > now impossible for a node to be swamped on inbound, and on outbound it is > made significantly harder to overcommit resources. It’s a simple, reliable > mechanism that drastically improves cluster stability under load, and > especially overload. > > Cassandra is a mature system, and introducing an entirely new messaging > implementation without resolving this fundamental stability issue is > difficult to justify in our view. > I'd say that this is required to be able to ship 4.0 as a release focused on stability. I personally have been waiting for this to happen for years. Significant step forward in our QoS story. > > # State of the patch, feature freeze and 4.0 timeline concerns > > The patch is essentially complete, with much improved unit tests all > passing, dtests green, and extensive fuzz testing underway - with initial > results all positive. We intend to further improve in-code documentation > and test coverage in the next week or two, and do some minor additional > code review, but we believe it will be basically good to commit in a couple > weeks. > > The patch could also use some performance testing. We are hoping that our > colleagues at Netflix and new Cassandra committers Joey and Vinay will help > us with this aspect. However, this work needs to be done regardless, so > provides no significant delay. > > I would classify absolutely most of the work done in this patch as > necessary bug fixes and stability improvements - in line with the stated > goals of the feature freeze. > > The only clear “feature” introduced is the expanded metrics, and virtual > tables to observe them. If the freeze is to be strictly interpreted these > can be removed, but we think this would be unwise. > > We hope that the community will appreciate and welcome these changes. > We’ve worked hard to deliver this promptly, to minimise delays to 4.0 were > these issues to be addressed piecemeal, and we sincerely believe this work > will have a positive impact on stability and performance of Cassandra > clusters for years to come. > > P.S. I believe that once it’s committed, we should cut our first 4.0 alpha. > It’s about time we started this train (: > > [1] https://issues.apache.org/jira/browse/CASSANDRA-15066 > [2] https://issues.apache.org/jira/browse/CASSANDRA-14503 > [3] https://issues.apache.org/jira/browse/CASSANDRA-13630 > [4] https://gist.github.com/belliottsmith/0d12df678d8e9ab06776e29116d56b91 > (incomplete list) > [5] https://www.evanjones.ca/tcp-checksums.html > [6] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html > [7] https://issues.apache.org/jira/browse/CASSANDRA-13304 >