Re: Stabilising Internode Messaging in 4.0

Sam Tunnicliffe Fri, 12 Apr 2019 10:47:28 -0700

+1 Thanks for articulating that so well Josh.

Sam


> On 12 Apr 2019, at 16:19, Blake Eggleston <beggles...@apple.com.INVALID> 
> wrote:
> 
> Well said Josh. You’ve pretty much summarized my thoughts on this as well.
> 
> +1 to moving forward with this
> 
>> On Apr 11, 2019, at 10:15 PM, Joshua McKenzie <jmcken...@apache.org> wrote:
>> 
>> As one of the two people that re-wrote all our unit tests to try and help
>> Sylvain get 8099 out the door, I think it's inaccurate to compare the scope
>> and potential stability impact of this work to the truly sweeping work that
>> went into 8099 (not to downplay the scope and extent of this work here).
>> 
>> TBH, one of the big reasons we tend to drop such large PRs is the fact that
>>> Cassandra's code is highly intertwined and it makes it hard to precisely
>>> change things. We need to iterate towards interfaces that allow us to
>>> iterate quickly and reduce the amount of highly intertwined code. It helps
>>> with testing as well. I want us to have a meaningful discussion around it
>>> before we drop a big PR.
>> 
>> This has been a huge issue with our codebase since at least back when I
>> first encountered it five years ago. To date, while we have made progress
>> on this front, it's been nowhere near sufficient to mitigate the issues in
>> the codebase and allow for large, meaningful changes in smaller incremental
>> patches or commits. Having yet another discussion around this (there have
>> been many, many of them over the years) as a blocker for significant work
>> to go into the codebase is an unnecessary and dangerous blocker. Not to say
>> we shouldn't formalize a path to try and make incremental progress to
>> improve the situation, far from it, but blocking other progress on a
>> decade's worth of accumulated hygiene problems isn't going to make the
>> community focus on fixing those problems imo, it'll just turn away
>> contributions.
>> 
>> So let me second jd (and many others') opinion here: "it makes sense to get
>> it right the first time, rather than applying bandaids to 4.0 and rewriting
>> things for 4.next". And fwiw, asking people who have already done a huge
>> body of work to reformat that work into a series of commits or to break up
>> that work in a fashion that's more to the liking of people not involved in
>> either the writing of the patch or reviewing of it doesn't make much sense
>> to me. As I am neither an assignee nor reviewer on this contribution, I
>> leave it up to the parties involved to do things professionally and with a
>> high standard of quality. Admittedly, a large code change merging in like
>> this has implications for rebasing on anyone else's work that's in flight,
>> but be it one commit merged or 50, or be it one JIRA ticket or ten, the
>> end-result is the same; any large contribution in any format will ripple
>> outwards and require re-work from others in the community.
>> 
>> The one thing I *would* strongly argue for is performance benchmarking of
>> the new messaging code on a representative sample of different
>> general-purpose queries, LWT's, etc, preferably in a 3 node RF=3 cluster,
>> plus a healthy suite of jmh micro-benches (assuming they're not already in
>> the diff. If they are, disregard / sorry). From speaking with Aleksey
>> offline about this work, my understanding is that that's something they
>> plan on doing before putting a bow on things.
>> 
>> In the balance between "fear change because it destabilizes" and "go forth
>> blindly into that dark night, rewriting All The Things", I think the
>> Cassandra project's willingness to jettison the old and introduce the new
>> has served it well in keeping relevant as the years have gone by. I'd hate
>> to see that culture of progress get mired in a dogmatic adherence to
>> requirements on commit counts, lines of code allowed / expected on a given
>> patch set, or any other metrics that might stymie the professional needs of
>> some of the heaviest contributors to the project.
>> 
>> On Wed, Apr 10, 2019 at 5:03 PM Oleksandr Petrov <oleksandr.pet...@gmail.com>
>> wrote:
>> 
>>> Sorry to pick only a few points to address, but I think these are ones
>>> where I can contribute productively to the discussion.
>>> 
>>>> In principle, I agree with the technical improvements you
>>> mention (backpressure / checksumming / etc). These things should be there.
>>> Are they a hard requirement for 4.0?
>>> 
>>> One thing that comes to mind is protocol versioning and consistency. If
>>> changes adding checksumming and handshake do not make it to 4.0, we grow
>>> the upgrade matrix and have to put changes to the separate protocol
>>> version. I'm not sure how many other internode protocol changes we have
>>> planned for 4.next, but this is definitely something we should keep in
>>> mind.
>>> 
>>>> 2. We should not be measuring complexity in LoC with the exception that
>>> all 20k lines *do need to be review* (not just the important parts and
>>> because code refactoring tools have bugs too) and more lines take more
>>> time.
>>> 
>>> Everything should definitely be reviewed. But with different rigour: one
>>> thing is to review byte arithmetic and protocol formats and completely
>>> different thing is to verify that Verb moved from one place to the other is
>>> still used. Concentrating on a smaller subset doesn't make a patch smaller,
>>> just helps to guide reviewers and observers, so my primary goal was to help
>>> people, hence my reference to the jira comment I'm working on.
>>> 
>>> 
>>> On Wed, Apr 10, 2019 at 6:13 PM Sankalp Kohli <kohlisank...@gmail.com>
>>> wrote:
>>> 
>>>> I think we should wait for testing doc on confluence to come up and
>>>> discuss what all needs to be added there to gain confidence.
>>>> 
>>>> If the work is more to split the patch into smaller parts and delays 4.0
>>>> even more, can we use time in adding more test coverage, documentation
>>> and
>>>> design docs to this component?  Will that be a middle ground here ?
>>>> 
>>>> Examples of large changes not going well is due to lack of testing,
>>>> documentation and design docs not because they were big. Being big adds
>>> to
>>>> the complexity and increased the total bug count but small changes can be
>>>> equally worse in terms of impact.
>>>> 
>>>> 
>>>>> On Apr 10, 2019, at 8:53 AM, Jordan West <jorda...@gmail.com> wrote:
>>>>> 
>>>>> There is a lot of discuss here so I’ll try to keep my opinions brief:
>>>>> 
>>>>> 1. The bug fixes are a requirement in order to have a stable 4.0.
>>> Whether
>>>>> they come from this patch or the original I have less of an opinion. I
>>> do
>>>>> think its important to minimize code changes at this time in the
>>>>> development/freeze cycle — including large refactors which add risk
>>>> despite
>>>>> how they are being discussed here. From that perspective, I would
>>> prefer
>>>> to
>>>>> see more targeted fixes but since we don’t have them and we have this
>>>> patch
>>>>> the decision is different.
>>>>> 
>>>>> 2. We should not be measuring complexity in LoC with the exception that
>>>> all
>>>>> 20k lines *do need to be review* (not just the important parts and
>>>> because
>>>>> code refactoring tools have bugs too) and more lines take more time.
>>>>> Otherwise, its a poor metric for how long this will take to review.
>>>>> Further, it seems odd that the authors are projecting how long it will
>>>> take
>>>>> to review — this should be the charge of the reviewers and I would like
>>>> to
>>>>> hear from them on this. Its clear this a complex patch — as risky as
>>>>> something like 8099 (and the original rewrite by Jason). We should
>>> treat
>>>> it
>>>>> as such and not try to merge it in quickly for the sake of it,
>>> repeating
>>>>> past mistakes. The goal of reviewing the messaging service work was to
>>> do
>>>>> just that. It would be a disservice to rush in these changes now. If
>>> the
>>>>> goal is to get the patch in that should be the priority, not completing
>>>> it
>>>>> “in two weeks”. Its clear several community members have pushed back on
>>>>> that and are not comfortable with the time frame.
>>>>> 
>>>>> 3. If we need to add special forks of Netty classes to the code because
>>>> of
>>>>> “how we use Netty” that is a concern to me w.r.t to quality, stability,
>>>> and
>>>>> maintenance. Is there a place that documents/justifies our
>>>> non-traditional
>>>>> usage (I saw some in JavaDocs but found it lacking in *why* but had a
>>> lot
>>>>> of how/what was changed). Given folks in the community have decent
>>>>> relationships with the Netty team perhaps we should leverage that as
>>>> well.
>>>>> Have we reached out to them?
>>>>> 
>>>>> 4. In principle, I agree with the technical improvements you mention
>>>>> (backpressure / checksumming / etc). These things should be there. Are
>>>> they
>>>>> a hard requirement for 4.0? In my opinion no and Dinesh has done a good
>>>> job
>>>>> of providing some reasons as to why so I won’t go into much detail
>>> here.
>>>> In
>>>>> short, a bug and a missing safety mechanism are not the same thing. Its
>>>>> also important to consider how many users a change like that covers and
>>>> for
>>>>> what risk — we found a bug in 13304 late in review, had it slipped
>>>> through
>>>>> it would have subjected users to silent corruption they thought
>>> couldn’t
>>>>> occur anymore because we included the feature in a prod Cassandra
>>>> release.
>>>>> 
>>>>> 5. The patch could seriously benefit from some commit hygiene that
>>> would
>>>>> make it easier for folks to review. Had this been done not only would
>>>>> review be easier but also the piecemeal breakup of features/fixes could
>>>>> have been done more easily. I don’t buy the premise that this wasn’t
>>>>> possible. If we had to add the feature/fix later it would have been
>>>>> possible. I’m sure there was a smart way we could have organized it, if
>>>> it
>>>>> was a priority.
>>>>> 
>>>>> 6. Have any upgrade tests been done/added? I would also like to see
>>> some
>>>>> performance benchmarks before merging so that the patch is in a similar
>>>>> state as 14503 in terms of performance testing.
>>>>> 
>>>>> I’m sure I have more thoughts but these seem like the important ones
>>> for
>>>>> now.
>>>>> 
>>>>> Jordan
>>>>> 
>>>>> On Wed, Apr 10, 2019 at 8:21 AM Dinesh Joshi
>>> <djos...@icloud.com.invalid
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> Here's are my 2¢.
>>>>>> 
>>>>>> I think the general direction of this work is valuable but I have a
>>> few
>>>>>> concerns I’d like to address. More inline.
>>>>>> 
>>>>>>> On Apr 4, 2019, at 1:13 PM, Aleksey Yeschenko <alek...@apache.org>
>>>>>> wrote:
>>>>>>> 
>>>>>>> I would like to propose CASSANDRA-15066 [1] - an important set of bug
>>>>>> fixes
>>>>>>> and stability improvements to internode messaging code that Benedict,
>>>> I,
>>>>>>> and others have been working on for the past couple of months.
>>>>>>> 
>>>>>>> First, some context.   This work started off as a review of
>>>>>> CASSANDRA-14503
>>>>>>> (Internode connection management is race-prone [2]), CASSANDRA-13630
>>>>>>> (Support large internode messages with netty) [3], and a pre-4.0
>>>>>>> confirmatory review of such a major new feature.
>>>>>>> 
>>>>>>> However, as we dug in, we realized this was insufficient. With more
>>>> than
>>>>>> 50
>>>>>>> bugs uncovered [4] - dozens of them critical to correctness and/or
>>>>>>> stability of the system - a substantial rework was necessary to
>>>>>> guarantee a
>>>>>>> solid internode messaging subsystem for the 4.0 release.
>>>>>>> 
>>>>>>> In addition to addressing all of the uncovered bugs [4] that were
>>>> unique
>>>>>> to
>>>>>>> trunk + 13630 [3] + 14503 [2], we used this opportunity to correct
>>> some
>>>>>>> long-existing, pre-4.0 bugs and stability issues. For the complete
>>> list
>>>>>> of
>>>>>>> notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d
>>>> like
>>>>>>> to highlight a few.
>>>>>> 
>>>>>> Do you have regression tests that will fail if these bugs are
>>>> reintroduced
>>>>>> at a later point?
>>>>>> 
>>>>>>> # Lack of message integrity checks
>>>>>>> 
>>>>>>> It’s known that TCP checksums are too weak [5] and Ethernet CRC
>>> cannot
>>>> be
>>>>>>> relied upon [6] for integrity. With sufficient scale or time, you
>>> will
>>>>>> hit
>>>>>>> bit flips. Sadly, most of the time these go undetected.  Cassandra’s
>>>>>>> replication model makes this issue much more serious, as the faulty
>>>> data
>>>>>>> can infect the cluster.
>>>>>>> 
>>>>>>> We recognised this problem, and recently introduced a fix for
>>>>>> server-client
>>>>>>> messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to
>>> the
>>>>>>> native protocol) [7].
>>>>>> 
>>>>>> This was discussed in my review and Jason created a ticket[1] to track
>>>> it.
>>>>>> We explicitly decided to defer this work not only due to the feature
>>>> freeze
>>>>>> in the community but also for technical reasons detailed below.
>>>>>> 
>>>>>> Regarding new features during the feature freeze window, we have had
>>>> such
>>>>>> discussions in the past. The most recent being the one I initiated on
>>>> Zstd
>>>>>> Compressor which went positively and we have moved forward after
>>>> assessing
>>>>>> risk & community consensus.
>>>>>> 
>>>>>> Regarding checksumming, please scroll down to the comments section in
>>>> the
>>>>>> link[2] you provided. You'll notice this discussion –
>>>>>> 
>>>>>>>> Daniel Fox Franke:
>>>>>>>> Please don't design new network protocols that don't either run over
>>>>>> TLS or do some other kind of cryptographic authentication. If you have
>>>>>> cryptographic authentication, then CRC is redundant.
>>>>>>> 
>>>>>>> Evan Jones:
>>>>>>> Good point. Even internal applications inside a data center should be
>>>>>> using encryption, and today the performance impact is probably small
>>> (I
>>>>>> haven't actually measured it myself these days).
>>>>>> 
>>>>>> Enabling TLS & internode compression are mitigation strategies to
>>> avoid
>>>>>> data corruption in transit. By your own admission in
>>>> CASSANDRA-13304[3], we
>>>>>> don't require checksumming if TLS is present. Here's your full quote –
>>>>>> 
>>>>>>> Aleksey Yeschenko:
>>>>>>> 
>>>>>> 
>>>>>>> Checksums and TLS are orthogonal. It just so happens that you don't
>>>> need
>>>>>> the former if you already have the latter.
>>>>>> 
>>>>>> I want to be fair, later you did say that we don't want to force
>>> people
>>>> to
>>>>>> pay the cost of TLS overhead. However, I would also like to point out
>>>> that
>>>>>> with introduction of Netty for internode communication, we have 4-5x
>>> the
>>>>>> TLS performance thanks to OpenSSL JNI bindings. You can refer to
>>> Norman
>>>> or
>>>>>> my talks on the topic. So TLS is practical & compression is necessary
>>>> for
>>>>>> performance. Both strategies work fine to protect against data
>>>> corruption
>>>>>> making checksumming redundant. With SSL certificate hot reloading, it
>>>> also
>>>>>> avoids unnecessary cluster restarts providing maximum availability.
>>>>>> 
>>>>>> In the same vein, it's 2019 and if people are not using TLS for
>>>> internode,
>>>>>> then it is really really bad for data security in our industry and we
>>>>>> should not be encouraging it. In fact, I would go so far as to make
>>> TLS
>>>> as
>>>>>> the default.
>>>>>> 
>>>>>> Managing TLS infrastructure is beyond Cassandra's scope and operators
>>>>>> should figure it out by now for their & their user's sake. Cassandra
>>>> makes
>>>>>> it super easy & performant to have TLS enabled. People should be using
>>>> it.
>>>>>> 
>>>>>>> But until CASSANDRA-15066 [1] lands, this is also a critical flaw
>>>>>>> internode. We have addressed it by ensuring that no matter what,
>>>> whether
>>>>>>> you use SSL or not, whether you use internode compression or not, a
>>>>>>> protocol level CRC is always present, for every message frame. It’s
>>> our
>>>>>>> deep and sincere belief that shipping a new implementation of the
>>>>>> messaging
>>>>>>> protocol without application-level data integrity checks would be
>>>>>>> unacceptable for a modern database.
>>>>>> 
>>>>>> My previous technical arguments have provided enough evidence that
>>>>>> protocol level checksumming is not a show stopper.
>>>>>> 
>>>>>> The only reason I see for adding checksums in the protocol is when
>>> some
>>>>>> user doesn't want to enable TLS and internode compression. As it
>>> stands,
>>>>>> from your comments[7] it seems to be mandatory and adds unnecessary
>>>>>> overhead when TLS and/or Compression is enabled. Frankly I don't think
>>>> we
>>>>>> need to risk destabilizing trunk for these use-cases. I want to
>>>> reiterate
>>>>>> that I believe in doing the right thing but we have to make acceptable
>>>>>> tradeoffs – as a community.
>>>>>> 
>>>>>>> # Lack of back-pressure and memory usage constraints
>>>>>>> 
>>>>>>> As it stands today, it’s far too easy for a single slow node to
>>> become
>>>>>>> overwhelmed by messages from its peers.  Conversely, multiple
>>>>>> coordinators
>>>>>>> can be made unstable by the backlog of messages to deliver to just
>>> one
>>>>>>> struggling node.
>>>>>> 
>>>>>> This is a known issue and it could have been addressed a separate bug
>>>> fix
>>>>>> – one that could be independently verified.
>>>>>> 
>>>>>>> To address this problem, we have introduced strict memory usage
>>>>>> constraints
>>>>>>> that apply TCP-level back-pressure, on both outbound and inbound.  It
>>>> is
>>>>>>> now impossible for a node to be swamped on inbound, and on outbound
>>> it
>>>> is
>>>>>>> made significantly harder to overcommit resources.  It’s a simple,
>>>>>> reliable
>>>>>>> mechanism that drastically improves cluster stability under load, and
>>>>>>> especially overload.
>>>>>>> 
>>>>>>> Cassandra is a mature system, and introducing an entirely new
>>> messaging
>>>>>>> implementation without resolving this fundamental stability issue is
>>>>>>> difficult to justify in our view.
>>>>>> 
>>>>>> This is great in theory. I would really like to see objective
>>>> measurements
>>>>>> like Chris did in CASSANDRA-14654[4]. Netflix engineers tested the
>>> Netty
>>>>>> refactor with a 200 node cluster[5], Zero Copy Streaming[6] and
>>> reported
>>>>>> their results. It's invaluable work. It would be great to see
>>> something
>>>>>> similar.
>>>>>> 
>>>>>>> # State of the patch, feature freeze and 4.0 timeline concerns
>>>>>>> 
>>>>>>> The patch is essentially complete, with much improved unit tests all
>>>>>>> passing, dtests green, and extensive fuzz testing underway - with
>>>> initial
>>>>>>> results all positive.  We intend to further improve in-code
>>>> documentation
>>>>>>> and test coverage in the next week or two, and do some minor
>>> additional
>>>>>>> code review, but we believe it will be basically good to commit in a
>>>>>> couple
>>>>>>> weeks.
>>>>>> 
>>>>>> A 20K LoC patch is unverifiable especially without much documentation.
>>>> It
>>>>>> places undue burden on reviewers. It also messes up everyone's
>>> branches
>>>>>> once you commit such a large refactor. Let's be considerate to others
>>> in
>>>>>> the community. It is a good engineering practice to limit patches to a
>>>> size
>>>>>> that is reasonable.
>>>>>> 
>>>>>> More importantly such large refactors lend themselves to new bugs that
>>>>>> cannot be caught easily unless you have a very strong regression test
>>>> suite
>>>>>> which Cassandra arguably lacks. Therefore I am of the opinion that the
>>>> bug
>>>>>> fixes can be applied piecemeal into the codebase. They should be small
>>>>>> enough that can be individually reviewed and independently verified.
>>>>>> 
>>>>>> I also noticed that you have replaced Netty's classes[7]. I am of the
>>>>>> opinion that they should be upstreamed if they're better so the wider
>>>> Netty
>>>>>> community benefits from it and we don't have to maintain our own
>>>> classes.
>>>>>> 
>>>>>>> P.S. I believe that once it’s committed, we should cut our first 4.0
>>>>>> alpha.
>>>>>>> It’s about time we started this train (:
>>>>>> 
>>>>>> +1 on working towards an alpha but that is a separate discussion from
>>>> this
>>>>>> issue.
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Dinesh
>>>>>> 
>>>>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-14578
>>>>>> [2] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
>>>>>> [3]
>>>>>> 
>>>> 
>>> https://issues.apache.org/jira/browse/CASSANDRA-13304?focusedCommentId=16183034&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16183034
>>>>>> [4] https://issues.apache.org/jira/browse/CASSANDRA-14654
>>>>>> [5] https://issues.apache.org/jira/browse/CASSANDRA-14747
>>>>>> [6] https://issues.apache.org/jira/browse/CASSANDRA-14765
>>>>>> [7]
>>>>>> 
>>>> 
>>> https://issues.apache.org/jira/browse/CASSANDRA-15066?focusedCommentId=16801277&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16801277
>>>>>> 
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>>>> 
>>>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>> 
>>>> 
>>> 
>>> --
>>> alex p
>>> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> For additional commands, e-mail: dev-h...@cassandra.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Stabilising Internode Messaging in 4.0

Reply via email to