Re: Stabilising Internode Messaging in 4.0

Blake Eggleston Fri, 12 Apr 2019 08:21:24 -0700

Well said Josh. You’ve pretty much summarized my thoughts on this as well.


+1 to moving forward with this

> On Apr 11, 2019, at 10:15 PM, Joshua McKenzie <jmcken...@apache.org> wrote:
> 
> As one of the two people that re-wrote all our unit tests to try and help
> Sylvain get 8099 out the door, I think it's inaccurate to compare the scope
> and potential stability impact of this work to the truly sweeping work that
> went into 8099 (not to downplay the scope and extent of this work here).
> 
> TBH, one of the big reasons we tend to drop such large PRs is the fact that
>> Cassandra's code is highly intertwined and it makes it hard to precisely
>> change things. We need to iterate towards interfaces that allow us to
>> iterate quickly and reduce the amount of highly intertwined code. It helps
>> with testing as well. I want us to have a meaningful discussion around it
>> before we drop a big PR.
> 
> This has been a huge issue with our codebase since at least back when I
> first encountered it five years ago. To date, while we have made progress
> on this front, it's been nowhere near sufficient to mitigate the issues in
> the codebase and allow for large, meaningful changes in smaller incremental
> patches or commits. Having yet another discussion around this (there have
> been many, many of them over the years) as a blocker for significant work
> to go into the codebase is an unnecessary and dangerous blocker. Not to say
> we shouldn't formalize a path to try and make incremental progress to
> improve the situation, far from it, but blocking other progress on a
> decade's worth of accumulated hygiene problems isn't going to make the
> community focus on fixing those problems imo, it'll just turn away
> contributions.
> 
> So let me second jd (and many others') opinion here: "it makes sense to get
> it right the first time, rather than applying bandaids to 4.0 and rewriting
> things for 4.next". And fwiw, asking people who have already done a huge
> body of work to reformat that work into a series of commits or to break up
> that work in a fashion that's more to the liking of people not involved in
> either the writing of the patch or reviewing of it doesn't make much sense
> to me. As I am neither an assignee nor reviewer on this contribution, I
> leave it up to the parties involved to do things professionally and with a
> high standard of quality. Admittedly, a large code change merging in like
> this has implications for rebasing on anyone else's work that's in flight,
> but be it one commit merged or 50, or be it one JIRA ticket or ten, the
> end-result is the same; any large contribution in any format will ripple
> outwards and require re-work from others in the community.
> 
> The one thing I *would* strongly argue for is performance benchmarking of
> the new messaging code on a representative sample of different
> general-purpose queries, LWT's, etc, preferably in a 3 node RF=3 cluster,
> plus a healthy suite of jmh micro-benches (assuming they're not already in
> the diff. If they are, disregard / sorry). From speaking with Aleksey
> offline about this work, my understanding is that that's something they
> plan on doing before putting a bow on things.
> 
> In the balance between "fear change because it destabilizes" and "go forth
> blindly into that dark night, rewriting All The Things", I think the
> Cassandra project's willingness to jettison the old and introduce the new
> has served it well in keeping relevant as the years have gone by. I'd hate
> to see that culture of progress get mired in a dogmatic adherence to
> requirements on commit counts, lines of code allowed / expected on a given
> patch set, or any other metrics that might stymie the professional needs of
> some of the heaviest contributors to the project.
> 
> On Wed, Apr 10, 2019 at 5:03 PM Oleksandr Petrov <oleksandr.pet...@gmail.com>
> wrote:
> 
>> Sorry to pick only a few points to address, but I think these are ones
>> where I can contribute productively to the discussion.
>> 
>>> In principle, I agree with the technical improvements you
>> mention (backpressure / checksumming / etc). These things should be there.
>> Are they a hard requirement for 4.0?
>> 
>> One thing that comes to mind is protocol versioning and consistency. If
>> changes adding checksumming and handshake do not make it to 4.0, we grow
>> the upgrade matrix and have to put changes to the separate protocol
>> version. I'm not sure how many other internode protocol changes we have
>> planned for 4.next, but this is definitely something we should keep in
>> mind.
>> 
>>> 2. We should not be measuring complexity in LoC with the exception that
>> all 20k lines *do need to be review* (not just the important parts and
>> because code refactoring tools have bugs too) and more lines take more
>> time.
>> 
>> Everything should definitely be reviewed. But with different rigour: one
>> thing is to review byte arithmetic and protocol formats and completely
>> different thing is to verify that Verb moved from one place to the other is
>> still used. Concentrating on a smaller subset doesn't make a patch smaller,
>> just helps to guide reviewers and observers, so my primary goal was to help
>> people, hence my reference to the jira comment I'm working on.
>> 
>> 
>> On Wed, Apr 10, 2019 at 6:13 PM Sankalp Kohli <kohlisank...@gmail.com>
>> wrote:
>> 
>>> I think we should wait for testing doc on confluence to come up and
>>> discuss what all needs to be added there to gain confidence.
>>> 
>>> If the work is more to split the patch into smaller parts and delays 4.0
>>> even more, can we use time in adding more test coverage, documentation
>> and
>>> design docs to this component?  Will that be a middle ground here ?
>>> 
>>> Examples of large changes not going well is due to lack of testing,
>>> documentation and design docs not because they were big. Being big adds
>> to
>>> the complexity and increased the total bug count but small changes can be
>>> equally worse in terms of impact.
>>> 
>>> 
>>>> On Apr 10, 2019, at 8:53 AM, Jordan West <jorda...@gmail.com> wrote:
>>>> 
>>>> There is a lot of discuss here so I’ll try to keep my opinions brief:
>>>> 
>>>> 1. The bug fixes are a requirement in order to have a stable 4.0.
>> Whether
>>>> they come from this patch or the original I have less of an opinion. I
>> do
>>>> think its important to minimize code changes at this time in the
>>>> development/freeze cycle — including large refactors which add risk
>>> despite
>>>> how they are being discussed here. From that perspective, I would
>> prefer
>>> to
>>>> see more targeted fixes but since we don’t have them and we have this
>>> patch
>>>> the decision is different.
>>>> 
>>>> 2. We should not be measuring complexity in LoC with the exception that
>>> all
>>>> 20k lines *do need to be review* (not just the important parts and
>>> because
>>>> code refactoring tools have bugs too) and more lines take more time.
>>>> Otherwise, its a poor metric for how long this will take to review.
>>>> Further, it seems odd that the authors are projecting how long it will
>>> take
>>>> to review — this should be the charge of the reviewers and I would like
>>> to
>>>> hear from them on this. Its clear this a complex patch — as risky as
>>>> something like 8099 (and the original rewrite by Jason). We should
>> treat
>>> it
>>>> as such and not try to merge it in quickly for the sake of it,
>> repeating
>>>> past mistakes. The goal of reviewing the messaging service work was to
>> do
>>>> just that. It would be a disservice to rush in these changes now. If
>> the
>>>> goal is to get the patch in that should be the priority, not completing
>>> it
>>>> “in two weeks”. Its clear several community members have pushed back on
>>>> that and are not comfortable with the time frame.
>>>> 
>>>> 3. If we need to add special forks of Netty classes to the code because
>>> of
>>>> “how we use Netty” that is a concern to me w.r.t to quality, stability,
>>> and
>>>> maintenance. Is there a place that documents/justifies our
>>> non-traditional
>>>> usage (I saw some in JavaDocs but found it lacking in *why* but had a
>> lot
>>>> of how/what was changed). Given folks in the community have decent
>>>> relationships with the Netty team perhaps we should leverage that as
>>> well.
>>>> Have we reached out to them?
>>>> 
>>>> 4. In principle, I agree with the technical improvements you mention
>>>> (backpressure / checksumming / etc). These things should be there. Are
>>> they
>>>> a hard requirement for 4.0? In my opinion no and Dinesh has done a good
>>> job
>>>> of providing some reasons as to why so I won’t go into much detail
>> here.
>>> In
>>>> short, a bug and a missing safety mechanism are not the same thing. Its
>>>> also important to consider how many users a change like that covers and
>>> for
>>>> what risk — we found a bug in 13304 late in review, had it slipped
>>> through
>>>> it would have subjected users to silent corruption they thought
>> couldn’t
>>>> occur anymore because we included the feature in a prod Cassandra
>>> release.
>>>> 
>>>> 5. The patch could seriously benefit from some commit hygiene that
>> would
>>>> make it easier for folks to review. Had this been done not only would
>>>> review be easier but also the piecemeal breakup of features/fixes could
>>>> have been done more easily. I don’t buy the premise that this wasn’t
>>>> possible. If we had to add the feature/fix later it would have been
>>>> possible. I’m sure there was a smart way we could have organized it, if
>>> it
>>>> was a priority.
>>>> 
>>>> 6. Have any upgrade tests been done/added? I would also like to see
>> some
>>>> performance benchmarks before merging so that the patch is in a similar
>>>> state as 14503 in terms of performance testing.
>>>> 
>>>> I’m sure I have more thoughts but these seem like the important ones
>> for
>>>> now.
>>>> 
>>>> Jordan
>>>> 
>>>> On Wed, Apr 10, 2019 at 8:21 AM Dinesh Joshi
>> <djos...@icloud.com.invalid
>>>> 
>>>> wrote:
>>>> 
>>>>> Here's are my 2¢.
>>>>> 
>>>>> I think the general direction of this work is valuable but I have a
>> few
>>>>> concerns I’d like to address. More inline.
>>>>> 
>>>>>> On Apr 4, 2019, at 1:13 PM, Aleksey Yeschenko <alek...@apache.org>
>>>>> wrote:
>>>>>> 
>>>>>> I would like to propose CASSANDRA-15066 [1] - an important set of bug
>>>>> fixes
>>>>>> and stability improvements to internode messaging code that Benedict,
>>> I,
>>>>>> and others have been working on for the past couple of months.
>>>>>> 
>>>>>> First, some context.   This work started off as a review of
>>>>> CASSANDRA-14503
>>>>>> (Internode connection management is race-prone [2]), CASSANDRA-13630
>>>>>> (Support large internode messages with netty) [3], and a pre-4.0
>>>>>> confirmatory review of such a major new feature.
>>>>>> 
>>>>>> However, as we dug in, we realized this was insufficient. With more
>>> than
>>>>> 50
>>>>>> bugs uncovered [4] - dozens of them critical to correctness and/or
>>>>>> stability of the system - a substantial rework was necessary to
>>>>> guarantee a
>>>>>> solid internode messaging subsystem for the 4.0 release.
>>>>>> 
>>>>>> In addition to addressing all of the uncovered bugs [4] that were
>>> unique
>>>>> to
>>>>>> trunk + 13630 [3] + 14503 [2], we used this opportunity to correct
>> some
>>>>>> long-existing, pre-4.0 bugs and stability issues. For the complete
>> list
>>>>> of
>>>>>> notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d
>>> like
>>>>>> to highlight a few.
>>>>> 
>>>>> Do you have regression tests that will fail if these bugs are
>>> reintroduced
>>>>> at a later point?
>>>>> 
>>>>>> # Lack of message integrity checks
>>>>>> 
>>>>>> It’s known that TCP checksums are too weak [5] and Ethernet CRC
>> cannot
>>> be
>>>>>> relied upon [6] for integrity. With sufficient scale or time, you
>> will
>>>>> hit
>>>>>> bit flips. Sadly, most of the time these go undetected.  Cassandra’s
>>>>>> replication model makes this issue much more serious, as the faulty
>>> data
>>>>>> can infect the cluster.
>>>>>> 
>>>>>> We recognised this problem, and recently introduced a fix for
>>>>> server-client
>>>>>> messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to
>> the
>>>>>> native protocol) [7].
>>>>> 
>>>>> This was discussed in my review and Jason created a ticket[1] to track
>>> it.
>>>>> We explicitly decided to defer this work not only due to the feature
>>> freeze
>>>>> in the community but also for technical reasons detailed below.
>>>>> 
>>>>> Regarding new features during the feature freeze window, we have had
>>> such
>>>>> discussions in the past. The most recent being the one I initiated on
>>> Zstd
>>>>> Compressor which went positively and we have moved forward after
>>> assessing
>>>>> risk & community consensus.
>>>>> 
>>>>> Regarding checksumming, please scroll down to the comments section in
>>> the
>>>>> link[2] you provided. You'll notice this discussion –
>>>>> 
>>>>>>> Daniel Fox Franke:
>>>>>>> Please don't design new network protocols that don't either run over
>>>>> TLS or do some other kind of cryptographic authentication. If you have
>>>>> cryptographic authentication, then CRC is redundant.
>>>>>> 
>>>>>> Evan Jones:
>>>>>> Good point. Even internal applications inside a data center should be
>>>>> using encryption, and today the performance impact is probably small
>> (I
>>>>> haven't actually measured it myself these days).
>>>>> 
>>>>> Enabling TLS & internode compression are mitigation strategies to
>> avoid
>>>>> data corruption in transit. By your own admission in
>>> CASSANDRA-13304[3], we
>>>>> don't require checksumming if TLS is present. Here's your full quote –
>>>>> 
>>>>>> Aleksey Yeschenko:
>>>>>> 
>>>>> 
>>>>>> Checksums and TLS are orthogonal. It just so happens that you don't
>>> need
>>>>> the former if you already have the latter.
>>>>> 
>>>>> I want to be fair, later you did say that we don't want to force
>> people
>>> to
>>>>> pay the cost of TLS overhead. However, I would also like to point out
>>> that
>>>>> with introduction of Netty for internode communication, we have 4-5x
>> the
>>>>> TLS performance thanks to OpenSSL JNI bindings. You can refer to
>> Norman
>>> or
>>>>> my talks on the topic. So TLS is practical & compression is necessary
>>> for
>>>>> performance. Both strategies work fine to protect against data
>>> corruption
>>>>> making checksumming redundant. With SSL certificate hot reloading, it
>>> also
>>>>> avoids unnecessary cluster restarts providing maximum availability.
>>>>> 
>>>>> In the same vein, it's 2019 and if people are not using TLS for
>>> internode,
>>>>> then it is really really bad for data security in our industry and we
>>>>> should not be encouraging it. In fact, I would go so far as to make
>> TLS
>>> as
>>>>> the default.
>>>>> 
>>>>> Managing TLS infrastructure is beyond Cassandra's scope and operators
>>>>> should figure it out by now for their & their user's sake. Cassandra
>>> makes
>>>>> it super easy & performant to have TLS enabled. People should be using
>>> it.
>>>>> 
>>>>>> But until CASSANDRA-15066 [1] lands, this is also a critical flaw
>>>>>> internode. We have addressed it by ensuring that no matter what,
>>> whether
>>>>>> you use SSL or not, whether you use internode compression or not, a
>>>>>> protocol level CRC is always present, for every message frame. It’s
>> our
>>>>>> deep and sincere belief that shipping a new implementation of the
>>>>> messaging
>>>>>> protocol without application-level data integrity checks would be
>>>>>> unacceptable for a modern database.
>>>>> 
>>>>> My previous technical arguments have provided enough evidence that
>>>>> protocol level checksumming is not a show stopper.
>>>>> 
>>>>> The only reason I see for adding checksums in the protocol is when
>> some
>>>>> user doesn't want to enable TLS and internode compression. As it
>> stands,
>>>>> from your comments[7] it seems to be mandatory and adds unnecessary
>>>>> overhead when TLS and/or Compression is enabled. Frankly I don't think
>>> we
>>>>> need to risk destabilizing trunk for these use-cases. I want to
>>> reiterate
>>>>> that I believe in doing the right thing but we have to make acceptable
>>>>> tradeoffs – as a community.
>>>>> 
>>>>>> # Lack of back-pressure and memory usage constraints
>>>>>> 
>>>>>> As it stands today, it’s far too easy for a single slow node to
>> become
>>>>>> overwhelmed by messages from its peers.  Conversely, multiple
>>>>> coordinators
>>>>>> can be made unstable by the backlog of messages to deliver to just
>> one
>>>>>> struggling node.
>>>>> 
>>>>> This is a known issue and it could have been addressed a separate bug
>>> fix
>>>>> – one that could be independently verified.
>>>>> 
>>>>>> To address this problem, we have introduced strict memory usage
>>>>> constraints
>>>>>> that apply TCP-level back-pressure, on both outbound and inbound.  It
>>> is
>>>>>> now impossible for a node to be swamped on inbound, and on outbound
>> it
>>> is
>>>>>> made significantly harder to overcommit resources.  It’s a simple,
>>>>> reliable
>>>>>> mechanism that drastically improves cluster stability under load, and
>>>>>> especially overload.
>>>>>> 
>>>>>> Cassandra is a mature system, and introducing an entirely new
>> messaging
>>>>>> implementation without resolving this fundamental stability issue is
>>>>>> difficult to justify in our view.
>>>>> 
>>>>> This is great in theory. I would really like to see objective
>>> measurements
>>>>> like Chris did in CASSANDRA-14654[4]. Netflix engineers tested the
>> Netty
>>>>> refactor with a 200 node cluster[5], Zero Copy Streaming[6] and
>> reported
>>>>> their results. It's invaluable work. It would be great to see
>> something
>>>>> similar.
>>>>> 
>>>>>> # State of the patch, feature freeze and 4.0 timeline concerns
>>>>>> 
>>>>>> The patch is essentially complete, with much improved unit tests all
>>>>>> passing, dtests green, and extensive fuzz testing underway - with
>>> initial
>>>>>> results all positive.  We intend to further improve in-code
>>> documentation
>>>>>> and test coverage in the next week or two, and do some minor
>> additional
>>>>>> code review, but we believe it will be basically good to commit in a
>>>>> couple
>>>>>> weeks.
>>>>> 
>>>>> A 20K LoC patch is unverifiable especially without much documentation.
>>> It
>>>>> places undue burden on reviewers. It also messes up everyone's
>> branches
>>>>> once you commit such a large refactor. Let's be considerate to others
>> in
>>>>> the community. It is a good engineering practice to limit patches to a
>>> size
>>>>> that is reasonable.
>>>>> 
>>>>> More importantly such large refactors lend themselves to new bugs that
>>>>> cannot be caught easily unless you have a very strong regression test
>>> suite
>>>>> which Cassandra arguably lacks. Therefore I am of the opinion that the
>>> bug
>>>>> fixes can be applied piecemeal into the codebase. They should be small
>>>>> enough that can be individually reviewed and independently verified.
>>>>> 
>>>>> I also noticed that you have replaced Netty's classes[7]. I am of the
>>>>> opinion that they should be upstreamed if they're better so the wider
>>> Netty
>>>>> community benefits from it and we don't have to maintain our own
>>> classes.
>>>>> 
>>>>>> P.S. I believe that once it’s committed, we should cut our first 4.0
>>>>> alpha.
>>>>>> It’s about time we started this train (:
>>>>> 
>>>>> +1 on working towards an alpha but that is a separate discussion from
>>> this
>>>>> issue.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Dinesh
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-14578
>>>>> [2] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
>>>>> [3]
>>>>> 
>>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-13304?focusedCommentId=16183034&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16183034
>>>>> [4] https://issues.apache.org/jira/browse/CASSANDRA-14654
>>>>> [5] https://issues.apache.org/jira/browse/CASSANDRA-14747
>>>>> [6] https://issues.apache.org/jira/browse/CASSANDRA-14765
>>>>> [7]
>>>>> 
>>> 
>> https://issues.apache.org/jira/browse/CASSANDRA-15066?focusedCommentId=16801277&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16801277
>>>>> 
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>>>> 
>>>>> 
>>> 
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
>>> For additional commands, e-mail: dev-h...@cassandra.apache.org
>>> 
>>> 
>> 
>> --
>> alex p
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
For additional commands, e-mail: dev-h...@cassandra.apache.org

Re: Stabilising Internode Messaging in 4.0

Reply via email to