Re: Stabilising Internode Messaging in 4.0

2019-04-10 Thread Benedict Elliott Smith
Appreciate everyone's input. It sounds like there's broad agreement that the 
fixes need to make it into 4.0, which I'm really pleased to see. The question 
seems to be moving towards scope and timing.

TL;DR: This patch is probably the most efficient route to a stable 4.0. The 
patch is already reviewed, extensively tested, and more thorough testing is 
coming. From our perspective, a timeline for merging this is on the order of 
two weeks.

To best answer the question of ideal scope and timing, it's perhaps worthwhile 
considering what our ideal approach would look like, given realistic breathing 
room, and then how other pressures might prefer to modify that. In my opinion, 
both views endorse proceeding with the patch largely as it stands, though I 
will highlight some of the strongest cases for splitting.

To answer the first question: Were I starting with 3.0, and producing a major 
refactor of messaging, would I have produced this patch, or one very like it? 
The answer is yes.

These changes have been designed together, intentionally. By making them 
simultaneously, we not only reduce the review and work burden, we reduce the 
overall risk by ensuring all of the semantics of each component are designed in 
harmony. A trickle of changes requires subtly updating the semantics of each 
component, across multiple constantly shifting interfaces between old and new. 
Each time we touch one of these interfaces, we risk failing to properly modify 
(or more likely hack) one side to understand the other, and carrying those 
misunderstandings and hacks through to later patches.

Each time we do this, it is costly in both time and risk. The project can't 
really afford either, even in the wider picture sense. There is already too 
much work to do, and risk to mitigate.

Just as importantly, most of these changes are necessary, and actually fix 
bugs. For example, the “performance improvements” from re-implementing Netty 
classes are actually to avoid bugs in our usage, since they are designed for 
use only on the event loop.

- Using our own allocator avoids cross-thread recycling, which can cause 
unbounded heap growth in Netty today (and is the topic of active discussion on 
the Netty bug tracker [1]). We already have our own allocator that works, and 
arguably it is lowest risk to repurpose it. We also reduce risk by using 
existing ByteBuffer methods, keeping the code more idiomatic.
- AsyncPromise is used to avoid blocking the event loop. Modifying and invoking 
listeners on a DefaultPromise requires taking a mutex, and we do this on 
threads besides the event loop. This could stall the event loop for an entire 
scheduler quanta (or more, if the owning thread is low priority or has been 
extremely busy recently), which is a bug for a real-time networking thread that 
could be servicing dozens of connections.
- FutureCombiner in Netty is not thread-safe, and we need it to be.

Some are perhaps not bug fixes, but a fundamental part of the design of the 
patch. You either get them for free, or you write a different patch. For 
example, checksumming, which falls naturally out of framing - which is integral 
to the new message flow. Or dropping messages eagerly when reading off the 
wire, which is simply natural to do in this design. Back pressure falls roughly 
into this category as well, and fixes a major stability bug. Perhaps the 
biggest stability bug we have today.

The new handshake protocol is probably the best candidate for splitting into 
its own patch. There's a good case to be made here, but the question is: what 
are we hoping to achieve by separating it? Reviewing these specific changes is 
not a significant burden, I don't think. Testing it independently probably 
doesn't yield significant benefit - if anything it's probably advantageous to 
include the changes in our atypically thorough testing of this subsystem.

Perhaps we can take this discussion onto Jira, to dive into specifics on more 
items? I've tried to keep it high level, and only select a few items, and it's 
perhaps already too deep in the weeds. We're absolutely open to modifying the 
scope if it definitely makes sense, and that's probably best established with 
some deep discussions on specific points.

So, what about our current pressures?

I think it's clear that landing this patch is the most efficient route to 4.0 
release. It has already exceeded the normal barrier for inclusion in the 
project - it has been reviewed by two people already (50% each by me and 
Aleksey, and 100% by Alex Petrov). It is already well tested, and I will 
shortly be posting a test plan to the Wiki outlining plans for further 
coverage. Suffice to say we anticipate atypically thorough test coverage before 
the patch is committed, and an extremely high confidence in the result. We also 
don't anticipate this taking a significant length of time to achieve.

Attempting to incorporate the patch in a piecemeal fashion can only slow things 
down and, in my

Re: Stabilising Internode Messaging in 4.0

2019-04-10 Thread Oleksandr Petrov
To be fair, even though the patch totals to 20K LoC, the core of the patch
is within reasonable bounds (around net.async.*). There are many changes
because of the code that got moved around. Some parts changes look large
because Java is quite a verbose language (e.g., metric tables).
Unfortunately, I do not see a good way to split out the bug fixes specific
to netty refactor from some other changes, since some things were
fundamentally reworked.

I'll leave a more elaborate comment that summarises the way I've been
approaching the patch review that might help to identify the "important"
parts that one should concentrate on while reviewing.

I've been reviewing the patch and can say that it has several advantages,
including the fact that incoming and outgoing paths are now easier to test
(e.g., write unit tests for). This has helped to validate rather complex
scenarios, such as handshake protocol, back pressure, dropping messages,
large messages, expiry, counting and more. Patch also integrates well with
in-jvm tests, which allowed to verify several behaviors which otherwise
would've been somewhat difficult to reproduce.

I do agree that validating this patch is no easy endeavor, but having that
said, at that point, we would have to invest a similar amount of time to
validate 4.0 with and without this patch.

I'm personally leaning towards 4.0, since this helps to keep the community
unified testing the same version all together instead of some concentrating
on making 4.0 work, while some are skipping it and proceeding with 4.next.


On Wed, Apr 10, 2019 at 4:55 PM Benedict Elliott Smith 
wrote:

> Appreciate everyone's input. It sounds like there's broad agreement that
> the fixes need to make it into 4.0, which I'm really pleased to see. The
> question seems to be moving towards scope and timing.
>
> TL;DR: This patch is probably the most efficient route to a stable 4.0.
> The patch is already reviewed, extensively tested, and more thorough
> testing is coming. From our perspective, a timeline for merging this is on
> the order of two weeks.
>
> To best answer the question of ideal scope and timing, it's perhaps
> worthwhile considering what our ideal approach would look like, given
> realistic breathing room, and then how other pressures might prefer to
> modify that. In my opinion, both views endorse proceeding with the patch
> largely as it stands, though I will highlight some of the strongest cases
> for splitting.
>
> To answer the first question: Were I starting with 3.0, and producing a
> major refactor of messaging, would I have produced this patch, or one very
> like it? The answer is yes.
>
> These changes have been designed together, intentionally. By making them
> simultaneously, we not only reduce the review and work burden, we reduce
> the overall risk by ensuring all of the semantics of each component are
> designed in harmony. A trickle of changes requires subtly updating the
> semantics of each component, across multiple constantly shifting interfaces
> between old and new. Each time we touch one of these interfaces, we risk
> failing to properly modify (or more likely hack) one side to understand the
> other, and carrying those misunderstandings and hacks through to later
> patches.
>
> Each time we do this, it is costly in both time and risk. The project
> can't really afford either, even in the wider picture sense. There is
> already too much work to do, and risk to mitigate.
>
> Just as importantly, most of these changes are necessary, and actually fix
> bugs. For example, the “performance improvements” from re-implementing
> Netty classes are actually to avoid bugs in our usage, since they are
> designed for use only on the event loop.
>
> - Using our own allocator avoids cross-thread recycling, which can cause
> unbounded heap growth in Netty today (and is the topic of active discussion
> on the Netty bug tracker [1]). We already have our own allocator that
> works, and arguably it is lowest risk to repurpose it. We also reduce risk
> by using existing ByteBuffer methods, keeping the code more idiomatic.
> - AsyncPromise is used to avoid blocking the event loop. Modifying and
> invoking listeners on a DefaultPromise requires taking a mutex, and we do
> this on threads besides the event loop. This could stall the event loop for
> an entire scheduler quanta (or more, if the owning thread is low priority
> or has been extremely busy recently), which is a bug for a real-time
> networking thread that could be servicing dozens of connections.
> - FutureCombiner in Netty is not thread-safe, and we need it to be.
>
> Some are perhaps not bug fixes, but a fundamental part of the design of
> the patch. You either get them for free, or you write a different patch.
> For example, checksumming, which falls naturally out of framing - which is
> integral to the new message flow. Or dropping messages eagerly when reading
> off the wire, which is simply natural to do in this design. Back pressure
> f

Re: Stabilising Internode Messaging in 4.0

2019-04-10 Thread Dinesh Joshi
Here's are my 2¢.

I think the general direction of this work is valuable but I have a few 
concerns I’d like to address. More inline.

> On Apr 4, 2019, at 1:13 PM, Aleksey Yeschenko  wrote:
> 
> I would like to propose CASSANDRA-15066 [1] - an important set of bug fixes
> and stability improvements to internode messaging code that Benedict, I,
> and others have been working on for the past couple of months.
> 
> First, some context.   This work started off as a review of CASSANDRA-14503
> (Internode connection management is race-prone [2]), CASSANDRA-13630
> (Support large internode messages with netty) [3], and a pre-4.0
> confirmatory review of such a major new feature.
> 
> However, as we dug in, we realized this was insufficient. With more than 50
> bugs uncovered [4] - dozens of them critical to correctness and/or
> stability of the system - a substantial rework was necessary to guarantee a
> solid internode messaging subsystem for the 4.0 release.
> 
> In addition to addressing all of the uncovered bugs [4] that were unique to
> trunk + 13630 [3] + 14503 [2], we used this opportunity to correct some
> long-existing, pre-4.0 bugs and stability issues. For the complete list of
> notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d like
> to highlight a few.

Do you have regression tests that will fail if these bugs are reintroduced at a 
later point?

> # Lack of message integrity checks
> 
> It’s known that TCP checksums are too weak [5] and Ethernet CRC cannot be
> relied upon [6] for integrity. With sufficient scale or time, you will hit
> bit flips. Sadly, most of the time these go undetected.  Cassandra’s
> replication model makes this issue much more serious, as the faulty data
> can infect the cluster.
> 
> We recognised this problem, and recently introduced a fix for server-client
> messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to the
> native protocol) [7].

This was discussed in my review and Jason created a ticket[1] to track it. We 
explicitly decided to defer this work not only due to the feature freeze in the 
community but also for technical reasons detailed below. 

Regarding new features during the feature freeze window, we have had such 
discussions in the past. The most recent being the one I initiated on Zstd 
Compressor which went positively and we have moved forward after assessing risk 
& community consensus.

Regarding checksumming, please scroll down to the comments section in the 
link[2] you provided. You'll notice this discussion –

>> Daniel Fox Franke:
>> Please don't design new network protocols that don't either run over TLS or 
>> do some other kind of cryptographic authentication. If you have 
>> cryptographic authentication, then CRC is redundant.
> 
> Evan Jones:
> Good point. Even internal applications inside a data center should be using 
> encryption, and today the performance impact is probably small (I haven't 
> actually measured it myself these days).

Enabling TLS & internode compression are mitigation strategies to avoid data 
corruption in transit. By your own admission in CASSANDRA-13304[3], we don't 
require checksumming if TLS is present. Here's your full quote –

> Aleksey Yeschenko:
> 

> Checksums and TLS are orthogonal. It just so happens that you don't need the 
> former if you already have the latter.

I want to be fair, later you did say that we don't want to force people to pay 
the cost of TLS overhead. However, I would also like to point out that with 
introduction of Netty for internode communication, we have 4-5x the TLS 
performance thanks to OpenSSL JNI bindings. You can refer to Norman or my talks 
on the topic. So TLS is practical & compression is necessary for performance. 
Both strategies work fine to protect against data corruption making 
checksumming redundant. With SSL certificate hot reloading, it also avoids 
unnecessary cluster restarts providing maximum availability.

In the same vein, it's 2019 and if people are not using TLS for internode, then 
it is really really bad for data security in our industry and we should not be 
encouraging it. In fact, I would go so far as to make TLS as the default. 

Managing TLS infrastructure is beyond Cassandra's scope and operators should 
figure it out by now for their & their user's sake. Cassandra makes it super 
easy & performant to have TLS enabled. People should be using it.

> But until CASSANDRA-15066 [1] lands, this is also a critical flaw
> internode. We have addressed it by ensuring that no matter what, whether
> you use SSL or not, whether you use internode compression or not, a
> protocol level CRC is always present, for every message frame. It’s our
> deep and sincere belief that shipping a new implementation of the messaging
> protocol without application-level data integrity checks would be
> unacceptable for a modern database.

My previous technical arguments have provided enough evidence that protocol 
level checksumming is not a show st

Cassandra 4.0 Quality and Stability Update

2019-04-10 Thread Jordan West
In September, the community chose to freeze trunk to begin working on
Quality and Stability with the goal of releasing the most stable Cassandra
major in the project’s history. While lots of work has been ongoing and
folks could follow along with progress on JIRA I thought it would be useful
to cover what has been accomplished so far since I’ve spent a good amount
of time working with others on various testing projects.

During this time we have made significant progress on improving the Quality
and Stability of Cassandra — not only Cassandra 4.0 but also the Cassandra
3.x series and future Cassandra releases. Additionally, testing has
provided the opportunity for new community members and committers to
contribute. While not comprehensive the community has found at least 25
bugs that can be classified as either Data Loss, Corruption, Incorrect
Response, Loss of Stability, Loss of Availability, Concurrency Issues,
Performance Issues, and Lack of Safety. These bugs have been found by a
variety of methodologies including commonly used ones like unit testing and
canary deployments. However, the majority of the bugs have been found or
confirmed using new methodologies like the ones described in a some recent
blog posts [1] [2].

Additionally, the state of the test suites and test tooling have improved.
CASSANDRA-14806 [3] brought some much welcomed improvements to the circleci
workflow and made it easier for people to run (d)tests on supported
platforms (jdk8/11) and the work to get upgrade tests running found several
bugs including CASSADNRA-14958 [4].

While we have made significant progress there is still more to do before we
can be truly confident in an Cassandra 4.0 release. Some ongoing and
outstanding work includes:

* Improving the state of the cqlsh tests [5]
* There is ongoing discussion on the new MessagingService [6] which will
require significant review and testing
* Additional upgrade testing for Cassandra 4.0 including additional support
for upgrade testing using in-jvm dtests [7]
* Work to increase coverage of important areas and new features in
Cassandra 4.0 [8]

While the list above may seem short, the last item contains a long list of
important areas the community has previously discussed adding coverage to.
If you are looking for areas to contribute this is a great starting point.
If there is a name down on an area you are interested in I would encourage
you to reach out to them to discuss how you can help further increase the
community’s confidence in the Quality and Stability of Cassandra.

Below is an in-complete list of many of the severe bugs found during this
part of the release cycle. Thanks again to all of the community members who
contributed to finding these bugs and improving Cassandra for everyone.

CASSANDRA-15004: Anti-compaction briefly removes sstables from the read path
CASSANDRA-14958: Counters fail to increment on 2.X to 3.X mixed version
clusters
CASSANDRA-14936: Anticompaction should throw exceptions on errors, not just
log them
CASSANDRA-14672: After deleting data in 3.11.3, reads fail: "open marker
and close marker have different deletion times"
CASSANDRA-14912: LegacyLayout errors on collection tombstones from dropped
columns
CASSANDRA-14843: Drop/add column name with different Kind can result in
corruption
CASSANDRA-14568: CorruptSSTableExceptions in 3.0.17.1 (CASSANDRA-14568 v2)
Static collection deletions are corrupted in 3.0 <-> 2.{1,2} messages
CASSANDRA-14749: Collection Deletions for Dropped Columns in 2.1/3.0
mixed-mode can delete rows
CASSANDRA-14568: Static collection deletions are corrupted in 3.0 ->
2.{1,2} messages
CASSANDRA-14861: Inaccurate sstable min/max metadata can cause data loss
CASSANDRA-14823: Legacy sstables with range tombstones spanning multiple
index blocks create invalid bound sequences on 3.0+ (#1193)
CASSANDRA-14873: Missing rows when reading 2.1 SSTables in 3.0
CASSANDRA-14838: Dropped columns can cause reverse sstable iteration to
return prematurely
CASSANDRA-14803: Rows that cross index block boundaries can cause
incomplete reverse reads in some cases.
CASSANDRA-14766: DESC order reads can fail to return the last Unfiltered in
the partition (#1170)
CASSANDRA-14991: SSL Cert Hot Reloading should defensively check for sanity
of the new keystore/truststore before loading it
CASSANDRA-14794: Avoid calling iter.next() in a loop when notifying
indexers about range tombstones
CASSANDRA-14780: Avoid creating empty compaction tasks after truncate
CASSANDRA-14657: Handle failures in upgradesstables/cleanup/relocatee
CASSANDRA-14638: Column result order can change in 'SELECT *' results when
upgrading from 2.1 to 3.0 causing response corruption for queries using
prepared statements when static columns are used
CASSANDRA-14919: Regression in paging queries in mixed version clusters
CASSANDRA-14554: LifecycleTransaction encounters
ConcurrentModificationException when used in multi-threaded context
CASSANDRA-14935: PendingAntiCompaction should be mor

Re: Stabilising Internode Messaging in 4.0

2019-04-10 Thread Jordan West
There is a lot of discuss here so I’ll try to keep my opinions brief:

1. The bug fixes are a requirement in order to have a stable 4.0. Whether
they come from this patch or the original I have less of an opinion. I do
think its important to minimize code changes at this time in the
development/freeze cycle — including large refactors which add risk despite
how they are being discussed here. From that perspective, I would prefer to
see more targeted fixes but since we don’t have them and we have this patch
the decision is different.

2. We should not be measuring complexity in LoC with the exception that all
20k lines *do need to be review* (not just the important parts and because
code refactoring tools have bugs too) and more lines take more time.
Otherwise, its a poor metric for how long this will take to review.
Further, it seems odd that the authors are projecting how long it will take
to review — this should be the charge of the reviewers and I would like to
hear from them on this. Its clear this a complex patch — as risky as
something like 8099 (and the original rewrite by Jason). We should treat it
as such and not try to merge it in quickly for the sake of it, repeating
past mistakes. The goal of reviewing the messaging service work was to do
just that. It would be a disservice to rush in these changes now. If the
goal is to get the patch in that should be the priority, not completing it
“in two weeks”. Its clear several community members have pushed back on
that and are not comfortable with the time frame.

3. If we need to add special forks of Netty classes to the code because of
“how we use Netty” that is a concern to me w.r.t to quality, stability, and
maintenance. Is there a place that documents/justifies our non-traditional
usage (I saw some in JavaDocs but found it lacking in *why* but had a lot
of how/what was changed). Given folks in the community have decent
relationships with the Netty team perhaps we should leverage that as well.
Have we reached out to them?

4. In principle, I agree with the technical improvements you mention
(backpressure / checksumming / etc). These things should be there. Are they
a hard requirement for 4.0? In my opinion no and Dinesh has done a good job
of providing some reasons as to why so I won’t go into much detail here. In
short, a bug and a missing safety mechanism are not the same thing. Its
also important to consider how many users a change like that covers and for
what risk — we found a bug in 13304 late in review, had it slipped through
it would have subjected users to silent corruption they thought couldn’t
occur anymore because we included the feature in a prod Cassandra release.

5. The patch could seriously benefit from some commit hygiene that would
make it easier for folks to review. Had this been done not only would
review be easier but also the piecemeal breakup of features/fixes could
have been done more easily. I don’t buy the premise that this wasn’t
possible. If we had to add the feature/fix later it would have been
possible. I’m sure there was a smart way we could have organized it, if it
was a priority.

6. Have any upgrade tests been done/added? I would also like to see some
performance benchmarks before merging so that the patch is in a similar
state as 14503 in terms of performance testing.

I’m sure I have more thoughts but these seem like the important ones for
now.

Jordan

On Wed, Apr 10, 2019 at 8:21 AM Dinesh Joshi 
wrote:

> Here's are my 2¢.
>
> I think the general direction of this work is valuable but I have a few
> concerns I’d like to address. More inline.
>
> > On Apr 4, 2019, at 1:13 PM, Aleksey Yeschenko 
> wrote:
> >
> > I would like to propose CASSANDRA-15066 [1] - an important set of bug
> fixes
> > and stability improvements to internode messaging code that Benedict, I,
> > and others have been working on for the past couple of months.
> >
> > First, some context.   This work started off as a review of
> CASSANDRA-14503
> > (Internode connection management is race-prone [2]), CASSANDRA-13630
> > (Support large internode messages with netty) [3], and a pre-4.0
> > confirmatory review of such a major new feature.
> >
> > However, as we dug in, we realized this was insufficient. With more than
> 50
> > bugs uncovered [4] - dozens of them critical to correctness and/or
> > stability of the system - a substantial rework was necessary to
> guarantee a
> > solid internode messaging subsystem for the 4.0 release.
> >
> > In addition to addressing all of the uncovered bugs [4] that were unique
> to
> > trunk + 13630 [3] + 14503 [2], we used this opportunity to correct some
> > long-existing, pre-4.0 bugs and stability issues. For the complete list
> of
> > notable bug fixes, read the comments to CASSANDRA-15066 [1]. But I’d like
> > to highlight a few.
>
> Do you have regression tests that will fail if these bugs are reintroduced
> at a later point?
>
> > # Lack of message integrity checks
> >
> > It’s known that TCP checksums 

Re: Stabilising Internode Messaging in 4.0

2019-04-10 Thread Sankalp Kohli
I think we should wait for testing doc on confluence to come up and discuss 
what all needs to be added there to gain confidence. 

If the work is more to split the patch into smaller parts and delays 4.0 even 
more, can we use time in adding more test coverage, documentation and design 
docs to this component?  Will that be a middle ground here ? 

Examples of large changes not going well is due to lack of testing, 
documentation and design docs not because they were big. Being big adds to the 
complexity and increased the total bug count but small changes can be equally 
worse in terms of impact. 


> On Apr 10, 2019, at 8:53 AM, Jordan West  wrote:
> 
> There is a lot of discuss here so I’ll try to keep my opinions brief:
> 
> 1. The bug fixes are a requirement in order to have a stable 4.0. Whether
> they come from this patch or the original I have less of an opinion. I do
> think its important to minimize code changes at this time in the
> development/freeze cycle — including large refactors which add risk despite
> how they are being discussed here. From that perspective, I would prefer to
> see more targeted fixes but since we don’t have them and we have this patch
> the decision is different.
> 
> 2. We should not be measuring complexity in LoC with the exception that all
> 20k lines *do need to be review* (not just the important parts and because
> code refactoring tools have bugs too) and more lines take more time.
> Otherwise, its a poor metric for how long this will take to review.
> Further, it seems odd that the authors are projecting how long it will take
> to review — this should be the charge of the reviewers and I would like to
> hear from them on this. Its clear this a complex patch — as risky as
> something like 8099 (and the original rewrite by Jason). We should treat it
> as such and not try to merge it in quickly for the sake of it, repeating
> past mistakes. The goal of reviewing the messaging service work was to do
> just that. It would be a disservice to rush in these changes now. If the
> goal is to get the patch in that should be the priority, not completing it
> “in two weeks”. Its clear several community members have pushed back on
> that and are not comfortable with the time frame.
> 
> 3. If we need to add special forks of Netty classes to the code because of
> “how we use Netty” that is a concern to me w.r.t to quality, stability, and
> maintenance. Is there a place that documents/justifies our non-traditional
> usage (I saw some in JavaDocs but found it lacking in *why* but had a lot
> of how/what was changed). Given folks in the community have decent
> relationships with the Netty team perhaps we should leverage that as well.
> Have we reached out to them?
> 
> 4. In principle, I agree with the technical improvements you mention
> (backpressure / checksumming / etc). These things should be there. Are they
> a hard requirement for 4.0? In my opinion no and Dinesh has done a good job
> of providing some reasons as to why so I won’t go into much detail here. In
> short, a bug and a missing safety mechanism are not the same thing. Its
> also important to consider how many users a change like that covers and for
> what risk — we found a bug in 13304 late in review, had it slipped through
> it would have subjected users to silent corruption they thought couldn’t
> occur anymore because we included the feature in a prod Cassandra release.
> 
> 5. The patch could seriously benefit from some commit hygiene that would
> make it easier for folks to review. Had this been done not only would
> review be easier but also the piecemeal breakup of features/fixes could
> have been done more easily. I don’t buy the premise that this wasn’t
> possible. If we had to add the feature/fix later it would have been
> possible. I’m sure there was a smart way we could have organized it, if it
> was a priority.
> 
> 6. Have any upgrade tests been done/added? I would also like to see some
> performance benchmarks before merging so that the patch is in a similar
> state as 14503 in terms of performance testing.
> 
> I’m sure I have more thoughts but these seem like the important ones for
> now.
> 
> Jordan
> 
> On Wed, Apr 10, 2019 at 8:21 AM Dinesh Joshi 
> wrote:
> 
>> Here's are my 2¢.
>> 
>> I think the general direction of this work is valuable but I have a few
>> concerns I’d like to address. More inline.
>> 
>>> On Apr 4, 2019, at 1:13 PM, Aleksey Yeschenko 
>> wrote:
>>> 
>>> I would like to propose CASSANDRA-15066 [1] - an important set of bug
>> fixes
>>> and stability improvements to internode messaging code that Benedict, I,
>>> and others have been working on for the past couple of months.
>>> 
>>> First, some context.   This work started off as a review of
>> CASSANDRA-14503
>>> (Internode connection management is race-prone [2]), CASSANDRA-13630
>>> (Support large internode messages with netty) [3], and a pre-4.0
>>> confirmatory review of such a major new feature.
>>> 
>>> However, as w

Re: Stabilising Internode Messaging in 4.0

2019-04-10 Thread Dinesh Joshi
> On Apr 10, 2019, at 9:06 AM, Sankalp Kohli  wrote:
> 
> I think we should wait for testing doc on confluence to come up and discuss 
> what all needs to be added there to gain confidence. 
> 
> If the work is more to split the patch into smaller parts and delays 4.0 even 
> more, can we use time in adding more test coverage, documentation and design 
> docs to this component?  Will that be a middle ground here ? 

+1 for finding the right middle ground.

If splitting is an issue, we can look through opportunities to split this. Here 
are the benefits –

1. It makes verifying and testing the code a lot easier.
2. It also helps us pin-point all sorts of regressions (git bisect anybody?) 
precisely. 
3. It also gives us the opportunity to add the correct interfaces. More on this 
later.
4. It helps make the review quicker.

TBH, one of the big reasons we tend to drop such large PRs is the fact that 
Cassandra's code is highly intertwined and it makes it hard to precisely change 
things. We need to iterate towards interfaces that allow us to iterate quickly 
and reduce the amount of highly intertwined code. It helps with testing as 
well. I want us to have a meaningful discussion around it before we drop a big 
PR.

> Examples of large changes not going well is due to lack of testing, 
> documentation and design docs not because they were big. Being big adds to 
> the complexity and increased the total bug count but small changes can be 
> equally worse in terms of impact. 

+1 so many times, Sankalp and I recognize this as well. Thank you for calling 
it out.

I was working with some of the PMCs to develop a document with concrete 
guidelines to help us standardize and avoid issues like these. Due to various 
reasons, that document did not go anywhere. We had similar debate on the 
sidecar and we used CIP to help with it. It did help make the discussion more 
meaningful.

Lack of documentation as many people have called out makes it really hard for 
us to review things especially with such a large change.

Dinesh


> 
>> On Apr 10, 2019, at 8:53 AM, Jordan West  wrote:
>> 
>> There is a lot of discuss here so I’ll try to keep my opinions brief:
>> 
>> 1. The bug fixes are a requirement in order to have a stable 4.0. Whether
>> they come from this patch or the original I have less of an opinion. I do
>> think its important to minimize code changes at this time in the
>> development/freeze cycle — including large refactors which add risk despite
>> how they are being discussed here. From that perspective, I would prefer to
>> see more targeted fixes but since we don’t have them and we have this patch
>> the decision is different.
>> 
>> 2. We should not be measuring complexity in LoC with the exception that all
>> 20k lines *do need to be review* (not just the important parts and because
>> code refactoring tools have bugs too) and more lines take more time.
>> Otherwise, its a poor metric for how long this will take to review.
>> Further, it seems odd that the authors are projecting how long it will take
>> to review — this should be the charge of the reviewers and I would like to
>> hear from them on this. Its clear this a complex patch — as risky as
>> something like 8099 (and the original rewrite by Jason). We should treat it
>> as such and not try to merge it in quickly for the sake of it, repeating
>> past mistakes. The goal of reviewing the messaging service work was to do
>> just that. It would be a disservice to rush in these changes now. If the
>> goal is to get the patch in that should be the priority, not completing it
>> “in two weeks”. Its clear several community members have pushed back on
>> that and are not comfortable with the time frame.
>> 
>> 3. If we need to add special forks of Netty classes to the code because of
>> “how we use Netty” that is a concern to me w.r.t to quality, stability, and
>> maintenance. Is there a place that documents/justifies our non-traditional
>> usage (I saw some in JavaDocs but found it lacking in *why* but had a lot
>> of how/what was changed). Given folks in the community have decent
>> relationships with the Netty team perhaps we should leverage that as well.
>> Have we reached out to them?
>> 
>> 4. In principle, I agree with the technical improvements you mention
>> (backpressure / checksumming / etc). These things should be there. Are they
>> a hard requirement for 4.0? In my opinion no and Dinesh has done a good job
>> of providing some reasons as to why so I won’t go into much detail here. In
>> short, a bug and a missing safety mechanism are not the same thing. Its
>> also important to consider how many users a change like that covers and for
>> what risk — we found a bug in 13304 late in review, had it slipped through
>> it would have subjected users to silent corruption they thought couldn’t
>> occur anymore because we included the feature in a prod Cassandra release.
>> 
>> 5. The patch could seriously benefit from some commit hygiene that would
>>

Re: Stabilising Internode Messaging in 4.0

2019-04-10 Thread Oleksandr Petrov
Sorry to pick only a few points to address, but I think these are ones
where I can contribute productively to the discussion.

> In principle, I agree with the technical improvements you
mention (backpressure / checksumming / etc). These things should be there.
Are they a hard requirement for 4.0?

One thing that comes to mind is protocol versioning and consistency. If
changes adding checksumming and handshake do not make it to 4.0, we grow
the upgrade matrix and have to put changes to the separate protocol
version. I'm not sure how many other internode protocol changes we have
planned for 4.next, but this is definitely something we should keep in mind.

> 2. We should not be measuring complexity in LoC with the exception that
all 20k lines *do need to be review* (not just the important parts and
because code refactoring tools have bugs too) and more lines take more time.

Everything should definitely be reviewed. But with different rigour: one
thing is to review byte arithmetic and protocol formats and completely
different thing is to verify that Verb moved from one place to the other is
still used. Concentrating on a smaller subset doesn't make a patch smaller,
just helps to guide reviewers and observers, so my primary goal was to help
people, hence my reference to the jira comment I'm working on.


On Wed, Apr 10, 2019 at 6:13 PM Sankalp Kohli 
wrote:

> I think we should wait for testing doc on confluence to come up and
> discuss what all needs to be added there to gain confidence.
>
> If the work is more to split the patch into smaller parts and delays 4.0
> even more, can we use time in adding more test coverage, documentation and
> design docs to this component?  Will that be a middle ground here ?
>
> Examples of large changes not going well is due to lack of testing,
> documentation and design docs not because they were big. Being big adds to
> the complexity and increased the total bug count but small changes can be
> equally worse in terms of impact.
>
>
> > On Apr 10, 2019, at 8:53 AM, Jordan West  wrote:
> >
> > There is a lot of discuss here so I’ll try to keep my opinions brief:
> >
> > 1. The bug fixes are a requirement in order to have a stable 4.0. Whether
> > they come from this patch or the original I have less of an opinion. I do
> > think its important to minimize code changes at this time in the
> > development/freeze cycle — including large refactors which add risk
> despite
> > how they are being discussed here. From that perspective, I would prefer
> to
> > see more targeted fixes but since we don’t have them and we have this
> patch
> > the decision is different.
> >
> > 2. We should not be measuring complexity in LoC with the exception that
> all
> > 20k lines *do need to be review* (not just the important parts and
> because
> > code refactoring tools have bugs too) and more lines take more time.
> > Otherwise, its a poor metric for how long this will take to review.
> > Further, it seems odd that the authors are projecting how long it will
> take
> > to review — this should be the charge of the reviewers and I would like
> to
> > hear from them on this. Its clear this a complex patch — as risky as
> > something like 8099 (and the original rewrite by Jason). We should treat
> it
> > as such and not try to merge it in quickly for the sake of it, repeating
> > past mistakes. The goal of reviewing the messaging service work was to do
> > just that. It would be a disservice to rush in these changes now. If the
> > goal is to get the patch in that should be the priority, not completing
> it
> > “in two weeks”. Its clear several community members have pushed back on
> > that and are not comfortable with the time frame.
> >
> > 3. If we need to add special forks of Netty classes to the code because
> of
> > “how we use Netty” that is a concern to me w.r.t to quality, stability,
> and
> > maintenance. Is there a place that documents/justifies our
> non-traditional
> > usage (I saw some in JavaDocs but found it lacking in *why* but had a lot
> > of how/what was changed). Given folks in the community have decent
> > relationships with the Netty team perhaps we should leverage that as
> well.
> > Have we reached out to them?
> >
> > 4. In principle, I agree with the technical improvements you mention
> > (backpressure / checksumming / etc). These things should be there. Are
> they
> > a hard requirement for 4.0? In my opinion no and Dinesh has done a good
> job
> > of providing some reasons as to why so I won’t go into much detail here.
> In
> > short, a bug and a missing safety mechanism are not the same thing. Its
> > also important to consider how many users a change like that covers and
> for
> > what risk — we found a bug in 13304 late in review, had it slipped
> through
> > it would have subjected users to silent corruption they thought couldn’t
> > occur anymore because we included the feature in a prod Cassandra
> release.
> >
> > 5. The patch could seriously benefit from some com