I don’t have a lot to add to Josh’s contribution, except that I’d like to really hammer home that many people were a party to 8099, and as a project we learned a great deal from the experience. It’s a very complex topic, that does not lend itself to simple comparisons, but I think anyone who participated in that work would find it strange to see these two pieces of work compared. I think it would also be helpful if we stopped using it as some kind of bogeyman. It seems too easy to forget how much positive change came out of 8099, and how many bugs we have since avoided because of it. A lot of people put a herculean effort into making it happen: by my recollection Sylvain alone spent perhaps a year of his time on the initial and follow-up work. Most of the active contributors participated in some way, for months in many cases. Every time we talk about it in this way it denigrates a lot of good work. Using it as a rhetorical device without seemingly appreciating what was involved, or where it went wrong, is even less helpful.
On a personal note, I found a couple of the responses to this thread disappointing. As far as I can tell, neither engaged with my email, in which I justify our approach on most of their areas of concern. Nor accepted the third-party reviewer’s comments that the patch is manageable to review and of acceptable scope. Nor seemingly read the patch with care to reach their own conclusion, with the one concrete factual assertion about the code being false. We’re trying to build a more positive and constructive community here than there has been in the past. I want to encourage and welcome critical feedback, but I think it is incumbent on critics to do some basic research and to engage with the target of their criticism - lest they appear to have a goal of frustrating a body of work rather than improving it. Please take a moment to read my email, take a closer look at the patch itself, and then engage with us on Jira with specific constructive feedback, and concrete positive suggestions. I'd like to thank everyone else for taking the time to provide their thoughts, and we hope to address any lingering concerns. I would love to hear your feedback on our testing and documentation plan [1] that we have put together and are executing on. [1] https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Internode+Messaging+Test+Plan <https://cwiki.apache.org/confluence/display/CASSANDRA/4.0+Internode+Messaging+Test+Plan> > On 12 Apr 2019, at 08:56, Pavel Yaskevich <pove...@gmail.com> wrote: > > On Thu, Apr 11, 2019 at 10:15 PM Joshua McKenzie <jmcken...@apache.org> > wrote: > >> As one of the two people that re-wrote all our unit tests to try and help >> Sylvain get 8099 out the door, I think it's inaccurate to compare the scope >> and potential stability impact of this work to the truly sweeping work that >> went into 8099 (not to downplay the scope and extent of this work here). >> >> TBH, one of the big reasons we tend to drop such large PRs is the fact that >>> Cassandra's code is highly intertwined and it makes it hard to precisely >>> change things. We need to iterate towards interfaces that allow us to >>> iterate quickly and reduce the amount of highly intertwined code. It >> helps >>> with testing as well. I want us to have a meaningful discussion around it >>> before we drop a big PR. >> >> This has been a huge issue with our codebase since at least back when I >> first encountered it five years ago. To date, while we have made progress >> on this front, it's been nowhere near sufficient to mitigate the issues in >> the codebase and allow for large, meaningful changes in smaller incremental >> patches or commits. Having yet another discussion around this (there have >> been many, many of them over the years) as a blocker for significant work >> to go into the codebase is an unnecessary and dangerous blocker. Not to say >> we shouldn't formalize a path to try and make incremental progress to >> improve the situation, far from it, but blocking other progress on a >> decade's worth of accumulated hygiene problems isn't going to make the >> community focus on fixing those problems imo, it'll just turn away >> contributions. >> > >> So let me second jd (and many others') opinion here: "it makes sense to get >> it right the first time, rather than applying bandaids to 4.0 and rewriting >> things for 4.next". And fwiw, asking people who have already done a huge >> body of work to reformat that work into a series of commits or to break up >> that work in a fashion that's more to the liking of people not involved in >> either the writing of the patch or reviewing of it doesn't make much sense >> to me. As I am neither an assignee nor reviewer on this contribution, I >> leave it up to the parties involved to do things professionally and with a >> high standard of quality. Admittedly, a large code change merging in like >> this has implications for rebasing on anyone else's work that's in flight, >> but be it one commit merged or 50, or be it one JIRA ticket or ten, the >> end-result is the same; any large contribution in any format will ripple >> outwards and require re-work from others in the community. >> > The one thing I *would* strongly argue for is performance benchmarking of >> the new messaging code on a representative sample of different >> general-purpose queries, LWT's, etc, preferably in a 3 node RF=3 cluster, >> plus a healthy suite of jmh micro-benches (assuming they're not already in >> the diff. If they are, disregard / sorry). From speaking with Aleksey >> offline about this work, my understanding is that that's something they >> plan on doing before putting a bow on things. >> >> In the balance between "fear change because it destabilizes" and "go forth >> blindly into that dark night, rewriting All The Things", I think the >> Cassandra project's willingness to jettison the old and introduce the new >> has served it well in keeping relevant as the years have gone by. I'd hate >> to see that culture of progress get mired in a dogmatic adherence to >> requirements on commit counts, lines of code allowed / expected on a given >> patch set, or any other metrics that might stymie the professional needs of >> some of the heaviest contributors to the project. >> > > +1. Based on all of the discussion here and in JIRA it seems to me that > we'd be doing a big disservice to the users by outright rejecting the > changes just > based on +/- LoC or complexity of review. From the points raised it seems > like > enabling encryption by default (or even making it the only available > option?), > upstreaming Netty related changes, possible steps to improve codebase, as > well as how the changes should be formatted to aid the reviewers could all > be discussed separately. > > I think at the end of the day it _might be_ reasonable for PMC have a final > say on the matter, > maybe even on point-by-point basis. > >> > >> >> On Wed, Apr 10, 2019 at 5:03 PM Oleksandr Petrov < >> oleksandr.pet...@gmail.com> >> wrote: >> >>> Sorry to pick only a few points to address, but I think these are ones >>> where I can contribute productively to the discussion. >>> >>>> In principle, I agree with the technical improvements you >>> mention (backpressure / checksumming / etc). These things should be >> there. >>> Are they a hard requirement for 4.0? >>> >>> One thing that comes to mind is protocol versioning and consistency. If >>> changes adding checksumming and handshake do not make it to 4.0, we grow >>> the upgrade matrix and have to put changes to the separate protocol >>> version. I'm not sure how many other internode protocol changes we have >>> planned for 4.next, but this is definitely something we should keep in >>> mind. >>> >>>> 2. We should not be measuring complexity in LoC with the exception that >>> all 20k lines *do need to be review* (not just the important parts and >>> because code refactoring tools have bugs too) and more lines take more >>> time. >>> >>> Everything should definitely be reviewed. But with different rigour: one >>> thing is to review byte arithmetic and protocol formats and completely >>> different thing is to verify that Verb moved from one place to the other >> is >>> still used. Concentrating on a smaller subset doesn't make a patch >> smaller, >>> just helps to guide reviewers and observers, so my primary goal was to >> help >>> people, hence my reference to the jira comment I'm working on. >>> >>> >>> On Wed, Apr 10, 2019 at 6:13 PM Sankalp Kohli <kohlisank...@gmail.com> >>> wrote: >>> >>>> I think we should wait for testing doc on confluence to come up and >>>> discuss what all needs to be added there to gain confidence. >>>> >>>> If the work is more to split the patch into smaller parts and delays >> 4.0 >>>> even more, can we use time in adding more test coverage, documentation >>> and >>>> design docs to this component? Will that be a middle ground here ? >>>> >>>> Examples of large changes not going well is due to lack of testing, >>>> documentation and design docs not because they were big. Being big adds >>> to >>>> the complexity and increased the total bug count but small changes can >> be >>>> equally worse in terms of impact. >>>> >>>> >>>>> On Apr 10, 2019, at 8:53 AM, Jordan West <jorda...@gmail.com> wrote: >>>>> >>>>> There is a lot of discuss here so I’ll try to keep my opinions brief: >>>>> >>>>> 1. The bug fixes are a requirement in order to have a stable 4.0. >>> Whether >>>>> they come from this patch or the original I have less of an opinion. >> I >>> do >>>>> think its important to minimize code changes at this time in the >>>>> development/freeze cycle — including large refactors which add risk >>>> despite >>>>> how they are being discussed here. From that perspective, I would >>> prefer >>>> to >>>>> see more targeted fixes but since we don’t have them and we have this >>>> patch >>>>> the decision is different. >>>>> >>>>> 2. We should not be measuring complexity in LoC with the exception >> that >>>> all >>>>> 20k lines *do need to be review* (not just the important parts and >>>> because >>>>> code refactoring tools have bugs too) and more lines take more time. >>>>> Otherwise, its a poor metric for how long this will take to review. >>>>> Further, it seems odd that the authors are projecting how long it >> will >>>> take >>>>> to review — this should be the charge of the reviewers and I would >> like >>>> to >>>>> hear from them on this. Its clear this a complex patch — as risky as >>>>> something like 8099 (and the original rewrite by Jason). We should >>> treat >>>> it >>>>> as such and not try to merge it in quickly for the sake of it, >>> repeating >>>>> past mistakes. The goal of reviewing the messaging service work was >> to >>> do >>>>> just that. It would be a disservice to rush in these changes now. If >>> the >>>>> goal is to get the patch in that should be the priority, not >> completing >>>> it >>>>> “in two weeks”. Its clear several community members have pushed back >> on >>>>> that and are not comfortable with the time frame. >>>>> >>>>> 3. If we need to add special forks of Netty classes to the code >> because >>>> of >>>>> “how we use Netty” that is a concern to me w.r.t to quality, >> stability, >>>> and >>>>> maintenance. Is there a place that documents/justifies our >>>> non-traditional >>>>> usage (I saw some in JavaDocs but found it lacking in *why* but had a >>> lot >>>>> of how/what was changed). Given folks in the community have decent >>>>> relationships with the Netty team perhaps we should leverage that as >>>> well. >>>>> Have we reached out to them? >>>>> >>>>> 4. In principle, I agree with the technical improvements you mention >>>>> (backpressure / checksumming / etc). These things should be there. >> Are >>>> they >>>>> a hard requirement for 4.0? In my opinion no and Dinesh has done a >> good >>>> job >>>>> of providing some reasons as to why so I won’t go into much detail >>> here. >>>> In >>>>> short, a bug and a missing safety mechanism are not the same thing. >> Its >>>>> also important to consider how many users a change like that covers >> and >>>> for >>>>> what risk — we found a bug in 13304 late in review, had it slipped >>>> through >>>>> it would have subjected users to silent corruption they thought >>> couldn’t >>>>> occur anymore because we included the feature in a prod Cassandra >>>> release. >>>>> >>>>> 5. The patch could seriously benefit from some commit hygiene that >>> would >>>>> make it easier for folks to review. Had this been done not only would >>>>> review be easier but also the piecemeal breakup of features/fixes >> could >>>>> have been done more easily. I don’t buy the premise that this wasn’t >>>>> possible. If we had to add the feature/fix later it would have been >>>>> possible. I’m sure there was a smart way we could have organized it, >> if >>>> it >>>>> was a priority. >>>>> >>>>> 6. Have any upgrade tests been done/added? I would also like to see >>> some >>>>> performance benchmarks before merging so that the patch is in a >> similar >>>>> state as 14503 in terms of performance testing. >>>>> >>>>> I’m sure I have more thoughts but these seem like the important ones >>> for >>>>> now. >>>>> >>>>> Jordan >>>>> >>>>> On Wed, Apr 10, 2019 at 8:21 AM Dinesh Joshi >>> <djos...@icloud.com.invalid >>>>> >>>>> wrote: >>>>> >>>>>> Here's are my 2¢. >>>>>> >>>>>> I think the general direction of this work is valuable but I have a >>> few >>>>>> concerns I’d like to address. More inline. >>>>>> >>>>>>> On Apr 4, 2019, at 1:13 PM, Aleksey Yeschenko <alek...@apache.org> >>>>>> wrote: >>>>>>> >>>>>>> I would like to propose CASSANDRA-15066 [1] - an important set of >> bug >>>>>> fixes >>>>>>> and stability improvements to internode messaging code that >> Benedict, >>>> I, >>>>>>> and others have been working on for the past couple of months. >>>>>>> >>>>>>> First, some context. This work started off as a review of >>>>>> CASSANDRA-14503 >>>>>>> (Internode connection management is race-prone [2]), >> CASSANDRA-13630 >>>>>>> (Support large internode messages with netty) [3], and a pre-4.0 >>>>>>> confirmatory review of such a major new feature. >>>>>>> >>>>>>> However, as we dug in, we realized this was insufficient. With more >>>> than >>>>>> 50 >>>>>>> bugs uncovered [4] - dozens of them critical to correctness and/or >>>>>>> stability of the system - a substantial rework was necessary to >>>>>> guarantee a >>>>>>> solid internode messaging subsystem for the 4.0 release. >>>>>>> >>>>>>> In addition to addressing all of the uncovered bugs [4] that were >>>> unique >>>>>> to >>>>>>> trunk + 13630 [3] + 14503 [2], we used this opportunity to correct >>> some >>>>>>> long-existing, pre-4.0 bugs and stability issues. For the complete >>> list >>>>>> of >>>>>>> notable bug fixes, read the comments to CASSANDRA-15066 [1]. But >> I’d >>>> like >>>>>>> to highlight a few. >>>>>> >>>>>> Do you have regression tests that will fail if these bugs are >>>> reintroduced >>>>>> at a later point? >>>>>> >>>>>>> # Lack of message integrity checks >>>>>>> >>>>>>> It’s known that TCP checksums are too weak [5] and Ethernet CRC >>> cannot >>>> be >>>>>>> relied upon [6] for integrity. With sufficient scale or time, you >>> will >>>>>> hit >>>>>>> bit flips. Sadly, most of the time these go undetected. >> Cassandra’s >>>>>>> replication model makes this issue much more serious, as the faulty >>>> data >>>>>>> can infect the cluster. >>>>>>> >>>>>>> We recognised this problem, and recently introduced a fix for >>>>>> server-client >>>>>>> messages, implementing CRCs in CASSANDRA-13304 (Add checksumming to >>> the >>>>>>> native protocol) [7]. >>>>>> >>>>>> This was discussed in my review and Jason created a ticket[1] to >> track >>>> it. >>>>>> We explicitly decided to defer this work not only due to the feature >>>> freeze >>>>>> in the community but also for technical reasons detailed below. >>>>>> >>>>>> Regarding new features during the feature freeze window, we have had >>>> such >>>>>> discussions in the past. The most recent being the one I initiated >> on >>>> Zstd >>>>>> Compressor which went positively and we have moved forward after >>>> assessing >>>>>> risk & community consensus. >>>>>> >>>>>> Regarding checksumming, please scroll down to the comments section >> in >>>> the >>>>>> link[2] you provided. You'll notice this discussion – >>>>>> >>>>>>>> Daniel Fox Franke: >>>>>>>> Please don't design new network protocols that don't either run >> over >>>>>> TLS or do some other kind of cryptographic authentication. If you >> have >>>>>> cryptographic authentication, then CRC is redundant. >>>>>>> >>>>>>> Evan Jones: >>>>>>> Good point. Even internal applications inside a data center should >> be >>>>>> using encryption, and today the performance impact is probably small >>> (I >>>>>> haven't actually measured it myself these days). >>>>>> >>>>>> Enabling TLS & internode compression are mitigation strategies to >>> avoid >>>>>> data corruption in transit. By your own admission in >>>> CASSANDRA-13304[3], we >>>>>> don't require checksumming if TLS is present. Here's your full >> quote – >>>>>> >>>>>>> Aleksey Yeschenko: >>>>>>> >>>>>> >>>>>>> Checksums and TLS are orthogonal. It just so happens that you don't >>>> need >>>>>> the former if you already have the latter. >>>>>> >>>>>> I want to be fair, later you did say that we don't want to force >>> people >>>> to >>>>>> pay the cost of TLS overhead. However, I would also like to point >> out >>>> that >>>>>> with introduction of Netty for internode communication, we have 4-5x >>> the >>>>>> TLS performance thanks to OpenSSL JNI bindings. You can refer to >>> Norman >>>> or >>>>>> my talks on the topic. So TLS is practical & compression is >> necessary >>>> for >>>>>> performance. Both strategies work fine to protect against data >>>> corruption >>>>>> making checksumming redundant. With SSL certificate hot reloading, >> it >>>> also >>>>>> avoids unnecessary cluster restarts providing maximum availability. >>>>>> >>>>>> In the same vein, it's 2019 and if people are not using TLS for >>>> internode, >>>>>> then it is really really bad for data security in our industry and >> we >>>>>> should not be encouraging it. In fact, I would go so far as to make >>> TLS >>>> as >>>>>> the default. >>>>>> >>>>>> Managing TLS infrastructure is beyond Cassandra's scope and >> operators >>>>>> should figure it out by now for their & their user's sake. Cassandra >>>> makes >>>>>> it super easy & performant to have TLS enabled. People should be >> using >>>> it. >>>>>> >>>>>>> But until CASSANDRA-15066 [1] lands, this is also a critical flaw >>>>>>> internode. We have addressed it by ensuring that no matter what, >>>> whether >>>>>>> you use SSL or not, whether you use internode compression or not, a >>>>>>> protocol level CRC is always present, for every message frame. It’s >>> our >>>>>>> deep and sincere belief that shipping a new implementation of the >>>>>> messaging >>>>>>> protocol without application-level data integrity checks would be >>>>>>> unacceptable for a modern database. >>>>>> >>>>>> My previous technical arguments have provided enough evidence that >>>>>> protocol level checksumming is not a show stopper. >>>>>> >>>>>> The only reason I see for adding checksums in the protocol is when >>> some >>>>>> user doesn't want to enable TLS and internode compression. As it >>> stands, >>>>>> from your comments[7] it seems to be mandatory and adds unnecessary >>>>>> overhead when TLS and/or Compression is enabled. Frankly I don't >> think >>>> we >>>>>> need to risk destabilizing trunk for these use-cases. I want to >>>> reiterate >>>>>> that I believe in doing the right thing but we have to make >> acceptable >>>>>> tradeoffs – as a community. >>>>>> >>>>>>> # Lack of back-pressure and memory usage constraints >>>>>>> >>>>>>> As it stands today, it’s far too easy for a single slow node to >>> become >>>>>>> overwhelmed by messages from its peers. Conversely, multiple >>>>>> coordinators >>>>>>> can be made unstable by the backlog of messages to deliver to just >>> one >>>>>>> struggling node. >>>>>> >>>>>> This is a known issue and it could have been addressed a separate >> bug >>>> fix >>>>>> – one that could be independently verified. >>>>>> >>>>>>> To address this problem, we have introduced strict memory usage >>>>>> constraints >>>>>>> that apply TCP-level back-pressure, on both outbound and inbound. >> It >>>> is >>>>>>> now impossible for a node to be swamped on inbound, and on outbound >>> it >>>> is >>>>>>> made significantly harder to overcommit resources. It’s a simple, >>>>>> reliable >>>>>>> mechanism that drastically improves cluster stability under load, >> and >>>>>>> especially overload. >>>>>>> >>>>>>> Cassandra is a mature system, and introducing an entirely new >>> messaging >>>>>>> implementation without resolving this fundamental stability issue >> is >>>>>>> difficult to justify in our view. >>>>>> >>>>>> This is great in theory. I would really like to see objective >>>> measurements >>>>>> like Chris did in CASSANDRA-14654[4]. Netflix engineers tested the >>> Netty >>>>>> refactor with a 200 node cluster[5], Zero Copy Streaming[6] and >>> reported >>>>>> their results. It's invaluable work. It would be great to see >>> something >>>>>> similar. >>>>>> >>>>>>> # State of the patch, feature freeze and 4.0 timeline concerns >>>>>>> >>>>>>> The patch is essentially complete, with much improved unit tests >> all >>>>>>> passing, dtests green, and extensive fuzz testing underway - with >>>> initial >>>>>>> results all positive. We intend to further improve in-code >>>> documentation >>>>>>> and test coverage in the next week or two, and do some minor >>> additional >>>>>>> code review, but we believe it will be basically good to commit in >> a >>>>>> couple >>>>>>> weeks. >>>>>> >>>>>> A 20K LoC patch is unverifiable especially without much >> documentation. >>>> It >>>>>> places undue burden on reviewers. It also messes up everyone's >>> branches >>>>>> once you commit such a large refactor. Let's be considerate to >> others >>> in >>>>>> the community. It is a good engineering practice to limit patches >> to a >>>> size >>>>>> that is reasonable. >>>>>> >>>>>> More importantly such large refactors lend themselves to new bugs >> that >>>>>> cannot be caught easily unless you have a very strong regression >> test >>>> suite >>>>>> which Cassandra arguably lacks. Therefore I am of the opinion that >> the >>>> bug >>>>>> fixes can be applied piecemeal into the codebase. They should be >> small >>>>>> enough that can be individually reviewed and independently verified. >>>>>> >>>>>> I also noticed that you have replaced Netty's classes[7]. I am of >> the >>>>>> opinion that they should be upstreamed if they're better so the >> wider >>>> Netty >>>>>> community benefits from it and we don't have to maintain our own >>>> classes. >>>>>> >>>>>>> P.S. I believe that once it’s committed, we should cut our first >> 4.0 >>>>>> alpha. >>>>>>> It’s about time we started this train (: >>>>>> >>>>>> +1 on working towards an alpha but that is a separate discussion >> from >>>> this >>>>>> issue. >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Dinesh >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/CASSANDRA-14578 >>>>>> [2] https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html >>>>>> [3] >>>>>> >>>> >>> >> https://issues.apache.org/jira/browse/CASSANDRA-13304?focusedCommentId=16183034&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16183034 >>>>>> [4] https://issues.apache.org/jira/browse/CASSANDRA-14654 >>>>>> [5] https://issues.apache.org/jira/browse/CASSANDRA-14747 >>>>>> [6] https://issues.apache.org/jira/browse/CASSANDRA-14765 >>>>>> [7] >>>>>> >>>> >>> >> https://issues.apache.org/jira/browse/CASSANDRA-15066?focusedCommentId=16801277&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16801277 >>>>>> >>>>>> >> --------------------------------------------------------------------- >>>>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>>>> >>>>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>>> >>>> >>> >>> -- >>> alex p >>> >>