Could also be the app not detecting the host is down and it keeps trying to use it as a coordinator
-- Jeff Jirsa > On Nov 27, 2018, at 6:33 PM, Ben Slater <ben.sla...@instaclustr.com> wrote: > > In what way does the cluster become unstable (ie more specifically what are > the symptoms)? My first thought would be the loss of the node causing the > other nodes to become overloaded but that doesn’t seem to fit with your > point 2. > > Cheers > Ben > --- > Ben Slater > Chief Product Officer > > > Read our latest technical blog posts here. > This email has been sent on behalf of Instaclustr Pty. Limited (Australia) > and Instaclustr Inc (USA). > This email and any attachments may contain confidential and legally > privileged information. If you are not the intended recipient, do not copy > or disclose its content, but please reply to this email immediately and > highlight the error to the sender and then immediately delete the message. > > >> On Tue, 27 Nov 2018 at 16:32, Agrawal, Pratik <paagr...@amazon.com.invalid> >> wrote: >> Hello all, >> >> >> >> Setup: >> >> >> >> 18 Cassandra node cluster. Cassandra version 2.2.8 >> >> Amazon C3.2x large machines. >> >> Replication factor of 3 (in 3 different AZs). >> >> Read and Write using Quorum. >> >> >> >> Use case: >> >> >> >> Short lived data with heavy updates (I know we are abusing Cassandra here) >> with gc grace period of 15 minutes (I know it sounds ridiculous). >> Level-tiered compaction strategy. >> Timeseries data, no updates (short lived) (1 hr). TTLed out using >> Date-tiered compaction strategy. >> Timeseries data, no updates (long lived) (7 days). TTLed out using >> Date-tiered compaction strategy. >> >> >> Overall high read and write throughput (100000/second) >> >> >> >> Problem: >> >> The EC2 machine becomes unreachable (we reproduced the issue by taking down >> network card) and the entire cluster becomes unstable for the time until the >> down node is removed from the cluster. The node is shown as DN node while >> doing nodetool status. Our understanding was that a single node down in one >> AZ should not impact other nodes. We are unable to understand why a single >> node going down is causing entire cluster to become unstable. Is there any >> open bug around this? >> We tried another experiment by killing Cassandra process but in this case we >> only see a blip in latencies but all the other nodes are still healthy and >> responsive (as expected). >> >> >> Any thoughts/comments on what could be the issue here? >> >> >> >> Thanks, >> Pratik >> >> >> >> >> >>