Cassandra single unreachable node causing total cluster outage

Agrawal, Pratik Tue, 27 Nov 2018 16:32:49 -0800

Hello all,

Setup:


18 Cassandra node cluster. Cassandra version 2.2.8
Amazon C3.2x large machines.
Replication factor of 3 (in 3 different AZs).
Read and Write using Quorum.

Use case:


  1.  Short lived data with heavy updates (I know we are abusing Cassandra 
here) with gc grace period of 15 minutes (I know it sounds ridiculous). 
Level-tiered compaction strategy.
  2.  Timeseries data, no updates (short lived) (1 hr). TTLed out using 
Date-tiered compaction strategy.
  3.  Timeseries data, no updates (long lived) (7 days). TTLed out using 
Date-tiered compaction strategy.

Overall high read and write throughput (100000/second)

Problem:

  1.  The EC2 machine becomes unreachable (we reproduced the issue by taking 
down network card) and the entire cluster becomes unstable for the time until 
the down node is removed from the cluster. The node is shown as DN node while 
doing nodetool status. Our understanding was that a single node down in one AZ 
should not impact other nodes. We are unable to understand why a single node 
going down is causing entire cluster to become unstable. Is there any open bug 
around this?
  2.  We tried another experiment by killing Cassandra process but in this case 
we only see a blip in latencies but all the other nodes are still healthy and 
responsive (as expected).

Any thoughts/comments on what could be the issue here?

Thanks,
Pratik

Cassandra single unreachable node causing total cluster outage

Reply via email to