Hello. We had some major latency problems yesterday with our 5 node
cassandra cluster. Wanted to get some feedback on where we could start to
look to figure out what was causing the issue. If there is more info I
should provide, please let me know.

Here are the basics of the cluster:
Clients: Hector and Cassie
Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont)
Replication Factor: 5
Quorum Reads and Writes enabled
Read Repair set to true
Cassandra Version: 1.0.12

We started experiencing catastrophic latency from our app servers. We
believed at the time this was due to compactions running, and the clients
were not re-routing appropriately, so we disabled thrift on a single node
that had high load. This did not resolve the issue. After that, we stopped
gossip on the same node that had high load on it, again this did not
resolve anything. We then took down gossip on another node (leaving 3/5 up)
and that fixed the latency from the application side. For a period of ~4
hours, every time we would try to bring up a fourth node, the app would see
the latency again. We then rotated the three nodes that were up to make
sure it was not a networking event related to a single region/provider and
we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5
nodes would. After the ~4hours, we brought the cluster up to 5 nodes and
everything was fine.

We currently have some ideas on what caused this behavior, but has anyone
else seen this type of problem where a full cluster causes problems, but
removing nodes fixes it? Any input on what to look for in our logs to
understand the issue?

Thanks

Arup

Reply via email to