Hello. We had some major latency problems yesterday with our 5 node cassandra cluster. Wanted to get some feedback on where we could start to look to figure out what was causing the issue. If there is more info I should provide, please let me know.
Here are the basics of the cluster: Clients: Hector and Cassie Size: 5 nodes (2 in AWS US-West-1, 2 in AWS US-West-2, 1 in Linode Fremont) Replication Factor: 5 Quorum Reads and Writes enabled Read Repair set to true Cassandra Version: 1.0.12 We started experiencing catastrophic latency from our app servers. We believed at the time this was due to compactions running, and the clients were not re-routing appropriately, so we disabled thrift on a single node that had high load. This did not resolve the issue. After that, we stopped gossip on the same node that had high load on it, again this did not resolve anything. We then took down gossip on another node (leaving 3/5 up) and that fixed the latency from the application side. For a period of ~4 hours, every time we would try to bring up a fourth node, the app would see the latency again. We then rotated the three nodes that were up to make sure it was not a networking event related to a single region/provider and we kept seeing the same problem: 3 nodes showed no latency problem, 4 or 5 nodes would. After the ~4hours, we brought the cluster up to 5 nodes and everything was fine. We currently have some ideas on what caused this behavior, but has anyone else seen this type of problem where a full cluster causes problems, but removing nodes fixes it? Any input on what to look for in our logs to understand the issue? Thanks Arup