Hi guys, I observed some odd behaviour with our Cassandra cluster the other day while doing some maintenance operation and was wondering if anyone would be able to provide some insight.
Initially, I started a node up to join the cluster. That node appeared to be having issues joining due to some SSTable corruption it encountered. Since it was still in early staged and I had never seen this failure before, I decided to take it out of commission and just try again. However, since it was in a bad state, I decided to issue a "nodetool removenode <host id>" on a peer rather than a "nodetool decommission" on the node itself. The removenode command hung indefinitely - my guess is that this is related to https://issues.apache.org/jira/browse/CASSANDRA-6542. We are using 2.1.11. While this was happening, the driver in the application started logging error messages about not being able to reach a quorum of 4. This, to me, was mysterious as none of my keyspaces have an RF > 3. That quorum count in the error implied an RF of 6 or 7. I eventually forced that node out of the ring with "nodetool removenode force". This seemed to mostly fix the issue, though there seems to have been enough of a load spike to cause some of the machines' JVMs to accumulate a lot of garbage very fast and spit out a ton of "Not marking nodes down due to local pause of ... ", trying to clean it up. Some of these nodes seemed unresponsive to their peers, who marked them DOWN (as indicated by "nodetool status" and the cassandra log file on those machines), further exacerbating the situation on the nodes that were still up. I guess my question is two-fold. First, can anyone provide some insight into what may have happened? Second, what do you consider good practices when dealing with such issues? Any advice is greatly appreciated! Thanks, Rutvij