Hello, I've been having some strange issues with one of our test clusters (4-day-old, 3-node, 2.1.10 cluster on AWS). I saw a number of messages like the following:
[0000] 10 Nov 20:21:00.406 * pri=WARN t=MessagingService-Incoming-/ 192.168.168.202 at=IncomingTcpConnection.run UnknownColumnFamilyException reading from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=3ecce750-84d3-11e5-bdd9-dd7717dcdbd5 A colleague suggested I run repair, but that failed with: [2015-11-10 20:06:54,329] Nothing to repair for keyspace 'eventPipesState' [2015-11-10 20:06:54,348] Starting repair command #1, repairing 768 ranges for keyspace dbs8okvd7jcurj (parallelism=SEQUENTIAL, full=true) [2015-11-10 20:06:55,599] Repair command #1 finished [2015-11-10 20:06:55,610] Starting repair command #2, repairing 487 ranges for keyspace context (parallelism=SEQUENTIAL, full=true) [2015-11-10 20:11:21,213] Lost notification. You should check server log for repair status of keyspace context [2015-11-10 20:11:21,288] Lost notification. You should check server log for repair status of keyspace context Exception occurred during clean-up. java.lang.reflect.UndeclaredThrowableException error: JMX connection closed. You should check server log for repair status of keyspace context(Subsequent keyspaces are not going to be repaired). -- StackTrace -- java.io.IOException: JMX connection closed. You should check server log for repair status of keyspace context(Subsequent keyspaces are not going to be repaired). I searched for other cases of similar issues, and found some posts (e.g., http://stackoverflow.com/questions/22783577/org-apache-cassandra-db-unknowncolumnfamilyexception-couldnt-find-cfid ), but nothing that seemed directly relevant. Still, I tried `nodetool describecluster` and all the nodes showed up as being on the same schema version. The server log did not include any more info. I asked about this on IRC and got the suggestion to run `nodetool resetlocalschema`. I tried running that, and it completed (and `nodetool describecluster` now shows this node as having a different schema version from the other two nodes) but now I still get the original error in the server logs but also [0000] 10 Nov 22:51:10.466 * pri=ERROR t=Thrift:12 at=CustomTThreadPoolServer.run Error occurred during processing of message. java.lang.IllegalArgumentException: Unknown keyspace/cf pair (system_auth.credentials) Further `nodetool repair`s on the same node do complete, but only seem to process the `system` keyspace (and don't do anything with it): [2015-11-10 22:38:07,415] Nothing to repair for keyspace 'system' I also tried running `nodetool repair` from another node in the cluster, but that just seems to hang: [2015-11-10 22:53:11,830] Starting repair command #7, repairing 768 ranges for keyspace dbs8okvd7jcurj (parallelism=SEQUENTIAL, full=true) [2015-11-10 22:53:12,943] Repair command #7 finished [2015-11-10 22:53:12,958] Starting repair command #8, repairing 534 ranges for keyspace context (parallelism=SEQUENTIAL, full=true) How can I restore this cluster? And ideally, how can I figure out what went wrong here in the first place?