Hello,

I've been having some strange issues with one of our test clusters
(4-day-old, 3-node, 2.1.10 cluster on AWS). I saw a number of messages like
the following:

[0000] 10 Nov 20:21:00.406 * pri=WARN  t=MessagingService-Incoming-/
192.168.168.202 at=IncomingTcpConnection.run UnknownColumnFamilyException
reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find
cfId=3ecce750-84d3-11e5-bdd9-dd7717dcdbd5

A colleague suggested I run repair, but that failed with:

[2015-11-10 20:06:54,329] Nothing to repair for keyspace 'eventPipesState'
[2015-11-10 20:06:54,348] Starting repair command #1, repairing 768 ranges
for keyspace dbs8okvd7jcurj (parallelism=SEQUENTIAL, full=true)
[2015-11-10 20:06:55,599] Repair command #1 finished
[2015-11-10 20:06:55,610] Starting repair command #2, repairing 487 ranges
for keyspace context (parallelism=SEQUENTIAL, full=true)
[2015-11-10 20:11:21,213] Lost notification. You should check server log
for repair status of keyspace context
[2015-11-10 20:11:21,288] Lost notification. You should check server log
for repair status of keyspace context
Exception occurred during clean-up.
java.lang.reflect.UndeclaredThrowableException
error: JMX connection closed. You should check server log for repair status
of keyspace context(Subsequent keyspaces are not going to be repaired).
-- StackTrace --
java.io.IOException: JMX connection closed. You should check server log for
repair status of keyspace context(Subsequent keyspaces are not going to be
repaired).

I searched for other cases of similar issues, and found some posts (e.g.,
http://stackoverflow.com/questions/22783577/org-apache-cassandra-db-unknowncolumnfamilyexception-couldnt-find-cfid
), but nothing that seemed directly relevant. Still, I tried `nodetool
describecluster` and all the nodes showed up as being on the same schema
version.

The server log did not include any more info. I asked about this on IRC and
got the suggestion to run `nodetool resetlocalschema`. I tried running
that, and it completed (and `nodetool describecluster` now shows this node
as having a different schema version from the other two nodes) but now I
still get the original error in the server logs but also

[0000] 10 Nov 22:51:10.466 * pri=ERROR t=Thrift:12
at=CustomTThreadPoolServer.run Error occurred during processing of message.
java.lang.IllegalArgumentException: Unknown keyspace/cf pair
(system_auth.credentials)

Further `nodetool repair`s on the same node do complete, but only seem to
process the `system` keyspace (and don't do anything with it):

[2015-11-10 22:38:07,415] Nothing to repair for keyspace 'system'

I also tried running `nodetool repair` from another node in the cluster,
but that just seems to hang:

[2015-11-10 22:53:11,830] Starting repair command #7, repairing 768 ranges
for keyspace dbs8okvd7jcurj (parallelism=SEQUENTIAL, full=true)
[2015-11-10 22:53:12,943] Repair command #7 finished
[2015-11-10 22:53:12,958] Starting repair command #8, repairing 534 ranges
for keyspace context (parallelism=SEQUENTIAL, full=true)

How can I restore this cluster? And ideally, how can I figure out what went
wrong here in the first place?

Reply via email to