I'm currently the proud owner of an 8-node cluster that won't start up.

Yesterday we had a developer doing very high volume writes to our cluster
via a Hadoop job that was reading an HDFS file and running six concurrent
mappers on each of 8 nodes and using Hector to do the load and it sort of
killed Cassandra.  It was running 0.7.0 and actually killed three of the
nodes with OutOfMemory errors before he realized something was awry and
killed the job.  He then tried to get rid of the keyspace by dropping it in
the CLI and got the following error:

javax.management.InstanceAlreadyExistsException:
org.apache.cassandra.db:type=ColumnFamilies,keyspace=devks,columnfamily=OriginCF

So he punted to me, and I decided to just try restarting the cluster in the
hopes that it would sort itself out.  The nodes that were still up died
gracefully with the stop-server command, no kill -9s required.  But when I
tried to start the nodes again, they all failed with stack traces.

My googling led me to this:
https://issues.apache.org/jira/browse/CASSANDRA-2197

So I upgraded to 0.7.2 and tried restarting, once again all the nodes fail
with two different stack traces,  but both types occur immediately after an
INFO message of the form:

INFO 12:06:26,979 Finished reading
/path/to/commitlog/etc/CommitLog-NNNNNNNN.log

The stack traces are one of:

Exception encountered during startup.
java.io.IOError: java.io.EOFException
    at
org.apache.cassandra.io.util.ColumnIterator.deserializeNext(ColumnSortedMap.java:246)
...

or

Exception encountered during startup.
java.lang.NullPointerException
    at
org.apache.cassandra.db.Table.createReplicationStrategy(Table.java:318)
...

Fortunately, I have the luxury of clearing out the data in the cluster, but
I'd like a more elegant option than that.  Anybody have any suggestions?

Thanks,
Matt

Reply via email to