practice failure recovery

William Oberman Tue, 26 Apr 2011 12:12:55 -0700

In my test cluster I manged to jam up a cassandra server.  I figure the easy
& failsafe solution is to just boot a replacement node, but I thought I'd
try a minute to either figure out what I did, or try to figure out how to
properly recover it before I lose my current state.


The symptom = on startup I get an exception:
ERROR 11:58:34,567 Exception encountered during startup.
java.lang.IndexOutOfBoundsException: 6
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:121)
        at
org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(TimeUUIDType.java:56)
        at
org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:45)
        at
org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:29)
        at
java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)
        at
java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)
        at
java.util.concurrent.ConcurrentSkipListMap.doPut(ConcurrentSkipListMap.java:864)
        at
java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(ConcurrentSkipListMap.java:1893)
        at
org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:216)
        at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:130)
        at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
        at
org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
        at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:253)
        at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156)
        at
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:173)
        at
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
        at
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)

Where things went wrong = I had been doing various testing and unit testing,
as this is my "proof of concept" cluster.  The unit tests in particular work
by cloning a keyspace as "keyspace_UUID" (to get a blank slate).  Because of
various bugs in my code and configuration, this left a fair amount of crud
keyspaces by the time I got everything to pass.  So, I wrote a script to
drop all of the test keyspaces (the script had worked on a single node
environment, which was my first step before the cluster).  I think the CLI
doesn't wait for schema propagation, so the script confused the node I was
talking to, as after it ran the schema UUIDs of that node vs. the rest of
the cluster didn't agree ("describe cluster" in the CLI).  And, it wasn't
fixing itself.  "nodetool loadbalance" said it would do a
decommission/bootstrap, which I thought might give the bad node a kick in
the pants, so I tried it.  Afterwards, I ran "nodetool ring" against all
nodes and the problem node claimed all was "UP", but everything else listed
the problem node as "?" and everything else as UP (sadly, I either didn't
check or can't remember what "nodetool ring" said before loadbalance).  So,
I shut down the problem node.  But, when I tried to restart it, I got the
error you see above.

Not sure what was the worst/dumbest thing I did, but it's definitely unhappy
now!

practice failure recovery

Reply via email to