Re: practice failure recovery

aaron morton Tue, 26 Apr 2011 14:10:11 -0700

In 0.7.X the cli waits for the schema to agree before returning, you should 
see...


Waiting for schema agreement...
... schemas agree across the cluster

Or if things fail
The schema has not settled in %d seconds; further migrations are ill-advised 
until it does.%nVersions are %s%n

WRT the error, first guess is something in the schema has changed it's 
upsetting the log replay. Given all the crazy i'd go with the nuclear option. 

Aaron
 
On 27 Apr 2011, at 07:11, William Oberman wrote:

> In my test cluster I manged to jam up a cassandra server.  I figure the easy 
> & failsafe solution is to just boot a replacement node, but I thought I'd try 
> a minute to either figure out what I did, or try to figure out how to 
> properly recover it before I lose my current state.
> 
> The symptom = on startup I get an exception:
> ERROR 11:58:34,567 Exception encountered during startup.
> java.lang.IndexOutOfBoundsException: 6
>         at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:121)
>         at 
> org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(TimeUUIDType.java:56)
>         at 
> org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:45)
>         at 
> org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:29)
>         at 
> java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)
>         at 
> java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)
>         at 
> java.util.concurrent.ConcurrentSkipListMap.doPut(ConcurrentSkipListMap.java:864)
>         at 
> java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(ConcurrentSkipListMap.java:1893)
>         at 
> org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:216)
>         at 
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:130)
>         at 
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
>         at 
> org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:253)
>         at 
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156)
>         at 
> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:173)
>         at 
> org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
>         at 
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)
> 
> Where things went wrong = I had been doing various testing and unit testing, 
> as this is my "proof of concept" cluster.  The unit tests in particular work 
> by cloning a keyspace as "keyspace_UUID" (to get a blank slate).  Because of 
> various bugs in my code and configuration, this left a fair amount of crud 
> keyspaces by the time I got everything to pass.  So, I wrote a script to drop 
> all of the test keyspaces (the script had worked on a single node 
> environment, which was my first step before the cluster).  I think the CLI 
> doesn't wait for schema propagation, so the script confused the node I was 
> talking to, as after it ran the schema UUIDs of that node vs. the rest of the 
> cluster didn't agree ("describe cluster" in the CLI).  And, it wasn't fixing 
> itself.  "nodetool loadbalance" said it would do a decommission/bootstrap, 
> which I thought might give the bad node a kick in the pants, so I tried it.  
> Afterwards, I ran "nodetool ring" against all nodes and the problem node 
> claimed all was "UP", but everything else listed the problem node as "?" and 
> everything else as UP (sadly, I either didn't check or can't remember what 
> "nodetool ring" said before loadbalance).  So, I shut down the problem node.  
> But, when I tried to restart it, I got the error you see above.
> 
> Not sure what was the worst/dumbest thing I did, but it's definitely unhappy 
> now!

Re: practice failure recovery

Reply via email to