Re: practice failure recovery

William Oberman Tue, 26 Apr 2011 14:33:20 -0700

Done and done.  I'm really loving how easy the nuclear option has been (it
was what I tested first).


will

On Tue, Apr 26, 2011 at 5:09 PM, aaron morton <aa...@thelastpickle.com>wrote:

> In 0.7.X the cli waits for the schema to agree before returning, you should
> see...
>
> Waiting for schema agreement...
> ... schemas agree across the cluster
>
> Or if things fail
> The schema has not settled in %d seconds; further migrations are
> ill-advised until it does.%nVersions are %s%n
>
> WRT the error, first guess is something in the schema has changed it's
> upsetting the log replay. Given all the crazy i'd go with the nuclear
> option.
>
> Aaron
>
> On 27 Apr 2011, at 07:11, William Oberman wrote:
>
> > In my test cluster I manged to jam up a cassandra server.  I figure the
> easy & failsafe solution is to just boot a replacement node, but I thought
> I'd try a minute to either figure out what I did, or try to figure out how
> to properly recover it before I lose my current state.
> >
> > The symptom = on startup I get an exception:
> > ERROR 11:58:34,567 Exception encountered during startup.
> > java.lang.IndexOutOfBoundsException: 6
> >         at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:121)
> >         at
> org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(TimeUUIDType.java:56)
> >         at
> org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:45)
> >         at
> org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:29)
> >         at
> java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)
> >         at
> java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)
> >         at
> java.util.concurrent.ConcurrentSkipListMap.doPut(ConcurrentSkipListMap.java:864)
> >         at
> java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(ConcurrentSkipListMap.java:1893)
> >         at
> org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:216)
> >         at
> org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:130)
> >         at
> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
> >         at
> org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
> >         at
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:253)
> >         at
> org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156)
> >         at
> org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:173)
> >         at
> org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
> >         at
> org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)
> >
> > Where things went wrong = I had been doing various testing and unit
> testing, as this is my "proof of concept" cluster.  The unit tests in
> particular work by cloning a keyspace as "keyspace_UUID" (to get a blank
> slate).  Because of various bugs in my code and configuration, this left a
> fair amount of crud keyspaces by the time I got everything to pass.  So, I
> wrote a script to drop all of the test keyspaces (the script had worked on a
> single node environment, which was my first step before the cluster).  I
> think the CLI doesn't wait for schema propagation, so the script confused
> the node I was talking to, as after it ran the schema UUIDs of that node vs.
> the rest of the cluster didn't agree ("describe cluster" in the CLI).  And,
> it wasn't fixing itself.  "nodetool loadbalance" said it would do a
> decommission/bootstrap, which I thought might give the bad node a kick in
> the pants, so I tried it.  Afterwards, I ran "nodetool ring" against all
> nodes and the problem node claimed all was "UP", but everything else listed
> the problem node as "?" and everything else as UP (sadly, I either didn't
> check or can't remember what "nodetool ring" said before loadbalance).  So,
> I shut down the problem node.  But, when I tried to restart it, I got the
> error you see above.
> >
> > Not sure what was the worst/dumbest thing I did, but it's definitely
> unhappy now!
>
>


-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com

Re: practice failure recovery

Reply via email to