Your logs seem to indicate you have some node flapping problems. The logs complain about nodes *.18 and *.17 going down and immediately coming back up, even before the decommission. Is that common throughout your logs? If thats the case I would attribute seeing *.17 down after the decommission to that.
I also have a question about the log from the additional node. The log indicates that *.19 was not a part of the cluster until you ran the removetoken command. Can you verify that the ring output for that node was what you saw *before* you ran the removetoken command? On Wed, Dec 15, 2010 at 8:34 AM, Nick Bailey <n...@riptano.com> wrote: > Just realized the ring output is included in the logs for both of those > nodes. Disregard my earlier request :). > > > On Wed, Dec 15, 2010 at 8:27 AM, Nick Bailey <n...@riptano.com> wrote: > >> This is rc2 I am assuming? >> >> One thing about remove, the removetoken force command is meant to be run >> on the node that originally started a remove and doesn't take a token >> parameter. Not relevant to you problem though. >> >> Is this a test cluster and have you tried to reproduce the error? I would >> be interested to know what the ring command looks like on both *.19 and *.17 >> after the decommission is run. I assume you were running the ring command >> on another node? I'll look into the logs more and see if anything jumps >> out. >> >> >> On Wed, Dec 15, 2010 at 6:37 AM, Dan Hendry <dan.hendry.j...@gmail.com>wrote: >> >>> I am seeing very strange things when trying to decommission a node in my >>> cluster (detailed logs attached). Here is a nodetool ring report **after >>> ** decommissioning of node 192.168.4.19 (as seen by any other, properly >>> functioning node). >>> >>> >>> >>> 192.168.4.15 Up Normal 49.9 GB 25.00% >>> 42535295865117307932921825928971026431 >>> >>> 192.168.4.20 Up Normal 42.56 GB 8.33% >>> 56713727820156410577229101238628035242 >>> >>> 192.168.4.16 Up Normal 29.17 GB 16.67% >>> 85070591730234615865843651857942052863 >>> >>> 192.168.4.19 Down Leaving 54.11 GB 16.67% >>> 113427455640312821154458202477256070484 >>> >>> 192.168.4.17 Down Normal 48.88 GB 8.33% >>> 127605887595351923798765477786913079295 >>> >>> 192.168.4.18 Up Normal 59.44 GB 25.00% >>> 170141183460469231731687303715884105726 >>> >>> 192.168.4.12 Up Normal 52.3 GB 0.00% >>> 170141183460469231731687303715884105727 >>> >>> >>> >>> >>> >>> What I am seeing is that after nodetool decommission completes on >>> 192.168.4.19, the next node in the ring (192.168.4.17) ‘dies’ (see attached >>> log, its nodetool ring report is quite different). By ‘dies’ I mean that it >>> stops communicating with other nodes (but the Cassandra process is still >>> running and, among other things, compaction continues). After restarting >>> Cassandra on 192.168.4.17, the ring state gets stuck and the decommissioned >>> node (192.168.4.19) does not get removed (at least from the nodetool ring >>> report): >>> >>> >>> >>> 192.168.4.15 Up Normal 49.9 GB 25.00% >>> 42535295865117307932921825928971026431 >>> >>> 192.168.4.20 Up Normal 42.56 GB 8.33% >>> 56713727820156410577229101238628035242 >>> >>> 192.168.4.16 Up Normal 29.17 GB 16.67% >>> 85070591730234615865843651857942052863 >>> >>> 192.168.4.19 Down Leaving 54.11 GB 16.67% >>> 113427455640312821154458202477256070484 >>> >>> 192.168.4.17 Up Normal 69.12 GB 8.33% >>> 127605887595351923798765477786913079295 >>> >>> 192.168.4.18 Up Normal 58.88 GB 25.00% >>> 170141183460469231731687303715884105726 >>> >>> 192.168.4.12 Up Normal 52.3 GB 0.00% >>> 170141183460469231731687303715884105727 >>> >>> >>> >>> >>> >>> Furthermore, when I try running “nodetool removetoken >>> 113427455640312821154458202477256070484”, I get: >>> >>> >>> >>> Exception in thread "main" java.lang.UnsupportedOperationException: Node >>> /192.168.4.19 is already being removed. >>> >>> at >>> org.apache.cassandra.service.StorageService.removeToken(StorageService.java:1731) >>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>> >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> >>> at >>> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) >>> >>> >>> >>> >>> >>> And when I try running “nodetool removetoken force >>> 113427455640312821154458202477256070484”, I get: >>> >>> >>> >>> RemovalStatus: No token removals in process. >>> >>> Exception in thread "main" java.lang.NullPointerException >>> >>> at >>> org.apache.cassandra.service.StorageService.forceRemoveCompletion(StorageService.java:1703) >>> >>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native >>> Method) >>> >>> at >>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) >>> >>> at >>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) >>> >>> at java.lang.reflect.Method.invoke(Method.java:597) >>> >>> at >>> com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntrospector.java:93) >>> >>> >>> >>> ?!?!?!? >>> >>> >>> >>> I think have seen this type of behaviour once or twice before (I believe >>> 0.7 b1 or later) but wrote it off as being caused by my misguided tinkering >>> and/or other Cassandra bugs. This time around, I have done very little with >>> JMX/CLI/nodetool and I can find no related Cassandra bugs. >>> >>> >>> >>> Help/suggestions? >>> >>> >>> >>> Dan Hendry >>> >>> (403) 660-2297 >>> >>> >>> >> >> >