> This is rc2 I am assuming?
No, RC1. I had a number of problems when trying to upgrade to RC2; I am in the process of setting up a separate, clean cluster to see if the errors are reproducible. > Your logs seem to indicate you have some node flapping problems. > The logs complain about nodes *.18 and *.17 going down and immediately coming back up, even before the decommission. Is that common throughout your logs? If thats the case I would attribute seeing *.17 down after the decommission to that. Yes, flapping has been a constant problem. Nodes will often go down for < 5 seconds then come back up. I have not had time to put significant effort into determining why. At a system level, nothing is showing up in /var/log/messages as I would expect if my TCP stack was somehow miss configured. After running a quick test I see that, there is significant jitter in internode ping times (although apparently no packet lost). Average pings of 0.2 ms, but 1 in 10 with ~5 ms and 1 in 50 up to 30 ms. Could this be causing Cassandra problems? The network jitter seems to only be present when Cassandra is under load. I should point out that all the nodes have gigabit Ethernet connections and I believe we are using a Cisco switch so I have a hard time believing bandwidth limits are being hit. Network troubleshooting is not really my area of expertise so any suggestions you may have to diagnose the problem would be appreciated. A little besides the point: the main difference between this 'regular' flapping (not to mean expected) and the current decommission issue is that the other node (*.17) did not come back up, even after the ~10 minutes I gave it before restarting. > I also have a question about the log from the additional node. The log indicates that *.19 was not a part of the cluster until you ran the removetoken command. Can you verify that the ring output for that node was what you saw *before* you ran the removetoken command? I am 100% sure the logs are from before I ran the removetoken command. I only thought to try it as I was about to send my email. I need to decommission two more nodes. I will turn on debug logging when doing so to try and get some extra info. Thanks for your help, Dan From: Nick Bailey [mailto:n...@riptano.com] Sent: December-15-10 9:46 To: user@cassandra.apache.org Subject: Re: Errors when decommissioning - 0.7 RC1 Your logs seem to indicate you have some node flapping problems. The logs complain about nodes *.18 and *.17 going down and immediately coming back up, even before the decommission. Is that common throughout your logs? If thats the case I would attribute seeing *.17 down after the decommission to that. I also have a question about the log from the additional node. The log indicates that *.19 was not a part of the cluster until you ran the removetoken command. Can you verify that the ring output for that node was what you saw *before* you ran the removetoken command? On Wed, Dec 15, 2010 at 8:34 AM, Nick Bailey <n...@riptano.com> wrote: Just realized the ring output is included in the logs for both of those nodes. Disregard my earlier request :). On Wed, Dec 15, 2010 at 8:27 AM, Nick Bailey <n...@riptano.com> wrote: This is rc2 I am assuming? One thing about remove, the removetoken force command is meant to be run on the node that originally started a remove and doesn't take a token parameter. Not relevant to you problem though. Is this a test cluster and have you tried to reproduce the error? I would be interested to know what the ring command looks like on both *.19 and *.17 after the decommission is run. I assume you were running the ring command on another node? I'll look into the logs more and see if anything jumps out. On Wed, Dec 15, 2010 at 6:37 AM, Dan Hendry <dan.hendry.j...@gmail.com> wrote: I am seeing very strange things when trying to decommission a node in my cluster (detailed logs attached). Here is a nodetool ring report *after* decommissioning of node 192.168.4.19 (as seen by any other, properly functioning node). 192.168.4.15 Up Normal 49.9 GB 25.00% 42535295865117307932921825928971026431 192.168.4.20 Up Normal 42.56 GB 8.33% 56713727820156410577229101238628035242 192.168.4.16 Up Normal 29.17 GB 16.67% 85070591730234615865843651857942052863 192.168.4.19 Down Leaving 54.11 GB 16.67% 113427455640312821154458202477256070484 192.168.4.17 Down Normal 48.88 GB 8.33% 127605887595351923798765477786913079295 192.168.4.18 Up Normal 59.44 GB 25.00% 170141183460469231731687303715884105726 192.168.4.12 Up Normal 52.3 GB 0.00% 170141183460469231731687303715884105727 What I am seeing is that after nodetool decommission completes on 192.168.4.19, the next node in the ring (192.168.4.17) 'dies' (see attached log, its nodetool ring report is quite different). By 'dies' I mean that it stops communicating with other nodes (but the Cassandra process is still running and, among other things, compaction continues). After restarting Cassandra on 192.168.4.17, the ring state gets stuck and the decommissioned node (192.168.4.19) does not get removed (at least from the nodetool ring report): 192.168.4.15 Up Normal 49.9 GB 25.00% 42535295865117307932921825928971026431 192.168.4.20 Up Normal 42.56 GB 8.33% 56713727820156410577229101238628035242 192.168.4.16 Up Normal 29.17 GB 16.67% 85070591730234615865843651857942052863 192.168.4.19 Down Leaving 54.11 GB 16.67% 113427455640312821154458202477256070484 192.168.4.17 Up Normal 69.12 GB 8.33% 127605887595351923798765477786913079295 192.168.4.18 Up Normal 58.88 GB 25.00% 170141183460469231731687303715884105726 192.168.4.12 Up Normal 52.3 GB 0.00% 170141183460469231731687303715884105727 Furthermore, when I try running "nodetool removetoken 113427455640312821154458202477256070484", I get: Exception in thread "main" java.lang.UnsupportedOperationException: Node /192.168.4.19 is already being removed. at org.apache.cassandra.service.StorageService.removeToken(StorageService.java: 1731) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntr ospector.java:93) And when I try running "nodetool removetoken force 113427455640312821154458202477256070484", I get: RemovalStatus: No token removals in process. Exception in thread "main" java.lang.NullPointerException at org.apache.cassandra.service.StorageService.forceRemoveCompletion(StorageSer vice.java:1703) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at com.sun.jmx.mbeanserver.StandardMBeanIntrospector.invokeM2(StandardMBeanIntr ospector.java:93) ?!?!?!? I think have seen this type of behaviour once or twice before (I believe 0.7 b1 or later) but wrote it off as being caused by my misguided tinkering and/or other Cassandra bugs. This time around, I have done very little with JMX/CLI/nodetool and I can find no related Cassandra bugs. Help/suggestions? Dan Hendry (403) 660-2297 No virus found in this incoming message. Checked by AVG - www.avg.com Version: 9.0.872 / Virus Database: 271.1.1/3315 - Release Date: 12/14/10 02:34:00