I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter). I replaced all of the servers in the cluster (0.6.4) with new ones (0.6.5). My old cluster was unbalanced since I was using Random Partitioner and I bootstrapped all the nodes without specifying their tokens.
Since I wanted the the cluster to be balanced I first added all the new nodes one after the other (with the right tokens this time) and then I run decommission on all the old ones, one after the other. One of the decommissioned nodes began throwing too many open files errors while It was decommissioning taking other nodes with him. After the second try I decided to stop it and run removetoken on his token from one of the other nodes. After that everything went well except that in the end one of the nodes looked unbalanced. I decided to run repair on the cluster. What I got is totally unbalanced nodes with way to much data then what is suppose to be. each node had x2-x4 more data. I run cleanup and all of them except the one which was unbalanced to begin with got back to the size they were suppose to be. Now whenever I try to run cleanup on this node I get: INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java (line 339) AntiCompacting ... INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129) GC for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 used; max is 6552551424 INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150) Pool Name Active Pending INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) STREAM-STAGE 0 0 INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) RESPONSE-STAGE 0 0 INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156) ROW-READ-STAGE 8 717 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) LB-OPERATIONS 0 0 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) MISCELLANEOUS-POOL 0 0 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) GMFD 0 2 INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) CONSISTENCY-MANAGER 0 1 INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) LB-TARGET 0 0 INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156) ROW-MUTATION-STAGE 0 0 INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) MESSAGE-STREAMING-POOL 0 0 INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) LOAD-BALANCER-STAGE 0 0 INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156) FLUSH-SORTER-POOL 0 0 INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) MEMTABLE-POST-FLUSHER 0 0 INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) AE-SERVICE-STAGE 0 0 INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) FLUSH-WRITER-POOL 0 0 INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) HINTED-HANDOFF-POOL 0 0 INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161) CompactionManager n/a 0 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402 SSTableDeletingReference.java (line 104) Deleted ... INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727 SSTableDeletingReference.java (line 104) Deleted ... INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730 SSTableDeletingReference.java (line 104) Deleted ... INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735 SSTableDeletingReference.java (line 104) Deleted ... and after that I saw an increase in the node response time and the number ROW-READ-STAGE pending tasks. Since there was no indication that something is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), the only thing that I could have done is to restart the server. I don't know if this is related but every hour I see this error (I think it is the IP of the machine that I couldn't decommission properly): INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) FatClient /X.X.X.X has been silent for 3600000ms, removing from gossip ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip error java.util.ConcurrentModificationException at java.util.Hashtable$Enumerator.next(Hashtable.java:1031) at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383) at org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node /X.X.X.X is now part of the cluster Does anyone have any idea how can I cleanup the problematic node? Does anyone have any idea how can I get rid of the Gossip error? Shimi