I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter).
I replaced all of the servers in the cluster (0.6.4) with new ones (0.6.5).
My old cluster was unbalanced since I was using Random Partitioner and I
bootstrapped all the nodes without specifying their tokens.

Since I wanted the the cluster to be balanced I first added all the new
nodes one after the other (with the right tokens this time) and then I run
decommission on all the old ones, one after the other.
One of the decommissioned nodes began throwing too many open files errors
while It was decommissioning taking other nodes with him. After the second
try I decided to stop it and run removetoken on his token from one of the
other nodes. After that everything went well except that in the end one of
the nodes looked unbalanced.

I decided to run repair on the cluster. What I got is totally unbalanced
nodes with way to much data then what is suppose to be. each node had x2-x4
more data.
I run cleanup and all of them except the one which was unbalanced to begin
with got back to the size they were suppose to be.
Now whenever I try to run cleanup on this node I get:

 INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java
(line 339) AntiCompacting ...
 INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129) GC
for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 used;
max is 6552551424
 INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150)
Pool Name                    Active   Pending
 INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
STREAM-STAGE                      0         0
 INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156)
RESPONSE-STAGE                    0         0
 INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156)
ROW-READ-STAGE                    8       717
 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
LB-OPERATIONS                     0         0
 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
MISCELLANEOUS-POOL                0         0
 INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156)
GMFD                              0         2
 INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
CONSISTENCY-MANAGER               0         1
 INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156)
LB-TARGET                         0         0
 INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156)
ROW-MUTATION-STAGE                0         0
 INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
MESSAGE-STREAMING-POOL            0         0
 INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156)
LOAD-BALANCER-STAGE               0         0
 INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156)
FLUSH-SORTER-POOL                 0         0
 INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
MEMTABLE-POST-FLUSHER             0         0
 INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156)
AE-SERVICE-STAGE                  0         0
 INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
FLUSH-WRITER-POOL                 0         0
 INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156)
HINTED-HANDOFF-POOL               0         0
 INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161)
CompactionManager               n/a         0
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402
SSTableDeletingReference.java (line 104) Deleted ...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727
SSTableDeletingReference.java (line 104) Deleted ...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730
SSTableDeletingReference.java (line 104) Deleted ...
 INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735
SSTableDeletingReference.java (line 104) Deleted ...

and after that I saw an increase in the node response time and the number
ROW-READ-STAGE pending tasks. Since there was no indication that something
is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), the
only thing that I could have done is to restart the server.

I don't know if this is related but every hour I see this error (I think it
is the IP of the machine that I couldn't decommission properly):

 INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) FatClient
/X.X.X.X has been silent for 3600000ms, removing from gossip
ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip error
java.util.ConcurrentModificationException
    at java.util.Hashtable$Enumerator.next(Hashtable.java:1031)
    at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383)
    at
org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93)
    at java.util.TimerThread.mainLoop(Timer.java:512)
    at java.util.TimerThread.run(Timer.java:462)
 INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node
/X.X.X.X is now part of the cluster

Does anyone have any idea how can I cleanup the problematic node?
Does anyone have any idea how can I get rid of the Gossip error?

Shimi

Reply via email to