On Mon, Sep 20, 2010 at 09:51, shimi <shim...@gmail.com> wrote: > I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter). > I replaced all of the servers in the cluster (0.6.4) with new ones (0.6.5). > My old cluster was unbalanced since I was using Random Partitioner and I > bootstrapped all the nodes without specifying their tokens. > > Since I wanted the the cluster to be balanced I first added all the new > nodes one after the other (with the right tokens this time) and then I run > decommission on all the old ones, one after the other. > One of the decommissioned nodes began throwing too many open files errors > while It was decommissioning taking other nodes with him. After the second > try I decided to stop it and run removetoken on his token from one of the > other nodes. After that everything went well except that in the end one of > the nodes looked unbalanced. > > I decided to run repair on the cluster. What I got is totally unbalanced > nodes with way to much data then what is suppose to be. each node had x2-x4 > more data. > I run cleanup and all of them except the one which was unbalanced to begin > with got back to the size they were suppose to be. > Now whenever I try to run cleanup on this node I get: > > INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java > (line 339) AntiCompacting ... > INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129) GC > for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 used; > max is 6552551424 > INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150) > Pool Name Active Pending > INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) > STREAM-STAGE 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) > RESPONSE-STAGE 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156) > ROW-READ-STAGE 8 717 > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > LB-OPERATIONS 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > MISCELLANEOUS-POOL 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > GMFD 0 2 > INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) > CONSISTENCY-MANAGER 0 1 > INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) > LB-TARGET 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156) > ROW-MUTATION-STAGE 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) > MESSAGE-STREAMING-POOL 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) > LOAD-BALANCER-STAGE 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156) > FLUSH-SORTER-POOL 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) > MEMTABLE-POST-FLUSHER 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) > AE-SERVICE-STAGE 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) > FLUSH-WRITER-POOL 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) > HINTED-HANDOFF-POOL 0 0 > INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161) > CompactionManager n/a 0 > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402 > SSTableDeletingReference.java (line 104) Deleted ... > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727 > SSTableDeletingReference.java (line 104) Deleted ... > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730 > SSTableDeletingReference.java (line 104) Deleted ... > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735 > SSTableDeletingReference.java (line 104) Deleted ... > > and after that I saw an increase in the node response time and the number > ROW-READ-STAGE pending tasks. Since there was no indication that something > is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), the > only thing that I could have done is to restart the server. > > I don't know if this is related but every hour I see this error (I think it > is the IP of the machine that I couldn't decommission properly): > > INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) FatClient > /X.X.X.X has been silent for 3600000ms, removing from gossip > ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip error > java.util.ConcurrentModificationException > at java.util.Hashtable$Enumerator.next(Hashtable.java:1031) > at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383) > at > org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93) > at java.util.TimerThread.mainLoop(Timer.java:512) > at java.util.TimerThread.run(Timer.java:462) > INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node > /X.X.X.X is now part of the cluster > > Does anyone have any idea how can I cleanup the problematic node?
You may just need to be patient. Have you tried monitoring the CompactionManager in jmx to see if it is doing things? > Does anyone have any idea how can I get rid of the Gossip error? This is CASSANDRA-1494. You can ignore it. Gary.