I was patient (although it is hard when you have millions of requests which are not served in time). I was waiting for a long time. There was nothing in the Logs and in JMX.
Shimi On Mon, Sep 20, 2010 at 6:12 PM, Gary Dusbabek <gdusba...@gmail.com> wrote: > On Mon, Sep 20, 2010 at 09:51, shimi <shim...@gmail.com> wrote: > > I have a cluster with 6 nodes on 2 datacenters (3 on each datacenter). > > I replaced all of the servers in the cluster (0.6.4) with new ones > (0.6.5). > > My old cluster was unbalanced since I was using Random Partitioner and I > > bootstrapped all the nodes without specifying their tokens. > > > > Since I wanted the the cluster to be balanced I first added all the new > > nodes one after the other (with the right tokens this time) and then I > run > > decommission on all the old ones, one after the other. > > One of the decommissioned nodes began throwing too many open files errors > > while It was decommissioning taking other nodes with him. After the > second > > try I decided to stop it and run removetoken on his token from one of the > > other nodes. After that everything went well except that in the end one > of > > the nodes looked unbalanced. > > > > I decided to run repair on the cluster. What I got is totally unbalanced > > nodes with way to much data then what is suppose to be. each node had > x2-x4 > > more data. > > I run cleanup and all of them except the one which was unbalanced to > begin > > with got back to the size they were suppose to be. > > Now whenever I try to run cleanup on this node I get: > > > > INFO [COMPACTION-POOL:1] 2010-09-20 12:04:23,069 CompactionManager.java > > (line 339) AntiCompacting ... > > INFO [GC inspection] 2010-09-20 12:05:37,600 GCInspector.java (line 129) > GC > > for ConcurrentMarkSweep: 1525 ms, 13641032 reclaimed leaving 767863520 > used; > > max is 6552551424 > > INFO [GC inspection] 2010-09-20 12:05:37,601 GCInspector.java (line 150) > > Pool Name Active Pending > > INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) > > STREAM-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,605 GCInspector.java (line 156) > > RESPONSE-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,606 GCInspector.java (line 156) > > ROW-READ-STAGE 8 717 > > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > > LB-OPERATIONS 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > > MISCELLANEOUS-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,607 GCInspector.java (line 156) > > GMFD 0 2 > > INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) > > CONSISTENCY-MANAGER 0 1 > > INFO [GC inspection] 2010-09-20 12:05:37,608 GCInspector.java (line 156) > > LB-TARGET 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,609 GCInspector.java (line 156) > > ROW-MUTATION-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) > > MESSAGE-STREAMING-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,610 GCInspector.java (line 156) > > LOAD-BALANCER-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,611 GCInspector.java (line 156) > > FLUSH-SORTER-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) > > MEMTABLE-POST-FLUSHER 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,612 GCInspector.java (line 156) > > AE-SERVICE-STAGE 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) > > FLUSH-WRITER-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,613 GCInspector.java (line 156) > > HINTED-HANDOFF-POOL 0 0 > > INFO [GC inspection] 2010-09-20 12:05:37,616 GCInspector.java (line 161) > > CompactionManager n/a 0 > > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,402 > > SSTableDeletingReference.java (line 104) Deleted ... > > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,727 > > SSTableDeletingReference.java (line 104) Deleted ... > > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,730 > > SSTableDeletingReference.java (line 104) Deleted ... > > INFO [SSTABLE-CLEANUP-TIMER] 2010-09-20 12:05:40,735 > > SSTableDeletingReference.java (line 104) Deleted ... > > > > and after that I saw an increase in the node response time and the number > > ROW-READ-STAGE pending tasks. Since there was no indication that > something > > is wrong or that the node is doing anyuthing (logs ,nodetool and JMX), > the > > only thing that I could have done is to restart the server. > > > > I don't know if this is related but every hour I see this error (I think > it > > is the IP of the machine that I couldn't decommission properly): > > > > INFO [Timer-0] 2010-09-20 13:56:11,406 Gossiper.java (line 402) > FatClient > > /X.X.X.X has been silent for 3600000ms, removing from gossip > > ERROR [Timer-0] 2010-09-20 13:56:11,421 Gossiper.java (line 99) Gossip > error > > java.util.ConcurrentModificationException > > at java.util.Hashtable$Enumerator.next(Hashtable.java:1031) > > at org.apache.cassandra.gms.Gossiper.doStatusCheck(Gossiper.java:383) > > at > > org.apache.cassandra.gms.Gossiper$GossipTimerTask.run(Gossiper.java:93) > > at java.util.TimerThread.mainLoop(Timer.java:512) > > at java.util.TimerThread.run(Timer.java:462) > > INFO [GMFD:1] 2010-09-20 13:56:43,251 Gossiper.java (line 586) Node > > /X.X.X.X is now part of the cluster > > > > Does anyone have any idea how can I cleanup the problematic node? > > You may just need to be patient. Have you tried monitoring the > CompactionManager in jmx to see if it is doing things? > > > Does anyone have any idea how can I get rid of the Gossip error? > > This is CASSANDRA-1494. You can ignore it. > > Gary. >