I ran into this. I also tried log_ring_state=false which also did not help. The way I got through this was to stop the entire cluster and start the nodes one-by-one.
I realize this is not a practical solution for everyone, but if you can afford to stop the cluster for a few minutes, it's worth a try. On Aug 23, 2011, at 9:26 AM, aaron morton wrote: > I'm running low on ideas for this one. Anyone else ? > > If the phantom node is not listed in the ring, other nodes should not be > storing hints for it. You can see what nodes they are storing hints for via > JConsole. > > You can try a rolling restart passing the JVM opt > -Dcassandra.load_ring_state=false However if the phantom node is been passed > around in the gossip state it will probably just come back again. > > Cheers > > > ----------------- > Aaron Morton > Freelance Cassandra Developer > @aaronmorton > http://www.thelastpickle.com > > On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote: > >> Could this ghost node be causing my hints column family to grow to this >> size? I also crash after about 24 hours due to commit logs growth taking up >> all the drive space. A manual nodetool flush keeps it under control though. >> >> >> Column Family: HintsColumnFamily >> SSTable count: 6 >> Space used (live): 666480352 >> Space used (total): 666480352 >> Number of Keys (estimate): 768 >> Memtable Columns Count: 1043 >> Memtable Data Size: 461773 >> Memtable Switch Count: 3 >> Read Count: 38 >> Read Latency: 131.289 ms. >> Write Count: 582108 >> Write Latency: 0.019 ms. >> Pending Tasks: 0 >> Key cache capacity: 7 >> Key cache size: 6 >> Key cache hit rate: 0.8333333333333334 >> Row cache: disabled >> Compacted row minimum size: 2816160 >> Compacted row maximum size: 386857368 >> Compacted row mean size: 120432714 >> >> Is there a way for me to manually remove this dead node? >> >> -----Original Message----- >> From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] >> Sent: Sunday, August 21, 2011 9:09 PM >> To: user@cassandra.apache.org >> Subject: RE: Completely removing a node from the cluster >> >> It's been at least 4 days now. >> >> -----Original Message----- >> From: aaron morton [mailto:aa...@thelastpickle.com] >> Sent: Sunday, August 21, 2011 3:16 PM >> To: user@cassandra.apache.org >> Subject: Re: Completely removing a node from the cluster >> >> I see the mistake I made about ring, gets the endpoint list from the same >> place but uses the token's to drive the whole process. >> >> I'm guessing here, don't have time to check all the code. But there is a 3 >> day timeout in the gossip system. Not sure if it applies in this case. >> >> Anyone know ? >> >> Cheers >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote: >> >>> Both .2 and .3 list the same from the mbean that Unreachable is empty >>> collection, and Live node lists all 3 nodes still: >>> 192.168.20.2 >>> 192.168.20.3 >>> 192.168.20.1 >>> >>> The removetoken was done a few days ago, and I believe the remove was done >>> from .2 >>> >>> Here is what ring outlook looks like, not sure why I get that token on the >>> empty first line either: >>> Address DC Rack Status State Load Owns >>> Token >>> >>> 85070591730234615865843651857942052864 >>> 192.168.20.2 datacenter1 rack1 Up Normal 79.53 GB >>> 50.00% 0 >>> 192.168.20.3 datacenter1 rack1 Up Normal 42.63 GB >>> 50.00% 85070591730234615865843651857942052864 >>> >>> Yes, both nodes show the same thing when doing a describe cluster, that .1 >>> is unreachable. >>> >>> >>> -----Original Message----- >>> From: aaron morton [mailto:aa...@thelastpickle.com] >>> Sent: Sunday, August 21, 2011 4:23 AM >>> To: user@cassandra.apache.org >>> Subject: Re: Completely removing a node from the cluster >>> >>> Unreachable nodes in either did not respond to the message or were known to >>> be down and were not sent a message. >>> The way the node lists are obtained for the ring command and describe >>> cluster are the same. So it's a bit odd. >>> >>> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean >>> ? What do the LiveNode and UnrechableNodes attributes say ? >>> >>> Also how long ago did you remove the token and on which machine? Do both >>> 20.2 and 20.3 think 20.1 is still around ? >>> >>> Cheers >>> >>> >>> ----------------- >>> Aaron Morton >>> Freelance Cassandra Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote: >>> >>>> I'm on 0.8.4 >>>> >>>> I have removed a dead node from the cluster using nodetool removetoken >>>> command, and moved one of the remaining nodes to rebalance the tokens. >>>> Everything looks fine when I run nodetool ring now, as it only lists the >>>> remaining 2 nodes and they both look fine, owning 50% of the tokens. >>>> >>>> However, I can still see it being considered as part of the cluster from >>>> the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried >>>> that the cluster is still queuing up hints for the node, or any other >>>> issues it may cause: >>>> >>>> Cluster Information: >>>> Snitch: org.apache.cassandra.locator.SimpleSnitch >>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner >>>> Schema versions: >>>> dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3] >>>> UNREACHABLE: [192.168.20.1] >>>> >>>> >>>> Do I need to do something else to completely remove this node? >>>> >>>> Thanks, >>>> Bryce >>> >> >