I normally link to the data stax article to avoid having to actually write those words :)
http://www.datastax.com/docs/0.8/troubleshooting/index#view-of-ring-differs-between-some-nodes A ----------------- Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 23/08/2011, at 7:45 PM, Jonathan Colby wrote: > I ran into this. I also tried log_ring_state=false which also did not help. > The way I got through this was to stop the entire cluster and start the > nodes one-by-one. > > I realize this is not a practical solution for everyone, but if you can > afford to stop the cluster for a few minutes, it's worth a try. > > > On Aug 23, 2011, at 9:26 AM, aaron morton wrote: > >> I'm running low on ideas for this one. Anyone else ? >> >> If the phantom node is not listed in the ring, other nodes should not be >> storing hints for it. You can see what nodes they are storing hints for via >> JConsole. >> >> You can try a rolling restart passing the JVM opt >> -Dcassandra.load_ring_state=false However if the phantom node is been passed >> around in the gossip state it will probably just come back again. >> >> Cheers >> >> >> ----------------- >> Aaron Morton >> Freelance Cassandra Developer >> @aaronmorton >> http://www.thelastpickle.com >> >> On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote: >> >>> Could this ghost node be causing my hints column family to grow to this >>> size? I also crash after about 24 hours due to commit logs growth taking >>> up all the drive space. A manual nodetool flush keeps it under control >>> though. >>> >>> >>> Column Family: HintsColumnFamily >>> SSTable count: 6 >>> Space used (live): 666480352 >>> Space used (total): 666480352 >>> Number of Keys (estimate): 768 >>> Memtable Columns Count: 1043 >>> Memtable Data Size: 461773 >>> Memtable Switch Count: 3 >>> Read Count: 38 >>> Read Latency: 131.289 ms. >>> Write Count: 582108 >>> Write Latency: 0.019 ms. >>> Pending Tasks: 0 >>> Key cache capacity: 7 >>> Key cache size: 6 >>> Key cache hit rate: 0.8333333333333334 >>> Row cache: disabled >>> Compacted row minimum size: 2816160 >>> Compacted row maximum size: 386857368 >>> Compacted row mean size: 120432714 >>> >>> Is there a way for me to manually remove this dead node? >>> >>> -----Original Message----- >>> From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] >>> Sent: Sunday, August 21, 2011 9:09 PM >>> To: user@cassandra.apache.org >>> Subject: RE: Completely removing a node from the cluster >>> >>> It's been at least 4 days now. >>> >>> -----Original Message----- >>> From: aaron morton [mailto:aa...@thelastpickle.com] >>> Sent: Sunday, August 21, 2011 3:16 PM >>> To: user@cassandra.apache.org >>> Subject: Re: Completely removing a node from the cluster >>> >>> I see the mistake I made about ring, gets the endpoint list from the same >>> place but uses the token's to drive the whole process. >>> >>> I'm guessing here, don't have time to check all the code. But there is a 3 >>> day timeout in the gossip system. Not sure if it applies in this case. >>> >>> Anyone know ? >>> >>> Cheers >>> >>> ----------------- >>> Aaron Morton >>> Freelance Cassandra Developer >>> @aaronmorton >>> http://www.thelastpickle.com >>> >>> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote: >>> >>>> Both .2 and .3 list the same from the mbean that Unreachable is empty >>>> collection, and Live node lists all 3 nodes still: >>>> 192.168.20.2 >>>> 192.168.20.3 >>>> 192.168.20.1 >>>> >>>> The removetoken was done a few days ago, and I believe the remove was done >>>> from .2 >>>> >>>> Here is what ring outlook looks like, not sure why I get that token on the >>>> empty first line either: >>>> Address DC Rack Status State Load >>>> Owns Token >>>> >>>> 85070591730234615865843651857942052864 >>>> 192.168.20.2 datacenter1 rack1 Up Normal 79.53 GB >>>> 50.00% 0 >>>> 192.168.20.3 datacenter1 rack1 Up Normal 42.63 GB >>>> 50.00% 85070591730234615865843651857942052864 >>>> >>>> Yes, both nodes show the same thing when doing a describe cluster, that .1 >>>> is unreachable. >>>> >>>> >>>> -----Original Message----- >>>> From: aaron morton [mailto:aa...@thelastpickle.com] >>>> Sent: Sunday, August 21, 2011 4:23 AM >>>> To: user@cassandra.apache.org >>>> Subject: Re: Completely removing a node from the cluster >>>> >>>> Unreachable nodes in either did not respond to the message or were known >>>> to be down and were not sent a message. >>>> The way the node lists are obtained for the ring command and describe >>>> cluster are the same. So it's a bit odd. >>>> >>>> Can you connect to JMX and have a look at the o.a.c.db.StorageService >>>> MBean ? What do the LiveNode and UnrechableNodes attributes say ? >>>> >>>> Also how long ago did you remove the token and on which machine? Do both >>>> 20.2 and 20.3 think 20.1 is still around ? >>>> >>>> Cheers >>>> >>>> >>>> ----------------- >>>> Aaron Morton >>>> Freelance Cassandra Developer >>>> @aaronmorton >>>> http://www.thelastpickle.com >>>> >>>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote: >>>> >>>>> I'm on 0.8.4 >>>>> >>>>> I have removed a dead node from the cluster using nodetool removetoken >>>>> command, and moved one of the remaining nodes to rebalance the tokens. >>>>> Everything looks fine when I run nodetool ring now, as it only lists the >>>>> remaining 2 nodes and they both look fine, owning 50% of the tokens. >>>>> >>>>> However, I can still see it being considered as part of the cluster from >>>>> the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried >>>>> that the cluster is still queuing up hints for the node, or any other >>>>> issues it may cause: >>>>> >>>>> Cluster Information: >>>>> Snitch: org.apache.cassandra.locator.SimpleSnitch >>>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner >>>>> Schema versions: >>>>> dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3] >>>>> UNREACHABLE: [192.168.20.1] >>>>> >>>>> >>>>> Do I need to do something else to completely remove this node? >>>>> >>>>> Thanks, >>>>> Bryce >>>> >>> >> >