I ran into this.  I also tried log_ring_state=false which also did not help.   
The way I got through this was to stop the entire cluster and start the nodes 
one-by-one.   

I realize this is not a practical solution for everyone, but if you can afford 
to stop the cluster for a few minutes, it's worth a try.


On Aug 23, 2011, at 9:26 AM, aaron morton wrote:

> I'm running low on ideas for this one. Anyone else ? 
> 
> If the phantom node is not listed in the ring, other nodes should not be 
> storing hints for it. You can see what nodes they are storing hints for via 
> JConsole. 
> 
> You can try a rolling restart passing the JVM opt 
> -Dcassandra.load_ring_state=false However if the phantom node is been passed 
> around in the gossip state it will probably just come back again. 
> 
> Cheers
> 
> 
> -----------------
> Aaron Morton
> Freelance Cassandra Developer
> @aaronmorton
> http://www.thelastpickle.com
> 
> On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:
> 
>> Could this ghost node be causing my hints column family to grow to this 
>> size?  I also crash after about 24 hours due to commit logs growth taking up 
>> all the drive space.  A manual nodetool flush keeps it under control though.
>> 
>> 
>>               Column Family: HintsColumnFamily
>>               SSTable count: 6
>>               Space used (live): 666480352
>>               Space used (total): 666480352
>>               Number of Keys (estimate): 768
>>               Memtable Columns Count: 1043
>>               Memtable Data Size: 461773
>>               Memtable Switch Count: 3
>>               Read Count: 38
>>               Read Latency: 131.289 ms.
>>               Write Count: 582108
>>               Write Latency: 0.019 ms.
>>               Pending Tasks: 0
>>               Key cache capacity: 7
>>               Key cache size: 6
>>               Key cache hit rate: 0.8333333333333334
>>               Row cache: disabled
>>               Compacted row minimum size: 2816160
>>               Compacted row maximum size: 386857368
>>               Compacted row mean size: 120432714
>> 
>> Is there a way for me to manually remove this dead node?
>> 
>> -----Original Message-----
>> From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] 
>> Sent: Sunday, August 21, 2011 9:09 PM
>> To: user@cassandra.apache.org
>> Subject: RE: Completely removing a node from the cluster
>> 
>> It's been at least 4 days now.
>> 
>> -----Original Message-----
>> From: aaron morton [mailto:aa...@thelastpickle.com] 
>> Sent: Sunday, August 21, 2011 3:16 PM
>> To: user@cassandra.apache.org
>> Subject: Re: Completely removing a node from the cluster
>> 
>> I see the mistake I made about ring, gets the endpoint list from the same 
>> place but uses the token's to drive the whole process. 
>> 
>> I'm guessing here, don't have time to check all the code. But there is a 3 
>> day timeout in the gossip system. Not sure if it applies in this case. 
>> 
>> Anyone know ?
>> 
>> Cheers
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
>> 
>>> Both .2 and .3 list the same from the mbean that Unreachable is empty 
>>> collection, and Live node lists all 3 nodes still:
>>> 192.168.20.2
>>> 192.168.20.3
>>> 192.168.20.1
>>> 
>>> The removetoken was done a few days ago, and I believe the remove was done 
>>> from .2
>>> 
>>> Here is what ring outlook looks like, not sure why I get that token on the 
>>> empty first line either:
>>> Address         DC          Rack        Status State   Load            Owns 
>>>    Token
>>>                                                                             
>>> 85070591730234615865843651857942052864
>>> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       
>>> 50.00%  0
>>> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       
>>> 50.00%  85070591730234615865843651857942052864
>>> 
>>> Yes, both nodes show the same thing when doing a describe cluster, that .1 
>>> is unreachable.
>>> 
>>> 
>>> -----Original Message-----
>>> From: aaron morton [mailto:aa...@thelastpickle.com] 
>>> Sent: Sunday, August 21, 2011 4:23 AM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Completely removing a node from the cluster
>>> 
>>> Unreachable nodes in either did not respond to the message or were known to 
>>> be down and were not sent a message. 
>>> The way the node lists are obtained for the ring command and describe 
>>> cluster are the same. So it's a bit odd. 
>>> 
>>> Can you connect to JMX and have a look at the o.a.c.db.StorageService MBean 
>>> ? What do the LiveNode and UnrechableNodes attributes say ? 
>>> 
>>> Also how long ago did you remove the token and on which machine? Do both 
>>> 20.2 and 20.3 think 20.1 is still around ? 
>>> 
>>> Cheers
>>> 
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
>>> 
>>>> I'm on 0.8.4
>>>> 
>>>> I have removed a dead node from the cluster using nodetool removetoken 
>>>> command, and moved one of the remaining nodes to rebalance the tokens.  
>>>> Everything looks fine when I run nodetool ring now, as it only lists the 
>>>> remaining 2 nodes and they both look fine, owning 50% of the tokens.
>>>> 
>>>> However, I can still see it being considered as part of the cluster from 
>>>> the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried 
>>>> that the cluster is still queuing up hints for the node, or any other 
>>>> issues it may cause:
>>>> 
>>>> Cluster Information:
>>>> Snitch: org.apache.cassandra.locator.SimpleSnitch
>>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>>> Schema versions:
>>>>     dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>>>     UNREACHABLE: [192.168.20.1]
>>>> 
>>>> 
>>>> Do I need to do something else to completely remove this node?
>>>> 
>>>> Thanks,
>>>> Bryce
>>> 
>> 
> 

Reply via email to