Re: Completely removing a node from the cluster

aaron morton Tue, 23 Aug 2011 01:46:14 -0700

I normally link to the data stax article to avoid having to actually write 
those words :)


http://www.datastax.com/docs/0.8/troubleshooting/index#view-of-ring-differs-between-some-nodes
A
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com

On 23/08/2011, at 7:45 PM, Jonathan Colby wrote:

> I ran into this.  I also tried log_ring_state=false which also did not help.  
>  The way I got through this was to stop the entire cluster and start the 
> nodes one-by-one.   
> 
> I realize this is not a practical solution for everyone, but if you can 
> afford to stop the cluster for a few minutes, it's worth a try.
> 
> 
> On Aug 23, 2011, at 9:26 AM, aaron morton wrote:
> 
>> I'm running low on ideas for this one. Anyone else ? 
>> 
>> If the phantom node is not listed in the ring, other nodes should not be 
>> storing hints for it. You can see what nodes they are storing hints for via 
>> JConsole. 
>> 
>> You can try a rolling restart passing the JVM opt 
>> -Dcassandra.load_ring_state=false However if the phantom node is been passed 
>> around in the gossip state it will probably just come back again. 
>> 
>> Cheers
>> 
>> 
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>> 
>> On 23/08/2011, at 3:49 PM, Bryce Godfrey wrote:
>> 
>>> Could this ghost node be causing my hints column family to grow to this 
>>> size?  I also crash after about 24 hours due to commit logs growth taking 
>>> up all the drive space.  A manual nodetool flush keeps it under control 
>>> though.
>>> 
>>> 
>>>              Column Family: HintsColumnFamily
>>>              SSTable count: 6
>>>              Space used (live): 666480352
>>>              Space used (total): 666480352
>>>              Number of Keys (estimate): 768
>>>              Memtable Columns Count: 1043
>>>              Memtable Data Size: 461773
>>>              Memtable Switch Count: 3
>>>              Read Count: 38
>>>              Read Latency: 131.289 ms.
>>>              Write Count: 582108
>>>              Write Latency: 0.019 ms.
>>>              Pending Tasks: 0
>>>              Key cache capacity: 7
>>>              Key cache size: 6
>>>              Key cache hit rate: 0.8333333333333334
>>>              Row cache: disabled
>>>              Compacted row minimum size: 2816160
>>>              Compacted row maximum size: 386857368
>>>              Compacted row mean size: 120432714
>>> 
>>> Is there a way for me to manually remove this dead node?
>>> 
>>> -----Original Message-----
>>> From: Bryce Godfrey [mailto:bryce.godf...@azaleos.com] 
>>> Sent: Sunday, August 21, 2011 9:09 PM
>>> To: user@cassandra.apache.org
>>> Subject: RE: Completely removing a node from the cluster
>>> 
>>> It's been at least 4 days now.
>>> 
>>> -----Original Message-----
>>> From: aaron morton [mailto:aa...@thelastpickle.com] 
>>> Sent: Sunday, August 21, 2011 3:16 PM
>>> To: user@cassandra.apache.org
>>> Subject: Re: Completely removing a node from the cluster
>>> 
>>> I see the mistake I made about ring, gets the endpoint list from the same 
>>> place but uses the token's to drive the whole process. 
>>> 
>>> I'm guessing here, don't have time to check all the code. But there is a 3 
>>> day timeout in the gossip system. Not sure if it applies in this case. 
>>> 
>>> Anyone know ?
>>> 
>>> Cheers
>>> 
>>> -----------------
>>> Aaron Morton
>>> Freelance Cassandra Developer
>>> @aaronmorton
>>> http://www.thelastpickle.com
>>> 
>>> On 22/08/2011, at 6:23 AM, Bryce Godfrey wrote:
>>> 
>>>> Both .2 and .3 list the same from the mbean that Unreachable is empty 
>>>> collection, and Live node lists all 3 nodes still:
>>>> 192.168.20.2
>>>> 192.168.20.3
>>>> 192.168.20.1
>>>> 
>>>> The removetoken was done a few days ago, and I believe the remove was done 
>>>> from .2
>>>> 
>>>> Here is what ring outlook looks like, not sure why I get that token on the 
>>>> empty first line either:
>>>> Address         DC          Rack        Status State   Load            
>>>> Owns    Token
>>>>                                                                            
>>>> 85070591730234615865843651857942052864
>>>> 192.168.20.2    datacenter1 rack1       Up     Normal  79.53 GB       
>>>> 50.00%  0
>>>> 192.168.20.3    datacenter1 rack1       Up     Normal  42.63 GB       
>>>> 50.00%  85070591730234615865843651857942052864
>>>> 
>>>> Yes, both nodes show the same thing when doing a describe cluster, that .1 
>>>> is unreachable.
>>>> 
>>>> 
>>>> -----Original Message-----
>>>> From: aaron morton [mailto:aa...@thelastpickle.com] 
>>>> Sent: Sunday, August 21, 2011 4:23 AM
>>>> To: user@cassandra.apache.org
>>>> Subject: Re: Completely removing a node from the cluster
>>>> 
>>>> Unreachable nodes in either did not respond to the message or were known 
>>>> to be down and were not sent a message. 
>>>> The way the node lists are obtained for the ring command and describe 
>>>> cluster are the same. So it's a bit odd. 
>>>> 
>>>> Can you connect to JMX and have a look at the o.a.c.db.StorageService 
>>>> MBean ? What do the LiveNode and UnrechableNodes attributes say ? 
>>>> 
>>>> Also how long ago did you remove the token and on which machine? Do both 
>>>> 20.2 and 20.3 think 20.1 is still around ? 
>>>> 
>>>> Cheers
>>>> 
>>>> 
>>>> -----------------
>>>> Aaron Morton
>>>> Freelance Cassandra Developer
>>>> @aaronmorton
>>>> http://www.thelastpickle.com
>>>> 
>>>> On 20/08/2011, at 9:48 AM, Bryce Godfrey wrote:
>>>> 
>>>>> I'm on 0.8.4
>>>>> 
>>>>> I have removed a dead node from the cluster using nodetool removetoken 
>>>>> command, and moved one of the remaining nodes to rebalance the tokens.  
>>>>> Everything looks fine when I run nodetool ring now, as it only lists the 
>>>>> remaining 2 nodes and they both look fine, owning 50% of the tokens.
>>>>> 
>>>>> However, I can still see it being considered as part of the cluster from 
>>>>> the Cassandra-cli (192.168.20.1 being the removed node) and I'm worried 
>>>>> that the cluster is still queuing up hints for the node, or any other 
>>>>> issues it may cause:
>>>>> 
>>>>> Cluster Information:
>>>>> Snitch: org.apache.cassandra.locator.SimpleSnitch
>>>>> Partitioner: org.apache.cassandra.dht.RandomPartitioner
>>>>> Schema versions:
>>>>>    dcc8f680-caa4-11e0-0000-553d4dced3ff: [192.168.20.2, 192.168.20.3]
>>>>>    UNREACHABLE: [192.168.20.1]
>>>>> 
>>>>> 
>>>>> Do I need to do something else to completely remove this node?
>>>>> 
>>>>> Thanks,
>>>>> Bryce
>>>> 
>>> 
>> 
>

Re: Completely removing a node from the cluster

Reply via email to