Hi Aaron - Thanks alot for the great feedback. I'll try your suggestion on
removing it as an endpoint with jmx.
On , aaron morton <aa...@thelastpickle.com> wrote:
Off the top of my head the simple way to stop invalid end point state
been passed around is a full cluster stop. Obviously thats not an option.
The problem is if one node has the IP is will share it around with the
others.
Out of interest take a look at the oacdb.FailureDetector MBean
getAllEndpointStates() function. That returns the end point state held by
the Gossiper. I think you should see the Phantom IP listed in there.
If it's only on some nodes *perhaps* restarting the node with the JVM
option -Dcassandra.load_ring_state=false *may* help. That will stop the
node from loading it's save ring state and force it to get it via gossip.
Again, if there are other nodes with the phantom IP it may just get it
again.
I'll do some digging and try to get back to you. This pops up from time
to time and thinking out loud I wonder if it would be possible to add a
new application state that purges an IP from the ring. eg
VersionedValue.STATUS_PURGED that works with a ttl so it goes through X
number of gossip rounds and then disappears.
Hope that helps.
-----------------
Aaron Morton
Freelance Cassandra Developer
@aaronmorton
http://www.thelastpickle.com
On 26 May 2011, at 19:58, Jonathan Colby wrote:
> @Aaron -
>
> Unfortunately I'm still seeing message like: " is down", removing from
gossip, although with not the same frequency.
>
> And repair/move jobs don't seem to try to stream data to the removed
node anymore.
>
> Anyone know how to totally purge any stored gossip/endpoint data on
nodes that were removed from the cluster. Or what might be happening here
otherwise?
>
>
> On May 26, 2011, at 9:10 AM, aaron morton wrote:
>
>> cool. I was going to suggest that but as you already had the move
running I thought it may be a little drastic.
>>
>> Did it show any progress ? If the IP address is not responding there
should have been some sort of error.
>>
>> Cheers
>>
>> -----------------
>> Aaron Morton
>> Freelance Cassandra Developer
>> @aaronmorton
>> http://www.thelastpickle.com
>>
>> On 26 May 2011, at 15:28, jonathan.co...@gmail.com wrote:
>>
>>> Seems like it had something to do with stale endpoint information. I
did a rolling restart of the whole cluster and that seemed to trigger the
nodes to remove the node that was decommissioned.
>>>
>>> On , aaron morton aa...@thelastpickle.com> wrote:
>>>> Is it showing progress ? It may just be a problem with the
information printed out.
>>>>
>>>>
>>>>
>>>> Can you check from the other nodes in the cluster to see if they are
receiving the stream ?
>>>>
>>>>
>>>>
>>>> cheers
>>>>
>>>>
>>>>
>>>> -----------------
>>>>
>>>> Aaron Morton
>>>>
>>>> Freelance Cassandra Developer
>>>>
>>>> @aaronmorton
>>>>
>>>> http://www.thelastpickle.com
>>>>
>>>>
>>>>
>>>> On 26 May 2011, at 00:42, Jonathan Colby wrote:
>>>>
>>>>
>>>>
>>>>> I recently removed a node (with decommission) from our cluster.
>>>>
>>>>>
>>>>
>>>>> I added a couple new nodes and am now trying to rebalance the
cluster using nodetool move.
>>>>
>>>>>
>>>>
>>>>> However, netstats shows that the node being "moved" is trying to
stream data to the node that I already decommissioned yesterday.
>>>>
>>>>>
>>>>
>>>>> The removed node was powered-off, taken out of dns, its IP is not
even pingable. It was never a seed neither.
>>>>
>>>>>
>>>>
>>>>> This is cassandra 0.7.5 on 64bit linux. How do I tell the cluster
that this node is gone? Gossip should have detected this. The ring
commands shows the correct cluster IPs.
>>>>
>>>>>
>>>>
>>>>> Here is a portion of netstats. 10.46.108.102 is the node which was
removed.
>>>>
>>>>>
>>>>
>>>>> Mode: Leaving: streaming data to other nodes
>>>>
>>>>> Streaming to: /10.46.108.102
>>>>
>>>>>
/var/lib/cassandra/data/DFS/main-f-1064-Data.db/(4681027,5195491),(5195491,15308570),(15308570,15891710),(16336750,20558705),(20558705,29112203),(29112203,36279329),(36465942,36623223),(36740457,37227058),(37227058,42206994),(42206994,47380294),(47635053,47709813),(47709813,48353944),(48621287,49406499),(53330048,53571312),(53571312,54153922),(54153922,59857615),(59857615,61029910),(61029910,61871509),(62190800,62498605),(62824281,62964830),(63511604,64353114),(64353114,64760400),(65174702,65919771),(65919771,66435630),(81440029,81725949),(81725949,83313847),(83313847,83908709),(88983863,89237303),(89237303,89934199),(89934199,97
>>>>
>>>>> ...................
>>>>
>>>>>
5693491,14795861666),(14795861666,14796105318),(14796105318,14796366886),(14796699825,14803874941),(14803874941,14808898331),(14808898331,14811670699),(14811670699,14815125177),(14815125177,14819765003),(14820229433,14820858266)
>>>>
>>>>> progress=280574376402/12434049900 - 2256%
>>>>
>>>>> .....
>>>>
>>>>>
>>>>
>>>>>
>>>>
>>>>> Note 10.46.108.102 is NOT part of the ring.
>>>>
>>>>>
>>>>
>>>>> Address Status State Load Owns Token
>>>>
>>>>> 148873535527910577765226390751398592512
>>>>
>>>>> 10.46.108.100 Up Normal 71.73 GB 12.50% 0
>>>>
>>>>> 10.46.108.101 Up Normal 109.69 GB 12.50%
21267647932558653966460912964485513216
>>>>
>>>>> 10.47.108.100 Up Leaving 281.95 GB 37.50%
85070591730234615865843651857942052863
>>>>> 10.47.108.102 Up Normal 210.77 GB 0.00%
85070591730234615865843651857942052864
>>>>
>>>>> 10.47.108.101 Up Normal 289.59 GB 16.67%
113427455640312821154458202477256070484
>>>>
>>>>> 10.46.108.103 Up Normal 299.87 GB 8.33%
127605887595351923798765477786913079296
>>>>
>>>>> 10.47.108.103 Up Normal 94.99 GB 12.50%
148873535527910577765226390751398592511
>>>>
>>>>> 10.46.108.104 Up Normal 103.01 GB 0.00%
148873535527910577765226390751398592512
>>>>
>>>>>
>>>>
>>>>>
>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>