Re: [Pacemaker] Remove a "ghost" node

Sean Lutner Wed, 13 Nov 2013 18:19:28 -0800

On Nov 10, 2013, at 8:03 PM, Sean Lutner <s...@rentul.net> wrote:

> 
> On Nov 10, 2013, at 7:54 PM, Andrew Beekhof <and...@beekhof.net> wrote:
> 
>> 
>> On 11 Nov 2013, at 11:44 am, Sean Lutner <s...@rentul.net> wrote:
>> 
>>> 
>>> On Nov 10, 2013, at 6:27 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>>> 
>>>> 
>>>> On 8 Nov 2013, at 12:59 pm, Sean Lutner <s...@rentul.net> wrote:
>>>> 
>>>>> 
>>>>> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <and...@beekhof.net> wrote:
>>>>> 
>>>>>> 
>>>>>> On 8 Nov 2013, at 4:45 am, Sean Lutner <s...@rentul.net> wrote:
>>>>>> 
>>>>>>> I have a confusing situation that I'm hoping to get help with. Last 
>>>>>>> night after configuring STONITH on my two node cluster, I suddenly have 
>>>>>>> a "ghost" node in my cluster. I'm looking to understand the best way to 
>>>>>>> remove this node from the config.
>>>>>>> 
>>>>>>> I'm using the fence_ec2 device for for STONITH. I dropped the script on 
>>>>>>> each node, registered the device with stonith_admin -R -a fence_ec2 and 
>>>>>>> confirmed the registration with both
>>>>>>> 
>>>>>>> # stonith_admin -I
>>>>>>> # pcs stonith list
>>>>>>> 
>>>>>>> I then configured STONITH per the Clusters from Scratch doc
>>>>>>> 
>>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html
>>>>>>> 
>>>>>>> Here are my commands:
>>>>>>> # pcs cluster cib stonith_cfg
>>>>>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 
>>>>>>> ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" 
>>>>>>> pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor 
>>>>>>> interval="300s" timeout="150s" op start start-delay="30s" interval="0"
>>>>>>> # pcs -f stonith_cfg stonith
>>>>>>> # pcs -f stonith_cfg property set stonith-enabled=true
>>>>>>> # pcs -f stonith_cfg property
>>>>>>> # pcs cluster push cib stonith_cfg
>>>>>>> 
>>>>>>> After that I saw that STONITH appears to be functioning but a new node 
>>>>>>> listed in pcs status output:
>>>>>> 
>>>>>> Do the EC2 instances have fixed IPs?
>>>>>> I didn't have much luck with EC2 because every time they came back up it 
>>>>>> was with a new name/address which confused corosync and created 
>>>>>> situations like this.
>>>>> 
>>>>> The IPs persist across reboots as far as I can tell. I thought the 
>>>>> problem was due to stonith being enabled but not working so I removed the 
>>>>> stonith_id and disabled stonith. After that I restarted pacemaker and 
>>>>> cman on both nodes and things started as expected but the ghost node it 
>>>>> still there. 
>>>>> 
>>>>> Someone else working on the cluster exported the CIB, removed the node 
>>>>> and then imported the CIB. They used this process 
>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html
>>>>> 
>>>>> Even after that, the ghost node is still there? Would pcs cluster cib > 
>>>>> /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after 
>>>>> editing the node out of the config?
>>>> 
>>>> No. If its coming back then pacemaker is holding it in one of its internal 
>>>> caches.
>>>> The only way to clear it out in your version is to restart pacemaker on 
>>>> the DC.
>>>> 
>>>> Actually... are you sure someone didn't just slip while editing 
>>>> cluster.conf?  [...].1251 does not look like a valid IP :)
>>> 
>>> In the end this fixed it
>>> 
>>> # pcs cluster cib > /tmp/cib-tmp.xml
>>> # vi /tmp/cib-tmp.xml # remove bad node
>>> # pcs cluster push cib /tmp/cib-tmp.xml
>>> 
>>> Followed by restaring pacemaker and cman on both nodes. The ghost node 
>>> disappeared, so it was cached as you mentioned.
>>> 
>>> I also tracked the bad IP down to bad non-printing characters in the 
>>> initial command line while configuring the fence_ec2 stonith device. I'd 
>>> put the command together from the github README and some mailing list posts 
>>> and laid it out in an external editor. Go me. :)
>>> 
>>>> 
>>>> 
>>>>>>> Version: 1.1.8-7.el6-394e906
>>>> 
>>>> There is now an update to 1.1.10 available for 6.4, that _may_ help in the 
>>>> future.
>>> 
>>> That's my next task. I believe I'm hitting the failure-timeout not clearing 
>>> failcount bug and want to upgrade to 1.1.10. Is it safe to yum update 
>>> pacemaker after stopping the cluster? I see there is also an updated pcs in 
>>> CentOS 6.4, should I update that as well?
>> 
>> yes and yes
>> 
>> you might want to check if you're using any OCF resource agents that didn't 
>> make it into the first supported release though.
>> 
>>  http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/
> 
> Thanks, I'll give that a read. All the resource agents are custom so I'm 
> thinking I'm okay (I'll back them up before upgrading). 
> 
> One last question related to the fence_ec2 script. Should crm_mon -VW show it 
> running on both nodes or just one?


I just went through the upgrade to pacemaker 1.1.10 and pcs. After running the 
yum update for those I ran a crm_verify and I'm seeing errors related to my 
order and colocation constraints. Did the behavior of these change from 1.1.8 
to 1.1.10?

# crm_verify -L -V
   error: unpack_order_template:        Invalid constraint 
'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or template 
named 'Varnish'
   error: unpack_order_template:        Invalid constraint 
'order-Varnish-Varnishlog-mandatory': No resource or template named 'Varnish'
   error: unpack_order_template:        Invalid constraint 
'order-Varnishlog-Varnishncsa-mandatory': No resource or template named 
'Varnishlog'
   error: unpack_colocation_template:   Invalid constraint 
'colocation-Varnish-ClusterEIP_54.215.143.166-INFINITY': No resource or 
template named 'Varnish'
   error: unpack_colocation_template:   Invalid constraint 
'colocation-Varnishlog-Varnish-INFINITY': No resource or template named 
'Varnishlog'
   error: unpack_colocation_template:   Invalid constraint 
'colocation-Varnishncsa-Varnishlog-INFINITY': No resource or template named 
'Varnishncsa'
Errors found during check: config not valid

The cluster doesn't start. I'd prefer to figure out how to fix this rather than 
roll back to 1.1.8. Any help is appreciated.

Thanks

> 
>> 
>>> 
>>>> 
>>>>> 
>>>>> I may have to go back to the drawing board on a fencing device for the 
>>>>> nodes. Are there any other recommendations for a cluster on EC2 nodes?
>>>>> 
>>>>> Thanks very much
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> # pcs status
>>>>>>> Last updated: Thu Nov  7 17:41:21 2013
>>>>>>> Last change: Thu Nov  7 04:29:06 2013 via cibadmin on ip-10-50-3-122
>>>>>>> Stack: cman
>>>>>>> Current DC: ip-10-50-3-122 - partition with quorum
>>>>>>> Version: 1.1.8-7.el6-394e906
>>>>>>> 3 Nodes configured, unknown expected votes
>>>>>>> 11 Resources configured.
>>>>>>> 
>>>>>>> 
>>>>>>> Node ip-10-50-3-1251: UNCLEAN (offline)
>>>>>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>>>> 
>>>>>>> Full list of resources:
>>>>>>> 
>>>>>>> ClusterEIP_54.215.143.166      (ocf::pacemaker:EIP):   Started 
>>>>>>> ip-10-50-3-122
>>>>>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH]
>>>>>>> Started: [ ip-10-50-3-122 ip-10-50-3-251 ]
>>>>>>> Stopped: [ EIP-AND-VARNISH:2 ]
>>>>>>> ec2-fencing    (stonith:fence_ec2):    Stopped 
>>>>>>> 
>>>>>>> I have no idea where the node that is marked UNCLEAN came from, though 
>>>>>>> it's a clear typo is a proper cluster node.
>>>>>>> 
>>>>>>> The only command I ran with the bad node ID was:
>>>>>>> 
>>>>>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node 
>>>>>>> ip-10-50-3-1251
>>>>>>> 
>>>>>>> Is there any possible way that could have caused the the node to be 
>>>>>>> added?
>>>>>>> 
>>>>>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there 
>>>>>>> is no node and thus no pcsd that failed. Is there a way I can safely 
>>>>>>> remove this ghost node from the cluster? I can provide logs from 
>>>>>>> pacemaker or corosync as needed.
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>> 
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>> 
>>>>>> 
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>> 
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>> 
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>> 
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>> 
>>>> 
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>> 
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

signature.asc
Description: Message signed with OpenPGP using GPGMail

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] Remove a "ghost" node

Reply via email to