On Nov 10, 2013, at 8:03 PM, Sean Lutner <s...@rentul.net> wrote: > > On Nov 10, 2013, at 7:54 PM, Andrew Beekhof <and...@beekhof.net> wrote: > >> >> On 11 Nov 2013, at 11:44 am, Sean Lutner <s...@rentul.net> wrote: >> >>> >>> On Nov 10, 2013, at 6:27 PM, Andrew Beekhof <and...@beekhof.net> wrote: >>> >>>> >>>> On 8 Nov 2013, at 12:59 pm, Sean Lutner <s...@rentul.net> wrote: >>>> >>>>> >>>>> On Nov 7, 2013, at 8:34 PM, Andrew Beekhof <and...@beekhof.net> wrote: >>>>> >>>>>> >>>>>> On 8 Nov 2013, at 4:45 am, Sean Lutner <s...@rentul.net> wrote: >>>>>> >>>>>>> I have a confusing situation that I'm hoping to get help with. Last >>>>>>> night after configuring STONITH on my two node cluster, I suddenly have >>>>>>> a "ghost" node in my cluster. I'm looking to understand the best way to >>>>>>> remove this node from the config. >>>>>>> >>>>>>> I'm using the fence_ec2 device for for STONITH. I dropped the script on >>>>>>> each node, registered the device with stonith_admin -R -a fence_ec2 and >>>>>>> confirmed the registration with both >>>>>>> >>>>>>> # stonith_admin -I >>>>>>> # pcs stonith list >>>>>>> >>>>>>> I then configured STONITH per the Clusters from Scratch doc >>>>>>> >>>>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Clusters_from_Scratch/_example.html >>>>>>> >>>>>>> Here are my commands: >>>>>>> # pcs cluster cib stonith_cfg >>>>>>> # pcs -f stonith_cfg stonith create ec2-fencing fence_ec2 >>>>>>> ec2-home="/opt/ec2-api-tools" pcmk_host_check="static-list" >>>>>>> pcmk_host_list="ip-10-50-3-122 ip-10-50-3-251" op monitor >>>>>>> interval="300s" timeout="150s" op start start-delay="30s" interval="0" >>>>>>> # pcs -f stonith_cfg stonith >>>>>>> # pcs -f stonith_cfg property set stonith-enabled=true >>>>>>> # pcs -f stonith_cfg property >>>>>>> # pcs cluster push cib stonith_cfg >>>>>>> >>>>>>> After that I saw that STONITH appears to be functioning but a new node >>>>>>> listed in pcs status output: >>>>>> >>>>>> Do the EC2 instances have fixed IPs? >>>>>> I didn't have much luck with EC2 because every time they came back up it >>>>>> was with a new name/address which confused corosync and created >>>>>> situations like this. >>>>> >>>>> The IPs persist across reboots as far as I can tell. I thought the >>>>> problem was due to stonith being enabled but not working so I removed the >>>>> stonith_id and disabled stonith. After that I restarted pacemaker and >>>>> cman on both nodes and things started as expected but the ghost node it >>>>> still there. >>>>> >>>>> Someone else working on the cluster exported the CIB, removed the node >>>>> and then imported the CIB. They used this process >>>>> http://clusterlabs.org/doc/en-US/Pacemaker/1.0/html/Pacemaker_Explained/s-config-updates.html >>>>> >>>>> Even after that, the ghost node is still there? Would pcs cluster cib > >>>>> /tmp/cib-temp.xml and then pcs cluster push cib /tmp/cib-temp.xml after >>>>> editing the node out of the config? >>>> >>>> No. If its coming back then pacemaker is holding it in one of its internal >>>> caches. >>>> The only way to clear it out in your version is to restart pacemaker on >>>> the DC. >>>> >>>> Actually... are you sure someone didn't just slip while editing >>>> cluster.conf? [...].1251 does not look like a valid IP :) >>> >>> In the end this fixed it >>> >>> # pcs cluster cib > /tmp/cib-tmp.xml >>> # vi /tmp/cib-tmp.xml # remove bad node >>> # pcs cluster push cib /tmp/cib-tmp.xml >>> >>> Followed by restaring pacemaker and cman on both nodes. The ghost node >>> disappeared, so it was cached as you mentioned. >>> >>> I also tracked the bad IP down to bad non-printing characters in the >>> initial command line while configuring the fence_ec2 stonith device. I'd >>> put the command together from the github README and some mailing list posts >>> and laid it out in an external editor. Go me. :) >>> >>>> >>>> >>>>>>> Version: 1.1.8-7.el6-394e906 >>>> >>>> There is now an update to 1.1.10 available for 6.4, that _may_ help in the >>>> future. >>> >>> That's my next task. I believe I'm hitting the failure-timeout not clearing >>> failcount bug and want to upgrade to 1.1.10. Is it safe to yum update >>> pacemaker after stopping the cluster? I see there is also an updated pcs in >>> CentOS 6.4, should I update that as well? >> >> yes and yes >> >> you might want to check if you're using any OCF resource agents that didn't >> make it into the first supported release though. >> >> http://blog.clusterlabs.org/blog/2013/pacemaker-and-rhel-6-dot-4/ > > Thanks, I'll give that a read. All the resource agents are custom so I'm > thinking I'm okay (I'll back them up before upgrading). > > One last question related to the fence_ec2 script. Should crm_mon -VW show it > running on both nodes or just one?
I just went through the upgrade to pacemaker 1.1.10 and pcs. After running the yum update for those I ran a crm_verify and I'm seeing errors related to my order and colocation constraints. Did the behavior of these change from 1.1.8 to 1.1.10? # crm_verify -L -V error: unpack_order_template: Invalid constraint 'order-ClusterEIP_54.215.143.166-Varnish-mandatory': No resource or template named 'Varnish' error: unpack_order_template: Invalid constraint 'order-Varnish-Varnishlog-mandatory': No resource or template named 'Varnish' error: unpack_order_template: Invalid constraint 'order-Varnishlog-Varnishncsa-mandatory': No resource or template named 'Varnishlog' error: unpack_colocation_template: Invalid constraint 'colocation-Varnish-ClusterEIP_54.215.143.166-INFINITY': No resource or template named 'Varnish' error: unpack_colocation_template: Invalid constraint 'colocation-Varnishlog-Varnish-INFINITY': No resource or template named 'Varnishlog' error: unpack_colocation_template: Invalid constraint 'colocation-Varnishncsa-Varnishlog-INFINITY': No resource or template named 'Varnishncsa' Errors found during check: config not valid The cluster doesn't start. I'd prefer to figure out how to fix this rather than roll back to 1.1.8. Any help is appreciated. Thanks > >> >>> >>>> >>>>> >>>>> I may have to go back to the drawing board on a fencing device for the >>>>> nodes. Are there any other recommendations for a cluster on EC2 nodes? >>>>> >>>>> Thanks very much >>>>> >>>>>> >>>>>>> >>>>>>> # pcs status >>>>>>> Last updated: Thu Nov 7 17:41:21 2013 >>>>>>> Last change: Thu Nov 7 04:29:06 2013 via cibadmin on ip-10-50-3-122 >>>>>>> Stack: cman >>>>>>> Current DC: ip-10-50-3-122 - partition with quorum >>>>>>> Version: 1.1.8-7.el6-394e906 >>>>>>> 3 Nodes configured, unknown expected votes >>>>>>> 11 Resources configured. >>>>>>> >>>>>>> >>>>>>> Node ip-10-50-3-1251: UNCLEAN (offline) >>>>>>> Online: [ ip-10-50-3-122 ip-10-50-3-251 ] >>>>>>> >>>>>>> Full list of resources: >>>>>>> >>>>>>> ClusterEIP_54.215.143.166 (ocf::pacemaker:EIP): Started >>>>>>> ip-10-50-3-122 >>>>>>> Clone Set: EIP-AND-VARNISH-clone [EIP-AND-VARNISH] >>>>>>> Started: [ ip-10-50-3-122 ip-10-50-3-251 ] >>>>>>> Stopped: [ EIP-AND-VARNISH:2 ] >>>>>>> ec2-fencing (stonith:fence_ec2): Stopped >>>>>>> >>>>>>> I have no idea where the node that is marked UNCLEAN came from, though >>>>>>> it's a clear typo is a proper cluster node. >>>>>>> >>>>>>> The only command I ran with the bad node ID was: >>>>>>> >>>>>>> # crm_resource --resource ClusterEIP_54.215.143.166 --cleanup --node >>>>>>> ip-10-50-3-1251 >>>>>>> >>>>>>> Is there any possible way that could have caused the the node to be >>>>>>> added? >>>>>>> >>>>>>> I tried running pcs cluster node remove ip-10-50-3-1251 but since there >>>>>>> is no node and thus no pcsd that failed. Is there a way I can safely >>>>>>> remove this ghost node from the cluster? I can provide logs from >>>>>>> pacemaker or corosync as needed. >>>>>>> _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org >>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> _______________________________________________ >>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> >>>> _______________________________________________ >>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>> Bugs: http://bugs.clusterlabs.org >>> >>> _______________________________________________ >>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>> Bugs: http://bugs.clusterlabs.org >> >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org