Andrew Beekhof napsal(a): > > On 13 Nov 2013, at 11:49 am, Sean Lutner <s...@rentul.net> wrote: > >> >> >>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof >>> <and...@beekhof.net> wrote: >>> >>> >>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <s...@rentul.net> >>>> wrote: >>>> >>>> >>>> >>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof >>>>> <and...@beekhof.net> wrote: >>>>> >>>>> >>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <s...@rentul.net> >>>>>> wrote: >>>>>> >>>>>> The folks testing the cluster I've been building have run >>>>>> a script which blocks all traffic except SSH on one node >>>>>> of the cluster for 15 seconds to mimic a network failure. >>>>>> During this time, the network being "down" seems to cause >>>>>> some odd behavior from pacemaker resulting in it dying. >>>>>> >>>>>> The cluster is two nodes and running four custom >>>>>> resources on EC2 instances. The OS is CentOS 6.4 with the >>>>>> config below: >>>>>> >>>>>> I've attached the /var/log/messages and >>>>>> /var/log/cluster/corosync.log from the time period during >>>>>> the test. I've having some difficulty in piecing together >>>>>> what happened and am hoping someone can shed some light >>>>>> on the problem. Any indications why pacemaker is dying on >>>>>> that node? >>>>> >>>>> Because corosync is dying underneath it: >>>>> >>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>> send_ais_text: Sending message 28 via cpg: FAILED >>>>> (rc=2): Library error: Connection timed out (110) Nov 09 >>>>> 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: 2 >>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>> cib_ais_destroy: Corosync connection lost! Exiting. Nov >>>>> 09 14:51:49 [942] ip-10-50-3-251 cib: info: >>>>> terminate_cib: cib_ais_destroy: Exiting fast... >>>> >>>> Is that the expected behavior? >>> >>> It is expected behaviour when corosync dies. Ideally corosync >>> wouldn't die though. >> >> What other debugging can I do to try to find out why corosync >> died? > > There are various logging setting that may help. CC'ing Jan to see > if he has any suggestions. >
If corosync really died corosync-fplay output (right after corosync death) and coredump are most useful. Regards, Honza >> >> Thanks >> >>> >>>> Is it because the DC was the other node? >>> >>> No. >>> >>>> >>>> I did notice that there was an attempted fence operation but >>>> it didn't look successful. >>>> >>>>> >>>>> >>>>>> >>>>>> >>>>>> [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes: >>>>>> >>>>>> Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251 >>>>>> >>>>>> Resources: Resource: ClusterEIP_54.215.143.166 >>>>>> (provider=pacemaker type=EIP class=ocf) Attributes: >>>>>> first_network_interface_id=eni-e4e0b68c >>>>>> second_network_interface_id=eni-35f9af5d >>>>>> first_private_ip=10.50.3.191 second_private_ip=10.50.3.91 >>>>>> eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s >>>>>> Operations: monitor interval=5s Clone: >>>>>> EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource: >>>>>> Varnish (provider=redhat type=varnish.sh class=ocf) >>>>>> Operations: monitor interval=5s Resource: Varnishlog >>>>>> (provider=redhat type=varnishlog.sh class=ocf) >>>>>> Operations: monitor interval=5s Resource: Varnishncsa >>>>>> (provider=redhat type=varnishncsa.sh class=ocf) >>>>>> Operations: monitor interval=5s Resource: ec2-fencing >>>>>> (type=fence_ec2 class=stonith) Attributes: >>>>>> ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list >>>>>> pcmk_host_list=HA01 HA02 Operations: monitor >>>>>> start-delay=30s interval=0 timeout=150s >>>>>> >>>>>> Location Constraints: Ordering Constraints: >>>>>> ClusterEIP_54.215.143.166 then Varnish Varnish then >>>>>> Varnishlog Varnishlog then Varnishncsa Colocation >>>>>> Constraints: Varnish with ClusterEIP_54.215.143.166 >>>>>> Varnishlog with Varnish Varnishncsa with Varnishlog >>>>>> >>>>>> Cluster Properties: dc-version: 1.1.8-7.el6-394e906 >>>>>> cluster-infrastructure: cman last-lrm-refresh: >>>>>> 1384196963 no-quorum-policy: ignore stonith-enabled: >>>>>> true >>>>>> >>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out> >>>>>> >>>>>> _______________________________________________ >>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> _______________________________________________ Pacemaker >>>>> mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>> Bugs: http://bugs.clusterlabs.org >>>> >>>> _______________________________________________ Pacemaker >>>> mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >>>> http://bugs.clusterlabs.org >>> >>> _______________________________________________ Pacemaker >>> mailing list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >>> http://bugs.clusterlabs.org >> >> _______________________________________________ Pacemaker mailing >> list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org Getting started: >> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >> http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org