On Nov 13, 2013, at 3:15 AM, Jan Friesse <jfrie...@redhat.com> wrote:
> Andrew Beekhof napsal(a): >> >> On 13 Nov 2013, at 11:49 am, Sean Lutner <s...@rentul.net> wrote: >> >>> >>> >>>> On Nov 12, 2013, at 7:33 PM, Andrew Beekhof >>>> <and...@beekhof.net> wrote: >>>> >>>> >>>>> On 13 Nov 2013, at 11:22 am, Sean Lutner <s...@rentul.net> >>>>> wrote: >>>>> >>>>> >>>>> >>>>>> On Nov 12, 2013, at 6:01 PM, Andrew Beekhof >>>>>> <and...@beekhof.net> wrote: >>>>>> >>>>>> >>>>>>> On 13 Nov 2013, at 6:10 am, Sean Lutner <s...@rentul.net> >>>>>>> wrote: >>>>>>> >>>>>>> The folks testing the cluster I've been building have run >>>>>>> a script which blocks all traffic except SSH on one node >>>>>>> of the cluster for 15 seconds to mimic a network failure. >>>>>>> During this time, the network being "down" seems to cause >>>>>>> some odd behavior from pacemaker resulting in it dying. >>>>>>> >>>>>>> The cluster is two nodes and running four custom >>>>>>> resources on EC2 instances. The OS is CentOS 6.4 with the >>>>>>> config below: >>>>>>> >>>>>>> I've attached the /var/log/messages and >>>>>>> /var/log/cluster/corosync.log from the time period during >>>>>>> the test. I've having some difficulty in piecing together >>>>>>> what happened and am hoping someone can shed some light >>>>>>> on the problem. Any indications why pacemaker is dying on >>>>>>> that node? >>>>>> >>>>>> Because corosync is dying underneath it: >>>>>> >>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>>> send_ais_text: Sending message 28 via cpg: FAILED >>>>>> (rc=2): Library error: Connection timed out (110) Nov 09 >>>>>> 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>>> pcmk_cpg_dispatch: Connection to the CPG API failed: 2 >>>>>> Nov 09 14:51:49 [942] ip-10-50-3-251 cib: error: >>>>>> cib_ais_destroy: Corosync connection lost! Exiting. Nov >>>>>> 09 14:51:49 [942] ip-10-50-3-251 cib: info: >>>>>> terminate_cib: cib_ais_destroy: Exiting fast... >>>>> >>>>> Is that the expected behavior? >>>> >>>> It is expected behaviour when corosync dies. Ideally corosync >>>> wouldn't die though. >>> >>> What other debugging can I do to try to find out why corosync >>> died? >> >> There are various logging setting that may help. CC'ing Jan to see >> if he has any suggestions. >> > > If corosync really died corosync-fplay output (right after corosync > death) and coredump are most useful. > > Regards, > Honza So the process to collect this would be: - Run the test - Watch the logs for corosync to die - Run corosync-fplay and capture the output (will corosync-fplay > file.out suffice?) - Capture a core dump from corosync How do I capture the core dump? Is it something that has to be enabled in the /etc/corosync/corosync.conf file first and then run the tests? I've not done this in the past. Thanks > >>> >>> Thanks >>> >>>> >>>>> Is it because the DC was the other node? >>>> >>>> No. >>>> >>>>> >>>>> I did notice that there was an attempted fence operation but >>>>> it didn't look successful. >>>>> >>>>>> >>>>>> >>>>>>> >>>>>>> >>>>>>> [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes: >>>>>>> >>>>>>> Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251 >>>>>>> >>>>>>> Resources: Resource: ClusterEIP_54.215.143.166 >>>>>>> (provider=pacemaker type=EIP class=ocf) Attributes: >>>>>>> first_network_interface_id=eni-e4e0b68c >>>>>>> second_network_interface_id=eni-35f9af5d >>>>>>> first_private_ip=10.50.3.191 second_private_ip=10.50.3.91 >>>>>>> eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s >>>>>>> Operations: monitor interval=5s Clone: >>>>>>> EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource: >>>>>>> Varnish (provider=redhat type=varnish.sh class=ocf) >>>>>>> Operations: monitor interval=5s Resource: Varnishlog >>>>>>> (provider=redhat type=varnishlog.sh class=ocf) >>>>>>> Operations: monitor interval=5s Resource: Varnishncsa >>>>>>> (provider=redhat type=varnishncsa.sh class=ocf) >>>>>>> Operations: monitor interval=5s Resource: ec2-fencing >>>>>>> (type=fence_ec2 class=stonith) Attributes: >>>>>>> ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list >>>>>>> pcmk_host_list=HA01 HA02 Operations: monitor >>>>>>> start-delay=30s interval=0 timeout=150s >>>>>>> >>>>>>> Location Constraints: Ordering Constraints: >>>>>>> ClusterEIP_54.215.143.166 then Varnish Varnish then >>>>>>> Varnishlog Varnishlog then Varnishncsa Colocation >>>>>>> Constraints: Varnish with ClusterEIP_54.215.143.166 >>>>>>> Varnishlog with Varnish Varnishncsa with Varnishlog >>>>>>> >>>>>>> Cluster Properties: dc-version: 1.1.8-7.el6-394e906 >>>>>>> cluster-infrastructure: cman last-lrm-refresh: >>>>>>> 1384196963 no-quorum-policy: ignore stonith-enabled: >>>>>>> true >>>>>>> >>>>>>> <net-failure-messages-110913.out><net-failure-corosync-110913.out> >>>>>>> >>>>>>> > _______________________________________________ >>>>>>> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>>> >>>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>>> Bugs: http://bugs.clusterlabs.org >>>>>> >>>>>> _______________________________________________ Pacemaker >>>>>> mailing list: Pacemaker@oss.clusterlabs.org >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>>> >>>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >>>>>> Bugs: http://bugs.clusterlabs.org >>>>> >>>>> _______________________________________________ Pacemaker >>>>> mailing list: Pacemaker@oss.clusterlabs.org >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>>> >>>>> Project Home: http://www.clusterlabs.org Getting started: >>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >>>>> http://bugs.clusterlabs.org >>>> >>>> _______________________________________________ Pacemaker >>>> mailing list: Pacemaker@oss.clusterlabs.org >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>>> >>>> Project Home: http://www.clusterlabs.org Getting started: >>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >>>> http://bugs.clusterlabs.org >>> >>> _______________________________________________ Pacemaker mailing >>> list: Pacemaker@oss.clusterlabs.org >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >>> >>> Project Home: http://www.clusterlabs.org Getting started: >>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: >>> http://bugs.clusterlabs.org >> > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org