The folks testing the cluster I've been building have run a script which blocks all traffic except SSH on one node of the cluster for 15 seconds to mimic a network failure. During this time, the network being "down" seems to cause some odd behavior from pacemaker resulting in it dying.
The cluster is two nodes and running four custom resources on EC2 instances. The OS is CentOS 6.4 with the config below: I've attached the /var/log/messages and /var/log/cluster/corosync.log from the time period during the test. I've having some difficulty in piecing together what happened and am hoping someone can shed some light on the problem. Any indications why pacemaker is dying on that node? [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes: Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251 Resources: Resource: ClusterEIP_54.215.143.166 (provider=pacemaker type=EIP class=ocf) Attributes: first_network_interface_id=eni-e4e0b68c second_network_interface_id=eni-35f9af5d first_private_ip=10.50.3.191 second_private_ip=10.50.3.91 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s Operations: monitor interval=5s Clone: EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource: Varnish (provider=redhat type=varnish.sh class=ocf) Operations: monitor interval=5s Resource: Varnishlog (provider=redhat type=varnishlog.sh class=ocf) Operations: monitor interval=5s Resource: Varnishncsa (provider=redhat type=varnishncsa.sh class=ocf) Operations: monitor interval=5s Resource: ec2-fencing (type=fence_ec2 class=stonith) Attributes: ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list pcmk_host_list=HA01 HA02 Operations: monitor start-delay=30s interval=0 timeout=150s Location Constraints: Ordering Constraints: ClusterEIP_54.215.143.166 then Varnish Varnish then Varnishlog Varnishlog then Varnishncsa Colocation Constraints: Varnish with ClusterEIP_54.215.143.166 Varnishlog with Varnish Varnishncsa with Varnishlog Cluster Properties: dc-version: 1.1.8-7.el6-394e906 cluster-infrastructure: cman last-lrm-refresh: 1384196963 no-quorum-policy: ignore stonith-enabled: true
net-failure-messages-110913.out
Description: Binary data
net-failure-corosync-110913.out
Description: Binary data
signature.asc
Description: Message signed with OpenPGP using GPGMail
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org