Just to tie this off. It now seems stable since reinstalling vmware tools on both nodes. So it seems nothing to do with corosync or pacemaker.
Regards, Darren On 7 February 2013 11:03, Darren Mansell <darren.mans...@gmail.com> wrote: > Hi all. > > I've installed a Corosync/Pacemaker cluster of 2 nodes into a VMware ESX > environment. The install uses Debian squeeze (6.0) with packages from > squeeze-backports. > > These are package versions in use: > > corosync 1.4.2-1~bpo60+1 > pacemaker 1.1.7-1~bpo60+1 > ( + required packages and libs ) > ( I had to use backports to get the failure-timeout ability ) > > I use these 2 nodes to run ldirectord and a VIP to load-balance a MS > Exchange cluster and it works very well in the main. But about twice a day > there are losses of quorum where the cluster will go split-brain then > recover after about 30 seconds. > > I've already had to disable STONITH for this issue as it was causing long > shoot-outs and taking a while to recover. Now with failure-timeouts and no > STONITH it comes back fairly quickly. > > I've attached a hb_report from both nodes and put the cluster config > below. Any ideas or thoughts would be most welcome. > > Many thanks. > Darren > > crm configure show: > node exlb01 > node exlb02 > primitive VIP1 ocf:heartbeat:IPaddr2 \ > params lvs_support="true" ip="10.8.35.55" cidr_netmask="24" > broadcast="10.8.35.255" \ > op monitor interval="60" timeout="60" \ > meta migration-threshold="2" failure-timeout="120" > primitive ldirectord ocf:heartbeat:ldirectord \ > params configfile="/etc/ha.d/ldirectord.cf" \ > op monitor interval="60" timeout="60" \ > meta migration-threshold="2" target-role="Started" > failure-timeout="120" > group lb VIP1 ldirectord \ > meta target-role="Started" > location l-lb-100 lb 100: exlb01 > property $id="cib-bootstrap-options" \ > dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ > cluster-infrastructure="openais" \ > expected-quorum-votes="2" \ > no-quorum-policy="ignore" \ > stonith-enabled="false" \ > last-lrm-refresh="1355878292" \ > cluster-recheck-interval="60s" > > crm status: > ============ > Last updated: Thu Feb 7 11:01:06 2013 > Last change: Wed Dec 19 01:32:40 2012 > Stack: openais > Current DC: exlb02 - partition with quorum > Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff > 2 Nodes configured, 2 expected votes > 2 Resources configured. > ============ > > Online: [ exlb02 exlb01 ] > > Resource Group: lb > VIP1 (ocf::heartbeat:IPaddr2): Started exlb01 > ldirectord (ocf::heartbeat:ldirectord): Started exlb01 >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org