Every 15-18 minutes one of my resources gets stopped on one node and then is restarted shortly after.
In the DC log I can see the following error lines. Dec 28 15:04:09 app01 pengine: [8618]: debug: clone_rsc_colocation_rh: Pairing resOCFS:1 with groupOcfs2Mgmt:0 Dec 28 15:04:09 app01 pengine: [8618]: debug: native_assign_node: Assigning app02 to resOCFS:1 Dec 28 15:04:09 app01 pengine: [8618]: ERROR: color_instance: Pre-allocation failed: got app02 instead of app01 Dec 28 15:04:09 app01 pengine: [8618]: info: native_deallocate: Deallocating resOCFS:1 from app02 Dec 28 15:04:09 app01 pengine: [8618]: debug: clone_rsc_colocation_rh: Pairing resOCFS:0 with groupOcfs2Mgmt:0 Dec 28 15:04:09 app01 pengine: [8618]: debug: native_assign_node: Assigning app02 to resOCFS:0 Dec 28 15:04:09 app01 pengine: [8618]: debug: clone_rsc_colocation_rh: Pairing resOCFS:1 with groupOcfs2Mgmt:1 Dec 28 15:04:09 app01 pengine: [8618]: debug: clone_rsc_colocation_rh: Pairing resOCFS:1 with groupOcfs2Mgmt:1 Dec 28 15:04:09 app01 pengine: [8618]: debug: native_assign_node: All nodes for resource resOCFS:1 are unavailable, unclean or shutting down (app01: 1, -1000000) Dec 28 15:04:09 app01 pengine: [8618]: debug: native_assign_node: Could not allocate a node for resOCFS:1 Dec 28 15:04:09 app01 pengine: [8618]: info: native_color: Resource resOCFS:1 cannot run anywhere This plays out before every stop event of OCFS. Here is the cib. primitive VirtualIP0 ocf:heartbeat:IPaddr2 \ params ip="10.121.12.30" \ op monitor interval="10s" \ meta target-role="Started" primitive resDLM ocf:pacemaker:controld primitive resDrbdShared0 ocf:linbit:drbd \ params drbd_resource="shared0" \ operations $id="resDrbd-operations" \ op monitor interval="20" role="Master" timeout="20" notify="true" \ op monitor interval="30" role="Slave" timeout="20" notify="true" primitive resJboss lsb:jboss4 \ op monitor interval="120s" timeout="150s" \ op start interval="0" timeout="150s" \ op stop interval="0" timeout="150s" primitive resO2CB ocf:pacemaker:o2cb primitive resOCFS ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/shared0" directory="/data" fstype="ocfs2" \ op monitor interval="120s" timeout="40" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" group groupOcfs2Mgmt resDLM resO2CB ms msDrbdShared0 resDrbdShared0 \ meta resource-stickines="100" notify="true" interleave="true" master-max="2" target-role="Started" clone cloneJboss resJboss \ meta interleave="true" ordered="true" is-managed="false" target-role="Started" clone cloneOCFS resOCFS \ meta interleave="true" ordered="true" target-role="Started" is-managed="true" clone cloneOcfs2Mgmt groupOcfs2Mgmt \ meta interleave="true" target-role="Started" location locVirtualIP0 VirtualIP0 9001: app01 colocation colDRBD inf: cloneOcfs2Mgmt msDrbdShared0:Master colocation colOcfs2 inf: cloneOCFS cloneOcfs2Mgmt order ordDRBD inf: msDrbdShared0:promote cloneOcfs2Mgmt:start order ordOcfs2 inf: cloneOcfs2Mgmt:start cloneOCFS:start property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ no-quorum-policy="ignore" \ last-lrm-refresh="1356702541" rsc_defaults $id="rsc-options" \ resource-stickiness="0" op_defaults $id="op-options" \ timeout="20s" I first suspected wrong network name resolution but /etc/hosts is correct and no duplicate names. -- Hälsningar / Greetings Stefan Midjich [De omnibus dubitandum]
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org