I have a 2 node cluster with no-quorum-policy=ignore. I call these nodes as node-0 and node-1. In addition, I have two cluster resources in a group; an IP-address and an OCF script.
Normally these resources are active on node-0. However when I bounce pacemaker on node-1 (service pacemaker stop followed by service pacemaker start), the OCF resource gets bounced on node-0, which is unexpected and causing problems for my application. In the log messages I see that monitor has failed with "unknown error", leading to "resource is active on 2 nodes" error and the recovery procedure then bounces the OCF resource. But when I manually run monitor on my OCF script, return value is always either OCF_SUCCESS(0) or OCF_NOT_RUNNING(7) I am using following versions of the software Pacemaker version: 1.1.10 Corosync version: 1-4.1-15 OS: CentOS 6.4 What am I doing wrong? Below I am including the cib config and corresponding log messages <cib epoch="10" num_updates="94" admin_epoch="0" validate-with="pacemaker-1.2" cib-last-written="Tue Jan 7 18:11:58 2014" update-origin="gol-5-7-0" update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="gol-5-7-0"> <configuration> <crm_config> <cluster_property_set id="cib-bootstrap-options"> <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.10-1.el6_4.4-368c726"/> <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="cman"/> <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/> <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/> <nvpair id="cib-bootstrap-options-migration-threshold" name="migration-threshold" value="3"/> </cluster_property_set> </crm_config> <nodes> <node id="gol-5-7-6" uname="gol-5-7-6"/> <node id="gol-5-7-0" uname="gol-5-7-0"/> </nodes> <resources> <group id="Group"> <primitive class="ocf" id="FAILOVER-INTER" provider="heartbeat" type="IPaddr2"> <instance_attributes id="FAILOVER-INTER-instance_attributes"> <nvpair id="FAILOVER-INTER-instance_attributes-ip" name="ip" value="10.20.7.190"/> <nvpair id="FAILOVER-INTER-instance_attributes-nic" name="nic" value="eth1"/> <nvpair id="FAILOVER-INTER-instance_attributes-cidr_netmask" name="cidr_netmask" value="14"/> </instance_attributes> <operations> <op id="FAILOVER-INTER-monitor-interval-5s" interval="5s" name="monitor"/> </operations> </primitive> <primitive class="ocf" id="GOL-HA" provider="redhat" type="script.sh"> <instance_attributes id="GOL-HA-instance_attributes"> <nvpair id="GOL-HA-instance_attributes-name" name="name" value="gol-ha"/> <nvpair id="GOL-HA-instance_attributes-file" name="file" value="/etc/init.d/gol-ha"/> </instance_attributes> <operations> <op id="GOL-HA-monitor-interval-60s" interval="60s" name="monitor"/> </operations> </primitive> </group> </resources> <constraints/> <rsc_defaults> <meta_attributes id="rsc_defaults-options"> <nvpair id="rsc_defaults-options-resource-stickiness" name="resource-stickiness" value="100"/> </meta_attributes> </rsc_defaults> </configuration> Corresponding Log messages Feb 04 11:27:29 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed. Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2 Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2 Feb 04 11:27:29 [45168] gol-5-7-0 crmd: notice: crm_update_peer_state: cman_event_callback: Node gol-5-7-6[2] - state is now member (was lost) Feb 04 11:27:29 corosync [CPG ] chosen downlist: sender r(0) ip(172.16.0.2) ; members(old:1 left:0) Feb 04 11:27:29 corosync [MAIN ] Completed service synchronization, ready to provide service. Feb 04 11:27:36 [45168] gol-5-7-0 crmd: notice: do_state_transition: State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL origin=peer_update_callback ] Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_local_callback: Sending full refresh (origin=crmd) Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-GOL-HA (5) Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-GOL-HA (1391444085) Feb 04 11:27:38 [45166] gol-5-7-0 attrd: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true) Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op: Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1) Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: process_pe_message: Calculated Transition 1825: /var/lib/pacemaker/pengine/pe-input-45.bz2 Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 7: monitor FAILOVER-INTER_monitor_0 on gol-5-7-6 Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 8: monitor GOL-HA_monitor_0 on gol-5-7-6 Feb 04 11:27:38 [45168] gol-5-7-0 crmd: warning: status_from_rc: Action 8 (GOL-HA_monitor_0) on gol-5-7-6 failed (target: 7 vs. rc: 1): Error Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 6: probe_complete probe_complete on gol-5-7-6 - no waiting Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: run_graph: Transition 1825 (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-45.bz2): Stopped Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: unpack_config: On loss of CCM Quorum: Ignore Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op: Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1) Feb 04 11:27:38 [45167] gol-5-7-0 pengine: warning: unpack_rsc_op: Processing failed op monitor for GOL-HA on gol-5-7-6: unknown error (1) Feb 04 11:27:38 [45167] gol-5-7-0 pengine: error: native_create_actions: Resource GOL-HA (ocf::script.sh) is active on 2 nodes attempting recovery Feb 04 11:27:38 [45167] gol-5-7-0 pengine: notice: LogActions: Recover GOL-HA (Started gol-5-7-0) Feb 04 11:27:38 [45167] gol-5-7-0 pengine: error: process_pe_message: Calculated Transition 1826: /var/lib/pacemaker/pengine/pe-error-3.bz2 Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 10: stop GOL-HA_stop_0 on gol-5-7-0 (local) Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 3: stop GOL-HA_stop_0 on gol-5-7-6 Feb 04 11:27:38 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 7: probe_complete probe_complete on gol-5-7-6 - no waiting Feb 04 11:27:39 [45168] gol-5-7-0 crmd: notice: process_lrm_event: LRM operation GOL-HA_stop_0 (call=111, rc=0, cib-update=1953, confirmed=true) ok Feb 04 11:27:39 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 11: start GOL-HA_start_0 on gol-5-7-0 (local) Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: process_lrm_event: LRM operation GOL-HA_start_0 (call=115, rc=0, cib-update=1954, confirmed=true) ok Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: te_rsc_command: Initiating action 1: monitor GOL-HA_monitor_60000 on gol-5-7-0 (local) Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: process_lrm_event: LRM operation GOL-HA_monitor_60000 (call=118, rc=0, cib-update=1955, confirmed=false) ok Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: run_graph: Transition 1826 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete Feb 04 11:27:40 [45168] gol-5-7-0 crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org