[Pacemaker] resource is too active problem in a 2-node cluster

Aggarwal, Ajay Mon, 10 Feb 2014 18:21:14 -0800

I have a 2 node cluster with no-quorum-policy=ignore. I call these nodes as 
node-0 and node-1. In addition, I have two cluster resources in a group; an 
IP-address and an OCF script.


Normally these resources are active on node-0. However when I bounce pacemaker 
on node-1 (service pacemaker stop followed by service pacemaker start), the OCF 
resource gets bounced on node-0, which is unexpected and causing problems for 
my application. In the log messages I see that monitor has failed with "unknown 
error", leading to "resource is active on 2 nodes" error and the recovery 
procedure then bounces the OCF resource. But when I manually run monitor on my 
OCF script, return value is always either OCF_SUCCESS(0) or OCF_NOT_RUNNING(7)

I am using following versions of the software
   Pacemaker version: 1.1.10
   Corosync version: 1-4.1-15
   OS: CentOS 6.4

What am I doing wrong?

Below I am including the cib config and corresponding log messages

<cib epoch="10" num_updates="94" admin_epoch="0" validate-with="pacemaker-1.2" 
cib-last-written="Tue Jan  7 18:11:58 2014" update-origin="gol-5-7-0" 
update-client="cibadmin" crm_feature_set="3.0.7" have-quorum="1" 
dc-uuid="gol-5-7-0">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
value="1.1.10-1.el6_4.4-368c726"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
name="cluster-infrastructure" value="cman"/>
        <nvpair id="cib-bootstrap-options-stonith-enabled" 
name="stonith-enabled" value="false"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" 
name="no-quorum-policy" value="ignore"/>
        <nvpair id="cib-bootstrap-options-migration-threshold" 
name="migration-threshold" value="3"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="gol-5-7-6" uname="gol-5-7-6"/>
      <node id="gol-5-7-0" uname="gol-5-7-0"/>
    </nodes>
    <resources>
      <group id="Group">
        <primitive class="ocf" id="FAILOVER-INTER" provider="heartbeat" 
type="IPaddr2">
          <instance_attributes id="FAILOVER-INTER-instance_attributes">
            <nvpair id="FAILOVER-INTER-instance_attributes-ip" name="ip" 
value="10.20.7.190"/>
            <nvpair id="FAILOVER-INTER-instance_attributes-nic" name="nic" 
value="eth1"/>
            <nvpair id="FAILOVER-INTER-instance_attributes-cidr_netmask" 
name="cidr_netmask" value="14"/>
          </instance_attributes>
          <operations>
            <op id="FAILOVER-INTER-monitor-interval-5s" interval="5s" 
name="monitor"/>
          </operations>
        </primitive>
        <primitive class="ocf" id="GOL-HA" provider="redhat" type="script.sh">
          <instance_attributes id="GOL-HA-instance_attributes">
            <nvpair id="GOL-HA-instance_attributes-name" name="name" 
value="gol-ha"/>
            <nvpair id="GOL-HA-instance_attributes-file" name="file" 
value="/etc/init.d/gol-ha"/>
          </instance_attributes>
          <operations>
            <op id="GOL-HA-monitor-interval-60s" interval="60s" name="monitor"/>
          </operations>
        </primitive>
      </group>
    </resources>
    <constraints/>
    <rsc_defaults>
      <meta_attributes id="rsc_defaults-options">
        <nvpair id="rsc_defaults-options-resource-stickiness" 
name="resource-stickiness" value="100"/>
      </meta_attributes>
    </rsc_defaults>
  </configuration>



Corresponding Log messages

Feb 04 11:27:29 corosync [TOTEM ] A processor joined or left the membership and 
a new membership was formed.
Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2
Feb 04 11:27:29 corosync [QUORUM] Members[2]: 1 2
Feb 04 11:27:29 [45168] gol-5-7-0       crmd:   notice: crm_update_peer_state:  
   cman_event_callback: Node gol-5-7-6[2] - state is now member (was lost)
Feb 04 11:27:29 corosync [CPG   ] chosen downlist: sender r(0) ip(172.16.0.2) ; 
members(old:1 left:0)
Feb 04 11:27:29 corosync [MAIN  ] Completed service synchronization, ready to 
provide service.
Feb 04 11:27:36 [45168] gol-5-7-0       crmd:   notice: do_state_transition:    
 State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN 
cause=C_FSA_INTERNAL origin=peer_update_callback ]
Feb 04 11:27:38 [45166] gol-5-7-0      attrd:   notice: attrd_local_callback:   
  Sending full refresh (origin=crmd)
Feb 04 11:27:38 [45166] gol-5-7-0      attrd:   notice: attrd_trigger_update:   
  Sending flush op to all hosts for: fail-count-GOL-HA (5)
Feb 04 11:27:38 [45166] gol-5-7-0      attrd:   notice: attrd_trigger_update:   
  Sending flush op to all hosts for: last-failure-GOL-HA (1391444085)
Feb 04 11:27:38 [45166] gol-5-7-0      attrd:   notice: attrd_trigger_update:   
  Sending flush op to all hosts for: probe_complete (true)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:   notice: unpack_config:     On 
loss of CCM Quorum: Ignore
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:  warning: unpack_rsc_op:     
Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:   notice: process_pe_message:     
Calculated Transition 1825: /var/lib/pacemaker/pengine/pe-input-45.bz2
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     
Initiating action 7: monitor FAILOVER-INTER_monitor_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     
Initiating action 8: monitor GOL-HA_monitor_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:  warning: status_from_rc:     
Action 8 (GOL-HA_monitor_0) on gol-5-7-6 failed (target: 7 vs. rc: 1): Error
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     
Initiating action 6: probe_complete probe_complete on gol-5-7-6 - no waiting
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: run_graph:     
Transition 1825 (Complete=3, Pending=0, Fired=0, Skipped=1, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-45.bz2): Stopped
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:   notice: unpack_config:     On 
loss of CCM Quorum: Ignore
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:  warning: unpack_rsc_op:     
Processing failed op monitor for GOL-HA on gol-5-7-0: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:  warning: unpack_rsc_op:     
Processing failed op monitor for GOL-HA on gol-5-7-6: unknown error (1)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:    error: native_create_actions:  
   Resource GOL-HA (ocf::script.sh) is active on 2 nodes attempting recovery
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:   notice: LogActions:     Recover 
GOL-HA    (Started gol-5-7-0)
Feb 04 11:27:38 [45167] gol-5-7-0    pengine:    error: process_pe_message:     
Calculated Transition 1826: /var/lib/pacemaker/pengine/pe-error-3.bz2
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     
Initiating action 10: stop GOL-HA_stop_0 on gol-5-7-0 (local)
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     
Initiating action 3: stop GOL-HA_stop_0 on gol-5-7-6
Feb 04 11:27:38 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     
Initiating action 7: probe_complete probe_complete on gol-5-7-6 - no waiting
Feb 04 11:27:39 [45168] gol-5-7-0       crmd:   notice: process_lrm_event:     
LRM operation GOL-HA_stop_0 (call=111, rc=0, cib-update=1953, confirmed=true) ok
Feb 04 11:27:39 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     
Initiating action 11: start GOL-HA_start_0 on gol-5-7-0 (local)
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: process_lrm_event:     
LRM operation GOL-HA_start_0 (call=115, rc=0, cib-update=1954, confirmed=true) 
ok
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: te_rsc_command:     
Initiating action 1: monitor GOL-HA_monitor_60000 on gol-5-7-0 (local)
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: process_lrm_event:     
LRM operation GOL-HA_monitor_60000 (call=118, rc=0, cib-update=1955, 
confirmed=false) ok
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: run_graph:     
Transition 1826 (Complete=10, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-error-3.bz2): Complete
Feb 04 11:27:40 [45168] gol-5-7-0       crmd:   notice: do_state_transition:    
 State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
cause=C_FSA_INTERNAL origin=notify_crmd ] 

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] resource is too active problem in a 2-node cluster

Reply via email to