Re: [Pacemaker] fence_rhevm (fence-agents-3.1.5-25.el6_4.2.x86_64) not working with pacemaker (pacemaker-1.1.8-7.el6.x86_64) on RHEL6.4

Andrew Beekhof Wed, 22 May 2013 03:41:02 -0700

On 22/05/2013, at 7:31 PM, John McCabe <j...@johnmccabe.net> wrote:

> Hi,
> I've been trying to get fence_rhevm (fence-agents-3.1.5-25.el6_4.2.x86_64) 
> working within pacemaker (pacemaker-1.1.8-7.el6.x86_64) but am unable to get 
> it to work as intended, using fence_rhevm on the command line works as 
> expected, as does stonith_admin but from within pacemaker (triggered by 
> deliberately killing corosync on the node to be fenced):
> 
> May 21 22:21:32 defiant corosync[1245]:   [TOTEM ] A processor failed, 
> forming new configuration.
> May 21 22:21:34 defiant corosync[1245]:   [QUORUM] Members[1]: 1
> May 21 22:21:34 defiant corosync[1245]:   [TOTEM ] A processor joined or left 
> the membership and a new membership was formed.
> May 21 22:21:34 defiant kernel: dlm: closing connection to node 2
> May 21 22:21:34 defiant corosync[1245]:   [CPG   ] chosen downlist: sender 
> r(0) ip(10.10.25.152) ; members(old:2 left:1)
> May 21 22:21:34 defiant corosync[1245]:   [MAIN  ] Completed service 
> synchronization, ready to provide service.
> May 21 22:21:34 defiant crmd[1749]:   notice: crm_update_peer_state: 
> cman_event_callback: Node enterprise[2] - state is now lost
> May 21 22:21:34 defiant crmd[1749]:  warning: match_down_event: No match for 
> shutdown action on enterprise
> May 21 22:21:34 defiant crmd[1749]:   notice: peer_update_callback: 
> Stonith/shutdown of enterprise not matched
> May 21 22:21:34 defiant crmd[1749]:   notice: do_state_transition: State 
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL 
> origin=check_join_state ]
> May 21 22:21:34 defiant fenced[1302]: fencing node enterprise
> May 21 22:21:34 defiant logger: fence_pcmk[2219]: Requesting Pacemaker fence 
> enterprise (reset)
> May 21 22:21:34 defiant stonith_admin[2220]:   notice: crm_log_args: Invoked: 
> stonith_admin --reboot enterprise --tolerance 5s
> May 21 22:21:35 defiant attrd[1747]:   notice: attrd_local_callback: Sending 
> full refresh (origin=crmd)
> May 21 22:21:35 defiant attrd[1747]:   notice: attrd_trigger_update: Sending 
> flush op to all hosts for: probe_complete (true)
> May 21 22:21:36 defiant pengine[1748]:   notice: unpack_config: On loss of 
> CCM Quorum: Ignore
> May 21 22:21:36 defiant pengine[1748]:   notice: process_pe_message: 
> Calculated Transition 64: /var/lib/pacemaker/pengine/pe-input-60.bz2
> May 21 22:21:36 defiant crmd[1749]:   notice: run_graph: Transition 64 
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
> May 21 22:21:36 defiant crmd[1749]:   notice: do_state_transition: State 
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> May 21 22:21:44 defiant logger: fence_pcmk[2219]: Call to fence enterprise 
> (reset) failed with rc=255
> May 21 22:21:45 defiant fenced[1302]: fence enterprise dev 0.0 agent 
> fence_pcmk result: error from agent
> May 21 22:21:45 defiant fenced[1302]: fence enterprise failed
> May 21 22:21:48 defiant fenced[1302]: fencing node enterprise
> May 21 22:21:48 defiant logger: fence_pcmk[2239]: Requesting Pacemaker fence 
> enterprise (reset)
> May 21 22:21:48 defiant stonith_admin[2240]:   notice: crm_log_args: Invoked: 
> stonith_admin --reboot enterprise --tolerance 5s
> May 21 22:21:58 defiant logger: fence_pcmk[2239]: Call to fence enterprise 
> (reset) failed with rc=255
> May 21 22:21:58 defiant fenced[1302]: fence enterprise dev 0.0 agent 
> fence_pcmk result: error from agent
> May 21 22:21:58 defiant fenced[1302]: fence enterprise failed
> May 21 22:22:01 defiant fenced[1302]: fencing node enterprise
> 
> and with corosync.log showing "warning: match_down_event:  No match for 
> shutdown action on enterprise", "notice: peer_update_callback:      
> Stonith/shutdown of enterprise not matched"
> 
> May 21 22:21:32 corosync [TOTEM ] A processor failed, forming new 
> configuration.
> May 21 22:21:34 corosync [QUORUM] Members[1]: 1
> May 21 22:21:34 corosync [TOTEM ] A processor joined or left the membership 
> and a new membership was formed.
> May 21 22:21:34 [1749] defiant       crmd:     info: cman_event_callback:     
>   Membership 296: quorum retained
> May 21 22:21:34 [1744] defiant        cib:     info: pcmk_cpg_membership:     
>   Left[5.0] cib.2
> May 21 22:21:34 [1744] defiant        cib:     info: crm_update_peer_proc:    
>   pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
> May 21 22:21:34 [1744] defiant        cib:     info: pcmk_cpg_membership:     
>   Member[5.0] cib.1
> May 21 22:21:34 [1745] defiant stonith-ng:     info: pcmk_cpg_membership:     
>   Left[5.0] stonith-ng.2
> May 21 22:21:34 [1745] defiant stonith-ng:     info: crm_update_peer_proc:    
>   pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
> May 21 22:21:34 corosync [CPG   ] chosen downlist: sender r(0) 
> ip(10.10.25.152) ; members(old:2 left:1)
> May 21 22:21:34 corosync [MAIN  ] Completed service synchronization, ready to 
> provide service.
> May 21 22:21:34 [1745] defiant stonith-ng:     info: pcmk_cpg_membership:     
>   Member[5.0] stonith-ng.1
> May 21 22:21:34 [1749] defiant       crmd:   notice: crm_update_peer_state:   
>   cman_event_callback: Node enterprise[2] - state is now lost
> May 21 22:21:34 [1749] defiant       crmd:     info: peer_update_callback:    
>   enterprise is now lost (was member)
> May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:     
>   Operation complete: op cib_modify for section nodes (origin=local/crmd/150, 
> version=0.22.3): OK (rc=0)
> May 21 22:21:34 [1749] defiant       crmd:     info: pcmk_cpg_membership:     
>   Left[5.0] crmd.2
> May 21 22:21:34 [1749] defiant       crmd:     info: crm_update_peer_proc:    
>   pcmk_cpg_membership: Node enterprise[2] - corosync-cpg is now offline
> May 21 22:21:34 [1749] defiant       crmd:     info: peer_update_callback:    
>   Client enterprise/peer now has status [offline] (DC=true)
> May 21 22:21:34 [1749] defiant       crmd:  warning: match_down_event:  No 
> match for shutdown action on enterprise
> May 21 22:21:34 [1749] defiant       crmd:   notice: peer_update_callback:    
>   Stonith/shutdown of enterprise not matched
> May 21 22:21:34 [1749] defiant       crmd:     info: 
> crm_update_peer_expected:  peer_update_callback: Node enterprise[2] - 
> expected state is now down
> May 21 22:21:34 [1749] defiant       crmd:     info: abort_transition_graph:  
>   peer_update_callback:211 - Triggered transition abort (complete=1) : Node 
> failure
> May 21 22:21:34 [1749] defiant       crmd:     info: pcmk_cpg_membership:     
>   Member[5.0] crmd.1
> May 21 22:21:34 [1749] defiant       crmd:   notice: do_state_transition:     
>   State transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 21 22:21:34 [1749] defiant       crmd:     info: abort_transition_graph:  
>   do_te_invoke:163 - Triggered transition abort (complete=1) : Peer Halt
> May 21 22:21:34 [1749] defiant       crmd:     info: join_make_offer:   
> Making join offers based on membership 296
> May 21 22:21:34 [1749] defiant       crmd:     info: do_dc_join_offer_all:    
>   join-7: Waiting on 1 outstanding join acks
> May 21 22:21:34 [1749] defiant       crmd:     info: update_dc:         Set 
> DC to defiant (3.0.7)
> May 21 22:21:34 [1749] defiant       crmd:     info: do_state_transition:     
>   State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 21 22:21:34 [1749] defiant       crmd:     info: do_dc_join_finalize:     
>   join-7: Syncing the CIB from defiant to the rest of the cluster
> May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:     
>   Operation complete: op cib_sync for section 'all' (origin=local/crmd/154, 
> version=0.22.5): OK (rc=0)
> May 21 22:21:34 [1744] defiant        cib:     info: cib_process_request:     
>   Operation complete: op cib_modify for section nodes (origin=local/crmd/155, 
> version=0.22.6): OK (rc=0)
> May 21 22:21:34 [1749] defiant       crmd:     info: stonith_action_create:   
>   Initiating action metadata for agent fence_rhevm (target=(null))
> May 21 22:21:35 [1749] defiant       crmd:     info: do_dc_join_ack:    
> join-7: Updating node state to member for defiant
> May 21 22:21:35 [1749] defiant       crmd:     info: erase_status_tag:  
> Deleting xpath: //node_state[@uname='defiant']/lrm
> May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:     
>   Operation complete: op cib_delete for section 
> //node_state[@uname='defiant']/lrm (origin=local/crmd/156, version=0.22.7): 
> OK (rc=0)
> May 21 22:21:35 [1749] defiant       crmd:     info: do_state_transition:     
>   State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED 
> cause=C_FSA_INTERNAL origin=check_join_state ]
> May 21 22:21:35 [1749] defiant       crmd:     info: abort_transition_graph:  
>   do_te_invoke:156 - Triggered transition abort (complete=1) : Peer Cancelled
> May 21 22:21:35 [1747] defiant      attrd:   notice: attrd_local_callback:    
>   Sending full refresh (origin=crmd)
> May 21 22:21:35 [1747] defiant      attrd:   notice: attrd_trigger_update:    
>   Sending flush op to all hosts for: probe_complete (true)
> May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:     
>   Operation complete: op cib_modify for section nodes (origin=local/crmd/158, 
> version=0.22.9): OK (rc=0)
> May 21 22:21:35 [1744] defiant        cib:     info: cib_process_request:     
>   Operation complete: op cib_modify for section cib (origin=local/crmd/160, 
> version=0.22.11): OK (rc=0)
> May 21 22:21:36 [1748] defiant    pengine:     info: unpack_config:     
> Startup probes: enabled
> May 21 22:21:36 [1748] defiant    pengine:   notice: unpack_config:     On 
> loss of CCM Quorum: Ignore
> May 21 22:21:36 [1748] defiant    pengine:     info: unpack_config:     Node 
> scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> May 21 22:21:36 [1748] defiant    pengine:     info: unpack_domains:    
> Unpacking domains
> May 21 22:21:36 [1748] defiant    pengine:     info: 
> determine_online_status_fencing:   Node defiant is active
> May 21 22:21:36 [1748] defiant    pengine:     info: determine_online_status: 
>   Node defiant is online
> May 21 22:21:36 [1748] defiant    pengine:     info: native_print:      
> st-rhevm        (stonith:fence_rhevm):  Started defiant
> May 21 22:21:36 [1748] defiant    pengine:     info: LogActions:        Leave 
>   st-rhevm        (Started defiant)
> May 21 22:21:36 [1748] defiant    pengine:   notice: process_pe_message:      
>   Calculated Transition 64: /var/lib/pacemaker/pengine/pe-input-60.bz2
> May 21 22:21:36 [1749] defiant       crmd:     info: do_state_transition:     
>   State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ 
> input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
> May 21 22:21:36 [1749] defiant       crmd:     info: do_te_invoke:      
> Processing graph 64 (ref=pe_calc-dc-1369171296-118) derived from 
> /var/lib/pacemaker/pengine/pe-input-60.bz2
> May 21 22:21:36 [1749] defiant       crmd:   notice: run_graph:         
> Transition 64 (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, 
> Source=/var/lib/pacemaker/pengine/pe-input-60.bz2): Complete
> May 21 22:21:36 [1749] defiant       crmd:   notice: do_state_transition:     
>   State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS 
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> I can get the node enterprise to fence as expected from the command line with:
> 
> stonith_admin --reboot enterprise --tolerance 5s
> 
> fence_rhevm -o reboot -a <hypervisor ip> -l <user>@<domain> -p <password> -n 
> enterprise -z
> 
> My config is as follows:
> 
> cluster.conf -----------------------------------
> 
> <?xml version="1.0"?>
> <cluster config_version="1" name="cluster">
>   <logging debug="off"/>
>   <clusternodes>
>     <clusternode name="defiant" nodeid="1">
>       <fence>
>         <method name="pcmk-redirect">
>           <device name="pcmk" port="defiant"/>
>         </method>
>       </fence>
>     </clusternode>
>     <clusternode name="enterprise" nodeid="2">
>       <fence>
>         <method name="pcmk-redirect">
>           <device name="pcmk" port="enterprise"/>
>         </method>
>       </fence>
>     </clusternode>
>   </clusternodes>
>   <fencedevices> 
>     <fencedevice name="pcmk" agent="fence_pcmk"/> 
>   </fencedevices>
>   <cman two_node="1" expected_votes="1">
>   </cman>
> </cluster>
> 
> pacemaker cib ---------------------------------
> 
> Stonith device created with:
> 
> pcs stonith create st-rhevm fence_rhevm login="<user>@<domain>" 
> passwd="<password>" ssl=1 ipaddr="<hypervisor ip>" verbose=1 
> debug="/tmp/debug.log"
> 
> 
> <cib epoch="18" num_updates="88" admin_epoch="0" 
> validate-with="pacemaker-1.2" update-origin="defiant" 
> update-client="cibadmin" cib-last-written="Tue May 21 07:55:31 2013" 
> crm_feature_set="3.0.7" have-quorum="1" dc-uuid="defiant">
>   <configuration>
>     <crm_config>
>       <cluster_property_set id="cib-bootstrap-options">
>         <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" 
> value="1.1.8-7.el6-394e906"/>
>         <nvpair id="cib-bootstrap-options-cluster-infrastructure" 
> name="cluster-infrastructure" value="cman"/>
>         <nvpair id="cib-bootstrap-options-no-quorum-policy" 
> name="no-quorum-policy" value="ignore"/>
>         <nvpair id="cib-bootstrap-options-stonith-enabled" 
> name="stonith-enabled" value="true"/>
>       </cluster_property_set>
>     </crm_config>
>     <nodes>
>       <node id="defiant" uname="defiant"/>
>       <node id="enterprise" uname="enterprise"/>
>     </nodes>
>     <resources>
>       <primitive class="stonith" id="st-rhevm" type="fence_rhevm">
>         <instance_attributes id="st-rhevm-instance_attributes">
>           <nvpair id="st-rhevm-instance_attributes-login" name="login" 
> value="<user>@<domain>"/>
>           <nvpair id="st-rhevm-instance_attributes-passwd" name="passwd" 
> value="<password>"/>
>           <nvpair id="st-rhevm-instance_attributes-debug" name="debug" 
> value="/tmp/debug.log"/>
>           <nvpair id="st-rhevm-instance_attributes-ssl" name="ssl" value="1"/>
>           <nvpair id="st-rhevm-instance_attributes-verbose" name="verbose" 
> value="1"/>
>           <nvpair id="st-rhevm-instance_attributes-ipaddr" name="ipaddr" 
> value="<hypervisor ip>"/>
>         </instance_attributes>
>       </primitive>


Mine is:

      <primitive id="Fencing" class="stonith" type="fence_rhevm">
        <instance_attributes id="Fencing-params">
          <nvpair id="Fencing-ipport" name="ipport" value="443"/>
          <nvpair id="Fencing-shell_timeout" name="shell_timeout" value="10"/>
          <nvpair id="Fencing-passwd" name="passwd" value="{pass}"/>
          <nvpair id="Fencing-ipaddr" name="ipaddr" value="{ip}"/>
          <nvpair id="Fencing-ssl" name="ssl" value="1"/>
          <nvpair id="Fencing-login" name="login" value="{user}@{domain}"/>
        </instance_attributes>
        <operations>
          <op id="Fencing-monitor-120s" interval="120s" name="monitor" 
timeout="120s"/>
          <op id="Fencing-stop-0" interval="0" name="stop" timeout="60s"/>
          <op id="Fencing-start-0" interval="0" name="start" timeout="60s"/>
        </operations>
      </primitive>

Maybe ipport is important?
Also, there was a RHEVM API change recently, I had to update the fence_rhevm 
agent before it would work again.

>     </resources>
>     <constraints/>
>   </configuration>
>   <status>
>     <node_state id="defiant" uname="defiant" in_ccm="true" crmd="online" 
> crm-debug-origin="do_state_transition" join="member" expected="member">
>       <transient_attributes id="defiant">
>         <instance_attributes id="status-defiant">
>           <nvpair id="status-defiant-probe_complete" name="probe_complete" 
> value="true"/>
>         </instance_attributes>
>       </transient_attributes>
>       <lrm id="defiant">
>         <lrm_resources>
>           <lrm_resource id="st-rhevm" type="fence_rhevm" class="stonith">
>             <lrm_rsc_op id="st-rhevm_last_0" operation_key="st-rhevm_start_0" 
> operation="start" crm-debug-origin="build_active_RAs" crm_feature_set="3.0.7" 
> transition-key="2:1:0:1e7972e8-6f9a-4325-b9c3-3d7e2950d996" 
> transition-magic="0:0;2:1:0:1e7972e8-6f9a-4325-b9c3-3d7e2950d996" 
> call-id="14" rc-code="0" op-status="0" interval="0" last-run="1369119332" 
> last-rc-change="0" exec-time="232" queue-time="0" 
> op-digest="3bc7e1ce413fe37998a289f77f85d159"/>
>           </lrm_resource>
>         </lrm_resources>
>       </lrm>
>     </node_state>
>     <node_state id="enterprise" uname="enterprise" in_ccm="true" 
> crmd="online" crm-debug-origin="do_update_resource" join="member" 
> expected="member">
>       <lrm id="enterprise">
>         <lrm_resources>
>           <lrm_resource id="st-rhevm" type="fence_rhevm" class="stonith">
>             <lrm_rsc_op id="st-rhevm_last_0" 
> operation_key="st-rhevm_monitor_0" operation="monitor" 
> crm-debug-origin="do_update_resource" crm_feature_set="3.0.7" 
> transition-key="5:59:7:8170c498-f66b-4974-b3c0-c17eb45ba5cb" 
> transition-magic="0:7;5:59:7:8170c498-f66b-4974-b3c0-c17eb45ba5cb" 
> call-id="5" rc-code="7" op-status="0" interval="0" last-run="1369170800" 
> last-rc-change="0" exec-time="4" queue-time="0" 
> op-digest="3bc7e1ce413fe37998a289f77f85d159"/>
>           </lrm_resource>
>         </lrm_resources>
>       </lrm>
>       <transient_attributes id="enterprise">
>         <instance_attributes id="status-enterprise">
>           <nvpair id="status-enterprise-probe_complete" name="probe_complete" 
> value="true"/>
>         </instance_attributes>
>       </transient_attributes>
>     </node_state>
>   </status>
> </cib>
> 
> 
> The debug log output from fence_rhevm doesn't appear to show pacemaker trying 
> to request the reboot, only a vms command sent to the hypervisor which 
> responds with xml listing the VMs.
> 
> I can't quite see why its failing? Are you aware of any issues with 
> fence_rhevm (fence-agents-3.1.5-25.el6_4.2.x86_64) not working with pacemaker 
> (pacemaker-1.1.8-7.el6.x86_64) on RHEL6.4?
> 
> All the best,
> /John
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] fence_rhevm (fence-agents-3.1.5-25.el6_4.2.x86_64) not working with pacemaker (pacemaker-1.1.8-7.el6.x86_64) on RHEL6.4

Reply via email to