Hi all, has anybody experienced following. We were simulating the failure of a resource stop with STONITH enabled. The node that failed to stop the resource requested to stonith itself. However, the request was not processed by peer node. It has not logged anything into log file at that time. Interestingly, this scenario works most of the time. This was justthe first one but we are not aware of anything that was modified on the system. Could anyone try to explain how such situation could occur? Excerpt of logs: SP1: pengine[1186]: 2009/11/10_10:09:07 notice: native_print: SP2-stonith (stonith:external/vsp-ipmi): Started sp2 pengine[1186]: 2009/11/10_10:09:07 notice: native_print: SP1-stonith (stonith:external/vsp-ipmi): Started sp1 pengine[1186]: 2009/11/10_10:09:07 notice: group_print: Resource Group: udom pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-drbd-udom0 (heartbeat:drbddisk): Started sp1 pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-drbd-udom1 (heartbeat:drbddisk): Started sp1 pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-drbd-udom2 (heartbeat:drbddisk): Started sp1 pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-delay (ocf::heartbeat:Delay): Started sp1 pengine[1186]: 2009/11/10_10:09:07 notice: native_print: udom-vm (ocf::heartbeat:Xen): Started sp1 Initiated resource stop: ... Xen[23346][24689]: 2009/11/10_10:09:38 INFO: Xen domain udom stopped. crmd[966]: 2009/11/10_10:09:38 info: process_lrm_event: LRM operation udom-vm_stop_0 (call=27, rc=0, cib-update=121, confirmed=true) complete ok ... crmd[966]: 2009/11/10_10:09:48 info: process_lrm_event: LRM operation udom-delay_stop_0 (call=30, rc=0, cib-update=124, confirmed=true) complete ok ... lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) 3: State change failed: (-12) Device is held open by someone lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) Command ' lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) /sbin/drbdsetup lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) 3 lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) secondary lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) ' terminated with exit code 11 lrmd[963]: 2009/11/10_10:09:48 info: RA output: (udom-drbd-udom2:stop:stderr) /sbin/drbdadm secondary udom2: exit code 11, mapping to 1 crmd[966]: 2009/11/10_10:09:48 info: process_lrm_event: LRM operation udom-drbd-udom2_stop_0 (call=32, rc=1, cib-update=125, confirmed=true) complete unknown error cib[962]: 2009/11/10_10:09:48 info: cib_process_request: Operation complete: op cib_modify for section 'all' (origin=local/crmd/125): ok (rc=0) crmd[966]: 2009/11/10_10:09:48 WARN: status_from_rc: Action 22 (udom-drbd-udom2_stop_0) on sp1 failed (target: 0 vs. rc: 1): Error crmd[966]: 2009/11/10_10:09:48 WARN: update_failcount: Updating failcount for udom-drbd-udom2 on d3463b4e-6f4a-4ced-8b29-533c3f5c6744 after failed stop: rc=1 (update=INFINITY, time=1257865788)
Decision to STONITH itself: pengine[1186]: 2009/11/10_10:10:48 WARN: native_color: Resource SP1-stonith cannot run anywhere pengine[1186]: 2009/11/10_10:10:48 WARN: stage6: Scheduling Node sp1 for STONITH crmd[966]: 2009/11/10_10:10:48 info: te_fence_node: Executing reboot fencing operation (36) on sp1 (timeout=60000) stonithd[964]: 2009/11/10_10:10:48 info: client tengine [pid: 966] requests a STONITH operation RESET on node sp1 stonithd[964]: 2009/11/10_10:10:48 info: we can't manage sp1, broadcast request to other nodes stonithd[964]: 2009/11/10_10:10:48 info: Broadcasting the message succeeded: require others to stonith node sp1. stonithd[964]: 2009/11/10_10:11:48 ERROR: Failed to STONITH the node sp1: optype=RESET, op_result=TIMEOUT ... crmd[966]: 2009/11/10_10:11:48 info: tengine_stonith_callback: call=-3, optype=1, node_name=sp1, result=2, node_list=, action=36:36:0:3f220573-b709-4243-bf70-5cc5b17c7a57 crmd[966]: 2009/11/10_10:11:48 ERROR: tengine_stonith_callback: Stonith of sp1 failed (2)... aborting transition. and again pengine[1186]: 2009/11/10_10:11:48 WARN: stage6: Scheduling Node sp1 for STONITH stonithd[964]: 2009/11/10_10:11:48 info: client tengine [pid: 966] requests a STONITH operation RESET on node sp1 crmd[966]: 2009/11/10_10:11:48 info: te_fence_node: Executing reboot fencing operation (35) on sp1 (timeout=60000) stonithd[964]: 2009/11/10_10:12:48 ERROR: Failed to STONITH the node sp1: optype=RESET, op_result=TIMEOUT crmd[966]: 2009/11/10_10:12:48 info: tengine_stonith_callback: call=-4, optype=1, node_name=sp1, result=2, node_list=, action=35:37:0:3f220573-b709-4243-bf70-5cc5b17c7a57 crmd[966]: 2009/11/10_10:12:48 ERROR: tengine_stonith_callback: Stonith of sp1 failed (2)... aborting transition. crmd[966]: 2009/11/10_10:12:48 info: abort_transition_graph: tengine_stonith_callback:391 - Triggered transition abort (complete=0) : Stonith failed While logs of the peer node: cib[4608]: 2009/11/10_10:08:59 info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) cib[4777]: 2009/11/10_10:11:21 info: write_cib_contents: Wrote version 0.67.0 of the CIB to disk (digest: 93bfe20b495b39ace698b42ce04b8e2c) cib[4777]: 2009/11/10_10:11:21 info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) cib[4796]: 2009/11/10_10:13:42 info: write_cib_contents: Wrote version 0.68.0 of the CIB to disk (digest: 9162eeefc80fe9f83876e6324e58c2c7) cib[4796]: 2009/11/10_10:13:42 info: retrieveCib: Reading cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /var/lib/heartbeat/crm/cib.xml.sig) cib[3110]: 2009/11/10_10:15:35 info: cib_stats: Processed 11 operations (0.00us average, 0% utilization) in the last 10min Simply no message to stonithd. Is stonith dependent on something on the system? Connection between heartbeats worked fine. # rpm -qa|egrep 'heart|pacem' libheartbeat2-2.99.2-6.1 heartbeat-common-2.99.2-6.1 heartbeat-resources-2.99.2-6.1 heartbeat-2.99.2-6.1 pacemaker-1.0.2-11.1 libpacemaker3-1.0.2-11.1 # uname -a Linux sp1 2.6.18-128.7.1.el5xen #1 SMP Mon Aug 24 09:14:33 EDT 2009 x86_64 x86_64 x86_64 GNU/Linux CentOS5.3 Thanks, Tino
_______________________________________________ Pacemaker mailing list Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker