I can reproduce this behavior: - On node02, which had no resources online, I killed all corosync processes with "killall -9 corosync". - Node02 was rebootet through stonith - On node01, I can see the following lines in the message-log (line 6 schedules the STONITH):
For me it seems, that node01 recognized, that the cluster-processes on node02 were not shot down properly. So the behavior in this case is, to stonith the node. Could this behavior be disabled? Which setting? << ... Apr 15 08:30:32 node01 pengine: [6152]: notice: unpack_config: On loss of CCM Quorum: Ignore Apr 15 08:30:32 node01 pengine: [6152]: WARN: pe_fence_node: Node node02 will be fenced because it is un-expectedly down Apr 15 08:30:32 node01 pengine: [6152]: WARN: determine_online_status: Node node02 is unclean ... Apr 15 08:30:32 node01 pengine: [6152]: WARN: custom_action: Action res_stonith_node01_stop_0 on node02 is unrunnable (offline) Apr 15 08:30:32 node01 pengine: [6152]: WARN: custom_action: Marking node node02 unclean Apr 15 08:30:32 node01 pengine: [6152]: WARN: stage6: Scheduling Node node02 for STONITH ... ause=C_IPC_MESSAGE origin=handle_response ] Apr 15 08:30:32 node01 crmd: [6153]: info: unpack_graph: Unpacked transition 4: 5 actions in 5 synapses Apr 15 08:30:32 node01 crmd: [6153]: info: do_te_invoke: Processing graph 4 (ref=pe_calc-dc-1302849032-37) derived from /var/lib/pengine/pe-warn-7315.bz2 Apr 15 08:30:32 node01 crmd: [6153]: info: te_pseudo_action: Pseudo action 21 fired and confirmed Apr 15 08:30:32 node01 crmd: [6153]: info: te_pseudo_action: Pseudo action 24 fired and confirmed Apr 15 08:30:32 node01 crmd: [6153]: info: te_fence_node: Executing reboot fencing operation (26) on node02 (timeout=60000) Apr 15 08:30:32 node01 stonith-ng: [6148]: info: initiate_remote_stonith_op: Initiating remote operation reboot for node02: 8190cf2d-d876-45d1-8e4d-e620e19ca354 ... Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_queryQuery <stonith_command t="stonith-ng" st_async_id="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_op="st_query" st_callid="0" st_callopt="0" st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_target="node02" st_device_action="reboot" st_clientid="983fd169-277a-457d-9985-f30f4320542e" st_timeout="6000" src="node01" seq="1" /> Apr 15 08:30:32 node01 stonith-ng: [6148]: info: can_fence_host_with_device: Refreshing port list for res_stonith_node02 Apr 15 08:30:32 node01 stonith-ng: [6148]: WARN: parse_host_line: Could not parse (0 0): Apr 15 08:30:32 node01 stonith-ng: [6148]: info: can_fence_host_with_device: res_stonith_node02 can fence node02: dynamic-list Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_query: Found 1 matching devices for 'node02' Apr 15 08:30:32 node01 stonith-ng: [6148]: info: call_remote_stonith: Requesting that node01 perform op reboot node02 Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_fenceExec <stonith_command t="stonith-ng" st_async_id="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_op="st_fence" st_callid="0" st_callopt="0" st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_target="node02" st_device_action="reboot" st_timeout="54000" src="node01" seq="3" /> Apr 15 08:30:32 node01 stonith-ng: [6148]: info: can_fence_host_with_device: res_stonith_node02 can fence node02: dynamic-list Apr 15 08:30:32 node01 stonith-ng: [6148]: info: stonith_fence: Found 1 matching devices for 'node02' Apr 15 08:30:32 node01 pengine: [6152]: WARN: process_pe_message: Transition 4: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-7315.bz2 Apr 15 08:30:32 node01 external/ipmi[19297]: [19310]: debug: ipmitool output: Chassis Power Control: Reset Apr 15 08:30:33 node01 stonith-ng: [6148]: info: log_operation: Operation 'reboot' [19292] for host 'node02' with device 'res_stonith_node02' returned: 0 (call 0 from (null)) Apr 15 08:30:33 node01 stonith-ng: [6148]: info: process_remote_stonith_execExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/ipmi -T reset node02 success: node02 0 " src="node01" seq="4" /> Apr 15 08:30:33 node01 stonith-ng: [6148]: info: remote_op_done: Notifing clients of 8190cf2d-d876-45d1-8e4d-e620e19ca354 (reboot of node02 from 983fd169-277a-457d-9985-f30f4320542e by node01): 1, rc=0 Apr 15 08:30:33 node01 stonith-ng: [6148]: info: stonith_notify_client: Sending st_fence-notification to client 6153/5395a0da-71b3-4437-b284-f10a8470fce6 Apr 15 08:30:33 node01 crmd: [6153]: info: tengine_stonith_callbackStonithOp <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="reboot" st_remote_op="8190cf2d-d876-45d1-8e4d-e620e19ca354" st_callid="0" st_callopt="0" st_rc="0" st_output="Performing: stonith -t external/ipmi -T reset node02 success: node02 0 " src="node01" seq="4" state="1" st_target="node02" /> Apr 15 08:30:33 node01 crmd: [6153]: info: tengine_stonith_callback: Stonith operation 2/26:4:0:25562131-e2c3-4dd8-8be7-a2237e7ad015: OK (0) Apr 15 08:30:33 node01 crmd: [6153]: info: tengine_stonith_callback: Stonith of node02 passed Apr 15 08:30:33 node01 crmd: [6153]: info: send_stonith_update: Sending fencing update 85 for node02 Apr 15 08:30:33 node01 crmd: [6153]: notice: crmd_peer_update: Status update: Client node02/crmd now has status [offline] (DC=true) Apr 15 08:30:33 node01 crmd: [6153]: info: check_join_state: crmd_peer_update: Membership changed since join started: 172 -> 176 ... >> OS: SLES11-SP1-HAE Clusterglue: cluster-glue: 1.0.7 (3e3d209f9217f8e517ed1ab8bb2fdd576cc864be) dc-version="1.1.5-5ce2879aa0d5f43d01629bc20edc6868a9352002" Installed RPM's: libpacemaker3-1.1.5-5.5.5 libopenais3-1.1.4-5.4.3 pacemaker-mgmt-2.0.0-0.5.5 cluster-glue-1.0.7-6.6.3 openais-1.1.4-5.4.3 pacemaker-1.1.5-5.5.5 pacemaker-mgmt-client-2.0.0-0.5.5 Thanks a lot. Tom 2011/4/15 Andrew Beekhof <and...@beekhof.net>: > Impossible to say without logs. Sounds strange though. > > On Fri, Apr 15, 2011 at 7:17 AM, Tom Tux <tomtu...@gmail.com> wrote: >> Hi >> >> I have a two node cluster (stonith enabled). On one node I tried >> stopping openais (/etc/init.d/openais stop), but this was hanging. So >> I killed all running corosync processes (killall -9 corosync). >> Afterward, I started openais on this node again (rcopenais start). >> After a few seconds, this node was stonith'ed and went to reboot. >> >> My question hereby: >> Is this a normal behavior? If yes, is it, because I killed the hanging >> corosync-processes and after starting openais again, the cluster >> recognized an unclean state on this node? >> >> Thanks a lot. >> Tom >> >> _______________________________________________ >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker >> > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker