I have a customer running pacemaker/openais from the SLE11-HAE distribution. On occasion we have a stonith clone instance timeout. It becomes unrestartable and it is not generally recoverable unless we completely stop openais on both nodes and restart. I have an example from the logs below. Is this a known issue?

 >> This is the stonith operation that has the problem
Nov 15 23:08:35 dmf2 lrmd: [83870]: debug: rsc:stonith-sgi-ipmi:1:10: monitor
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: perform_ra_op: resetting scheduler 
class to SCHED_OTHER
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_signon: creating connection
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: sending out the signon msg.
Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 (pid=98416) succeeded to signon to stonithd.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: signed on to stonithd.
Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 [pid: 98416] requests a resource operation monitor on stonith-sgi-ipmi:1 (external/sgi-ipmi)
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: waiting for the stonithRA reply msg.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: a stonith RA operation queue to run, 
call_id=98417.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_receive_ops_result: begin
Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_status: called.
Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_run_cmd: Calling '/usr/lib64/stonith/plugins/external/sgi-ipmi status'

 >> the stonith operation times out!!
Nov 15 23:08:50 dmf2 stonithd: [83868]: WARN: external_stonith-sgi-ipmi:1_monitor process (PID 98417) timed out (try 1). Killing with signal SIGTERM (15). Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: external_stonith-sgi-ipmi:1_monitor process (PID 98417) timed out (try 2). Killing with signal SIGKILL (9). Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: Managed external_stonith-sgi-ipmi:1_monitor process 98417 killed by signal 9 [SIGKILL - Kill, unblockable]. Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: Child process external_stonith-sgi-ipmi:1_monitor [98417] exited, its exit code: 0 when signo=9.
Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: A STONITH operation timed out.
Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's (external/sgi-ipmi) op monitor finished. op_result=2
Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 
(pid=98416) signed off
Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_lrm_event: LRM operation stonith-sgi-ipmi:1_monitor_15000 (call=10, rc=2, cib-update=118, confirmed=false) invalid parameter
  >> a timeout is interpreted as an invalid parameter
Nov 15 23:08:55 dmf2 crmd: [83873]: debug: te_update_diff: Processing diff (cib_modify): 0.453.14 -> 0.453.15 (S_IDLE) Nov 15 23:08:55 dmf2 crmd: [83873]: debug: get_xpath_object: No match for //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_graph_event: Action stonith-sgi-ipmi:1_monitor_15000/32 (0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21) initiated by a different transitioner Nov 15 23:08:55 dmf2 crmd: [83873]: info: abort_transition_graph: process_graph_event:456 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=stonith-sgi-ipmi:1_monitor_15000, magic=0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21, cib=0.453.15) : Foreign event

 >> updating the failcount and last failure time
Nov 15 23:09:09 dmf2 crmd: [83873]: WARN: update_failcount: Updating failcount for stonith-sgi-ipmi:1 on dmf2 after failed monitor: rc=2 (update=value++, time=1289880535) Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback: update message from crmd: fail-count-stonith-sgi-ipmi:1=value++ Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent update: fail-count-stonith-sgi-ipmi:1=value++ for dmf2 Nov 15 23:09:09 dmf2 attrd: [83871]: info: find_hash_entry: Creating hash entry for fail-count-stonith-sgi-ipmi:1 Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent update: last-failure-stonith-sgi-ipmi:1=1289880535 for dmf2 Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback: Supplied: value++, Current: (null), Stored: (null)
...

 >> the timeout is considered a hard error and it will not allow a restart!
Nov 15 23:09:15 dmf2 pengine: [83872]: debug: unpack_rsc_op: stonith-sgi-ipmi:1_monitor_15000 on dmf2 returned 2 (invalid parameter) instead of the expected value: 0 (ok) Nov 15 23:09:15 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard error - stonith-sgi-ipmi:1_monitor_15000 failed with rc=2: Preventing stonith-sgi-ipmi-set from re-starting on dmf2 Nov 15 23:09:37 dmf2 pengine: [83872]: debug: native_assign_node: All nodes for resource stonith-sgi-ipmi:1 are unavailable, unclean or shutting down (dmf1: 1, -1000000) Nov 15 23:09:37 dmf2 pengine: [83872]: WARN: native_color: Resource stonith-sgi-ipmi:1 cannot run anywhere Nov 15 23:09:37 dmf2 pengine: [83872]: debug: clone_color: Allocated 1 stonith-sgi-ipmi-set instances of a possible 2
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource CXFS  
(Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource 
Public-IP     (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource 
PDMO-IP       (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource 
Openvault     (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource DMF   
(Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource NFS   
(Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource stonith-sgi-ipmi:0 (Started dmf1)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Stop resource 
stonith-sgi-ipmi:1     (dmf2)
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: unpack_rsc_op: stonith-sgi-ipmi:1_monitor_15000 on dmf2 returned 2 (invalid parameter) instead of the expected value: 0 (ok) Nov 15 23:09:58 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard error - stonith-sgi-ipmi:1_monitor_15000 failed with rc=2: Preventing stonith-sgi-ipmi-set from re-starting on dmf2 Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op: Processing failed op stonith-sgi-ipmi:1_monitor_15000 on dmf2: invalid parameter (2) Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op: Processing failed op CXFS_monitor_0 on dmf2: unknown exec error (-2)
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: group_print:  Resource Group: 
dmfGroup
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: CXFS (ocf::sgi:cxfs): Started dmf2 Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: Public-IP (ocf::heartbeat:IPaddr2): Started dmf2 Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: PDMO-IP (ocf::heartbeat:IPaddr2): Started dmf2 Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: Openvault (ocf::sgi:openvault): Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print:      DMF   
(ocf::sgi:dmf): Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: NFS (ocf::heartbeat:nfsserver): Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: clone_print:  Clone Set: 
stonith-sgi-ipmi-set
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource 
stonith-sgi-ipmi:0 active on dmf1
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource 
stonith-sgi-ipmi:0 active on dmf1
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource 
stonith-sgi-ipmi:1 active on dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: stonith-sgi-ipmi:1 (stonith:external/sgi-ipmi): Started dmf2 FAILED
Nov 15 23:10:08 dmf2 pengine: [83872]: notice: short_print:      Started: [ 
dmf1 ]
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource stonith-sgi-ipmi:0: preferring current location (node=dmf1, weight=1) Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource CXFS: preferring current location (node=dmf2, weight=1) Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource Public-IP: preferring current location (node=dmf2, weight=1) Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource PDMO-IP: preferring current location (node=dmf2, weight=1) Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource Openvault: preferring current location (node=dmf2, weight=1) Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource DMF: preferring current location (node=dmf2, weight=1) Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource NFS: preferring current location (node=dmf2, weight=1) Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource stonith-sgi-ipmi:1: preferring current location (node=dmf2, weight=1)

 >> This implies the resource can still restart, but the operation is cancelled 
anyway
Nov 15 23:10:08 dmf2 pengine: [83872]: info: get_failcount: stonith-sgi-ipmi-set has failed 1 times on dmf2 Nov 15 23:10:08 dmf2 pengine: [83872]: notice: common_apply_stickiness: stonith-sgi-ipmi-set can fail 999999 more times on dmf2 before being forced off Nov 15 23:10:08 dmf2 pengine: [83872]: debug: native_assign_node: All nodes for resource stonith-sgi-ipmi:1 are unavailable, unclean or shutting down (dmf1: 1, -1000000) Nov 15 23:10:08 dmf2 lrmd: [83870]: info: cancel_op: operation monitor[10] on stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for client 83873, its parameters: CRM_meta_interval=[15000] nodelist=[dmf1;admin;admin;supermicro;10.7.8.201 dmf2;admin;admin;supermicro;10CRM_meta_on_fail=[restart] CRM_meta_timeout=[15000] CRM_meta_clone_max=[2] crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] CRM_meta_name=[monitor] CRM_meta_clone=[1] cancelled
Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace:        
// A_FINALIZE_TIMER_STOP
Nov 15 23:10:08 dmf2 pengine: [83872]: WARN: native_color: Resource stonith-sgi-ipmi:1 cannot run anywhere
Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_cancel_op: operation 10 
cancelled
Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace:        
// A_TE_INVOKE
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: clone_color: Allocated 1 stonith-sgi-ipmi-set instances of a possible 2 Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_perform_op: add an operation operation stop[26] on stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for client 83873, its parameters: CRM_meta_timeout=[20000] CRM_meta_clone_max=[2] crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] CRM_meta_clone=[1] to the operation list. Nov 15 23:10:09 dmf2 crmd: [83873]: info: te_rsc_command: Initiating action 3: stop stonith-sgi-ipmi:1_stop_0 on dmf2 (local)
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: sending out the signon msg.
Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Cancelling op 10 for stonith-sgi-ipmi:1 (stonith-sgi-ipmi:1:10) Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Op 10 for stonith-sgi-ipmi:1 (stonith-sgi-ipmi:1:10): cancelled
Nov 15 23:10:09 dmf2 pengine: [83872]: notice: LogActions: Stop resource 
stonith-sgi-ipmi:1     (dmf2)
Nov 15 23:10:09 dmf2 crmd: [83873]: info: do_lrm_rsc_op: Performing key=3:35:0:815c525b-e8f1-4bb4-a5eb-2229d1980cc1 op=stonith-sgi-ipmi:1_stop_0 ) Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 (pid=102943) succeeded to signon to stonithd.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: signed on to stonithd.
Nov 15 23:10:09 dmf2 lrmd: [102943]: info: Try to stop STONITH resource <rsc_id=stonith-sgi-ipmi:1> : Device=external/sgi-ipmi
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: waiting for the stonithRA reply msg.
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 [pid: 102943] requests a resource operation stop on stonith-sgi-ipmi:1 (external/sgi-ipmi)
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: external_destroy: called.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: a stonith RA operation queue to 
run, call_id=102944.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: stonithd_receive_ops_result: begin
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: Child process external_stonith-sgi-ipmi:1_stop [102944] exited, its exit code: 0 when signo=0. Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's (external/sgi-ipmi) op stop finished. op_result=0
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 
(pid=102943) signed off
Nov 15 23:10:09 dmf2 lrmd: [83870]: info: Managed stonith-sgi-ipmi:1:stop process 102943 exited with return code 0. Nov 15 23:10:09 dmf2 crmd: [83873]: info: process_lrm_event: LRM operation stonith-sgi-ipmi:1_stop_0 (call=26, rc=0, cib-update=122, confirmed=true) ok Nov 15 23:10:09 dmf2 crmd: [83873]: info: match_graph_event: Action stonith-sgi-ipmi:1_stop_0 (3) confirmed on dmf2 (rc=0)
--

Ron Kerry         rke...@sgi.com

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to