Hi, On Fri, Nov 19, 2010 at 02:19:02PM -0500, Ron Kerry wrote: > I have a customer running pacemaker/openais from the SLE11-HAE > distribution. On occasion we have a stonith clone instance timeout. > It becomes unrestartable and it is not generally recoverable unless > we completely stop openais on both nodes and restart. I have an > example from the logs below. Is this a known issue?
Didn't see this so far. Strange, this code has been there for quite some time. Can you please open a bugzilla. Thanks, Dejan > >> This is the stonith operation that has the problem > Nov 15 23:08:35 dmf2 lrmd: [83870]: debug: rsc:stonith-sgi-ipmi:1:10: monitor > Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: perform_ra_op: resetting scheduler > class to SCHED_OTHER > Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_signon: creating > connection > Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: sending out the signon msg. > Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client > STONITH_RA_EXEC_98416 (pid=98416) succeeded to signon to stonithd. > Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: signed on to stonithd. > Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client > STONITH_RA_EXEC_98416 [pid: 98416] requests a resource operation > monitor on stonith-sgi-ipmi:1 (external/sgi-ipmi) > Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: waiting for the stonithRA reply > msg. > Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: a stonith RA operation queue to > run, call_id=98417. > Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_receive_ops_result: begin > Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_status: called. > Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_run_cmd: > Calling '/usr/lib64/stonith/plugins/external/sgi-ipmi status' > > >> the stonith operation times out!! > Nov 15 23:08:50 dmf2 stonithd: [83868]: WARN: > external_stonith-sgi-ipmi:1_monitor process (PID 98417) timed out > (try 1). Killing with signal SIGTERM (15). > Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: > external_stonith-sgi-ipmi:1_monitor process (PID 98417) timed out > (try 2). Killing with signal SIGKILL (9). > Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: Managed > external_stonith-sgi-ipmi:1_monitor process 98417 killed by signal 9 > [SIGKILL - Kill, unblockable]. > Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: Child process > external_stonith-sgi-ipmi:1_monitor [98417] exited, its exit code: 0 > when signo=9. > Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: A STONITH operation timed out. > Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's > (external/sgi-ipmi) op monitor finished. op_result=2 > Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 > (pid=98416) signed off > Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_lrm_event: LRM > operation stonith-sgi-ipmi:1_monitor_15000 (call=10, rc=2, > cib-update=118, confirmed=false) invalid parameter > >> a timeout is interpreted as an invalid parameter > Nov 15 23:08:55 dmf2 crmd: [83873]: debug: te_update_diff: > Processing diff (cib_modify): 0.453.14 -> 0.453.15 (S_IDLE) > Nov 15 23:08:55 dmf2 crmd: [83873]: debug: get_xpath_object: No > match for //cib_update_result//diff-added//crm_config in > /notify/cib_update_result/diff > Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_graph_event: > Action stonith-sgi-ipmi:1_monitor_15000/32 > (0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21) initiated by a > different transitioner > Nov 15 23:08:55 dmf2 crmd: [83873]: info: abort_transition_graph: > process_graph_event:456 - Triggered transition abort (complete=1, > tag=lrm_rsc_op, id=stonith-sgi-ipmi:1_monitor_15000, > magic=0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21, cib=0.453.15) > : Foreign event > > >> updating the failcount and last failure time > Nov 15 23:09:09 dmf2 crmd: [83873]: WARN: update_failcount: Updating > failcount for stonith-sgi-ipmi:1 on dmf2 after failed monitor: rc=2 > (update=value++, time=1289880535) > Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback: > update message from crmd: fail-count-stonith-sgi-ipmi:1=value++ > Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent > update: fail-count-stonith-sgi-ipmi:1=value++ for dmf2 > Nov 15 23:09:09 dmf2 attrd: [83871]: info: find_hash_entry: Creating > hash entry for fail-count-stonith-sgi-ipmi:1 > Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent > update: last-failure-stonith-sgi-ipmi:1=1289880535 for dmf2 > Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback: > Supplied: value++, Current: (null), Stored: (null) > ... > > >> the timeout is considered a hard error and it will not allow a restart! > Nov 15 23:09:15 dmf2 pengine: [83872]: debug: unpack_rsc_op: > stonith-sgi-ipmi:1_monitor_15000 on dmf2 returned 2 (invalid > parameter) instead of the expected value: 0 (ok) > Nov 15 23:09:15 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard > error - stonith-sgi-ipmi:1_monitor_15000 failed with rc=2: > Preventing stonith-sgi-ipmi-set from re-starting on dmf2 > Nov 15 23:09:37 dmf2 pengine: [83872]: debug: native_assign_node: > All nodes for resource stonith-sgi-ipmi:1 are unavailable, unclean > or shutting down (dmf1: 1, -1000000) > Nov 15 23:09:37 dmf2 pengine: [83872]: WARN: native_color: Resource > stonith-sgi-ipmi:1 cannot run anywhere > Nov 15 23:09:37 dmf2 pengine: [83872]: debug: clone_color: Allocated > 1 stonith-sgi-ipmi-set instances of a possible 2 > Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource > CXFS (Started dmf2) > Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource > Public-IP (Started dmf2) > Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource > PDMO-IP (Started dmf2) > Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource > Openvault (Started dmf2) > Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource DMF > (Started dmf2) > Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource NFS > (Started dmf2) > Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave > resource stonith-sgi-ipmi:0 (Started dmf1) > Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Stop resource > stonith-sgi-ipmi:1 (dmf2) > Nov 15 23:09:58 dmf2 pengine: [83872]: debug: unpack_rsc_op: > stonith-sgi-ipmi:1_monitor_15000 on dmf2 returned 2 (invalid > parameter) instead of the expected value: 0 (ok) > Nov 15 23:09:58 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard > error - stonith-sgi-ipmi:1_monitor_15000 failed with rc=2: > Preventing stonith-sgi-ipmi-set from re-starting on dmf2 > Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op: > Processing failed op stonith-sgi-ipmi:1_monitor_15000 on dmf2: > invalid parameter (2) > Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op: > Processing failed op CXFS_monitor_0 on dmf2: unknown exec error (-2) > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: group_print: Resource Group: > dmfGroup > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: > CXFS (ocf::sgi:cxfs): Started dmf2 > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: > Public-IP (ocf::heartbeat:IPaddr2): Started dmf2 > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: > PDMO-IP (ocf::heartbeat:IPaddr2): Started dmf2 > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: > Openvault (ocf::sgi:openvault): Started dmf2 > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: DMF > (ocf::sgi:dmf): Started dmf2 > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: > NFS (ocf::heartbeat:nfsserver): Started dmf2 > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: clone_print: Clone Set: > stonith-sgi-ipmi-set > Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource > stonith-sgi-ipmi:0 active on dmf1 > Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource > stonith-sgi-ipmi:0 active on dmf1 > Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource > stonith-sgi-ipmi:1 active on dmf2 > Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: > stonith-sgi-ipmi:1 (stonith:external/sgi-ipmi): Started dmf2 > FAILED > Nov 15 23:10:08 dmf2 pengine: [83872]: notice: short_print: Started: [ > dmf1 ] > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: > common_apply_stickiness: Resource stonith-sgi-ipmi:0: preferring > current location (node=dmf1, weight=1) > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: > common_apply_stickiness: Resource CXFS: preferring current location > (node=dmf2, weight=1) > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: > common_apply_stickiness: Resource Public-IP: preferring current > location (node=dmf2, weight=1) > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: > common_apply_stickiness: Resource PDMO-IP: preferring current > location (node=dmf2, weight=1) > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: > common_apply_stickiness: Resource Openvault: preferring current > location (node=dmf2, weight=1) > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: > common_apply_stickiness: Resource DMF: preferring current location > (node=dmf2, weight=1) > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: > common_apply_stickiness: Resource NFS: preferring current location > (node=dmf2, weight=1) > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: > common_apply_stickiness: Resource stonith-sgi-ipmi:1: preferring > current location (node=dmf2, weight=1) > > >> This implies the resource can still restart, but the operation is > cancelled anyway > Nov 15 23:10:08 dmf2 pengine: [83872]: info: get_failcount: > stonith-sgi-ipmi-set has failed 1 times on dmf2 > Nov 15 23:10:08 dmf2 pengine: [83872]: notice: > common_apply_stickiness: stonith-sgi-ipmi-set can fail 999999 more > times on dmf2 before being forced off > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: native_assign_node: > All nodes for resource stonith-sgi-ipmi:1 are unavailable, unclean > or shutting down (dmf1: 1, -1000000) > Nov 15 23:10:08 dmf2 lrmd: [83870]: info: cancel_op: operation > monitor[10] on stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for > client 83873, its parameters: CRM_meta_interval=[15000] > nodelist=[dmf1;admin;admin;supermicro;10.7.8.201 > dmf2;admin;admin;supermicro;10CRM_meta_on_fail=[restart] > CRM_meta_timeout=[15000] CRM_meta_clone_max=[2] > crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] > CRM_meta_name=[monitor] CRM_meta_clone=[1] cancelled > Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace: > // A_FINALIZE_TIMER_STOP > Nov 15 23:10:08 dmf2 pengine: [83872]: WARN: native_color: Resource > stonith-sgi-ipmi:1 cannot run anywhere > Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_cancel_op: operation 10 > cancelled > Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace: > // A_TE_INVOKE > Nov 15 23:10:08 dmf2 pengine: [83872]: debug: clone_color: Allocated > 1 stonith-sgi-ipmi-set instances of a possible 2 > Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_perform_op: add an > operation operation stop[26] on > stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for client 83873, its > parameters: CRM_meta_timeout=[20000] CRM_meta_clone_max=[2] > crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false] > CRM_meta_clone=[1] to the operation list. > Nov 15 23:10:09 dmf2 crmd: [83873]: info: te_rsc_command: Initiating > action 3: stop stonith-sgi-ipmi:1_stop_0 on dmf2 (local) > Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: sending out the signon msg. > Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Cancelling op > 10 for stonith-sgi-ipmi:1 (stonith-sgi-ipmi:1:10) > Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Op 10 for > stonith-sgi-ipmi:1 (stonith-sgi-ipmi:1:10): cancelled > Nov 15 23:10:09 dmf2 pengine: [83872]: notice: LogActions: Stop resource > stonith-sgi-ipmi:1 (dmf2) > Nov 15 23:10:09 dmf2 crmd: [83873]: info: do_lrm_rsc_op: Performing > key=3:35:0:815c525b-e8f1-4bb4-a5eb-2229d1980cc1 > op=stonith-sgi-ipmi:1_stop_0 ) > Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client > STONITH_RA_EXEC_102943 (pid=102943) succeeded to signon to stonithd. > Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: signed on to stonithd. > Nov 15 23:10:09 dmf2 lrmd: [102943]: info: Try to stop STONITH > resource <rsc_id=stonith-sgi-ipmi:1> : Device=external/sgi-ipmi > Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: waiting for the stonithRA reply > msg. > Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client > STONITH_RA_EXEC_102943 [pid: 102943] requests a resource operation > stop on stonith-sgi-ipmi:1 (external/sgi-ipmi) > Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: external_destroy: called. > Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: a stonith RA operation queue to > run, call_id=102944. > Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: stonithd_receive_ops_result: begin > Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: Child process > external_stonith-sgi-ipmi:1_stop [102944] exited, its exit code: 0 > when signo=0. > Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's > (external/sgi-ipmi) op stop finished. op_result=0 > Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 > (pid=102943) signed off > Nov 15 23:10:09 dmf2 lrmd: [83870]: info: Managed > stonith-sgi-ipmi:1:stop process 102943 exited with return code 0. > Nov 15 23:10:09 dmf2 crmd: [83873]: info: process_lrm_event: LRM > operation stonith-sgi-ipmi:1_stop_0 (call=26, rc=0, cib-update=122, > confirmed=true) ok > Nov 15 23:10:09 dmf2 crmd: [83873]: info: match_graph_event: Action > stonith-sgi-ipmi:1_stop_0 (3) confirmed on dmf2 (rc=0) > -- > > Ron Kerry rke...@sgi.com > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker