22.03.2011 17:35, Phil Armstrong:
I have 4 primitive resources that I was in the process of migrating
from one node (lotus) to the other (aston). I was using the 'crm
resource migrate' command (4 times). Normally this process works just
fine, but in this instance, one of the resources (copan_shelf_07) got
'stuck', in that it didn't get started on the target node for about 15
minutes after the others (04, 05, 06), and then it seems it was kicked
off by crm_timer_popped. I'm hoping someone can tell me why
copan_shelf_07 took so long to be started. I have the complete
hb_report if that is required. Here are the basics:
Probably I'm wrong on why exactly all this happened. But core of your
problem is here:
Mar 22 07:45:12 aston crmd: [10962]: WARN: status_from_rc: Action 17
(copan_shelf_07_stop_0) on lotus failed (target: 0 vs. rc: 1): Error
Resource agents should not fail to stop resource. In your case, the
action has failed in 2 seconds. It was not a timeout.
Lotus node should be fenced after this, if you had stonith enabled. Or
the resource should be unmanaged, if not.
But in you configuration (which you did not show), the resource had been
stopped two more times, and third stop eventually succeded. Do you have
on_fail='stop' parameter for the action?
After cluster recheck timeout, pengine had realized that the service is
stopped but should be started.
So, basically, you need to fix your Resource Agent.
--
Pavel Levshin
Here are the copan_shelf_07 syslog messages:
Mar 22 07:45:07 aston pengine: [10961]: notice: LogActions: Leave
resource copan_shelf_07 (Started lotus)
Mar 22 07:45:08 aston cib: [10958]: info: log_data_element: cib:diff:
+ <rsc_location id="cli-prefer-copan_shelf_07" rsc="copan_shelf_07"
__crm_diff_marker__="added:top" >
Mar 22 07:45:08 aston cib: [10958]: info: log_data_element: cib:diff:
+ <rule id="cli-prefer-rule-copan_shelf_07" score="INFINITY"
boolean-op="and" >
Mar 22 07:45:08 aston cib: [10958]: info: log_data_element: cib:diff:
+ <expression id="cli-prefer-expr-copan_shelf_07" attribute="#uname"
operation="eq" value="aston" type="string" />
Mar 22 07:45:10 aston pengine: [10961]: notice: native_print:
copan_shelf_07 (ocf::sgi:copan_ov_client): Started lotus
Mar 22 07:45:10 aston pengine: [10961]: notice: RecurringOp: Start
recurring monitor (30s) for copan_shelf_07 on aston
Mar 22 07:45:10 aston pengine: [10961]: notice: LogActions: Move
resource copan_shelf_07 (Started lotus -> aston)
Mar 22 07:45:10 aston pengine: [10961]: notice: native_print:
copan_shelf_07 (ocf::sgi:copan_ov_client): Started lotus
Mar 22 07:45:10 aston pengine: [10961]: notice: RecurringOp: Start
recurring monitor (30s) for copan_shelf_07 on aston
Mar 22 07:45:10 aston pengine: [10961]: notice: LogActions: Move
resource copan_shelf_07 (Started lotus -> aston)
Mar 22 07:45:10 aston crmd: [10962]: info: te_rsc_command: Initiating
action 17: stop copan_shelf_07_stop_0 on lotus
Mar 22 07:45:12 aston crmd: [10962]: WARN: status_from_rc: Action 17
(copan_shelf_07_stop_0) on lotus failed (target: 0 vs. rc: 1): Error
Mar 22 07:45:12 aston crmd: [10962]: WARN: update_failcount: Updating
failcount for copan_shelf_07 on lotus after failed stop: rc=1
(update=INFINITY, time=1300797912)
Mar 22 07:45:12 aston crmd: [10962]: info: abort_transition_graph:
match_graph_event:276 - Triggered transition abort (complete=0,
tag=lrm_rsc_op, id=copan_shelf_07_stop_0,
magic=0:1;17:509:0:cebe491c-0397-4ee8-bdd0-ae6ccd1e5c7d, cib=0.920.5)
: Event failed
Mar 22 07:45:12 aston crmd: [10962]: info: match_graph_event: Action
copan_shelf_07_stop_0 (17) confirmed on lotus (rc=4)
Mar 22 07:45:17 aston pengine: [10961]: WARN: unpack_rsc_op:
Processing failed op copan_shelf_07_stop_0 on lotus: unknown error (1)
Mar 22 07:45:17 aston pengine: [10961]: ERROR: unpack_rsc_op: Making
sure copan_shelf_07 doesn't come up again
Mar 22 07:45:17 aston pengine: [10961]: notice: native_print:
copan_shelf_07 (ocf::sgi:copan_ov_client): Started lotus FAILED
Mar 22 07:45:17 aston pengine: [10961]: info: get_failcount:
copan_shelf_07 has failed INFINITY times on lotus
Mar 22 07:45:17 aston pengine: [10961]: WARN: common_apply_stickiness:
Forcing copan_shelf_07 away from lotus after 1000000 failures (max=1)
Mar 22 07:45:17 aston pengine: [10961]: info: rsc_merge_weights:
cxfs-client-clone: Rolling back scores from copan_shelf_07
Mar 22 07:45:17 aston pengine: [10961]: info: native_color: Resource
copan_shelf_07 cannot run anywhere
Mar 22 07:45:17 aston pengine: [10961]: notice: LogActions: Stop
resource copan_shelf_07 (lotus)
Mar 22 07:45:17 aston pengine: [10961]: WARN: unpack_rsc_op:
Processing failed op copan_shelf_07_stop_0 on lotus: unknown error (1)
Mar 22 07:45:17 aston pengine: [10961]: ERROR: unpack_rsc_op: Making
sure copan_shelf_07 doesn't come up again
Mar 22 07:45:17 aston pengine: [10961]: notice: native_print:
copan_shelf_07 (ocf::sgi:copan_ov_client): Started lotus FAILED
Mar 22 07:45:17 aston pengine: [10961]: info: get_failcount:
copan_shelf_07 has failed INFINITY times on lotus
Mar 22 07:45:17 aston pengine: [10961]: WARN: common_apply_stickiness:
Forcing copan_shelf_07 away from lotus after 1000000 failures (max=1)
Mar 22 07:45:17 aston pengine: [10961]: info: rsc_merge_weights:
cxfs-client-clone: Rolling back scores from copan_shelf_07
Mar 22 07:45:17 aston pengine: [10961]: info: native_color: Resource
copan_shelf_07 cannot run anywhere
Mar 22 07:45:17 aston pengine: [10961]: notice: LogActions: Stop
resource copan_shelf_07 (lotus)
Mar 22 07:45:18 aston crmd: [10962]: info: te_rsc_command: Initiating
action 11: stop copan_shelf_07_stop_0 on lotus
Mar 22 07:45:19 aston crmd: [10962]: WARN: status_from_rc: Action 11
(copan_shelf_07_stop_0) on lotus failed (target: 0 vs. rc: 1): Error
Mar 22 07:45:19 aston crmd: [10962]: WARN: update_failcount: Updating
failcount for copan_shelf_07 on lotus after failed stop: rc=1
(update=INFINITY, time=1300797919)
Mar 22 07:45:19 aston crmd: [10962]: info: abort_transition_graph:
match_graph_event:276 - Triggered transition abort (complete=0,
tag=lrm_rsc_op, id=copan_shelf_07_stop_0,
magic=0:1;11:511:0:cebe491c-0397-4ee8-bdd0-ae6ccd1e5c7d, cib=0.920.16)
: Event failed
Mar 22 07:45:19 aston crmd: [10962]: info: match_graph_event: Action
copan_shelf_07_stop_0 (11) confirmed on lotus (rc=4)
Mar 22 07:45:22 aston pengine: [10961]: WARN: unpack_rsc_op:
Processing failed op copan_shelf_07_stop_0 on lotus: unknown error (1)
Mar 22 07:45:22 aston pengine: [10961]: ERROR: unpack_rsc_op: Making
sure copan_shelf_07 doesn't come up again
Mar 22 07:45:22 aston pengine: [10961]: notice: native_print:
copan_shelf_07 (ocf::sgi:copan_ov_client): Started lotus FAILED
Mar 22 07:45:22 aston pengine: [10961]: info: get_failcount:
copan_shelf_07 has failed INFINITY times on lotus
Mar 22 07:45:22 aston pengine: [10961]: WARN: common_apply_stickiness:
Forcing copan_shelf_07 away from lotus after 1000000 failures (max=1)
Mar 22 07:45:22 aston pengine: [10961]: info: rsc_merge_weights:
cxfs-client-clone: Rolling back scores from copan_shelf_07
Mar 22 07:45:22 aston pengine: [10961]: info: native_color: Resource
copan_shelf_07 cannot run anywhere
Mar 22 07:45:22 aston pengine: [10961]: notice: LogActions: Stop
resource copan_shelf_07 (lotus)
Mar 22 07:45:23 aston crmd: [10962]: info: te_rsc_command: Initiating
action 14: stop copan_shelf_07_stop_0 on lotus
Mar 22 07:45:24 aston crmd: [10962]: info: match_graph_event: Action
copan_shelf_07_stop_0 (14) confirmed on lotus (rc=0)
Mar 22 08:00:25 aston pengine: [10961]: notice: native_print:
copan_shelf_07 (ocf::sgi:copan_ov_client): Stopped
Mar 22 08:00:25 aston pengine: [10961]: info: get_failcount:
copan_shelf_07 has failed INFINITY times on lotus
Mar 22 08:00:25 aston pengine: [10961]: WARN: common_apply_stickiness:
Forcing copan_shelf_07 away from lotus after 1000000 failures (max=1)
Mar 22 08:00:25 aston pengine: [10961]: notice: RecurringOp: Start
recurring monitor (30s) for copan_shelf_07 on aston
Mar 22 08:00:25 aston pengine: [10961]: notice: LogActions: Start
copan_shelf_07 (aston)
Mar 22 08:00:25 aston crmd: [10962]: info: te_rsc_command: Initiating
action 16: start copan_shelf_07_start_0 on aston (local)
Mar 22 08:00:25 aston crmd: [10962]: info: do_lrm_rsc_op: Performing
key=16:513:0:cebe491c-0397-4ee8-bdd0-ae6ccd1e5c7d
op=copan_shelf_07_start_0 )
Mar 22 08:00:25 aston lrmd: [10959]: debug: on_msg_perform_op:2359:
copying parameters for rsc copan_shelf_07
Mar 22 08:00:25 aston lrmd: [10959]: debug: on_msg_perform_op: add an
operation operation start[142] on ocf::copan_ov_client::copan_shelf_07
for client 10962, its parameters: CRM_meta_name=[start]
crm_feature_set=[3.0.2] CRM_meta_on_fail=[restart]
CRM_meta_timeout=[60000] shelf_name=[c07] to the operation list.
Mar 22 08:00:25 aston lrmd: [10959]: info: rsc:copan_shelf_07:142: start
Thanks,
Phil
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker