unsubscribe 2015-04-08 8:22 GMT-07:00 <pacemaker-requ...@oss.clusterlabs.org>:
> Send Pacemaker mailing list submissions to > pacemaker@oss.clusterlabs.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > or, via email, send a message with subject or body 'help' to > pacemaker-requ...@oss.clusterlabs.org > > You can reach the person managing the list at > pacemaker-ow...@oss.clusterlabs.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Pacemaker digest..." > > > Today's Topics: > > 1. Re: update cib after fence (Michael Schwartzkopff) > 2. Cluster with two STONITH devices (Jorge Lopes) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 08 Apr 2015 16:53:38 +0200 > From: Michael Schwartzkopff <m...@sys4.de> > To: The Pacemaker cluster resource manager > <pacemaker@oss.clusterlabs.org> > Subject: Re: [Pacemaker] update cib after fence > Message-ID: <3983598.UMelJT2BHb@nb003> > Content-Type: text/plain; charset="iso-8859-1" > > Am Mittwoch, 8. April 2015, 15:03:48 schrieb philipp.achmuel...@arz.at: > > hi, > > > > how to cleanup cib from node after unexpected system halt? > > failed node still thinks of running VirtualDomain resource, which is > > already running on other node in cluster(sucessful takeover: > > > > executing "pcs cluster start" - > > .... > > Apr 8 13:41:10 lnx0083a daemon:info lnx0083a > > VirtualDomain(lnx0106a)[20360]: INFO: Virtual domain lnx0106a currently > > has no state, retrying. > > Apr 8 13:41:12 lnx0083a daemon:err|error lnx0083a > > VirtualDomain(lnx0106a)[20360]: ERROR: Virtual domain lnx0106a has no > > state during stop operation, bailing out. > > Apr 8 13:41:12 lnx0083a daemon:info lnx0083a > > VirtualDomain(lnx0106a)[20360]: INFO: Issuing forced shutdown (destroy) > > request for domain lnx0106a. > > Apr 8 13:41:12 lnx0083a daemon:err|error lnx0083a > > VirtualDomain(lnx0106a)[20360]: ERROR: forced stop failed > > Apr 8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]: notice: > > operation_finished: lnx0106a_stop_0:20360:stderr [ error: failed to > > connect to the hypervisor error: end of file while reading data: : > > input/output error ] > > Apr 8 13:41:12 lnx0083a daemon:notice lnx0083a lrmd[14230]: notice: > > operation_finished: lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced > > stop failed ] > > Apr 8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]: notice: > > process_lrm_event: Operation lnx0106a_stop_0: unknown error > > (node=lnx0083a, call=131, rc=1, cib-update=43, confirmed=true) > > Apr 8 13:41:12 lnx0083a daemon:notice lnx0083a crmd[14233]: notice: > > process_lrm_event: lnx0083a-lnx0106a_stop_0:131 [ error: failed to > connect > > to the hypervisor error: end of file while reading data: : input/output > > error\nocf-exit-reason:forced stop failed\n ] > > Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: > > warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed > > (target: 0 vs. rc: 1): Error > > Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: > > warning: update_failcount: Updating failcount for lnx0106a on lnx0083a > > after failed stop: rc=1 (update=INFINITY, time=1428493272) > > Apr 8 13:41:12 lnx0083b daemon:notice lnx0083b crmd[18244]: notice: > > abort_transition_graph: Transition aborted by lnx0106a_stop_0 'modify' on > > lnx0083a: Event failed > > (magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > cib=1.499.624, source=match_graph_event:350, 0) > > Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: > > warning: update_failcount: Updating failcount for lnx0106a on lnx0083a > > after failed stop: rc=1 (update=INFINITY, time=1428493272) > > Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: > > warning: status_from_rc: Action 105 (lnx0106a_stop_0) on lnx0083a failed > > (target: 0 vs. rc: 1): Error > > Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: > > warning: update_failcount: Updating failcount for lnx0106a on lnx0083a > > after failed stop: rc=1 (update=INFINITY, time=1428493272) > > Apr 8 13:41:12 lnx0083b daemon:warn|warning lnx0083b crmd[18244]: > > warning: update_failcount: Updating failcount for lnx0106a on lnx0083a > > after failed stop: rc=1 (update=INFINITY, time=1428493272) > > Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: > > warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on > > lnx0083a: unknown error (1) > > Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: > > warning: unpack_rsc_op_failure: Processing failed op stop for lnx0106a on > > lnx0083a: unknown error (1) > > Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: > > warning: pe_fence_node: Node lnx0083a will be fenced because of resource > > failure(s) > > Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: > > warning: common_apply_stickiness: Forcing lnx0106a away from lnx0083a > > after 1000000 failures (max=3) > > Apr 8 13:41:17 lnx0083b daemon:warn|warning lnx0083b pengine[18243]: > > warning: stage6: Scheduling Node lnx0083a for STONITH > > Apr 8 13:41:17 lnx0083b daemon:notice lnx0083b pengine[18243]: notice: > > native_stop_constraints: Stop of failed resource lnx0106a is implicit > > after lnx0083a is fenced > > .... > > > > Node is fenced.. > > > > log from corosync.log: > > ... > > Apr 08 13:41:00 [14226] lnx0083a pacemakerd: notice: mcp_read_config: > > Configured corosync to accept connections from group 2035: OK (1) > > Apr 08 13:41:00 [14226] lnx0083a pacemakerd: notice: main: Starting > > Pacemaker 1.1.12 (Build: 4ed91da): agent-manpages ascii-docs ncurses > > libqb-logging libqb-ip > > c lha-fencing upstart nagios corosync-native atomic-attrd libesmtp acls > > .... > > Apr 08 13:16:04 [23690] lnx0083a cib: info: cib_perform_op: + > > > /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id= > > 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']: > > @operation_key=lnx0106a_stop_0, @operation=stop, > > @transition-key=106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > @transition-magic=0:0;106:10167:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > @call-id=538, @last-run=1428491757, @last-rc-change=1428491757, > > @exec-time=7686 > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: write_attribute: > > Sent update 40 with 3 changes for fail-count-vm-lnx0106a, id=<n/a>, > > set=(null) > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: write_attribute: > > Sent update 45 with 3 changes for fail-count-lnx0106a, id=<n/a>, > > set=(null) > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_resource id="lnx0106a" type="VirtualDomain" > > class="ocf" provider="heartbeat"> > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_rsc_op id="lnx0106a_last_0" > > operation_key="lnx0106a_monitor_0" operation="monitor" > > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9" > > transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0" > > last-run="1427965815" last-rc-change="1427965815" exec-time="8 > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_resource id="lnx0106a" type="VirtualDomain" > > class="ocf" provider="heartbeat"> > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_rsc_op id="lnx0106a_last_failure_0" > > operation_key="lnx0106a_migrate_to_0" operation="migrate_to" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" > > transition-key="112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:1;112:8364:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0129a" call-id="444" rc-code="1" op-status="0" interval="0" > > last-run="1427973596" last-rc-change="1427 > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_rsc_op id="lnx0106a_last_0" > > operation_key="lnx0106a_stop_0" operation="stop" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" > > transition-key="113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:0;113:9846:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0129a" call-id="546" rc-code="0" op-status="0" interval="0" > > last-run="1428403880" last-rc-change="1428403880" exec-time="2 > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_rsc_op id="lnx0106a_monitor_30000" > > operation_key="lnx0106a_monitor_30000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" > > transition-key="47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:0;47:8337:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0129a" call-id="436" rc-code="0" op-status="0" > > interval="30000" last-rc-change="1427965985" exec-time="1312 > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_resource id="lnx0106a" type="VirtualDomain" > > class="ocf" provider="heartbeat"> > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_rsc_op id="lnx0106a_last_0" > > operation_key="lnx0106a_start_0" operation="start" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" > > transition-key="110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:0;110:10168:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0129b" call-id="539" rc-code="0" op-status="0" interval="0" > > last-run="1428491780" last-rc-change="1428491780" exec-tim > > Apr 08 13:41:04 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_rsc_op id="lnx0106a_monitor_30000" > > operation_key="lnx0106a_monitor_30000" operation="monitor" > > crm-debug-origin="do_update_resource" crm_feature_set="3.0.9" > > transition-key="89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:0;89:10170:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0129b" call-id="540" rc-code="0" op-status="0" > > interval="30000" last-rc-change="1428491810" exec-time="12 > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: > attrd_cib_callback: > > Update 40 for fail-count-vm-lnx0106a: OK (0) > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: > attrd_cib_callback: > > Update 40 for fail-count-vm-lnx0106a[lnx0129a]=(null): OK (0) > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: > attrd_cib_callback: > > Update 40 for fail-count-vm-lnx0106a[lnx0129b]=(null): OK (0) > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: > attrd_cib_callback: > > Update 40 for fail-count-vm-lnx0106a[lnx0083b]=(null): OK (0) > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: > attrd_cib_callback: > > Update 45 for fail-count-lnx0106a: OK (0) > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: > attrd_cib_callback: > > Update 45 for fail-count-lnx0106a[lnx0129a]=(null): OK (0) > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: > attrd_cib_callback: > > Update 45 for fail-count-lnx0106a[lnx0129b]=(null): OK (0) > > Apr 08 13:41:04 [14231] lnx0083a attrd: info: > attrd_cib_callback: > > Update 45 for fail-count-lnx0106a[lnx0083b]=(null): OK (0) > > Apr 08 13:41:05 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_resource id="lnx0106a" > > type="VirtualDomain" class="ocf" provider="heartbeat"> > > Apr 08 13:41:05 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > <lrm_rsc_op id="lnx0106a_last_0" > > operation_key="lnx0106a_monitor_0" operation="monitor" > > crm-debug-origin="build_active_RAs" crm_feature_set="3.0.9" > > transition-key="7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:7;7:8297:7:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0083b" call-id="660" rc-code="7" op-status="0" interval="0" > > last-run="1427965815" last-rc-change="142796 > > Apr 08 13:41:07 [14230] lnx0083a lrmd: info: > > process_lrmd_get_rsc_info: Resource 'lnx0106a' not found (27 active > > resources) > > Apr 08 13:41:07 [14230] lnx0083a lrmd: info: > > process_lrmd_rsc_register: Added 'lnx0106a' to the rsc list (28 > > active resources) > > Apr 08 13:41:07 [14233] lnx0083a crmd: info: do_lrm_rsc_op: > > Performing key=65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc > > op=lnx0106a_monitor_0 > > Apr 08 13:41:08 [14233] lnx0083a crmd: notice: process_lrm_event: > > Operation lnx0106a_monitor_0: not running (node=lnx0083a, call=114, rc=7, > > cib-update=34, confirmed=true) > > Apr 08 13:41:08 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources: > <lrm_resource > > id="lnx0106a" type="VirtualDomain" class="ocf" provider="heartbeat"/> > > Apr 08 13:41:08 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > > <lrm_rsc_op > > id="lnx0106a_last_failure_0" operation_key="lnx0106a_monitor_0" > > operation="monitor" crm-debug-origin="do_update_resource" > > crm_feature_set="3.0.9" > > transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0" > > last-ru > > Apr 08 13:41:08 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > > <lrm_rsc_op > > id="lnx0106a_last_0" operation_key="lnx0106a_monitor_0" > > operation="monitor" crm-debug-origin="do_update_resource" > > crm_feature_set="3.0.9" > > transition-key="65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > transition-magic="0:7;65:10177:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc" > > on_node="lnx0083a" call-id="114" rc-code="7" op-status="0" interval="0" > > last-run="14284 > > Apr 08 13:41:09 [14233] lnx0083a crmd: info: do_lrm_rsc_op: > > Performing key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc > > op=lnx0106a_stop_0 > > Apr 08 13:41:09 [14230] lnx0083a lrmd: info: log_execute: > > executing - rsc:lnx0106a action:stop call_id:131 > > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:09 INFO: Virtual domain > > lnx0106a currently has no state, retrying. > > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:10 INFO: Virtual domain > > lnx0106a currently has no state, retrying. > > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: Virtual domain > > lnx0106a has no state during stop operation, bailing out. > > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 INFO: Issuing forced > > shutdown (destroy) request for domain lnx0106a. > > VirtualDomain(lnx0106a)[20360]: 2015/04/08_13:41:12 ERROR: forced stop > > failed > > Apr 08 13:41:12 [14230] lnx0083a lrmd: notice: > operation_finished: > > lnx0106a_stop_0:20360:stderr [ error: failed to connect to the > > hypervisor error: end of file while reading data: : input/output error ] > > Apr 08 13:41:12 [14230] lnx0083a lrmd: notice: > operation_finished: > > lnx0106a_stop_0:20360:stderr [ ocf-exit-reason:forced stop failed > > ] > > Apr 08 13:41:12 [14230] lnx0083a lrmd: info: log_finished: > > finished - rsc:lnx0106a action:stop call_id:131 pid:20360 exit-code:1 > > exec-time:2609ms queue-time:0ms > > Apr 08 13:41:12 [14233] lnx0083a crmd: notice: process_lrm_event: > > Operation lnx0106a_stop_0: unknown error (node=lnx0083a, call=131, rc=1, > > cib-update=43, confirmed=true) > > Apr 08 13:41:12 [14233] lnx0083a crmd: notice: process_lrm_event: > > lnx0083a-lnx0106a_stop_0:131 [ error: failed to connect to the hypervisor > > error: end of file while reading data: : input/output > > error\nocf-exit-reason:forced stop failed\n ] > > Apr 08 13:41:12 [14228] lnx0083a cib: info: cib_perform_op: + > > > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id= > > 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_failure_0']: > > @operation_key=lnx0106a_stop_0, @operation=stop, > > @transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > @transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > @call-id=131, @rc-code=1, @last-run=1428493269, > > @last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop > > Apr 08 13:41:12 [14228] lnx0083a cib: info: cib_perform_op: + > > > /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id= > > 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']: > > @operation_key=lnx0106a_stop_0, @operation=stop, > > @transition-key=105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > @transition-magic=0:1;105:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > @call-id=131, @rc-code=1, @last-run=1428493269, > > @last-rc-change=1428493269, @exec-time=2609, @exit-reason=forced stop > > failed > > Apr 08 13:41:12 [14231] lnx0083a attrd: info: attrd_peer_update: > > Setting fail-count-lnx0106a[lnx0083a]: (null) -> INFINITY from lnx0083b > > Apr 08 13:41:12 [14231] lnx0083a attrd: info: attrd_peer_update: > > Setting last-failure-lnx0106a[lnx0083a]: (null) -> 1428493272 from > > lnx0083b > > Apr 08 13:41:12 [14228] lnx0083a cib: > > info: cib_perform_op: ++ > > > /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attri > > butes[@id='status-1']: <nvpair id="status-1-fail-count-lnx0106a" > > name="fail-count-lnx0106a" value="INFINITY"/> > > Apr 08 13:41:12 [14228] lnx0083a cib: info: cib_perform_op: > ++ > > > /cib/status/node_state[@id='1']/transient_attributes[@id='1']/instance_attri > > butes[@id='status-1']: <nvpair id="status-1-last-failure-lnx0106a" > > name="last-failure-lnx0106a" value="1428493272"/> > > Apr 08 13:41:17 [14228] lnx0083a cib: info: cib_perform_op: + > > > /cib/status/node_state[@id='4']/lrm[@id='4']/lrm_resources/lrm_resource[@id= > > 'lnx0106a']/lrm_rsc_op[@id='lnx0106a_last_0']: > > @operation_key=lnx0106a_stop_0, @operation=stop, > > @transition-key=106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > @transition-magic=0:0;106:10179:0:f57c21e4-fd47-4fef-9d73-c7d8b204c9bc, > > @call-id=542, @last-run=1428493269, @last-rc-change=1428493269, > > @exec-time=7645 > > ... > > > > any ideas? > > > > thank you! > > Philipp > > It will get the latest version of the CIB after it connects to the cluster > again. > > The CIB has a timestamp and so every node can decide if it has the lastest > version of the CIB or it should fetch it from an other node. > > Mit freundlichen Gr??en, > > Michael Schwartzkopff > > -- > [*] sys4 AG > > http://sys4.de, +49 (89) 30 90 46 64, +49 (162) 165 0044 > Franziskanerstra?e 15, 81669 M?nchen > > Sitz der Gesellschaft: M?nchen, Amtsgericht M?nchen: HRB 199263 > Vorstand: Patrick Ben Koetter, Marc Schiffbauer > Aufsichtsratsvorsitzender: Florian Kirstein > -------------- next part -------------- > A non-text attachment was scrubbed... > Name: signature.asc > Type: application/pgp-signature > Size: 230 bytes > Desc: This is a digitally signed message part. > URL: < > http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150408/c367f49a/attachment-0001.sig > > > > ------------------------------ > > Message: 2 > Date: Wed, 8 Apr 2015 16:20:50 +0100 > From: Jorge Lopes <jmclo...@gmail.com> > To: pacemaker@oss.clusterlabs.org > Subject: [Pacemaker] Cluster with two STONITH devices > Message-ID: > <CAASpg58BJLxNSt3JjyNqeuAEPX1bbgyGHTD_CeP= > zxna1ey...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Hi all. > > I'm having difficulties orchestrating two STONITH devices in my cluster. I > have been struggling with this in past days and I need some help, please. > > A simplified version of my cluster and its goals is as follows: > - The cluster has two physical servers, each with two nodes (VMWare virtual > machines): overall, there are 4 nodes in this simplified version. > - There are two resource groups: group-cluster-a and group-cluster-b. > - To achieve a good CPU balance in the physical servers, the cluster is > asymmetric, with one group running in one server and the other group > running on the other server. > - If the VM of one host becomes not usable, then its resources are started > in its sister VM deployed in the other physical host. > - If one physical host becomes not usable, then all resources are started > in the other physical host. > - Two STONITH levels are used to fence the problematic nodes. > > The resources have the following behavior: > - If the resource monitor detects a problem, then Pacemaker tries to > restart the resource in the same node. > - If it fails, then STONITH takes place (vcenter reboots the VM) and > Pacemaker starts the resource in the sister VM present in the other > physical host. > - If restarting the VM fails, I want to power off the physical server and > Pacemaker will start all resources in the other physical host. > > > The HA stack is: > Ubuntu 14.04 (the node OS, which is a visualized guest running in VMWare > ESXi 5.5) > Pacemaker 1.1.12 > Corosync 2.3.4 > CRM 2.1.2 > > The 4 nodes are: > cluster-a-1 > cluster-a-2 > cluster-b-1 > cluster-b-2 > > The relevant configuration is: > > property symmetric-cluster=false > property stonith-enabled=true > property no-quorum-policy=stop > > group group-cluster-a vip-cluster-a docker-web > location loc-group-cluster-a-1 group-cluster-a inf: cluster-a-1 > location loc-group-cluster-a-2 group-cluster-a 500: cluster-a-2 > > group group-cluster-b vip-cluster-b docker-srv > location loc-group-cluster-b-1 group-cluster-b 500: cluster-b-1 > location loc-group-cluster-b-2 group-cluster-b inf: cluster-b-2 > > > # stonith vcenter definitions for host 1 > # run in any of the host2 VM > primitive stonith-vcenter-host1 stonith:external/vcenter \ > params \ > VI_SERVER="192.168.40.20" \ > VI_CREDSTORE="/etc/vicredentials.xml" \ > HOSTLIST="cluster-a-1=cluster-a-1;cluster-a-2=cluster-a-2" \ > RESETPOWERON="1" \ > priority="2" \ > pcmk_host_check="static-list" \ > pcmk_host_list="cluster-a-1 cluster-a-2" \ > op monitor interval="60s" > > location loc1-stonith-vcenter-host1 stonith-vcenter-host1 500: cluster-b-1 > location loc2-stonith-vcenter-host1 stonith-vcenter-host1 501: cluster-b-2 > > # stonith vcenter definitions for host 2 > # run in any of the host1 VM > primitive stonith-vcenter-host2 stonith:external/vcenter \ > params \ > VI_SERVER="192.168.40.21" \ > VI_CREDSTORE="/etc/vicredentials.xml" \ > HOSTLIST="cluster-b-1=cluster-b-1;cluster-b-2=cluster-b-2" \ > RESETPOWERON="1" \ > priority="2" \ > pcmk_host_check="static-list" \ > pcmk_host_list="cluster-b-1 cluster-b-2" \ > op monitor interval="60s" > > location loc1-stonith-vcenter-host2 stonith-vcenter-host2 500: cluster-a-1 > location loc2-stonith-vcenter-host2 stonith-vcenter-host2 501: cluster-a-2 > > > # stonith IPMI definitions for host 1 (DELL with iDRAC 7 enterprise > interface at 192.168.40.15) > # run in any of the host2 VM > primitive stonith-ipmi-host1 stonith:external/ipmi \ > params hostname="host1" ipaddr="192.168.40.15" userid="root" > passwd="mypassword" interface="lanplus" \ > priority="1" \ > pcmk_host_check="static-list" \ > pcmk_host_list="cluster-a-1 cluster-a-2" \ > op start interval="0" timeout="60s" requires="nothing" \ > op monitor interval="3600s" timeout="20s" requires="nothing" > > location loc1-stonith-ipmi-host1 stonith-ipmi-host1 500: cluster-b-1 > location loc2-stonith-ipmi-host1 stonith-ipmi-host1 501: cluster-b-2 > > > # stonith IPMI definitions for host 2 (DELL with iDRAC 7 enterprise > interface at 192.168.40.16) > # run in any of the host1 VM > primitive stonith-ipmi-host2 stonith:external/ipmi \ > params hostname="host2" ipaddr="192.168.40.16" userid="root" > passwd="mypassword" interface="lanplus" \ > priority="1" \ > pcmk_host_check="static-list" \ > pcmk_host_list="cluster-b-1 cluster-b-2" \ > op start interval="0" timeout="60s" requires="nothing" \ > op monitor interval="3600s" timeout="20s" requires="nothing" > > location loc1-stonith-ipmi-host2 stonith-ipmi-host2 500: cluster-a-1 > location loc2-stonith-ipmi-host2 stonith-ipmi-host2 501: cluster-a-2 > > > What is working: > - When an error is detected in one resource, the resource restart in the > same node, as expected. > - With the STONITH external/ipmi resource *stopped*, a fail in one node > makes the vcenter rebooting it and the resources starts in the sister node. > > > What is not so good: > - When vcenter reboots one node, then the resource starts in the other node > as expected but then they return to the original node as soon as it becomes > online. This makes a bit of ping-pong and I think it is a consequence of > how the locations are defined. Any suggestion to avoid this? After the > resource was moved to another node, I would prefer that it stays there, > instead of returning it to the original node. I can think of playing with > the resource affinity scores - is this way it should be done? > > What is wrong: > Lets consider this scenario. > I have a set of resources provided by a docker agent. My test consists in > stopping the docker service in the node cluster-a-1, which makes the docker > agent to return OCF_ERR_INSTALLED to Pacemaker (this is a change I made in > the docker agent, when compared to the github repository version). With the > IPMI STONITH resource stopped, this leads to the node cluster-a-1 restart, > which is expected. > > But with the IPMI STONITH resource started, I notice an erratic behavior: > - Some times, the resources at the node cluster-a-1 are stopped and no > STONITH happens. Also, the resources are not moved to the node cluster-a-2. > In this situation, if I manually restart the node cluster-a-1 (virtual > machine restart), then the IPMI STONITH takes place and restarts the > corresponding physical server. > - Sometimes, the IPMI STONITH starts before the vCenter STONITH, which is > not expected because the vCenter STONITH has higher priority. > > I might have something wrong in my stonith definition, but I can't figure > what. > Any idea how to correct this? > > And how can I set external/ipmi to power off the physical host, instead of > rebooting it? > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20150408/b5f9cc1c/attachment.html > > > > ------------------------------ > > _______________________________________________ > Pacemaker mailing list > Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > End of Pacemaker Digest, Vol 89, Issue 2 > **************************************** >
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org