Hi all, I am trying to set up a cluster of virtual machine hosts, and while doing so, I came out with a very strange behaviour (I think it may be a bug) and I hope you can lend me a hand in debugging this. For testing the behaviour observed in my production environment, I set up 2 new simple machines with no location/colocation/order constraints, and changed the Xen resource agent with the dummy resource agent, and the behaviour was the same.
The scenario is the following: - The strategy is "utilization". - 2 nodes: vmHost1 and vmHost2, with 2 cores each handle 5 resources: DummyVM001-005, with resource-stickiness="INFINITY". What happens is the following: - If I start resources DummyVM001-004, everything is fine. 2 resources run on each of the nodes - Now, I start DummyVM005, but utilization is full, so it does not start (cool :) - Then , if I stop any of the running resources, everything goes smoothly and DummyVM005 starts up. That's cool too. * Here comes the strange part: I have full utilization and resource stickines INFINITY, so starting new resources shouldn't change anything in the cluster status BUT If any of the freshly restarted resources is alphabetically sorted before, the cluster stops the "last" resource alphabetically sorted running and starts the stopped one. I don't think this is the expected behaviour, please correct me if I wrong. Thank you kindly, Agustin status: ==== Online: [ vmHost1 vmHost2 ] DummyVM1 (ocf::pacemaker:Dummy): Started vmHost1 DummyVM2 (ocf::pacemaker:Dummy): Started vmHost1 DummyVM3 (ocf::pacemaker:Dummy): Started vmHost2 DummyVM5 (ocf::pacemaker:Dummy): Started vmHost2 crm(live)# resource start DummyVM4 My configuration: ============ crm(live)# configure show node vmHost1 \ utilization cores="2" node vmHost2 \ utilization cores="2" primitive DummyVM1 ocf:pacemaker:Dummy \ op monitor interval="60s" timeout="60s" \ op start on-fail="restart" interval="0" \ op stop on-fail="ignore" interval="0" \ utilization cores="1" \ meta is-managed="true" migration-threshold="2" primitive DummyVM2 ocf:pacemaker:Dummy \ op monitor interval="60s" timeout="60s" \ op start on-fail="restart" interval="0" \ op stop on-fail="ignore" interval="0" \ utilization cores="1" \ meta is-managed="true" migration-threshold="2" primitive DummyVM3 ocf:pacemaker:Dummy \ op monitor interval="60s" timeout="60s" \ op start on-fail="restart" interval="0" \ op stop on-fail="ignore" interval="0" \ utilization cores="1" \ meta is-managed="true" migration-threshold="2" primitive DummyVM4 ocf:pacemaker:Dummy \ op monitor interval="60s" timeout="60s" \ op start on-fail="restart" interval="0" \ op stop on-fail="ignore" interval="0" \ utilization cores="1" \ meta is-managed="true" migration-threshold="2" target-role="Started" primitive DummyVM5 ocf:pacemaker:Dummy \ op monitor interval="60s" timeout="60s" \ op start on-fail="restart" interval="0" \ op stop on-fail="ignore" interval="0" \ utilization cores="1" \ meta is-managed="true" migration-threshold="2" target-role="Started" property $id="cib-bootstrap-options" \ dc-version="1.1.5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ stop-all-resources="false" \ placement-strategy="utilization" \ no-quorum-policy="ignore" \ cluster-infrastructure="openais" \ stop-orphan-resources="true" \ stop-orphan-actions="true" \ last-lrm-refresh="1326975274" rsc_defaults $id="rsc-options" \ resource-stickiness="INFINITY" and the important part in /var/log/syslog : Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- <cib admin_epoch="0" epoch="75" num_updates="4" > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- <configuration > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- <resources > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- <primitive id="DummyVM4" > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- <meta_attributes id="DummyVM4-meta_attributes" > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- <nvpair value="Stopped" id="DummyVM4-meta_attributes-target-role" /> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- </meta_attributes> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- </primitive> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- </resources> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- </configuration> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff- </cib> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ <cib epoch="76" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2" crm_feature_set="3.0.5" have-quorum="1" cib-last-written="Thu Jan 19 12:35:47 2012" dc-uuid="vmHost1" > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ <configuration > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ <resources > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ <primitive class="ocf" id="DummyVM4" provider="pacemaker" type="Dummy" > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ <meta_attributes id="DummyVM4-meta_attributes" > Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ <nvpair id="DummyVM4-meta_attributes-target-role" name="target-role" value="Started" /> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ </meta_attributes> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ </primitive> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ </resources> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ </configuration> Jan 19 13:36:19 vmHost1 cib: [725]: info: cib:diff+ </cib> Jan 19 13:36:19 vmHost1 crmd: [729]: info: abort_transition_graph: te_update_diff:131 - Triggered transition abort (complete=1, tag=diff, id=(null), magic=NA, cib=0.76.1) : Non-status change Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ] Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: All 2 cluster nodes are eligible to run resources. Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_pe_invoke: Query 191: Requesting the current CIB: S_POLICY_ENGINE Jan 19 13:36:19 vmHost1 cib: [725]: info: cib_process_request: Operation complete: op cib_replace for section resources (origin=local/cibadmin/2, version=0.76.1): ok (rc=0) Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_pe_invoke_callback: Invoking the PE: query=191, ref=pe_calc-dc-1326976579-119, seq=64, quorate=1 Jan 19 13:36:19 vmHost1 pengine: [728]: notice: unpack_config: On loss of CCM Quorum: Ignore Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print: DummyVM1#011(ocf::pacemaker:Dummy):#011Started vmHost1 Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print: DummyVM2#011(ocf::pacemaker:Dummy):#011Started vmHost1 Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print: DummyVM3#011(ocf::pacemaker:Dummy):#011Started vmHost2 Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print: DummyVM4#011(ocf::pacemaker:Dummy):#011Stopped Jan 19 13:36:19 vmHost1 pengine: [728]: notice: native_print: DummyVM5#011(ocf::pacemaker:Dummy):#011Started vmHost2 Jan 19 13:36:19 vmHost1 pengine: [728]: notice: RecurringOp: Start recurring monitor (60s) for DummyVM4 on vmHost2 Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Leave DummyVM1#011(Started vmHost1) Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Leave DummyVM2#011(Started vmHost1) Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Leave DummyVM3#011(Started vmHost2) Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Start DummyVM4#011(vmHost2) Jan 19 13:36:19 vmHost1 pengine: [728]: notice: LogActions: Stop DummyVM5#011(vmHost2) Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ] Jan 19 13:36:19 vmHost1 crmd: [729]: info: unpack_graph: Unpacked transition 33: 6 actions in 6 synapses Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_te_invoke: Processing graph 33 (ref=pe_calc-dc-1326976579-119) derived from /var/lib/pengine/pe-input-717.bz2 Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_rsc_command: Initiating action 19: stop DummyVM5_stop_0 on vmHost2 Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_pseudo_action: Pseudo action 6 fired and confirmed Jan 19 13:36:19 vmHost1 crmd: [729]: info: match_graph_event: Action DummyVM5_stop_0 (19) confirmed on vmHost2 (rc=0) Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_pseudo_action: Pseudo action 7 fired and confirmed Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_pseudo_action: Pseudo action 5 fired and confirmed Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_rsc_command: Initiating action 17: start DummyVM4_start_0 on vmHost2 Jan 19 13:36:19 vmHost1 crmd: [729]: info: match_graph_event: Action DummyVM4_start_0 (17) confirmed on vmHost2 (rc=0) Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_rsc_command: Initiating action 18: monitor DummyVM4_monitor_60000 on vmHost2 Jan 19 13:36:19 vmHost1 crmd: [729]: info: match_graph_event: Action DummyVM4_monitor_60000 (18) confirmed on vmHost2 (rc=0) Jan 19 13:36:19 vmHost1 crmd: [729]: info: run_graph: ==================================================== Jan 19 13:36:19 vmHost1 crmd: [729]: notice: run_graph: Transition 33 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pengine/pe-input-717.bz2): Complete Jan 19 13:36:19 vmHost1 crmd: [729]: info: te_graph_trigger: Transition 33 is now complete Jan 19 13:36:19 vmHost1 crmd: [729]: info: notify_crmd: Transition 33 status: done - <null> Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ] Jan 19 13:36:19 vmHost1 crmd: [729]: info: do_state_transition: Starting PEngine Recheck Timer -- "Death: Human beings make life so interesting. Do you know, that in a universe so full of wonders, they have managed to invent boredom. " Terry Pratcher - Hogfather _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org