Hi,
I've set up 3-node cluster (2 active nodes + 1 standby for quorum),
cman+pacemaker
Resources - "cxml-clone" gfs2 filesystem (cloned, run on both nodes) and
"testVM" via heartbeat:VirtualDomain (domain xml located on gfs2 fs,
cLVM disk backend), set up constraints: "cxml-clone" started first than
"testVM" (symmetrical, according to description it'll be stopped in
reverse order).
While manual migration of VM runs fine (pcs resource move testVM
node-2/node-1) - succesfull live migration, VM runs uninterrupted, but
when I'm trying to reboot/put in standby node running VM - everything is
crashing, migration fails and node fenced.
From logs i can see that resource "cxml" stopped first (or
simultaneously, at least not waiting for vm migration to complete), then
migration fails because xml not available.
Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished:
cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: Running stop for
/dev/cstore/cxml on /mnt ]
Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished:
cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: Trying to unmount /mnt ]
Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished:
cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: unmounted /mnt successfully ]
Apr 10 14:03:20 node-2 crmd[2682]: notice: process_lrm_event: LRM
operation cxml_stop_0 (call=77, rc=0, cib-update=37, confirmed=true) ok
Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished:
testVM_migrate_to_0:3281 [ 2013/04/10_14:03:20 INFO: testvm: Starting
live migration to node-1 (using remote hypervisor URI
qemu+ssh://node-1/system ). ]
Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished:
testVM_migrate_to_0:3281 [ error: Requested operation is not valid:
domain is already active as 'testvm' ]
Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished:
testVM_migrate_to_0:3281 [ 2013/04/10_14:03:21 ERROR: testvm: live
migration to qemu+ssh://node-1/system failed: 1 ]
Apr 10 14:03:21 node-2 crmd[2682]: notice: process_lrm_event: LRM
operation testVM_migrate_to_0 (call=75, rc=1, cib-update=38,
confirmed=true) unknown error
Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished:
testVM_stop_0:3392 [ 2013/04/10_14:03:21 ERROR: Configuration file
/mnt/testvm.xml does not exist or is not readable. ]
But wtf?! I've set up constraint, and "testVM" should be stopped/moved
first, not "cxml"
What is wrong with my configuration, am I missing something?
logs and CIB in attach
--
Yuriy Demchenko
<cib epoch="33" num_updates="4" admin_epoch="0" validate-with="pacemaker-1.2" update-origin="node-2" update-client="crm_attribute" cib-last-written="Wed Apr 10 10:14:36 2013" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="vnode-3">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.8-7.el6-394e906"/>
<nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="cman"/>
<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="true"/>
<nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2"/>
</cluster_property_set>
</crm_config>
<nodes>
<node id="node-2" uname="node-2">
<instance_attributes id="nodes-node-2"/>
</node>
<node id="vnode-3" uname="vnode-3">
<instance_attributes id="nodes-vnode-3">
<nvpair id="nodes-vnode-3-standby" name="standby" value="on"/>
</instance_attributes>
</node>
<node id="node-1" uname="node-1"/>
</nodes>
<resources>
<primitive class="stonith" id="node-1" type="fence_ipmilan">
<instance_attributes id="node-1-instance_attributes">
<nvpair id="node-1-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="node-1"/>
<nvpair id="node-1-instance_attributes-ipaddr" name="ipaddr" value="192.168.220.11"/>
<nvpair id="node-1-instance_attributes-login" name="login" value="ADMIN"/>
<nvpair id="node-1-instance_attributes-passwd" name="passwd" value="ADMIN"/>
<nvpair id="node-1-instance_attributes-timeout" name="timeout" value="20"/>
<nvpair id="node-1-instance_attributes-auth" name="auth" value="password"/>
<nvpair id="node-1-instance_attributes-method" name="method" value="cycle"/>
</instance_attributes>
</primitive>
<primitive class="stonith" id="node-2" type="fence_ipmilan">
<instance_attributes id="node-2-instance_attributes">
<nvpair id="node-2-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="node-2"/>
<nvpair id="node-2-instance_attributes-ipaddr" name="ipaddr" value="192.168.220.12"/>
<nvpair id="node-2-instance_attributes-login" name="login" value="ADMIN"/>
<nvpair id="node-2-instance_attributes-passwd" name="passwd" value="ADMIN"/>
<nvpair id="node-2-instance_attributes-timeout" name="timeout" value="20"/>
<nvpair id="node-2-instance_attributes-auth" name="auth" value="password"/>
<nvpair id="node-2-instance_attributes-method" name="method" value="cycle"/>
</instance_attributes>
</primitive>
<clone id="cxml-clone">
<primitive class="ocf" id="cxml" provider="heartbeat" type="Filesystem">
<instance_attributes id="cxml-instance_attributes">
<nvpair id="cxml-instance_attributes-device" name="device" value="/dev/cstore/cxml"/>
<nvpair id="cxml-instance_attributes-directory" name="directory" value="/mnt"/>
<nvpair id="cxml-instance_attributes-fstype" name="fstype" value="gfs2"/>
</instance_attributes>
<meta_attributes id="cxml-meta_attributes"/>
</primitive>
<meta_attributes id="cxml-clone-meta"/>
</clone>
<primitive class="ocf" id="testVM" provider="heartbeat" type="VirtualDomain">
<instance_attributes id="testVM-instance_attributes">
<nvpair id="testVM-instance_attributes-config" name="config" value="/mnt/testvm.xml"/>
<nvpair id="testVM-instance_attributes-migration_transport" name="migration_transport" value="ssh"/>
<nvpair id="testVM-instance_attributes-allow-migrate" name="allow-migrate" value="true"/>
</instance_attributes>
<meta_attributes id="testVM-meta_attributes"/>
</primitive>
</resources>
<constraints>
<rsc_order first="cxml" id="order-cxml-testVM-mandatory" then="testVM"/>
</constraints>
</configuration>
</cib>
Apr 10 14:03:22 node-1 stonith-ng[2700]: notice: stonith_device_register: Added 'node-1' to the device list (2 active devices)
Apr 10 14:03:23 node-1 crmd[2704]: notice: process_lrm_event: LRM operation node-1_start_0 (call=30, rc=0, cib-update=15, confirmed=true) ok
Apr 10 14:03:23 node-1 lrmd[2701]: notice: operation_finished: testVM_migrate_from_0:3058 [ 2013/04/10_14:03:23 INFO: Domain name "testvm" saved to /var/run/heartbeat/rsctmp/VirtualDomain-testVM.state. ]
Apr 10 14:03:23 node-1 lrmd[2701]: notice: operation_finished: testVM_migrate_from_0:3058 [ 2013/04/10_14:03:23 INFO: testvm: live migration from node-2 succeeded. ]
Apr 10 14:03:23 node-1 crmd[2704]: notice: process_lrm_event: LRM operation testVM_migrate_from_0 (call=33, rc=0, cib-update=16, confirmed=true) ok
Apr 10 14:03:30 node-1 kernel: vlan1: port 2(vnet0) entering disabled state
Apr 10 14:03:30 node-1 kernel: device vnet0 left promiscuous mode
Apr 10 14:03:30 node-1 kernel: vlan1: port 2(vnet0) entering disabled state
Apr 10 14:03:31 node-1 lrmd[2701]: notice: operation_finished: testVM_stop_0:3093 [ 2013/04/10_14:03:23 INFO: Issuing graceful shutdown request for domain testvm. ]
Apr 10 14:03:31 node-1 lrmd[2701]: notice: operation_finished: testVM_stop_0:3093 [ 2013/04/10_14:03:31 ERROR: Virtual domain testvm has unknown status "in shutdown"! ]
Apr 10 14:03:31 node-1 lrmd[2701]: notice: operation_finished: testVM_stop_0:3093 [ 2013/04/10_14:03:31 INFO: Issuing forced shutdown (destroy) request for domain testvm. ]
Apr 10 14:03:31 node-1 lrmd[2701]: notice: operation_finished: testVM_stop_0:3093 [ Domain testvm destroyed ]
Apr 10 14:03:31 node-1 crmd[2704]: notice: process_lrm_event: LRM operation testVM_stop_0 (call=36, rc=0, cib-update=17, confirmed=true) ok
Apr 10 14:03:31 node-1 stonith-ng[2700]: notice: remote_op_done: Operation reboot of node-2 by vnode-3 for crmd.1868@vnode-3.2f74f7fe: OK
Apr 10 14:03:31 node-1 crmd[2704]: notice: tengine_stonith_notify: Peer node-2 was terminated (st_notify_fence) by vnode-3 for vnode-3: OK (ref=2f74f7fe-d335-4c5c-9764-2196d33aceb6) by client crmd.1868
Apr 10 14:03:31 node-1 crmd[2704]: notice: tengine_stonith_notify: Notified CMAN that 'node-2' is now fenced
Apr 10 14:03:31 node-1 cib[2699]: warning: cib_process_diff: Diff 0.32.19 -> 0.32.20 from vnode-3 not applied to 0.32.20: current "num_updates" is greater than required
Apr 10 14:03:31 node-1 kernel: device vnet0 entered promiscuous mode
Apr 10 14:03:31 node-1 kernel: vlan1: port 2(vnet0) entering forwarding state
Apr 10 14:03:32 node-1 lrmd[2701]: notice: operation_finished: testVM_start_0:3181 [ 2013/04/10_14:03:31 INFO: Domain name "testvm" saved to /var/run/heartbeat/rsctmp/VirtualDomain-testVM.state. ]
Apr 10 14:03:33 node-1 crmd[2704]: notice: process_lrm_event: LRM operation testVM_start_0 (call=39, rc=0, cib-update=18, confirmed=true) ok
Apr 10 14:03:33 node-1 crmd[2704]: notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_JOIN_OFFER cause=C_HA_MESSAGE origin=route_message ]
Apr 10 14:03:33 node-1 cib[2699]: notice: cib_process_diff: Diff 0.32.20 -> 0.32.21 from vnode-3 not applied to 0.32.20: Failed application of an update diff
Apr 10 14:03:35 node-1 crmd[2704]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Apr 10 14:03:35 node-1 attrd[2702]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 10 14:03:35 node-1 attrd[2702]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Apr 10 14:03:41 node-1 kernel: kvm: 3218: cpu0 disabled perfctr wrmsr: 0xc1 data 0xabcd
Apr 10 14:03:48 node-1 corosync[1920]: [TOTEM ] A processor failed, forming new configuration.
Apr 10 14:04:01 node-1 corosync[1920]: [QUORUM] Members[2]: 1 3
Apr 10 14:04:01 node-1 corosync[1920]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 10 14:04:01 node-1 crmd[2704]: notice: crm_update_peer_state: cman_event_callback: Node node-2[2] - state is now lost
Apr 10 14:04:01 node-1 kernel: dlm: closing connection to node 2
Apr 10 14:04:01 node-1 corosync[1920]: [CPG ] chosen downlist: sender r(0) ip(192.168.220.21) ; members(old:3 left:1)
Apr 10 14:04:01 node-1 corosync[1920]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 10 14:04:02 node-1 crmd[2704]: notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_JOIN_OFFER cause=C_HA_MESSAGE origin=route_message ]
Apr 10 14:04:05 node-1 attrd[2702]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 10 14:04:05 node-1 crmd[2704]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Apr 10 14:04:05 node-1 attrd[2702]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Apr 10 14:05:09 node-1 corosync[1920]: [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 10 14:05:09 node-1 corosync[1920]: [QUORUM] Members[3]: 1 2 3
Apr 10 14:05:09 node-1 corosync[1920]: [QUORUM] Members[3]: 1 2 3
Apr 10 14:05:09 node-1 crmd[2704]: notice: crm_update_peer_state: cman_event_callback: Node node-2[2] - state is now member
Apr 10 14:05:09 node-1 corosync[1920]: [CPG ] chosen downlist: sender r(0) ip(192.168.220.21) ; members(old:2 left:0)
Apr 10 14:05:09 node-1 corosync[1920]: [MAIN ] Completed service synchronization, ready to provide service.
Apr 10 14:05:17 node-1 kernel: dlm: got connection from 2
Apr 10 14:05:22 node-1 cib[2699]: warning: cib_process_diff: Diff 0.32.0 -> 0.32.1 from node-2 not applied to 0.32.60: current "num_updates" is greater than required
Apr 10 14:05:22 node-1 crmd[2704]: notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_JOIN_OFFER cause=C_HA_MESSAGE origin=route_message ]
Apr 10 14:05:27 node-1 crmd[2704]: notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Apr 10 14:05:27 node-1 attrd[2702]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 10 14:05:27 node-1 attrd[2702]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Apr 10 14:03:20 node-2 crmd[2682]: notice: process_lrm_event: LRM operation node-1_stop_0 (call=73, rc=0, cib-update=36, confirmed=true) ok
Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished: cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: Running stop for /dev/cstore/cxml on /mnt ]
Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished: cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: Trying to unmount /mnt ]
Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished: cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: unmounted /mnt successfully ]
Apr 10 14:03:20 node-2 crmd[2682]: notice: process_lrm_event: LRM operation cxml_stop_0 (call=77, rc=0, cib-update=37, confirmed=true) ok
Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished: testVM_migrate_to_0:3281 [ 2013/04/10_14:03:20 INFO: testvm: Starting live migration to node-1 (using remote hypervisor URI qemu+ssh://node-1/system ). ]
Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished: testVM_migrate_to_0:3281 [ error: Requested operation is not valid: domain is already active as 'testvm' ]
Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished: testVM_migrate_to_0:3281 [ 2013/04/10_14:03:21 ERROR: testvm: live migration to qemu+ssh://node-1/system failed: 1 ]
Apr 10 14:03:21 node-2 crmd[2682]: notice: process_lrm_event: LRM operation testVM_migrate_to_0 (call=75, rc=1, cib-update=38, confirmed=true) unknown error
Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished: testVM_stop_0:3392 [ 2013/04/10_14:03:21 ERROR: Configuration file /mnt/testvm.xml does not exist or is not readable. ]
Apr 10 14:03:21 node-2 crmd[2682]: notice: process_lrm_event: LRM operation testVM_stop_0 (call=82, rc=5, cib-update=39, confirmed=true) not installed
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_ais_dispatch: Update relayed from vnode-3
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-testVM (INFINITY)
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_perform_update: Sent update 55: fail-count-testVM=INFINITY
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_ais_dispatch: Update relayed from vnode-3
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-testVM (1365602610)
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_perform_update: Sent update 58: last-failure-testVM=1365602610
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_ais_dispatch: Update relayed from vnode-3
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-testVM (INFINITY)
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_perform_update: Sent update 61: fail-count-testVM=INFINITY
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_ais_dispatch: Update relayed from vnode-3
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-testVM (1365602610)
Apr 10 14:03:21 node-2 attrd[2680]: notice: attrd_perform_update: Sent update 63: last-failure-testVM=1365602610
Apr 10 14:03:29 node-2 stonith-ng[2678]: notice: remote_op_done: Operation reboot of node-2 by vnode-3 for crmd.1868@vnode-3.2f74f7fe: OK
Apr 10 14:03:29 node-2 crmd[2682]: error: tengine_stonith_notify: We were alegedly just fenced by vnode-3 for vnode-3!
Apr 10 14:03:29 node-2 crmd[2682]: error: do_log: FSA: Input I_ERROR from tengine_stonith_notify() received in state S_NOT_DC
Apr 10 14:03:29 node-2 crmd[2682]: notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=tengine_stonith_notify ]
Apr 10 14:03:29 node-2 crmd[2682]: error: do_recover: Action A_RECOVER (0000000001000000) not supported
Apr 10 14:03:29 node-2 crmd[2682]: error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Apr 10 14:03:29 node-2 crmd[2682]: error: verify_stopped: Resource testVM was active at shutdown. You may ignore this error if it is unmanaged.
Apr 10 14:03:29 node-2 crmd[2682]: notice: terminate_cs_connection: Disconnecting from Corosync
Apr 10 14:03:29 node-2 cib[2677]: warning: qb_ipcs_event_sendv: new_event_notification (2677-2682-15): Broken pipe (32)
Apr 10 14:03:29 node-2 crmd[2682]: error: do_exit: Could not recover from internal error
Apr 10 14:03:29 node-2 cib[2677]: warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Apr 10 14:03:29 node-2 pacemakerd[2669]: error: pcmk_child_exit: Child process crmd exited (pid=2682, rc=2)
Apr 10 14:03:29 node-2 pacemakerd[2669]: notice: pcmk_process_exit: Respawning failed child process: crmd
Apr 10 14:03:29 node-2 crmd[3416]: notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Apr 10 14:03:29 node-2 crmd[3416]: notice: main: CRM Git Version: 394e906
Apr 10 14:03:29 node-2 crmd[3416]: notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Apr 10 14:03:29 node-2 crmd[3416]: notice: cman_event_callback: Membership 68: quorum acquired
Apr 10 14:03:29 node-2 crmd[3416]: notice: crm_update_peer_state: cman_event_callback: Node node-1[1] - state is now member
Apr 10 14:03:29 node-2 crmd[3416]: notice: crm_update_peer_state: cman_event_callback: Node node-2[2] - state is now member
Apr 10 14:03:29 node-2 crmd[3416]: notice: crm_update_peer_state: cman_event_callback: Node vnode-3[3] - state is now member
Apr 10 14:03:29 node-2 crmd[3416]: notice: do_started: The local CRM is operational
Apr 10 14:03:29 node-2 cib[2677]: warning: cib_process_diff: Diff 0.32.19 -> 0.32.20 from vnode-3 not applied to 0.32.20: current "num_updates" is greater than required
Apr 10 14:03:31 node-2 cib[2677]: notice: cib_process_diff: Diff 0.32.20 -> 0.32.21 from vnode-3 not applied to 0.32.20: Failed application of an update diff
Broadcast message from root@node-2
(unknown) at 14:03 ...
The system is going down for halt NOW!
Apr 10 14:03:33 node-2 init: tty (/dev/tty1) main process (2708) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty2) main process (2710) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty3) main process (2712) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty4) main process (2714) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty5) main process (2720) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty6) main process (2722) killed by TERM signal
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org