[Pacemaker] problem with VM in pacemaker cluster

Yuriy Demchenko Wed, 10 Apr 2013 04:03:56 -0700

Hi,

I've set up 3-node cluster (2 active nodes + 1 standby for quorum),cman+pacemakerResources - "cxml-clone" gfs2 filesystem (cloned, run on both nodes) and"testVM" via heartbeat:VirtualDomain (domain xml located on gfs2 fs,cLVM disk backend), set up constraints: "cxml-clone" started first than"testVM" (symmetrical, according to description it'll be stopped inreverse order).While manual migration of VM runs fine (pcs resource move testVMnode-2/node-1) - succesfull live migration, VM runs uninterrupted, butwhen I'm trying to reboot/put in standby node running VM - everything iscrashing, migration fails and node fenced.

From logs i can see that resource "cxml" stopped first (orsimultaneously, at least not waiting for vm migration to complete), thenmigration fails because xml not available.

Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished:cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: Running stop for/dev/cstore/cxml on /mnt ]Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished:cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: Trying to unmount /mnt ]Apr 10 14:03:20 node-2 lrmd[2679]: notice: operation_finished:cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: unmounted /mnt successfully ]Apr 10 14:03:20 node-2 crmd[2682]: notice: process_lrm_event: LRMoperation cxml_stop_0 (call=77, rc=0, cib-update=37, confirmed=true) okApr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished:testVM_migrate_to_0:3281 [ 2013/04/10_14:03:20 INFO: testvm: Startinglive migration to node-1 (using remote hypervisor URIqemu+ssh://node-1/system ). ]Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished:testVM_migrate_to_0:3281 [ error: Requested operation is not valid:domain is already active as 'testvm' ]Apr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished:testVM_migrate_to_0:3281 [ 2013/04/10_14:03:21 ERROR: testvm: livemigration to qemu+ssh://node-1/system failed: 1 ]Apr 10 14:03:21 node-2 crmd[2682]: notice: process_lrm_event: LRMoperation testVM_migrate_to_0 (call=75, rc=1, cib-update=38,confirmed=true) unknown errorApr 10 14:03:21 node-2 lrmd[2679]: notice: operation_finished:testVM_stop_0:3392 [ 2013/04/10_14:03:21 ERROR: Configuration file/mnt/testvm.xml does not exist or is not readable. ]

But wtf?! I've set up constraint, and "testVM" should be stopped/movedfirst, not "cxml"


What is wrong with my configuration, am I missing something?

logs and CIB in attach

--
Yuriy Demchenko

<cib epoch="33" num_updates="4" admin_epoch="0" validate-with="pacemaker-1.2" update-origin="node-2" update-client="crm_attribute" cib-last-written="Wed Apr 10 10:14:36 2013" crm_feature_set="3.0.7" have-quorum="1" dc-uuid="vnode-3">
  <configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
        <nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.8-7.el6-394e906"/>
        <nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="cman"/>
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="true"/>
        <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2"/>
      </cluster_property_set>
    </crm_config>
    <nodes>
      <node id="node-2" uname="node-2">
        <instance_attributes id="nodes-node-2"/>
      </node>
      <node id="vnode-3" uname="vnode-3">
        <instance_attributes id="nodes-vnode-3">
          <nvpair id="nodes-vnode-3-standby" name="standby" value="on"/>
        </instance_attributes>
      </node>
      <node id="node-1" uname="node-1"/>
    </nodes>
    <resources>
      <primitive class="stonith" id="node-1" type="fence_ipmilan">
        <instance_attributes id="node-1-instance_attributes">
          <nvpair id="node-1-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="node-1"/>
          <nvpair id="node-1-instance_attributes-ipaddr" name="ipaddr" value="192.168.220.11"/>
          <nvpair id="node-1-instance_attributes-login" name="login" value="ADMIN"/>
          <nvpair id="node-1-instance_attributes-passwd" name="passwd" value="ADMIN"/>
          <nvpair id="node-1-instance_attributes-timeout" name="timeout" value="20"/>
          <nvpair id="node-1-instance_attributes-auth" name="auth" value="password"/>
          <nvpair id="node-1-instance_attributes-method" name="method" value="cycle"/>
        </instance_attributes>
      </primitive>
      <primitive class="stonith" id="node-2" type="fence_ipmilan">
        <instance_attributes id="node-2-instance_attributes">
          <nvpair id="node-2-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="node-2"/>
          <nvpair id="node-2-instance_attributes-ipaddr" name="ipaddr" value="192.168.220.12"/>
          <nvpair id="node-2-instance_attributes-login" name="login" value="ADMIN"/>
          <nvpair id="node-2-instance_attributes-passwd" name="passwd" value="ADMIN"/>
          <nvpair id="node-2-instance_attributes-timeout" name="timeout" value="20"/>
          <nvpair id="node-2-instance_attributes-auth" name="auth" value="password"/>
          <nvpair id="node-2-instance_attributes-method" name="method" value="cycle"/>
        </instance_attributes>
      </primitive>
      <clone id="cxml-clone">
        <primitive class="ocf" id="cxml" provider="heartbeat" type="Filesystem">
          <instance_attributes id="cxml-instance_attributes">
            <nvpair id="cxml-instance_attributes-device" name="device" value="/dev/cstore/cxml"/>
            <nvpair id="cxml-instance_attributes-directory" name="directory" value="/mnt"/>
            <nvpair id="cxml-instance_attributes-fstype" name="fstype" value="gfs2"/>
          </instance_attributes>
          <meta_attributes id="cxml-meta_attributes"/>
        </primitive>
        <meta_attributes id="cxml-clone-meta"/>
      </clone>
      <primitive class="ocf" id="testVM" provider="heartbeat" type="VirtualDomain">
        <instance_attributes id="testVM-instance_attributes">
          <nvpair id="testVM-instance_attributes-config" name="config" value="/mnt/testvm.xml"/>
          <nvpair id="testVM-instance_attributes-migration_transport" name="migration_transport" value="ssh"/>
          <nvpair id="testVM-instance_attributes-allow-migrate" name="allow-migrate" value="true"/>
        </instance_attributes>
        <meta_attributes id="testVM-meta_attributes"/>
      </primitive>
    </resources>
    <constraints>
      <rsc_order first="cxml" id="order-cxml-testVM-mandatory" then="testVM"/>
    </constraints>
  </configuration>
</cib>

Apr 10 14:03:22 node-1 stonith-ng[2700]:   notice: stonith_device_register: Added 'node-1' to the device list (2 active devices)
Apr 10 14:03:23 node-1 crmd[2704]:   notice: process_lrm_event: LRM operation node-1_start_0 (call=30, rc=0, cib-update=15, confirmed=true) ok
Apr 10 14:03:23 node-1 lrmd[2701]:   notice: operation_finished: testVM_migrate_from_0:3058 [ 2013/04/10_14:03:23 INFO: Domain name "testvm" saved to /var/run/heartbeat/rsctmp/VirtualDomain-testVM.state. ]
Apr 10 14:03:23 node-1 lrmd[2701]:   notice: operation_finished: testVM_migrate_from_0:3058 [ 2013/04/10_14:03:23 INFO: testvm: live migration from node-2 succeeded. ]
Apr 10 14:03:23 node-1 crmd[2704]:   notice: process_lrm_event: LRM operation testVM_migrate_from_0 (call=33, rc=0, cib-update=16, confirmed=true) ok
Apr 10 14:03:30 node-1 kernel: vlan1: port 2(vnet0) entering disabled state
Apr 10 14:03:30 node-1 kernel: device vnet0 left promiscuous mode
Apr 10 14:03:30 node-1 kernel: vlan1: port 2(vnet0) entering disabled state
Apr 10 14:03:31 node-1 lrmd[2701]:   notice: operation_finished: testVM_stop_0:3093 [ 2013/04/10_14:03:23 INFO: Issuing graceful shutdown request for domain testvm. ]
Apr 10 14:03:31 node-1 lrmd[2701]:   notice: operation_finished: testVM_stop_0:3093 [ 2013/04/10_14:03:31 ERROR: Virtual domain testvm has unknown status "in shutdown"! ]
Apr 10 14:03:31 node-1 lrmd[2701]:   notice: operation_finished: testVM_stop_0:3093 [ 2013/04/10_14:03:31 INFO: Issuing forced shutdown (destroy) request for domain testvm. ]
Apr 10 14:03:31 node-1 lrmd[2701]:   notice: operation_finished: testVM_stop_0:3093 [ Domain testvm destroyed ]
Apr 10 14:03:31 node-1 crmd[2704]:   notice: process_lrm_event: LRM operation testVM_stop_0 (call=36, rc=0, cib-update=17, confirmed=true) ok
Apr 10 14:03:31 node-1 stonith-ng[2700]:   notice: remote_op_done: Operation reboot of node-2 by vnode-3 for crmd.1868@vnode-3.2f74f7fe: OK
Apr 10 14:03:31 node-1 crmd[2704]:   notice: tengine_stonith_notify: Peer node-2 was terminated (st_notify_fence) by vnode-3 for vnode-3: OK (ref=2f74f7fe-d335-4c5c-9764-2196d33aceb6) by client crmd.1868
Apr 10 14:03:31 node-1 crmd[2704]:   notice: tengine_stonith_notify: Notified CMAN that 'node-2' is now fenced
Apr 10 14:03:31 node-1 cib[2699]:  warning: cib_process_diff: Diff 0.32.19 -> 0.32.20 from vnode-3 not applied to 0.32.20: current "num_updates" is greater than required
Apr 10 14:03:31 node-1 kernel: device vnet0 entered promiscuous mode
Apr 10 14:03:31 node-1 kernel: vlan1: port 2(vnet0) entering forwarding state
Apr 10 14:03:32 node-1 lrmd[2701]:   notice: operation_finished: testVM_start_0:3181 [ 2013/04/10_14:03:31 INFO: Domain name "testvm" saved to /var/run/heartbeat/rsctmp/VirtualDomain-testVM.state. ]
Apr 10 14:03:33 node-1 crmd[2704]:   notice: process_lrm_event: LRM operation testVM_start_0 (call=39, rc=0, cib-update=18, confirmed=true) ok
Apr 10 14:03:33 node-1 crmd[2704]:   notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_JOIN_OFFER cause=C_HA_MESSAGE origin=route_message ]
Apr 10 14:03:33 node-1 cib[2699]:   notice: cib_process_diff: Diff 0.32.20 -> 0.32.21 from vnode-3 not applied to 0.32.20: Failed application of an update diff
Apr 10 14:03:35 node-1 crmd[2704]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Apr 10 14:03:35 node-1 attrd[2702]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 10 14:03:35 node-1 attrd[2702]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Apr 10 14:03:41 node-1 kernel: kvm: 3218: cpu0 disabled perfctr wrmsr: 0xc1 data 0xabcd
Apr 10 14:03:48 node-1 corosync[1920]:   [TOTEM ] A processor failed, forming new configuration.
Apr 10 14:04:01 node-1 corosync[1920]:   [QUORUM] Members[2]: 1 3
Apr 10 14:04:01 node-1 corosync[1920]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 10 14:04:01 node-1 crmd[2704]:   notice: crm_update_peer_state: cman_event_callback: Node node-2[2] - state is now lost
Apr 10 14:04:01 node-1 kernel: dlm: closing connection to node 2
Apr 10 14:04:01 node-1 corosync[1920]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.220.21) ; members(old:3 left:1)
Apr 10 14:04:01 node-1 corosync[1920]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 10 14:04:02 node-1 crmd[2704]:   notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_JOIN_OFFER cause=C_HA_MESSAGE origin=route_message ]
Apr 10 14:04:05 node-1 attrd[2702]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 10 14:04:05 node-1 crmd[2704]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Apr 10 14:04:05 node-1 attrd[2702]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
Apr 10 14:05:09 node-1 corosync[1920]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
Apr 10 14:05:09 node-1 corosync[1920]:   [QUORUM] Members[3]: 1 2 3
Apr 10 14:05:09 node-1 corosync[1920]:   [QUORUM] Members[3]: 1 2 3
Apr 10 14:05:09 node-1 crmd[2704]:   notice: crm_update_peer_state: cman_event_callback: Node node-2[2] - state is now member
Apr 10 14:05:09 node-1 corosync[1920]:   [CPG   ] chosen downlist: sender r(0) ip(192.168.220.21) ; members(old:2 left:0)
Apr 10 14:05:09 node-1 corosync[1920]:   [MAIN  ] Completed service synchronization, ready to provide service.
Apr 10 14:05:17 node-1 kernel: dlm: got connection from 2
Apr 10 14:05:22 node-1 cib[2699]:  warning: cib_process_diff: Diff 0.32.0 -> 0.32.1 from node-2 not applied to 0.32.60: current "num_updates" is greater than required
Apr 10 14:05:22 node-1 crmd[2704]:   notice: do_state_transition: State transition S_NOT_DC -> S_PENDING [ input=I_JOIN_OFFER cause=C_HA_MESSAGE origin=route_message ]
Apr 10 14:05:27 node-1 crmd[2704]:   notice: do_state_transition: State transition S_PENDING -> S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE origin=do_cl_join_finalize_respond ]
Apr 10 14:05:27 node-1 attrd[2702]:   notice: attrd_local_callback: Sending full refresh (origin=crmd)
Apr 10 14:05:27 node-1 attrd[2702]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)

Apr 10 14:03:20 node-2 crmd[2682]:   notice: process_lrm_event: LRM operation node-1_stop_0 (call=73, rc=0, cib-update=36, confirmed=true) ok
Apr 10 14:03:20 node-2 lrmd[2679]:   notice: operation_finished: cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: Running stop for /dev/cstore/cxml on /mnt ]
Apr 10 14:03:20 node-2 lrmd[2679]:   notice: operation_finished: cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: Trying to unmount /mnt ]
Apr 10 14:03:20 node-2 lrmd[2679]:   notice: operation_finished: cxml_stop_0:3282 [ 2013/04/10_14:03:20 INFO: unmounted /mnt successfully ]
Apr 10 14:03:20 node-2 crmd[2682]:   notice: process_lrm_event: LRM operation cxml_stop_0 (call=77, rc=0, cib-update=37, confirmed=true) ok
Apr 10 14:03:21 node-2 lrmd[2679]:   notice: operation_finished: testVM_migrate_to_0:3281 [ 2013/04/10_14:03:20 INFO: testvm: Starting live migration to node-1 (using remote hypervisor URI qemu+ssh://node-1/system ). ]
Apr 10 14:03:21 node-2 lrmd[2679]:   notice: operation_finished: testVM_migrate_to_0:3281 [ error: Requested operation is not valid: domain is already active as 'testvm' ]
Apr 10 14:03:21 node-2 lrmd[2679]:   notice: operation_finished: testVM_migrate_to_0:3281 [ 2013/04/10_14:03:21 ERROR: testvm: live migration to qemu+ssh://node-1/system  failed: 1 ]
Apr 10 14:03:21 node-2 crmd[2682]:   notice: process_lrm_event: LRM operation testVM_migrate_to_0 (call=75, rc=1, cib-update=38, confirmed=true) unknown error
Apr 10 14:03:21 node-2 lrmd[2679]:   notice: operation_finished: testVM_stop_0:3392 [ 2013/04/10_14:03:21 ERROR: Configuration file /mnt/testvm.xml does not exist or is not readable. ]
Apr 10 14:03:21 node-2 crmd[2682]:   notice: process_lrm_event: LRM operation testVM_stop_0 (call=82, rc=5, cib-update=39, confirmed=true) not installed
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_ais_dispatch: Update relayed from vnode-3
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-testVM (INFINITY)
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_perform_update: Sent update 55: fail-count-testVM=INFINITY
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_ais_dispatch: Update relayed from vnode-3
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-testVM (1365602610)
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_perform_update: Sent update 58: last-failure-testVM=1365602610
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_ais_dispatch: Update relayed from vnode-3
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-testVM (INFINITY)
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_perform_update: Sent update 61: fail-count-testVM=INFINITY
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_ais_dispatch: Update relayed from vnode-3
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-testVM (1365602610)
Apr 10 14:03:21 node-2 attrd[2680]:   notice: attrd_perform_update: Sent update 63: last-failure-testVM=1365602610
Apr 10 14:03:29 node-2 stonith-ng[2678]:   notice: remote_op_done: Operation reboot of node-2 by vnode-3 for crmd.1868@vnode-3.2f74f7fe: OK
Apr 10 14:03:29 node-2 crmd[2682]:    error: tengine_stonith_notify: We were alegedly just fenced by vnode-3 for vnode-3!
Apr 10 14:03:29 node-2 crmd[2682]:    error: do_log: FSA: Input I_ERROR from tengine_stonith_notify() received in state S_NOT_DC
Apr 10 14:03:29 node-2 crmd[2682]:   notice: do_state_transition: State transition S_NOT_DC -> S_RECOVERY [ input=I_ERROR cause=C_FSA_INTERNAL origin=tengine_stonith_notify ]
Apr 10 14:03:29 node-2 crmd[2682]:    error: do_recover: Action A_RECOVER (0000000001000000) not supported
Apr 10 14:03:29 node-2 crmd[2682]:    error: do_log: FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
Apr 10 14:03:29 node-2 crmd[2682]:    error: verify_stopped: Resource testVM was active at shutdown.  You may ignore this error if it is unmanaged.
Apr 10 14:03:29 node-2 crmd[2682]:   notice: terminate_cs_connection: Disconnecting from Corosync
Apr 10 14:03:29 node-2 cib[2677]:  warning: qb_ipcs_event_sendv: new_event_notification (2677-2682-15): Broken pipe (32)
Apr 10 14:03:29 node-2 crmd[2682]:    error: do_exit: Could not recover from internal error
Apr 10 14:03:29 node-2 cib[2677]:  warning: do_local_notify: A-Sync reply to crmd failed: No message of desired type
Apr 10 14:03:29 node-2 pacemakerd[2669]:    error: pcmk_child_exit: Child process crmd exited (pid=2682, rc=2)
Apr 10 14:03:29 node-2 pacemakerd[2669]:   notice: pcmk_process_exit: Respawning failed child process: crmd
Apr 10 14:03:29 node-2 crmd[3416]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log
Apr 10 14:03:29 node-2 crmd[3416]:   notice: main: CRM Git Version: 394e906
Apr 10 14:03:29 node-2 crmd[3416]:   notice: crm_cluster_connect: Connecting to cluster infrastructure: cman
Apr 10 14:03:29 node-2 crmd[3416]:   notice: cman_event_callback: Membership 68: quorum acquired
Apr 10 14:03:29 node-2 crmd[3416]:   notice: crm_update_peer_state: cman_event_callback: Node node-1[1] - state is now member
Apr 10 14:03:29 node-2 crmd[3416]:   notice: crm_update_peer_state: cman_event_callback: Node node-2[2] - state is now member
Apr 10 14:03:29 node-2 crmd[3416]:   notice: crm_update_peer_state: cman_event_callback: Node vnode-3[3] - state is now member
Apr 10 14:03:29 node-2 crmd[3416]:   notice: do_started: The local CRM is operational
Apr 10 14:03:29 node-2 cib[2677]:  warning: cib_process_diff: Diff 0.32.19 -> 0.32.20 from vnode-3 not applied to 0.32.20: current "num_updates" is greater than required
Apr 10 14:03:31 node-2 cib[2677]:   notice: cib_process_diff: Diff 0.32.20 -> 0.32.21 from vnode-3 not applied to 0.32.20: Failed application of an update diff

Broadcast message from root@node-2
        (unknown) at 14:03 ...

The system is going down for halt NOW!
Apr 10 14:03:33 node-2 init: tty (/dev/tty1) main process (2708) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty2) main process (2710) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty3) main process (2712) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty4) main process (2714) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty5) main process (2720) killed by TERM signal
Apr 10 14:03:33 node-2 init: tty (/dev/tty6) main process (2722) killed by TERM signal

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

[Pacemaker] problem with VM in pacemaker cluster

Reply via email to