[one-users] VM HA with RBD

Tobias Brunner Mon, 09 Sep 2013 08:27:55 -0700

Hi,

While testing the provided HOST_HOOK "host_error.rb" to have VM HighAvailability with RADOS/RBD as block device backend several questionspopped up which I was not able to solve:


The configuration is very default-ish:
HOST_HOOK = [
    name      = "error",
    on        = "ERROR",
    command   = "ft/host_error.rb",
    arguments = "$ID -r",
    remote    = "no" ]

And this was my test scenario:

Starting position:
* A VM running on host1, using RBD as block storage
* There is another host in the cluster: host2
* The VM is also able to run on host2 (tested with live migration)

1. Kill host1 (power off)

2. After some minutes, oned discovers that the host is down:

Mon Sep 9 17:14:10 2013 [InM][I]: Command execution fail: 'if [ -x"/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2host1; else exit 42; fi'Mon Sep 9 17:14:10 2013 [InM][I]: ssh: connect to host host1 port 22:Connection timed out

Mon Sep  9 17:14:10 2013 [InM][I]: ExitCode: 255
Mon Sep  9 17:14:10 2013 [ONE][E]: Error monitoring Host host1 (2): -


3. ONE tries to remove the VM from host1:

Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Commandexecution fail: /var/tmp/one/vmm/kvm/cancel 'one-64' 'host1' 64 host1Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh:connect to host host1 port 22: Connection timed outMon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64ExitSSHCode: 255Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Errorconnecting to host1Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed toexecute virtualization driver operation: cancel.Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Commandexecution fail: /var/tmp/one/vnm/ovswitch/clean [...]Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh:connect to host host1 port 22: Connection timed outMon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64ExitSSHCode: 255Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Errorconnecting to host1Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed toexecute network driver operation: clean.


[...]

Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Commandexecution fail: /var/lib/one/remotes/tm/ceph/deletehost1:/var/lib/one//datastores/0/64/disk.0 64 101Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 delete:Deleting /var/lib/one/datastores/0/64/disk.0Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 delete:Command "rbd rm one/one-5-64-0" failed: ssh: connect to host host1 port22: Connection timed outMon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 Errordeleting one/one-5-64-0 in host1Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 ExitCode:255Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Failed toexecute transfer manager driver operation: tm_delete.Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Commandexecution fail: /var/lib/one/remotes/tm/shared/deletehost1:/var/lib/one//datastores/0/64 64 0Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 delete:Deleting /var/lib/one/datastores/0/64Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 delete:Command "rm -rf /var/lib/one/datastores/0/64" failed: ssh: connect tohost host1 port 22: Connection timed outMon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 Errordeleting /var/lib/one/datastores/0/64Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 ExitCode:255Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Failed toexecute transfer manager driver operation: tm_delete.

Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: CLEANUP SUCCESS 64

4. ONE tries to deploy the VM which was running on host1 to host2, butfails because the RBD volume already exists.


Mon Sep  9 17:14:33 2013 [DiM][D]: Deploying VM 64

Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 Commandexecution fail: /var/lib/one/remotes/tm/ceph/clone quimby:one/one-5uetli2:/var/lib/one//datastores/0/64/disk.0 64 101Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG E 64 clone:Command "rbd copy one/one-5 one/one-5-64-0" faileImage copy: 0%complete...failed.Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 rbd: copyfailed: (17) File existsMon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-0917:14:35.466472 7f81463a8780 -1 librbd: rbd image one-5-64-0 alreadyexistsMon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-0917:14:35.466500 7f81463a8780 -1 librbd: header creation failedMon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG E 64 Errorcloning one/one-5 to one/one-5-64-0 in quimby

Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 ExitCode: 1

Mon Sep 9 17:14:35 2013 [TM][D]: Message received: TRANSFER FAILURE 64Error cloning one/one-5 to one/one-5-64-0 in frontend



Now some questions:

* Why does ONE try to remove the VM from the failed host? It reallymakes no sense, because the host is down and not reachable anymore.* Why has the VM to be recreated? The disk image lies on a sharedstorage (RBD) and should only be started on another host, not recreated.* The VM now has the state "FAILED". How is the VM supposed to berecovered?


Thanks for every clarification on this topic.

Cheers,
Tobias

_______________________________________________
Users mailing list
Users@lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

[one-users] VM HA with RBD

Reply via email to