Hi,

While testing the provided HOST_HOOK "host_error.rb" to have VM High Availability with RADOS/RBD as block device backend several questions popped up which I was not able to solve:

The configuration is very default-ish:
HOST_HOOK = [
    name      = "error",
    on        = "ERROR",
    command   = "ft/host_error.rb",
    arguments = "$ID -r",
    remote    = "no" ]

And this was my test scenario:

Starting position:
* A VM running on host1, using RBD as block storage
* There is another host in the cluster: host2
* The VM is also able to run on host2 (tested with live migration)

1. Kill host1 (power off)

2. After some minutes, oned discovers that the host is down:

Mon Sep 9 17:14:10 2013 [InM][I]: Command execution fail: 'if [ -x "/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2 host1; else exit 42; fi' Mon Sep 9 17:14:10 2013 [InM][I]: ssh: connect to host host1 port 22: Connection timed out
Mon Sep  9 17:14:10 2013 [InM][I]: ExitCode: 255
Mon Sep  9 17:14:10 2013 [ONE][E]: Error monitoring Host host1 (2): -


3. ONE tries to remove the VM from host1:

Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command execution fail: /var/tmp/one/vmm/kvm/cancel 'one-64' 'host1' 64 host1 Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh: connect to host host1 port 22: Connection timed out Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ExitSSHCode: 255 Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error connecting to host1 Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to execute virtualization driver operation: cancel. Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command execution fail: /var/tmp/one/vnm/ovswitch/clean [...] Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh: connect to host host1 port 22: Connection timed out Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ExitSSHCode: 255 Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error connecting to host1 Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to execute network driver operation: clean.

[...]

Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Command execution fail: /var/lib/one/remotes/tm/ceph/delete host1:/var/lib/one//datastores/0/64/disk.0 64 101 Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 delete: Deleting /var/lib/one/datastores/0/64/disk.0 Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 delete: Command "rbd rm one/one-5-64-0" failed: ssh: connect to host host1 port 22: Connection timed out Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 Error deleting one/one-5-64-0 in host1 Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 ExitCode: 255 Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Failed to execute transfer manager driver operation: tm_delete. Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Command execution fail: /var/lib/one/remotes/tm/shared/delete host1:/var/lib/one//datastores/0/64 64 0 Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 delete: Deleting /var/lib/one/datastores/0/64 Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 delete: Command "rm -rf /var/lib/one/datastores/0/64" failed: ssh: connect to host host1 port 22: Connection timed out Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 Error deleting /var/lib/one/datastores/0/64 Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 ExitCode: 255 Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Failed to execute transfer manager driver operation: tm_delete.
Mon Sep  9 17:14:26 2013 [VMM][D]: Message received: CLEANUP SUCCESS 64

4. ONE tries to deploy the VM which was running on host1 to host2, but fails because the RBD volume already exists.

Mon Sep  9 17:14:33 2013 [DiM][D]: Deploying VM 64
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 Command execution fail: /var/lib/one/remotes/tm/ceph/clone quimby:one/one-5 uetli2:/var/lib/one//datastores/0/64/disk.0 64 101 Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG E 64 clone: Command "rbd copy one/one-5 one/one-5-64-0" faileImage copy: 0% complete...failed. Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 rbd: copy failed: (17) File exists Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09 17:14:35.466472 7f81463a8780 -1 librbd: rbd image one-5-64-0 already exists Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09 17:14:35.466500 7f81463a8780 -1 librbd: header creation failed Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG E 64 Error cloning one/one-5 to one/one-5-64-0 in quimby
Mon Sep  9 17:14:35 2013 [TM][D]: Message received: LOG I 64 ExitCode: 1
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: TRANSFER FAILURE 64 Error cloning one/one-5 to one/one-5-64-0 in frontend


Now some questions:
* Why does ONE try to remove the VM from the failed host? It really makes no sense, because the host is down and not reachable anymore. * Why has the VM to be recreated? The disk image lies on a shared storage (RBD) and should only be started on another host, not recreated. * The VM now has the state "FAILED". How is the VM supposed to be recovered?

Thanks for every clarification on this topic.

Cheers,
Tobias

_______________________________________________
Users mailing list
Users@lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org

Reply via email to