Hi,
While testing the provided HOST_HOOK "host_error.rb" to have VM High
Availability with RADOS/RBD as block device backend several questions
popped up which I was not able to solve:
The configuration is very default-ish:
HOST_HOOK = [
name = "error",
on = "ERROR",
command = "ft/host_error.rb",
arguments = "$ID -r",
remote = "no" ]
And this was my test scenario:
Starting position:
* A VM running on host1, using RBD as block storage
* There is another host in the cluster: host2
* The VM is also able to run on host2 (tested with live migration)
1. Kill host1 (power off)
2. After some minutes, oned discovers that the host is down:
Mon Sep 9 17:14:10 2013 [InM][I]: Command execution fail: 'if [ -x
"/var/tmp/one/im/run_probes" ]; then /var/tmp/one/im/run_probes kvm 2
host1; else exit 42; fi'
Mon Sep 9 17:14:10 2013 [InM][I]: ssh: connect to host host1 port 22:
Connection timed out
Mon Sep 9 17:14:10 2013 [InM][I]: ExitCode: 255
Mon Sep 9 17:14:10 2013 [ONE][E]: Error monitoring Host host1 (2): -
3. ONE tries to remove the VM from host1:
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command
execution fail: /var/tmp/one/vmm/kvm/cancel 'one-64' 'host1' 64 host1
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh:
connect to host host1 port 22: Connection timed out
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64
ExitSSHCode: 255
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error
connecting to host1
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to
execute virtualization driver operation: cancel.
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Command
execution fail: /var/tmp/one/vnm/ovswitch/clean [...]
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 ssh:
connect to host host1 port 22: Connection timed out
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64
ExitSSHCode: 255
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG E 64 Error
connecting to host1
Mon Sep 9 17:14:15 2013 [VMM][D]: Message received: LOG I 64 Failed to
execute network driver operation: clean.
[...]
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Command
execution fail: /var/lib/one/remotes/tm/ceph/delete
host1:/var/lib/one//datastores/0/64/disk.0 64 101
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 delete:
Deleting /var/lib/one/datastores/0/64/disk.0
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 delete:
Command "rbd rm one/one-5-64-0" failed: ssh: connect to host host1 port
22: Connection timed out
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG E 64 Error
deleting one/one-5-64-0 in host1
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 ExitCode:
255
Mon Sep 9 17:14:21 2013 [VMM][D]: Message received: LOG I 64 Failed to
execute transfer manager driver operation: tm_delete.
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Command
execution fail: /var/lib/one/remotes/tm/shared/delete
host1:/var/lib/one//datastores/0/64 64 0
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 delete:
Deleting /var/lib/one/datastores/0/64
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 delete:
Command "rm -rf /var/lib/one/datastores/0/64" failed: ssh: connect to
host host1 port 22: Connection timed out
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG E 64 Error
deleting /var/lib/one/datastores/0/64
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 ExitCode:
255
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: LOG I 64 Failed to
execute transfer manager driver operation: tm_delete.
Mon Sep 9 17:14:26 2013 [VMM][D]: Message received: CLEANUP SUCCESS 64
4. ONE tries to deploy the VM which was running on host1 to host2, but
fails because the RBD volume already exists.
Mon Sep 9 17:14:33 2013 [DiM][D]: Deploying VM 64
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 Command
execution fail: /var/lib/one/remotes/tm/ceph/clone quimby:one/one-5
uetli2:/var/lib/one//datastores/0/64/disk.0 64 101
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG E 64 clone:
Command "rbd copy one/one-5 one/one-5-64-0" faileImage copy: 0%
complete...failed.
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 rbd: copy
failed: (17) File exists
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09
17:14:35.466472 7f81463a8780 -1 librbd: rbd image one-5-64-0 already
exists
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 2013-09-09
17:14:35.466500 7f81463a8780 -1 librbd: header creation failed
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG E 64 Error
cloning one/one-5 to one/one-5-64-0 in quimby
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: LOG I 64 ExitCode: 1
Mon Sep 9 17:14:35 2013 [TM][D]: Message received: TRANSFER FAILURE 64
Error cloning one/one-5 to one/one-5-64-0 in frontend
Now some questions:
* Why does ONE try to remove the VM from the failed host? It really
makes no sense, because the host is down and not reachable anymore.
* Why has the VM to be recreated? The disk image lies on a shared
storage (RBD) and should only be started on another host, not recreated.
* The VM now has the state "FAILED". How is the VM supposed to be
recovered?
Thanks for every clarification on this topic.
Cheers,
Tobias
_______________________________________________
Users mailing list
Users@lists.opennebula.org
http://lists.opennebula.org/listinfo.cgi/users-opennebula.org