Gary Kotton originally posted this bug against the VMware driver: https://bugs.launchpad.net/nova/+bug/1419785
I posted a proposed patch to fix this here: https://review.openstack.org/#/c/158269/1 However, Dan Smith pointed out that the bug can actually be triggered against any driver in a manner not addressed by the above patch alone. I have confirmed this against a libvirt setup as follows: 1. Create some instances 2. Shutdown n-cpu 3. Change hostname 4. Restart n-cpu Nova compute will delete all instances in libvirt, but continue to report them as ACTIVE and Running. There are 2 parts to this issue: 1. _destroy_evacuated_instances() should do a better job of sanity checking before performing such a drastic action. 2. The underlying issue is the definition and use of instance.host, instance.node, compute_node.host and compute_node.hypervisor_hostname. (1) is belt and braces. It's very important, but I want to focus on (2) here. Instantly you'll notice some inconsistent naming here, so to clarify: * instance.host == compute_node.host == Nova compute's 'host' value. * instance.node == compute_node.hypervisor_hostname == an identifier which represents a hypervisor. Architecturally, I'd argue that these mean: * Host: A Nova communication endpoint for a hypervisor. * Hypervisor: The physical location of a VM. Note that in the above case the libvirt driver changed the hypervisor identifier despite the fact that the hypervisor had not changed, only its communication endpoint. I propose the following: * ComputeNode describes 1 hypervisor. * ComputeNode maps 1 hypervisor to 1 compute host. * A ComputeNode is identified by a hypervisor_id. * hypervisor_id represents the physical location of running VMs, independent of a compute host. We've renamed compute_node.hypervisor_hostname to compute_node.hypervisor_id. This resolves some confusion, because it asserts that the identity of the hypervisor is tied to the data describing VMs, not the host which is running it. In fact, for the VMware and Ironic drivers it has never been a hostname. VMware[1] and Ironic don't require any changes here. Other drivers will need to be modified so that get_available_nodes() returns a persistent value rather than just the hostname. A reasonable default implementation of this would be to write a uuid to a file which lives with VM data and return its contents. If the hypervisor has a native concept of a globally unique identifier, that should be used instead. ComputeNode.hypervisor_id is unique. The hypervisor is unique (there is physically only 1 of it) so it does not make sense to have multiple representations of it and its associated resources. An Instance's location is its hypervisor, whereever that may be, so Instance.host could be removed. This isn't strictly necessary, but it is redundant as the communication endpoint is available via ComputeNode. If we wanted to support the possibility of changing a communication endpoint at some point, it would also make that operation trivial. Thinking blue sky, it would also open the future possibility for multiple communication endpoints for a single hypervisor. There is a data migration issue associated with changing a driver's reported hypervisor id. The bug linked below fudges it, but if we were doing it for all drivers I believe it could be handled efficiently by passing the instance list already collected by ComputeManager.init_host to the driver at startup. My proposed patch above fixes a potentially severe issue for users of the VMware and Ironic drivers. In conjunction with a move to a persistent hypervisor id for other drivers, it also fixes the related issue described above across the board. I would like to go forward with my proposed fix as it has an immediate benefit, and I'm happy to work on the persistent hypervisor id for other drivers. Matt [1] Modulo bugs: https://review.openstack.org/#/c/159481/ -- Matthew Booth Red Hat Engineering, Virtualisation Team Phone: +442070094448 (UK) GPG ID: D33C3490 GPG FPR: 3733 612D 2D05 5458 8A8A 1600 3441 EA19 D33C 3490 __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev