On 03/29/2015 09:26 PM, Mike Dorman wrote:
Hi all,

I’m curious about how people deal with failures of compute nodes, as in total
failure when the box is gone for good.  (Mainly care about KVM HV, but also
interested in more general cases as well.)

The particular situation we’re looking at: how end users could identify or be
notified of VMs that no longer exist, because their hypervisor is dead.  As I
understand it, Nova will still believe VMs are running, and really has no way to
know anything has changed (other than the nova-compute instance has dropped 
off.)

I understand failure detection is a tricky thing.  But it seems like there must
be something a little better than this.

This is a timely question...I was wondering if it might make sense to upstream one of the changes we've made locally.

We have an external entity monitoring the health of compute nodes. When one of them goes down we automatically take action regarding the instances that had been running on it.

Normally nova won't let you evacuate an instance until the compute node is detected as "down", but that takes 60 sec typically and our software knows the compute node is gone within a few seconds.

The change we made was to patch nova to allow the health monitor to explicitly tell nova that the node is to be considered "down" (so that instances can be evacuated without delay). When the external monitoring entity detects that the compute node is back, it tells nova the node may be considered "up" (if nova agrees that it's "up").

Is this ability to tell nova that a compute node is "down" something that would be of interest to others?

Chris

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Reply via email to