Re: [Openstack-operators] What to do when a compute node dies?

Chris Friesen Mon, 30 Mar 2015 07:46:14 -0700

On 03/29/2015 09:26 PM, Mike Dorman wrote:

Hi all,


I’m curious about how people deal with failures of compute nodes, as in total
failure when the box is gone for good.  (Mainly care about KVM HV, but also
interested in more general cases as well.)

The particular situation we’re looking at: how end users could identify or be
notified of VMs that no longer exist, because their hypervisor is dead.  As I
understand it, Nova will still believe VMs are running, and really has no way to
know anything has changed (other than the nova-compute instance has dropped 
off.)

I understand failure detection is a tricky thing.  But it seems like there must
be something a little better than this.

This is a timely question...I was wondering if it might make sense to upstreamone of the changes we've made locally.

We have an external entity monitoring the health of compute nodes. When one ofthem goes down we automatically take action regarding the instances that hadbeen running on it.

Normally nova won't let you evacuate an instance until the compute node isdetected as "down", but that takes 60 sec typically and our software knows thecompute node is gone within a few seconds.

The change we made was to patch nova to allow the health monitor to explicitlytell nova that the node is to be considered "down" (so that instances can beevacuated without delay). When the external monitoring entity detects that thecompute node is back, it tells nova the node may be considered "up" (if novaagrees that it's "up").

Is this ability to tell nova that a compute node is "down" something that wouldbe of interest to others?


Chris

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] What to do when a compute node dies?

Reply via email to