Re: [Openstack-operators] What to do when a compute node dies?

Jay Pipes Mon, 30 Mar 2015 16:00:31 -0700

On 03/30/2015 06:42 PM, Chris Friesen wrote:

On 03/30/2015 02:47 PM, Jay Pipes wrote:

On 03/30/2015 10:42 AM, Chris Friesen wrote:

On 03/29/2015 09:26 PM, Mike Dorman wrote:

Hi all,


I’m curious about how people deal with failures of compute
nodes, as in total failure when the box is gone for good.
(Mainly care about KVM HV, but also interested in more general
cases as well.)

The particular situation we’re looking at: how end users could
identify or be notified of VMs that no longer exist, because
their hypervisor is dead.  As I understand it, Nova will still
believe VMs are running, and really has no way to know anything
has changed (other than the nova-compute instance has dropped
off.)

I understand failure detection is a tricky thing.  But it
seems like there must be something a little better than this.


This is a timely question...I was wondering if it might make
sense to upstream one of the changes we've made locally.

We have an external entity monitoring the health of compute
nodes. When one of them goes down we automatically take action
regarding the instances that had been running on it.

Normally nova won't let you evacuate an instance until the
compute node is detected as "down", but that takes 60 sec
typically and our software knows the compute node is gone within
a few seconds.


Any external monitoring solution that detects the compute node is
"down" could issue a call to `nova evacuate $HOST`.

The question I have for you is what does your software consider as
a "downed" node? Is it some heartbeat-type stuff in network
connectivity? A watchdog in KVM? Some proactive monitoring of disk
or memory faults? Some combination? Something entirely different?
:)


Combination of the above.  A local entity monitors "critical stuff"
on the compute node, and heartbeats with a control node via one or
more network links.

OK.

The change we made was to patch nova to allow the health monitor
to explicitly tell nova that the node is to be considered "down"
(so that instances can be evacuated without delay).


Why was it necessary to modify Nova for this? The external
monitoring script could easily do: `nova service-disable $HOST
nova-compute` and that immediately takes the compute node out of
service and enables evacuation.


Disabling the service is not sufficient.  compute.api.API.evacuate()
 throws an exception if servicegroup.api.API.service_is_up(service)
is true.

servicegroup.api.service_is_up() returns whether the service has beendisabled in the database (when using the DB servicegroup driver). Whichis what `nova service-disable $HOST nova-compute` does.

When the external monitoring entity detects that the compute node
is back, it tells nova the node may be considered "up" (if nova
agrees that it's "up").


You mean `nova service-disable $HOST nova-compute`?

Is this ability to tell nova that a compute node is "down"
something that would be of interest to others?


Unless I'm mistaken, `nova service-disable $HOST nova-compute`
already exists that does this?


No, what we have is basically a way to cause
servicegroup.api.API.service_is_up() to return false. That causes
the correct status to be displayed in the "State" column in the
output of "nova service-list" and allows evacuation to proceed.


That's exactly what `nova service-disable $HOST nova-compute` does.

What servicegroup driver are you using?

Best,
-jay

_______________________________________________
OpenStack-operators mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-operators

Re: [Openstack-operators] What to do when a compute node dies?

Reply via email to