On Wed, Mar 19, 2014 at 12:08:30PM -0400, Zane Bitter wrote: > On 19/03/14 02:07, Chris Friesen wrote: > >On 03/18/2014 11:18 AM, Zane Bitter wrote: > >>On 18/03/14 12:42, Steven Dake wrote: > > > >>>You should be able to use the HARestarter resource and functionality to > >>>do healthchecking of a vm. > >> > >>HARestarter is actually pretty problematic, both in a "causes major > >>architectural headaches for Heat and will probably be deprecated very > >>soon" sense and a "may do very unexpected things to your resources" > >>sense. I wouldn't recommend it. > > > >Could you elaborate? What unexpected things might it do? And what are > >the alternatives? > First of all, despite the name, it doesn't just restart but actually > deletes the server that it's monitoring and recreates an entirely > new one. It also deletes any resources which directly or indirectly > depend on the server being monitored and recreates them too. > > The alternative is to use Ceilometer alarms and/or some external > monitoring system and implement recovery yourself, since the > strategy you want depends on both your application and the type of > failure. > > Another avenue being explored in Heat is to have a general way of > bringing a stack back into line with its template: > https://blueprints.launchpad.net/heat/+spec/stack-convergence > > cheers, > Zane. >
Thanks, Zane. Though I wasn't able to make the HA sample template work in my environment (primarily due to some CloudWatch token authentication failures), I did get some hands-on experience how 'HARestarter' is actually doing the VM 'restart' work. A VM is just a resource that can be recreated, from HARestarter's perspective. This is simple, effective, but too brutle a way to 'restart' VM servers, ;) What I am trying to do is to achieve certain level of HA for VMs which are treated as black-boxes. When something bad happens, some VM health monitoring system can quickly detect and report it to Heat. So Heat can decide, based on user-specified policy, to 1) reboot or rebuild the VM with the same identity, or 2) evacuate (i.e. remote-restart) it on another host, or 3) migrate it to another host. The recovery actions above, for Heat, are just invocations to Nova APIs. But I am not suggesting that VM failures should be handled in Nova directly. IMHO, this level of orchestration should go to Heat. To avoid messing up data consistency or network setup, some fencing operations are to be done -- blueprints on this are either under-review or being implemented in cinder, neutron. I don't think it a good idea to rely on some external monitoring systems to do a VM failure detection. It means additional steps to set up, additional software to upgrade, additional chapter in the Operator's Guide, etc. We are evaluating whether Ceilometer can do a good job here. Regarding the stack convergence work, it is a good starting point. If I may suggest something, I'd like to see the separation between the two efforts: - the robustness of the Heat (engine) itself, including API retry ... - the health monitoring of the stack created by Heat, which can be done either via active status polling or reactive event handling In this context, a cluster of VM can be monitored as a single entity in the stack. When stack convergence check is performed, such a cluster (say 2 members) can report, for example: - Green: supposed to have 2 members (Servers) running, and they are both active now. - Yellow: supposed to have 2 members, but one is reported as down, trying to recover now. - Red: both Servers are seemed hijacked by aliens, need some action now, rebuild me (the cluster) if needed. It would be good to have stack convergence do a per-resource-type monitoring/recovery action. Just some random thoughts for discussion. Regards, Qiming _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev