All, Currently OpenStack does not have a built-in HA mechanism for tenant instances which could restore virtual machines in case of a host failure. Openstack assumes every app is designed for failure and can handle instance failure and will self-remediate, but that is rarely the case for the very large Enterprise application ecosystem. Many existing enterprise applications are stateful, and assume that the physical infrastructure is always on.
Even the OpenStack controller services themselves do not gracefully handle failure. When these applications were virtualized, they were virtualized on platforms that enabled very high SLAs for each virtual machine, allowing the application to not be rewritten as the IT team moved them from physical to virtual. Now while these apps cannot benefit from methods like automatic scaleout, the application owners will greatly benefit from the self-service capabilities they will recieve as they utilize the OpenStack control plane. I'd like to suggest to expand heat convergence mechanism to enable self-remediation of virtual machines and other heat resources. convergence specs: https://review.openstack.org/#/c/95907/ Basic flow would look like this: 1. Nova detects host failure and posts notification Nova service_group API implements host health monitor. We will use it as notification source when host goes down. Afaik there are some issues with that, and we might need to fix them. We need host-health notification source with low latency and good reliability (when we get host-down notification, we will be 100% sure that its actually down). 2. Nova sends notifs about affected resources Nova generates list of affected resources (VMs for example) and notifies that they are down. 3. Convergence listens on resource-health notification It schedules rebuild of affected resources, for example VMs on given host. 4. We introduce different, configurable methods for resource rescue Client might want to cover different resources with different level of SLA. For example http edge server may be fault tolerant and all we want is to simply recreate it on different node and add to LBaaS pool to regain quorum, while DB server has to be evacuated. 5. We call nova evacuate if server is configured to use it By evacuate I mean nova evacuate --on-shared-storage, so in fact we'll boot up same vm (from existing disk), keep addesses, data and so on. This will allow pet-servers to minimize downtime caused by host failure. We might stumble upon fencing problem in this case. Nova already has some form of safeguard implemented (it deletes evacuated instances when host comes back up). We might want to add more reliable form of fencing (storage locking?) to nova in the future. 6. Heat makes sure that all the configuration needed are applied Volumes attached, processes run and so on. In short, what we'll need from nova is to have 100% reliable host-health monitor and equally reliable rebuild/evacuate mechanism with fencing and scheduler. In heat we need scallable and reliable event listener and engine to decide which action to perform in given situation. Regards, Michał "inc0" Jastrzębski
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev