cc'ing Intel and Ericsson engineers who are interested in a similar plan...
On Mon, 2014-04-28 at 15:33 +0100, John Garbutt wrote: > On 28 April 2014 13:30, Jiangying (Jenny) <jenny.jiangy...@huawei.com> wrote: > > Nova now can detect host unreachable. But it fails to make out host > > isolation, host dead and nova compute service down. When host unreachable is > > reported, users have to find out the exact state by himself and then take > > the appropriate measure to recover. Therefore we’d like to improve the host > > detection for nova. > > > > Currently the service group API factors out the host detection and makes it > > a set of abstract internal APIs with a pluggable backend implementation. The > > backend we designed is as follows: > > > > A detection central agent is introduced. When a member joins into the > > service group, the member host starts to send network heartbeat to the > > central agent and writes timestamp in shared storage periodically. When the > > central agent stops receiving the network heartbeats from a member, it pings > > the member and checks the storage heartbeat before declaring the host to > > have failed. > > > > ---------------------------------------------------------------------------------------------------------------- > > > > network heartbeat|network ping|storage heartbeat| state | reason > > > > ------------------------|-----------------|------------------------|---------------------------|------------------------------------------ > > > > OK | - | - | Running | - > > > > Not OK | Not OK | Not OK | Dead | > > hardware failure/abnormal host shut down > > > > Not OK | OK | Not OK | Service unreachable| > > service process crashed > > > > Not OK | Not OK | OK | Isolated | > > network unreachable > > > > ---------------------------------------------------------------------------------------------------------------- > > > > Based on the state recognition table, nova can discern the exact host state > > and assign the reasons. > > > > Thoughts? > > I don't think Nova should try to include functionality that > re-implements other good monitoring tools (Nagios, etc) Agreed. > Having said that, having a new service group API that uses information > from external tools to decide if a host is dead or not, and describes > why, is maybe worth considering. Also agreed. FYI, related blueprint from Ericsson: https://review.openstack.org/#/c/87978/ I am -1 on the above blueprint not because I don't see the value in having nic state play a part in service group management, but because I don't see a reason to have the resource tracker (which manages resource usage, not state) or scheduler implement agent state checks. Best, -jay _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev