Hi, I had a lively discussion yesterday with OpenStack Nova cores about the reset server state. At first how to have that by one API call for all VMs on a host (hypervisor) as discussed in DOCTOR-78. But then it came to a question why we actually want the reset server state in the first place. It is not something that need to do if force down a host. If we want a notification about effected VMs and further an alarm, then that is another thing. So if we want that kind of notification, it is then something we should make a spec. Not to reset state to error for each VM on a host that we should not be doing in the first place if error was not on VM, but host level (yes before you ask, Nova can have the working VM state unchanged if host is down. You do not touch VM state if you do not want to do something for the VM or if it was actually the one having error. Yes and you do not want to do anything for the VM itself in all scenarios, but just be happy it comes up again on same host when host comes back.)
Again I realize here and what I have said a long ago before we had anything. It will not be possible to make alarms correctly by changing state in Nova and other controllers and then triggering alarm from the notification about those state changes. That will never have what we want for the alarms, while otherwise we sure need to correct states. Even for things we get a notification triggered by state change, we will not have information needed in alarm and surely we do not call APIs in vain, just to have alarm (like reset server state) . We want tenant/VNFM specific alarms to tells which his VMs (virtual resources) are effected by fault and a cause (and surely alarms about physical faults that will not be consumed by tenant/VNFM and other fields needed by ETSI spec). Only way of having this correct for each kind of fault that can appear, is to form all the alarms (notification to form alarm) in the Inspector (Congress or Vitrage). It is the only place that has all the information needed in different scenarios and can make this right and has the minimum delay that is crucial in Telco fault management. Also if looking to have OPNFV used in production and one would need to be OPNFV compliant, it means we need to make things right. I strongly suggest that while we have the way we make alarm as a great step we have achieved so far as proof of concept (changing states and having alarm under 1 second), let's make next steps to go towards having conceptually correct way to achieve this and have correct alarms. Br, Tomi
_______________________________________________ opnfv-tech-discuss mailing list opnfv-tech-discuss@lists.opnfv.org https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss