There is long time issue related to KVM HA, see bug: CLOUDSTACK-3535. 
Basically, HA won't be triggered, if KVM agent is stopped either normally nor 
abnormally, HA only be triggered if the network between mgt server and kvm host 
is disconnected and the network between KVM hosts in the same cluster is 
disconnected.
Here is how the KVM HA works after the fix for CLOUDSTACK-3535:
1. If agent is stopped, agent will send a shutdown request to mgt server, mgt 
server will mark the host as disconnected, while still maintain the host in 
pingmap. Code is in AgentManagerImpl->AgentHandler-     >ProcessRequest-> 
disconnectWithoutInvestigation
2. After ping.interval, mgt server will find the host is ping timeout, then 
start HA investigation for the host. Code is in AgentMonitor->run-> 
disconnectWithInvestigation
3. Mgt server will call all the available Investigators to investigate the 
status of host.
     The current investigators will be called for KVM host:
        UserVmDomRInvestigator->isAgentAlive, will send PingTestCommand to the 
host's neighbor. PingTestCommand will ping host's private ip address, if ping 
is reachable, means host is up, otherwise, host's state is unknown. So this 
investigator can only detect host is in up state.
                KVMInvestigator, which is newly added, will send a 
CheckOnHostCommand to host's neighbor. CheckOnHostCommand will check the 
heartbeat of host(heartbeat is stored on shared primary storage). Ideally, it 
will detect host is down or up.
              
     Combined with   UserVmDomRInvestigator  and KVMInvestigator, mgt server 
should find out the status of host. But there is case, these two investigators 
can report wrong status of host:
          Host is in a network partition, while the KVM agent is down(thus 
heartbeat is stopped)
4. After investigator reports status of host, if host is down, then start HA 
for VMs created on the host.


Improvement:
     Per suggestion from Lennert den Teuling,  we'd better use IPMI to detect 
host status, which is more reliable than ping and heartbeat, as IPMI has its 
own network, less likely has network partition.    

             

Reply via email to