Indeed HA is very tricky as you note. In the generic case where the MS cannot communicate with the agent, nothing can be concluded and the MS does nothing. I dug this up and posted it to the wiki https://cwiki.apache.org/confluence/x/dwn8AQ
On 7/15/13 1:20 PM, "Marcus Sorensen" <shadow...@gmail.com> wrote: >I don't know much about HA in regards to management server/agent >connectivity, but it seems to me like this is perilous ground. If a >host loses connection with the management server, it seems to me that >the management server doesn't have the resources to determine whether >it should start HA-enabled VMs elsewhere. You could very well end up >with VMs running in two or three places at once, corrupting them, just >because a host failed to check in. Maybe the agent was stopped (that >happens all the time). The management server has no fencing >capaiblity, hence the messages "I don't know, doing nothing", are the >correct thing to do. That doesn't seem like it's KVM specific, >however. > >I'm very interested in hearing the details on how this HA was intended >to work, or how it might be working on other platforms. One solution >may be to leverage the secondary storage to create locks for VMs, then >again, when VMs can run without the agent it seems prone to deadlock >(how does another node take over when another host has the lock, but >the host seems down, but is actually running the vm?). > >On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <paul.an...@shapeblue.com> >wrote: >> I bumped this from the user list as we've just come across the same >>issue. >> >> CloudStack does not react or even change host status when contact is >>lost with a KVM host. >> >> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl] >>(AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning >>null ('I don't know') >> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator] >>(AgentTaskPool-1:null) could not reach agent, could not reach agent's >>host, returning that we don't have enough information >> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] >>(AgentTaskPool-1:null) null unable to determine the state of the host. >>Moving on. >> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] >>(AgentTaskPool-1:null) null unable to determine the state of the host. >>Moving on. >> 2013-07-13 17:53:56,695 WARN [agent.manager.AgentManagerImpl] >>(AgentTaskPool-1:null) Agent state cannot be determined, do nothing >> >> HA for KVM is almost useless. >> >> I suggest this a blocker for any release until fixed. >> >> >> Regards, >> >> Paul Angus >> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus >> paul.an...@shapeblue.com >> >> -----Original Message----- >> From: Koushik Das [mailto:koushik....@citrix.com] >> Sent: 12 July 2013 12:21 >> To: us...@cloudstack.apache.org >> Subject: RE: cs 4.1 host disconnected status >> >> I looked at the logs and none of the existing investigators are able to >>determine that the host is down. I am not sure if there is a clean way >>to identify if a host is down in case of KVM. Consider the following >>cases: >> >> 1. Host is actually shutdown >> 2. Management nic of the host is plugged out of the network but host is >>up and running >> >> There is no clean way to distinguish these cases. Cloudstack should >>only mark the host as down in the first case. But not sure how one would >>achieve this. >> >> -Koushik >> >>> -----Original Message----- >>> From: Valery Ciareszka [mailto:valery.teres...@gmail.com] >>> Sent: Friday, July 12, 2013 2:39 PM >>> To: us...@cloudstack.apache.org >>> Subject: Re: cs 4.1 host disconnected status >>> >>> I've simulated crash again and here is the log: >>> http://thesuki.org/temp/cs.log.txt >>> I stripped out of there GET requests with api keys. >>> Server was switched off at 8:36 >>> >>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das >>><koushik....@citrix.com>wrote: >>> >>> > Looks like the KVM investigator is not able to determine the state >>> > of the agent. Can you share the full log? >>> > >>> > > -----Original Message----- >>> > > From: Valery Ciareszka [mailto:valery.teres...@gmail.com] >>> > > Sent: Thursday, July 11, 2013 7:47 PM >>> > > To: users >>> > > Subject: cs 4.1 host disconnected status >>> > > >>> > > Hi all. >>> > > >>> > > I use the following environment: CS 4.1, KVM, Centos 6.4 >>> > > (management+node1+node2), OpenIndiana NFS server as primary and >>> > > secondary storage. >>> > > and I have the following problem: >>> > > If I switch one hypervisor node off via ipmi (simulate server >>> > > crash), it >>> > never >>> > > goes to Disconnected status in management. Accordingly, ha-enabled >>> > > VMs are not restarted on another hypervisor node, because it >>> > > believes that disconnected node is still online. >>> > > >>> > > >>> > > I get following in management server logs: >>> > > >>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] >>> > > (AgentManager-Handler-13:null) Seq 19-1133189098: >>>Processing: >>> > > { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10, >>> > > [{"Answer":{"result":false,"details": "Unable to ping >>>computing host, >>> > > exiting","wait":0}}] } >>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] >>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received: { Ans: , >>>MgmtId: >>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } } >>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl] >>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot be pinged, >>> > > returning >>> > null >>> > > ('I don't know') >>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator] >>> > > (AgentTaskPool-1:null) could not reach agent, could not reach >>>agent's >>> > > host, returning that we don't have enough information >>> > > 2013-07-11 10:19:16,153 DEBUG >>> > > [cloud.ha.HighAvailabilityManagerImpl] >>> > > (AgentTaskPool-1:null) null unable to determine the state of the >>>host. >>> > > Moving on. >>> > > 2013-07-11 10:19:16,153 DEBUG >>> > > [cloud.ha.HighAvailabilityManagerImpl] >>> > > (AgentTaskPool-1:null) null unable to determine the state of the >>>host. >>> > > Moving on. >>> > > 2013-07-11 10:19:16,153 WARN [agent.manager.AgentManagerImpl] >>> > > (AgentTaskPool-1:null) Agent state cannot be determined, >>>do >>> > > nothing >>> > > >>> > > >>> > > If I power on dead node, it goes to state "Connecting" and then >>>"Up" >>> > > in management interface. >>> > > >>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) >>> > > Ping timeout for host 12, do invstigation >>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) >>> > > Ping timeout for host 12, do invstigation >>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) >>> > > Ping timeout for host 12, do invstigation >>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status] >>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state = >>> > > Enabled, Agent event = AgentConnected, Host id = 12, name = >>> > > ad112.colobridge.net] >>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status] >>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name >>> > > = ad112.colobridge.net; old status = Up; event = AgentConnected; >>> > > new >>> > status >>> > > = Connecting; old update count = 1285; new update count = 1286] >>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status] >>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state = >>> > > Enabled, Agent event = Ready, Host id = 12, name = >>> > > ad112.colobridge.net] >>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status] >>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name >>> > > = ad112.colobridge.net; old status = Connecting; event = Ready; >>> > > new >>> > status = >>> > > Up; old update count = 1286; new update count = 1287] >>> > > >>> > > >>> > > If I restart cloud-management service, dead node goes to state >>> > > "Disconnected" in management interface. >>> > > (there is nothing special in logs in this case) >>> > > >>> > > If I do nothing, dead node could stay in "Up" state forever (I >>> > > waited >>> > for >>> > > 12 hours) in management interface, throwing into logs "Agent state >>> > > cannot be determined, do nothing" >>> > > >>> > > Would appreciate if someone could help/suggest how to deal with >>> > > this problem. >>> > > >>> > > -- >>> > > Regards, >>> > > Valery >>> > > >>> > > http://protocol.by/slayer >>> > >>> >>> >>> >>> -- >>> Regards, >>> Valery >>> >>> http://protocol.by/slayer >> This email and any attachments to it may be confidential and are >>intended solely for the use of the individual to whom it is addressed. >>Any views or opinions expressed are solely those of the author and do >>not necessarily represent those of Shape Blue Ltd or related companies. >>If you are not the intended recipient of this email, you must neither >>take any action based upon its contents, nor copy or show it to anyone. >>Please contact the sender if you believe you have received this email in >>error. Shape Blue Ltd is a company incorporated in England & Wales. >>ShapeBlue Services India LLP is operated under license from Shape Blue >>Ltd. ShapeBlue is a registered trademark.