Indeed HA is very tricky as you note. In the generic case where the MS
cannot communicate with the agent, nothing can be concluded and the MS
does nothing.
I dug this up and posted it to the wiki
https://cwiki.apache.org/confluence/x/dwn8AQ


On 7/15/13 1:20 PM, "Marcus Sorensen" <shadow...@gmail.com> wrote:

>I don't know much about HA in regards to management server/agent
>connectivity, but it seems to me like this is perilous ground.  If a
>host loses connection with the management server, it seems to me that
>the management server doesn't have the resources to determine whether
>it should start HA-enabled VMs elsewhere. You could very well end up
>with VMs running in two or three places at once, corrupting them, just
>because a host failed to check in. Maybe the agent was stopped (that
>happens all the time). The management server has no fencing
>capaiblity, hence the messages "I don't know, doing nothing", are the
>correct thing to do. That doesn't seem like it's KVM specific,
>however.
>
>I'm very interested in hearing the details on how this HA was intended
>to work, or how it might be working on other platforms.  One solution
>may be to leverage the secondary storage to create locks for VMs, then
>again, when VMs can run without the agent it seems prone to deadlock
>(how does another node take over when another host has the lock, but
>the host seems down, but is actually running the vm?).
>
>On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <paul.an...@shapeblue.com>
>wrote:
>> I bumped this from the user list as we've just come across the same
>>issue.
>>
>> CloudStack does not react or even change host status when contact is
>>lost with a KVM host.
>>
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>(AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
>>null ('I don't know')
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>(AgentTaskPool-1:null) could not reach agent, could not reach agent's
>>host, returning that we don't have enough information
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>Moving on.
>> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>>(AgentTaskPool-1:null) null unable to determine the state of the host.
>>Moving on.
>> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
>>(AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>>
>> HA for KVM is almost useless.
>>
>> I suggest this a blocker for any release until fixed.
>>
>>
>> Regards,
>>
>> Paul Angus
>> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
>> paul.an...@shapeblue.com
>>
>> -----Original Message-----
>> From: Koushik Das [mailto:koushik....@citrix.com]
>> Sent: 12 July 2013 12:21
>> To: us...@cloudstack.apache.org
>> Subject: RE: cs 4.1 host disconnected status
>>
>> I looked at the logs and none of the existing investigators are able to
>>determine that the host is down. I am not sure if there is a clean way
>>to identify if a host is down in case of KVM. Consider the following
>>cases:
>>
>> 1. Host is actually shutdown
>> 2. Management nic of the host is plugged out of the network but host is
>>up and running
>>
>> There is no clean way to distinguish these cases. Cloudstack should
>>only mark the host as down in the first case. But not sure how one would
>>achieve this.
>>
>> -Koushik
>>
>>> -----Original Message-----
>>> From: Valery Ciareszka [mailto:valery.teres...@gmail.com]
>>> Sent: Friday, July 12, 2013 2:39 PM
>>> To: us...@cloudstack.apache.org
>>> Subject: Re: cs 4.1 host disconnected status
>>>
>>> I've simulated crash again and here is the log:
>>> http://thesuki.org/temp/cs.log.txt
>>> I stripped out of there GET requests with api keys.
>>> Server was switched off at 8:36
>>>
>>> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das
>>><koushik....@citrix.com>wrote:
>>>
>>> > Looks like the KVM investigator is not able to determine the state
>>> > of the agent. Can you share the full log?
>>> >
>>> > > -----Original Message-----
>>> > > From: Valery Ciareszka [mailto:valery.teres...@gmail.com]
>>> > > Sent: Thursday, July 11, 2013 7:47 PM
>>> > > To: users
>>> > > Subject: cs 4.1 host disconnected status
>>> > >
>>> > > Hi all.
>>> > >
>>> > > I use the following environment: CS 4.1, KVM, Centos 6.4
>>> > > (management+node1+node2), OpenIndiana NFS server as primary and
>>> > > secondary storage.
>>> > > and I have the following problem:
>>> > > If I switch one hypervisor node off via ipmi (simulate server
>>> > > crash), it
>>> > never
>>> > > goes to Disconnected status in management. Accordingly, ha-enabled
>>> > > VMs are not restarted on another hypervisor node, because it
>>> > > believes that disconnected node is still online.
>>> > >
>>> > >
>>> > > I get following in management server logs:
>>> > >
>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>> > > (AgentManager-Handler-13:null) Seq 19-1133189098:
>>>Processing:
>>> > >  { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
>>> > > [{"Answer":{"result":false,"details":     "Unable to ping
>>>computing host,
>>> > > exiting","wait":0}}] }
>>> > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
>>> > > (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
>>>MgmtId:
>>> > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
>>> > > (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
>>> > > returning
>>> > null
>>> > > ('I don't know')
>>> > > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
>>> > > (AgentTaskPool-1:null) could not reach agent, could   not reach
>>>agent's
>>> > > host, returning that we don't have enough information
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the
>>>host.
>>> > >  Moving on.
>>> > > 2013-07-11 10:19:16,153 DEBUG
>>> > > [cloud.ha.HighAvailabilityManagerImpl]
>>> > > (AgentTaskPool-1:null) null unable to determine  the state of the
>>>host.
>>> > >  Moving on.
>>> > > 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
>>> > > (AgentTaskPool-1:null) Agent state cannot be           determined,
>>>do
>>> > > nothing
>>> > >
>>> > >
>>> > > If I power on dead node, it goes to state "Connecting" and then
>>>"Up"
>>> > > in management interface.
>>> > >
>>> > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
>>> > > Ping timeout for host 12, do invstigation
>>> > > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>> > > Enabled, Agent event = AgentConnected, Host id = 12, name =
>>> > > ad112.colobridge.net]
>>> > > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>> > > = ad112.colobridge.net; old status = Up; event = AgentConnected;
>>> > > new
>>> > status
>>> > > = Connecting; old update count = 1285; new update count = 1286]
>>> > > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Transition:[Resource state =
>>> > > Enabled, Agent event = Ready, Host id = 12, name =
>>> > > ad112.colobridge.net]
>>> > > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
>>> > > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
>>> > > = ad112.colobridge.net; old status = Connecting; event = Ready;
>>> > > new
>>> > status =
>>> > > Up; old update count = 1286; new update count = 1287]
>>> > >
>>> > >
>>> > > If I restart cloud-management service, dead node goes to state
>>> > > "Disconnected" in management interface.
>>> > > (there is nothing special in logs in this case)
>>> > >
>>> > > If I do nothing,  dead node could stay in "Up" state forever (I
>>> > > waited
>>> > for
>>> > > 12 hours) in management interface, throwing into logs "Agent state
>>> > > cannot be determined, do nothing"
>>> > >
>>> > > Would appreciate if someone could help/suggest how to deal with
>>> > > this problem.
>>> > >
>>> > > --
>>> > > Regards,
>>> > > Valery
>>> > >
>>> > > http://protocol.by/slayer
>>> >
>>>
>>>
>>>
>>> --
>>> Regards,
>>> Valery
>>>
>>> http://protocol.by/slayer
>> This email and any attachments to it may be confidential and are
>>intended solely for the use of the individual to whom it is addressed.
>>Any views or opinions expressed are solely those of the author and do
>>not necessarily represent those of Shape Blue Ltd or related companies.
>>If you are not the intended recipient of this email, you must neither
>>take any action based upon its contents, nor copy or show it to anyone.
>>Please contact the sender if you believe you have received this email in
>>error. Shape Blue Ltd is a company incorporated in England & Wales.
>>ShapeBlue Services India LLP is operated under license from Shape Blue
>>Ltd. ShapeBlue is a registered trademark.

Reply via email to