Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Marcus Sorensen Mon, 15 Jul 2013 07:44:48 -0700

For open stack, look to the current state of "evacuate".

http://www.mirantis.com/blog/cloud-prizefight-vmware-vs-openstack/


"there is no official support for VM-level HA in OpenStack—it was initially
planned for the Folsom release but was later dropped/postponed. There is
currently an incubation project called Evacuate that is adding support for
VM-level HA to OpenStack."
On Jul 15, 2013 7:25 AM, "Shanker Balan" <[email protected]>
wrote:

>  On 15-Jul-2013, at 12:03 PM, Chiradeep Vittal <
> [email protected]> wrote:
>
> A robust solution would probably involve Apache Zookeeper (using Curator
> perhaps) to perform robust distributed locking and/or leader election.
>
>
>
>  Just curious - Any idea as to how OpenStack deals with a failed KVM host
> in a cluster?
>
>
>
> On 7/15/13 3:51 PM, "Chiradeep Vittal" <[email protected]>
> wrote:
>
> Indeed HA is very tricky as you note. In the generic case where the MS
> cannot communicate with the agent, nothing can be concluded and the MS
> does nothing.
> I dug this up and posted it to the wiki
> https://cwiki.apache.org/confluence/x/dwn8AQ
>
>
> On 7/15/13 1:20 PM, "Marcus Sorensen" <[email protected]> wrote:
>
> I don't know much about HA in regards to management server/agent
> connectivity, but it seems to me like this is perilous ground.  If a
> host loses connection with the management server, it seems to me that
> the management server doesn't have the resources to determine whether
> it should start HA-enabled VMs elsewhere. You could very well end up
> with VMs running in two or three places at once, corrupting them, just
> because a host failed to check in. Maybe the agent was stopped (that
> happens all the time). The management server has no fencing
> capaiblity, hence the messages "I don't know, doing nothing", are the
> correct thing to do. That doesn't seem like it's KVM specific,
> however.
>
> I'm very interested in hearing the details on how this HA was intended
> to work, or how it might be working on other platforms.  One solution
> may be to leverage the secondary storage to create locks for VMs, then
> again, when VMs can run without the agent it seems prone to deadlock
> (how does another node take over when another host has the lock, but
> the host seems down, but is actually running the vm?).
>
> On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <[email protected]>
> wrote:
>
> I bumped this from the user list as we've just come across the same
> issue.
>
> CloudStack does not react or even change host status when contact is
> lost with a KVM host.
>
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning
> null ('I don't know')
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-1:null) could not reach agent, could not reach agent's
> host, returning that we don't have enough information
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine the state of the host.
> Moving on.
> 2013-07-13 17:53:56,695 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-1:null) Agent state cannot be determined, do nothing
>
> HA for KVM is almost useless.
>
> I suggest this a blocker for any release until fixed.
>
>
> Regards,
>
> Paul Angus
> S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus
> [email protected]
>
> -----Original Message-----
> From: Koushik Das [mailto:[email protected]]
> Sent: 12 July 2013 12:21
> To: [email protected]
> Subject: RE: cs 4.1 host disconnected status
>
> I looked at the logs and none of the existing investigators are able to
> determine that the host is down. I am not sure if there is a clean way
> to identify if a host is down in case of KVM. Consider the following
> cases:
>
> 1. Host is actually shutdown
> 2. Management nic of the host is plugged out of the network but host is
> up and running
>
> There is no clean way to distinguish these cases. Cloudstack should
> only mark the host as down in the first case. But not sure how one would
> achieve this.
>
> -Koushik
>
> -----Original Message-----
> From: Valery Ciareszka [mailto:[email protected]]
> Sent: Friday, July 12, 2013 2:39 PM
> To: [email protected]
> Subject: Re: cs 4.1 host disconnected status
>
> I've simulated crash again and here is the log:
> http://thesuki.org/temp/cs.log.txt
> I stripped out of there GET requests with api keys.
> Server was switched off at 8:36
>
> On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das
> <[email protected]>wrote:
>
> Looks like the KVM investigator is not able to determine the state
> of the agent. Can you share the full log?
>
> -----Original Message-----
> From: Valery Ciareszka [mailto:[email protected]]
> Sent: Thursday, July 11, 2013 7:47 PM
> To: users
> Subject: cs 4.1 host disconnected status
>
> Hi all.
>
> I use the following environment: CS 4.1, KVM, Centos 6.4
> (management+node1+node2), OpenIndiana NFS server as primary and
> secondary storage.
> and I have the following problem:
> If I switch one hypervisor node off via ipmi (simulate server
> crash), it
>
> never
>
> goes to Disconnected status in management. Accordingly, ha-enabled
> VMs are not restarted on another hypervisor node, because it
> believes that disconnected node is still online.
>
>
> I get following in management server logs:
>
> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> (AgentManager-Handler-13:null) Seq 19-1133189098:
>
>  Processing:
>
> { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10,
> [{"Answer":{"result":false,"details":     "Unable to ping
>
>  computing host,
>
> exiting","wait":0}}] }
> 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request]
> (AgentTaskPool-1:null) Seq 19-1133189098: Received:  { Ans: ,
>
>  MgmtId:
>
> 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } }
> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> (AgentTaskPool-1:null) host (172.16.20.241) cannot  be pinged,
> returning
>
> null
>
> ('I don't know')
> 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (AgentTaskPool-1:null) could not reach agent, could   not reach
>
>  agent's
>
> host, returning that we don't have enough information
> 2013-07-11 10:19:16,153 DEBUG
> [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine  the state of the
>
>  host.
>
> Moving on.
> 2013-07-11 10:19:16,153 DEBUG
> [cloud.ha.HighAvailabilityManagerImpl]
> (AgentTaskPool-1:null) null unable to determine  the state of the
>
>  host.
>
> Moving on.
> 2013-07-11 10:19:16,153 WARN  [agent.manager.AgentManagerImpl]
> (AgentTaskPool-1:null) Agent state cannot be           determined,
>
>  do
>
> nothing
>
>
> If I power on dead node, it goes to state "Connecting" and then
>
>  "Up"
>
> in management interface.
>
> 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null)
> Ping timeout for host 12, do invstigation
> 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null)
> Ping timeout for host 12, do invstigation
> 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null)
> Ping timeout for host 12, do invstigation
> 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Transition:[Resource state =
> Enabled, Agent event = AgentConnected, Host id = 12, name =
> ad112.colobridge.net]
> 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> = ad112.colobridge.net; old status = Up; event = AgentConnected;
> new
>
> status
>
> = Connecting; old update count = 1285; new update count = 1286]
> 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Transition:[Resource state =
> Enabled, Agent event = Ready, Host id = 12, name =
> ad112.colobridge.net]
> 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status]
> (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name
> = ad112.colobridge.net; old status = Connecting; event = Ready;
> new
>
> status =
>
> Up; old update count = 1286; new update count = 1287]
>
>
> If I restart cloud-management service, dead node goes to state
> "Disconnected" in management interface.
> (there is nothing special in logs in this case)
>
> If I do nothing,  dead node could stay in "Up" state forever (I
> waited
>
> for
>
> 12 hours) in management interface, throwing into logs "Agent state
> cannot be determined, do nothing"
>
> Would appreciate if someone could help/suggest how to deal with
> this problem.
>
> --
> Regards,
> Valery
>
> http://protocol.by/slayer
>
>
>
>
>
> --
> Regards,
> Valery
>
> http://protocol.by/slayer
>
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed.
> Any views or opinions expressed are solely those of the author and do
> not necessarily represent those of Shape Blue Ltd or related companies.
> If you are not the intended recipient of this email, you must neither
> take any action based upon its contents, nor copy or show it to anyone.
> Please contact the sender if you believe you have received this email in
> error. Shape Blue Ltd is a company incorporated in England & Wales.
> ShapeBlue Services India LLP is operated under license from Shape Blue
> Ltd. ShapeBlue is a registered trademark.
>
>
>
>
>
> --
> Shanker Balan
> Managing Consultant
>
>
>
>  M: +91 98860 60539
>  [email protected] | www.shapeblue.com | Twitter:@shapeblue
>  ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore -
> 560 055
>
> This email and any attachments to it may be confidential and are intended
> solely for the use of the individual to whom it is addressed. Any views or
> opinions expressed are solely those of the author and do not necessarily
> represent those of Shape Blue Ltd or related companies. If you are not the
> intended recipient of this email, you must neither take any action based
> upon its contents, nor copy or show it to anyone. Please contact the sender
> if you believe you have received this email in error. Shape Blue Ltd is a
> company incorporated in England & Wales. ShapeBlue Services India LLP is
> operated under license from Shape Blue Ltd. ShapeBlue is a registered
> trademark.
>

Re: [URGENT] KVM HA - (FW: cs 4.1 host disconnected status)

Reply via email to