For open stack, look to the current state of "evacuate". http://www.mirantis.com/blog/cloud-prizefight-vmware-vs-openstack/
"there is no official support for VM-level HA in OpenStack—it was initially planned for the Folsom release but was later dropped/postponed. There is currently an incubation project called Evacuate that is adding support for VM-level HA to OpenStack." On Jul 15, 2013 7:25 AM, "Shanker Balan" <shanker.ba...@shapeblue.com> wrote: > On 15-Jul-2013, at 12:03 PM, Chiradeep Vittal < > chiradeep.vit...@citrix.com> wrote: > > A robust solution would probably involve Apache Zookeeper (using Curator > perhaps) to perform robust distributed locking and/or leader election. > > > > Just curious - Any idea as to how OpenStack deals with a failed KVM host > in a cluster? > > > > On 7/15/13 3:51 PM, "Chiradeep Vittal" <chiradeep.vit...@citrix.com> > wrote: > > Indeed HA is very tricky as you note. In the generic case where the MS > cannot communicate with the agent, nothing can be concluded and the MS > does nothing. > I dug this up and posted it to the wiki > https://cwiki.apache.org/confluence/x/dwn8AQ > > > On 7/15/13 1:20 PM, "Marcus Sorensen" <shadow...@gmail.com> wrote: > > I don't know much about HA in regards to management server/agent > connectivity, but it seems to me like this is perilous ground. If a > host loses connection with the management server, it seems to me that > the management server doesn't have the resources to determine whether > it should start HA-enabled VMs elsewhere. You could very well end up > with VMs running in two or three places at once, corrupting them, just > because a host failed to check in. Maybe the agent was stopped (that > happens all the time). The management server has no fencing > capaiblity, hence the messages "I don't know, doing nothing", are the > correct thing to do. That doesn't seem like it's KVM specific, > however. > > I'm very interested in hearing the details on how this HA was intended > to work, or how it might be working on other platforms. One solution > may be to leverage the secondary storage to create locks for VMs, then > again, when VMs can run without the agent it seems prone to deadlock > (how does another node take over when another host has the lock, but > the host seems down, but is actually running the vm?). > > On Mon, Jul 15, 2013 at 1:31 AM, Paul Angus <paul.an...@shapeblue.com> > wrote: > > I bumped this from the user list as we've just come across the same > issue. > > CloudStack does not react or even change host status when contact is > lost with a KVM host. > > 2013-07-13 17:53:56,695 DEBUG [cloud.ha.AbstractInvestigatorImpl] > (AgentTaskPool-1:null) host (10.0.100.51) cannot be pinged, returning > null ('I don't know') > 2013-07-13 17:53:56,695 DEBUG [cloud.ha.UserVmDomRInvestigator] > (AgentTaskPool-1:null) could not reach agent, could not reach agent's > host, returning that we don't have enough information > 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-1:null) null unable to determine the state of the host. > Moving on. > 2013-07-13 17:53:56,695 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-1:null) null unable to determine the state of the host. > Moving on. > 2013-07-13 17:53:56,695 WARN [agent.manager.AgentManagerImpl] > (AgentTaskPool-1:null) Agent state cannot be determined, do nothing > > HA for KVM is almost useless. > > I suggest this a blocker for any release until fixed. > > > Regards, > > Paul Angus > S: +44 20 3603 0540 | M: +447711418784 | T: CloudyAngus > paul.an...@shapeblue.com > > -----Original Message----- > From: Koushik Das [mailto:koushik....@citrix.com] > Sent: 12 July 2013 12:21 > To: us...@cloudstack.apache.org > Subject: RE: cs 4.1 host disconnected status > > I looked at the logs and none of the existing investigators are able to > determine that the host is down. I am not sure if there is a clean way > to identify if a host is down in case of KVM. Consider the following > cases: > > 1. Host is actually shutdown > 2. Management nic of the host is plugged out of the network but host is > up and running > > There is no clean way to distinguish these cases. Cloudstack should > only mark the host as down in the first case. But not sure how one would > achieve this. > > -Koushik > > -----Original Message----- > From: Valery Ciareszka [mailto:valery.teres...@gmail.com] > Sent: Friday, July 12, 2013 2:39 PM > To: us...@cloudstack.apache.org > Subject: Re: cs 4.1 host disconnected status > > I've simulated crash again and here is the log: > http://thesuki.org/temp/cs.log.txt > I stripped out of there GET requests with api keys. > Server was switched off at 8:36 > > On Fri, Jul 12, 2013 at 11:17 AM, Koushik Das > <koushik....@citrix.com>wrote: > > Looks like the KVM investigator is not able to determine the state > of the agent. Can you share the full log? > > -----Original Message----- > From: Valery Ciareszka [mailto:valery.teres...@gmail.com] > Sent: Thursday, July 11, 2013 7:47 PM > To: users > Subject: cs 4.1 host disconnected status > > Hi all. > > I use the following environment: CS 4.1, KVM, Centos 6.4 > (management+node1+node2), OpenIndiana NFS server as primary and > secondary storage. > and I have the following problem: > If I switch one hypervisor node off via ipmi (simulate server > crash), it > > never > > goes to Disconnected status in management. Accordingly, ha-enabled > VMs are not restarted on another hypervisor node, because it > believes that disconnected node is still online. > > > I get following in management server logs: > > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] > (AgentManager-Handler-13:null) Seq 19-1133189098: > > Processing: > > { Ans: , MgmtId: 161603152803976, via: 19, Ver: v1, Flags: 10, > [{"Answer":{"result":false,"details": "Unable to ping > > computing host, > > exiting","wait":0}}] } > 2013-07-11 10:19:16,153 DEBUG [agent.transport.Request] > (AgentTaskPool-1:null) Seq 19-1133189098: Received: { Ans: , > > MgmtId: > > 161603152803976, via: 19, Ver: v1, Flags: 10, { Answer } } > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.AbstractInvestigatorImpl] > (AgentTaskPool-1:null) host (172.16.20.241) cannot be pinged, > returning > > null > > ('I don't know') > 2013-07-11 10:19:16,153 DEBUG [cloud.ha.UserVmDomRInvestigator] > (AgentTaskPool-1:null) could not reach agent, could not reach > > agent's > > host, returning that we don't have enough information > 2013-07-11 10:19:16,153 DEBUG > [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-1:null) null unable to determine the state of the > > host. > > Moving on. > 2013-07-11 10:19:16,153 DEBUG > [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-1:null) null unable to determine the state of the > > host. > > Moving on. > 2013-07-11 10:19:16,153 WARN [agent.manager.AgentManagerImpl] > (AgentTaskPool-1:null) Agent state cannot be determined, > > do > > nothing > > > If I power on dead node, it goes to state "Connecting" and then > > "Up" > > in management interface. > > 2013-07-11 13:57:24,311 DEBUG [cloud.host.Status] (Thread-6:null) > Ping timeout for host 12, do invstigation > 2013-07-11 13:58:24,315 DEBUG [cloud.host.Status] (Thread-6:null) > Ping timeout for host 12, do invstigation > 2013-07-11 13:59:24,320 DEBUG [cloud.host.Status] (Thread-6:null) > Ping timeout for host 12, do invstigation > 2013-07-11 13:59:57,239 DEBUG [cloud.host.Status] > (AgentConnectTaskPool-5:null) Transition:[Resource state = > Enabled, Agent event = AgentConnected, Host id = 12, name = > ad112.colobridge.net] > 2013-07-11 13:59:57,264 DEBUG [cloud.host.Status] > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name > = ad112.colobridge.net; old status = Up; event = AgentConnected; > new > > status > > = Connecting; old update count = 1285; new update count = 1286] > 2013-07-11 14:00:50,611 DEBUG [cloud.host.Status] > (AgentConnectTaskPool-5:null) Transition:[Resource state = > Enabled, Agent event = Ready, Host id = 12, name = > ad112.colobridge.net] > 2013-07-11 14:00:50,633 DEBUG [cloud.host.Status] > (AgentConnectTaskPool-5:null) Agent status update: [id = 12; name > = ad112.colobridge.net; old status = Connecting; event = Ready; > new > > status = > > Up; old update count = 1286; new update count = 1287] > > > If I restart cloud-management service, dead node goes to state > "Disconnected" in management interface. > (there is nothing special in logs in this case) > > If I do nothing, dead node could stay in "Up" state forever (I > waited > > for > > 12 hours) in management interface, throwing into logs "Agent state > cannot be determined, do nothing" > > Would appreciate if someone could help/suggest how to deal with > this problem. > > -- > Regards, > Valery > > http://protocol.by/slayer > > > > > > -- > Regards, > Valery > > http://protocol.by/slayer > > This email and any attachments to it may be confidential and are > intended solely for the use of the individual to whom it is addressed. > Any views or opinions expressed are solely those of the author and do > not necessarily represent those of Shape Blue Ltd or related companies. > If you are not the intended recipient of this email, you must neither > take any action based upon its contents, nor copy or show it to anyone. > Please contact the sender if you believe you have received this email in > error. Shape Blue Ltd is a company incorporated in England & Wales. > ShapeBlue Services India LLP is operated under license from Shape Blue > Ltd. ShapeBlue is a registered trademark. > > > > > > -- > Shanker Balan > Managing Consultant > > > > M: +91 98860 60539 > shanker.ba...@shapeblue.com | www.shapeblue.com | Twitter:@shapeblue > ShapeBlue India, 22nd floor, Unit 2201A, World Trade Centre, Bangalore - > 560 055 > > This email and any attachments to it may be confidential and are intended > solely for the use of the individual to whom it is addressed. Any views or > opinions expressed are solely those of the author and do not necessarily > represent those of Shape Blue Ltd or related companies. If you are not the > intended recipient of this email, you must neither take any action based > upon its contents, nor copy or show it to anyone. Please contact the sender > if you believe you have received this email in error. Shape Blue Ltd is a > company incorporated in England & Wales. ShapeBlue Services India LLP is > operated under license from Shape Blue Ltd. ShapeBlue is a registered > trademark. >