CLOUDSTACK-3535 bug looks like it is describing the problem perfectly. What else can we add?
On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers <[email protected]> wrote: > This sucks. > > Can one of the folks on this thread please open a bug with as much > information as possible? I'd like to make sure that someone picks up the > issue and gets it resolved for the next release. > > > > On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead <[email protected]>wrote: > >> This same thing happened to me - but it was a Power-Supply that died >> on a box. All my templates have HA turned on. >> >> All the VM's (including 1 system-router-vm) were shown as "Running" >> and the host itself was simply marked "Disconnected". When I tried to >> shutdown the VM's to start them again I got errors about not being >> able to communicate with the agent. I tried restarting the management >> server but that didn't change anything. >> >> Getting the router working again was extremely annoying. After >> changing it to Stopped it kept trying to start it again on the dead >> host. I marked it destroyed then restarted the network with the force >> option. That fixed it. After I hacked the DB to get all my VM's not >> running with state Running to Stopped, then I was able to start all >> the VM's that were down on the bad host. >> >> Anyway, The time between host death and me finding out was about 4 >> days - as these were on managed servers of a customer and their >> monitoring of each host wasn't working. They were pretty unhappy. :( >> >> Other notes: this is KVM with sharedmountpoint on a gluster mount. >> After host got back online gluster rsynced about 200GB of data - I >> migrated VM's to the host at the same time as normal. I've had a >> similar things happen with 3.0.2 install of cloudstack and everything >> seamlessly restarted. Disappointing this happened with 4.1 >> >> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <[email protected]> wrote: >> > Dear Chip, Geoff and all, >> > >> > I scrutinized the management server's logs during the time when I >> shutdown >> > the host and the time when I turned the host back on. >> > >> > This is the management server's logs when the host is being shut down: >> > >> > http://pastebin.com/4wfV830Z >> > >> > During the time, I noted that there are quite a lot of "Sending >> Disconnect >> > to listener" messages, which implies that the management server try to >> > notify other listeners that the host is going down. However, >> subsequently I >> > didn't see any messages on the logs showing that the management server is >> > trying to activate the HA capability to start the affected VMs on another >> > available host. >> > >> > This is the management server's logs when the host is being turned back >> on: >> > >> > http://pastebin.com/JrLJxbXH >> > >> > When the agent is reconnected, then CloudStack marked the affected VMs as >> > stopped from previously running: >> > >> > === >> > 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl] >> > (AgentConnectTaskPool-7:null) Found 5 VMs for host 34 >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl] >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and >> > realState = Stopped >> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl] >> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and >> > realState = Stopped >> > 2013-07-24 23:04:57,408 DEBUG [cloud.ha.HighAvailabilityManagerImpl] >> > (AgentConnectTaskPool-7:null) VM does not require investigation so I'm >> > marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit] >> > 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl] >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to >> Stopping >> > with event: StopRequestedvm's original host id: 28 new host id: 34 host >> id >> > before state transition: 34 >> > === >> > >> > Then the HA starts to kick in. >> > >> > === >> > 2013-07-24 23:04:57,955 INFO [cloud.ha.HighAvailabilityManagerImpl] >> > (HA-Worker-1:work-307) Processing HAWork[307-HA-273-Stopped-Scheduled] >> > 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl] >> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to >> Stopping >> > with event: StopRequestedvm's original host id: 28 new host id: 34 host >> id >> > before state transition: 34 >> > 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request] >> > (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending { Cmd , MgmtId: >> > 161342671900, via: 34, Ver: v1, Flags: 100111, >> > [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}] } >> > 2013-07-24 23:04:57,968 INFO [cloud.ha.HighAvailabilityManagerImpl] >> > (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit] >> > 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl] >> > (HA-Worker-1:work-307) VM state transitted from :Stopped to Starting with >> > event: StartRequestedvm's original host id: 28 new host id: null host id >> > before state transition: null >> > 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl] >> > (HA-Worker-1:work-307) Successfully transitioned to start state for >> > VM[User|Ubuntu-12-04-2-64bit] reservation id = >> > b56364ef-90d8-443f-a348-7660fda48d34 >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl] >> > (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and podId: 6 >> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl] >> > (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts: >> null >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl] >> > (HA-Worker-1:work-307) Root volume is ready, need to place VM in volume's >> > cluster >> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl] >> > (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing deployment >> > plan to use this pool's dcId: 6 , podId: 6 , and clusterId: 6 >> > === >> > >> > My question is why HA only kicks in when the host is turned back on? By >> > right it should kick in soon after the host is shut down and marked as >> > "Disconnected". >> > >> > Any insights on the possible solutions to this problem is highly >> > appreciated. >> > >> > Looking forward to your reply, thank you. >> > >> > Cheers. >> > >> > >> > >> > On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <[email protected]> wrote: >> > >> >> Hi Chip, >> >> >> >> Yes, "Offer HA" is set to "Yes" on all my compute offerings. >> >> >> >> Hi Geoff, >> >> >> >> Yes, I am using KVM. Is this a known issue and is there any solution to >> >> this problem? >> >> >> >> Looking forward to your reply, thank you. >> >> >> >> Cheers. >> >> >> >> >> >> >> >> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom < >> >> [email protected]> wrote: >> >> >> >>> Is it running on KVM, we are seeing some real issue with HA simply not >> >>> working on KVM. >> >>> >> >>> Regards >> >>> >> >>> Geoff Higginbottom >> >>> >> >>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581 >> >>> >> >>> [email protected] >> >>> >> >>> -----Original Message----- >> >>> From: Chip Childers [mailto:[email protected]] >> >>> Sent: 24 July 2013 16:37 >> >>> To: <[email protected]> >> >>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts >> >>> >> >>> Did you enable HA for your compute offering? >> >>> >> >>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <[email protected]> wrote: >> >>> >> >>> > Dear all, >> >>> > >> >>> > I tried to shutdown one of my hypervisor hosts to simulate a server >> >>> > failure, and the HA is not working, all the VMs on the affected host >> >>> > is not started on another available host. >> >>> > >> >>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for >> >>> > primary storage. >> >>> > >> >>> > My issue is similar to what is being described here: >> >>> > >> >>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535 >> >>> > >> >>> > Except that on my case, the host is indeed marked as "Disconnected" >> >>> > but there is no attempt from CloudStack to try starting the VMs on >> >>> > another host. I can't provide logs since there's nothing on the logs >> >>> > which suggest that CloudStack tries to activate the HA and start the >> >>> > affected VMs on another host. >> >>> > >> >>> > Anyone has similar experience? Anyone knows if the above bug has been >> >>> > resolved? >> >>> > >> >>> > Looking forward to your reply, thank you. >> >>> > >> >>> > Cheers. >> >>> This email and any attachments to it may be confidential and are >> intended >> >>> solely for the use of the individual to whom it is addressed. Any >> views or >> >>> opinions expressed are solely those of the author and do not >> necessarily >> >>> represent those of Shape Blue Ltd or related companies. If you are not >> the >> >>> intended recipient of this email, you must neither take any action >> based >> >>> upon its contents, nor copy or show it to anyone. Please contact the >> sender >> >>> if you believe you have received this email in error. Shape Blue Ltd >> is a >> >>> company incorporated in England & Wales. ShapeBlue Services India LLP >> is >> >>> operated under license from Shape Blue Ltd. ShapeBlue is a registered >> >>> trademark. >> >>> >> >> >> >> >> >>
