Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Bryan Whitehead Wed, 24 Jul 2013 19:28:18 -0700

CLOUDSTACK-3535 bug looks like it is describing the problem perfectly.
What else can we add?


On Wed, Jul 24, 2013 at 7:20 PM, Chip Childers
<[email protected]> wrote:
> This sucks.
>
> Can one of the folks on this thread please open a bug with as much
> information as possible?  I'd like to make sure that someone picks up the
> issue and gets it resolved for the next release.
>
>
>
> On Wed, Jul 24, 2013 at 7:26 PM, Bryan Whitehead <[email protected]>wrote:
>
>> This same thing happened to me - but it was a Power-Supply that died
>> on a box. All my templates have HA turned on.
>>
>> All the VM's (including 1 system-router-vm) were shown as "Running"
>> and the host itself was simply marked "Disconnected". When I tried to
>> shutdown the VM's to start them again I got errors about not being
>> able to communicate with the agent. I tried restarting the management
>> server but that didn't change anything.
>>
>> Getting the router working again was extremely annoying. After
>> changing it to Stopped it kept trying to start it again on the dead
>> host. I marked it destroyed then restarted the network with the force
>> option. That fixed it. After I hacked the DB to get all my VM's not
>> running with state Running to Stopped, then I was able to start all
>> the VM's that were down on the bad host.
>>
>> Anyway, The time between host death and me finding out was about 4
>> days - as these were on managed servers of a customer and their
>> monitoring of each host wasn't working. They were pretty unhappy. :(
>>
>> Other notes: this is KVM with sharedmountpoint on a gluster mount.
>> After host got back online gluster rsynced about 200GB of data - I
>> migrated VM's to the host at the same time as normal. I've had a
>> similar things happen with 3.0.2 install of cloudstack and everything
>> seamlessly restarted. Disappointing this happened with 4.1
>>
>> On Wed, Jul 24, 2013 at 9:23 AM, Indra Pramana <[email protected]> wrote:
>> > Dear Chip, Geoff and all,
>> >
>> > I scrutinized the management server's logs during the time when I
>> shutdown
>> > the host and the time when I turned the host back on.
>> >
>> > This is the management server's logs when the host is being shut down:
>> >
>> > http://pastebin.com/4wfV830Z
>> >
>> > During the time, I noted that there are quite a lot of "Sending
>> Disconnect
>> > to listener" messages, which implies that the management server try to
>> > notify other listeners that the host is going down. However,
>> subsequently I
>> > didn't see any messages on the logs showing that the management server is
>> > trying to activate the HA capability to start the affected VMs on another
>> > available host.
>> >
>> > This is the management server's logs when the host is being turned back
>> on:
>> >
>> > http://pastebin.com/JrLJxbXH
>> >
>> > When the agent is reconnected, then CloudStack marked the affected VMs as
>> > stopped from previously running:
>> >
>> > ===
>> > 2013-07-24 23:04:57,406 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (AgentConnectTaskPool-7:null) Found 5 VMs for host 34
>> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>> > realState = Stopped
>> > 2013-07-24 23:04:57,408 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM i-2-273-VM: cs state = Running and
>> > realState = Stopped
>> > 2013-07-24 23:04:57,408 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM does not require investigation so I'm
>> > marking it as Stopped: VM[User|Ubuntu-12-04-2-64bit]
>> > 2013-07-24 23:04:57,450 DEBUG [cloud.capacity.CapacityManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>> Stopping
>> > with event: StopRequestedvm's original host id: 28 new host id: 34 host
>> id
>> > before state transition: 34
>> > ===
>> >
>> > Then the HA starts to kick in.
>> >
>> > ===
>> > 2013-07-24 23:04:57,955 INFO  [cloud.ha.HighAvailabilityManagerImpl]
>> > (HA-Worker-1:work-307) Processing HAWork[307-HA-273-Stopped-Scheduled]
>> > 2013-07-24 23:04:57,956 DEBUG [cloud.capacity.CapacityManagerImpl]
>> > (AgentConnectTaskPool-7:null) VM state transitted from :Running to
>> Stopping
>> > with event: StopRequestedvm's original host id: 28 new host id: 34 host
>> id
>> > before state transition: 34
>> > 2013-07-24 23:04:57,960 DEBUG [agent.transport.Request]
>> > (AgentConnectTaskPool-7:null) Seq 34-105644038: Sending  { Cmd , MgmtId:
>> > 161342671900, via: 34, Ver: v1, Flags: 100111,
>> > [{"StopCommand":{"isProxy":false,"vmName":"i-2-281-VM","wait":0}}] }
>> > 2013-07-24 23:04:57,968 INFO  [cloud.ha.HighAvailabilityManagerImpl]
>> > (HA-Worker-1:work-307) HA on VM[User|Ubuntu-12-04-2-64bit]
>> > 2013-07-24 23:04:57,984 DEBUG [cloud.capacity.CapacityManagerImpl]
>> > (HA-Worker-1:work-307) VM state transitted from :Stopped to Starting with
>> > event: StartRequestedvm's original host id: 28 new host id: null host id
>> > before state transition: null
>> > 2013-07-24 23:04:57,984 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Successfully transitioned to start state for
>> > VM[User|Ubuntu-12-04-2-64bit] reservation id =
>> > b56364ef-90d8-443f-a348-7660fda48d34
>> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Trying to deploy VM, vm has dcId: 6 and podId: 6
>> > 2013-07-24 23:04:58,025 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Deploy avoids pods: null, clusters: null, hosts:
>> null
>> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Root volume is ready, need to place VM in volume's
>> > cluster
>> > 2013-07-24 23:04:58,031 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>> > (HA-Worker-1:work-307) Vol[295|vm=273|ROOT] is READY, changing deployment
>> > plan to use this pool's dcId: 6 , podId: 6 , and clusterId: 6
>> > ===
>> >
>> > My question is why HA only kicks in when the host is turned back on? By
>> > right it should kick in soon after the host is shut down and marked as
>> > "Disconnected".
>> >
>> > Any insights on the possible solutions to this problem is highly
>> > appreciated.
>> >
>> > Looking forward to your reply, thank you.
>> >
>> > Cheers.
>> >
>> >
>> >
>> > On Thu, Jul 25, 2013 at 12:00 AM, Indra Pramana <[email protected]> wrote:
>> >
>> >> Hi Chip,
>> >>
>> >> Yes, "Offer HA" is set to "Yes" on all my compute offerings.
>> >>
>> >> Hi Geoff,
>> >>
>> >> Yes, I am using KVM. Is this a known issue and is there any solution to
>> >> this problem?
>> >>
>> >> Looking forward to your reply, thank you.
>> >>
>> >> Cheers.
>> >>
>> >>
>> >>
>> >> On Wed, Jul 24, 2013 at 11:38 PM, Geoff Higginbottom <
>> >> [email protected]> wrote:
>> >>
>> >>> Is it running on KVM, we are seeing some real issue with HA simply not
>> >>> working on KVM.
>> >>>
>> >>> Regards
>> >>>
>> >>> Geoff Higginbottom
>> >>>
>> >>> D: +44 20 3603 0542 | S: +44 20 3603 0540 | M: +447968161581
>> >>>
>> >>> [email protected]
>> >>>
>> >>> -----Original Message-----
>> >>> From: Chip Childers [mailto:[email protected]]
>> >>> Sent: 24 July 2013 16:37
>> >>> To: <[email protected]>
>> >>> Subject: Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts
>> >>>
>> >>> Did you enable HA for your compute offering?
>> >>>
>> >>> On Jul 24, 2013, at 11:25 AM, Indra Pramana <[email protected]> wrote:
>> >>>
>> >>> > Dear all,
>> >>> >
>> >>> > I tried to shutdown one of my hypervisor hosts to simulate a server
>> >>> > failure, and the HA is not working, all the VMs on the affected host
>> >>> > is not started on another available host.
>> >>> >
>> >>> > I am using CloudStack 4.1.0 with KVM hypervisors and Ceph RBD for
>> >>> > primary storage.
>> >>> >
>> >>> > My issue is similar to what is being described here:
>> >>> >
>> >>> > https://issues.apache.org/jira/browse/CLOUDSTACK-3535
>> >>> >
>> >>> > Except that on my case, the host is indeed marked as "Disconnected"
>> >>> > but there is no attempt from CloudStack to try starting the VMs on
>> >>> > another host. I can't provide logs since there's nothing on the logs
>> >>> > which suggest that CloudStack tries to activate the HA and start the
>> >>> > affected VMs on another host.
>> >>> >
>> >>> > Anyone has similar experience? Anyone knows if the above bug has been
>> >>> > resolved?
>> >>> >
>> >>> > Looking forward to your reply, thank you.
>> >>> >
>> >>> > Cheers.
>> >>> This email and any attachments to it may be confidential and are
>> intended
>> >>> solely for the use of the individual to whom it is addressed. Any
>> views or
>> >>> opinions expressed are solely those of the author and do not
>> necessarily
>> >>> represent those of Shape Blue Ltd or related companies. If you are not
>> the
>> >>> intended recipient of this email, you must neither take any action
>> based
>> >>> upon its contents, nor copy or show it to anyone. Please contact the
>> sender
>> >>> if you believe you have received this email in error. Shape Blue Ltd
>> is a
>> >>> company incorporated in England & Wales. ShapeBlue Services India LLP
>> is
>> >>> operated under license from Shape Blue Ltd. ShapeBlue is a registered
>> >>> trademark.
>> >>>
>> >>
>> >>
>>
>>

Re: HA not working - CloudStack 4.1.0 and KVM hypervisor hosts

Reply via email to