Ilya,
Can you check in the MS logs if more than one thread tries to process connect 
for the same host simultaneously? In case of XS I had seen that there is a 
possibility of the vm states getting incorrectly determined (during vm sync) in 
case more than one thread tries to process connect to the host simultaneously. 
The fix for 4636 can be added to the 4.1.x branch but if the root cause turns 
out to be the same.

-Koushik

On 26-Sep-2013, at 9:18 PM, "Musayev, Ilya" <imusa...@webmd.net> wrote:

> ++Koushik
> 
> Thanks Sowmya
> 
> Koushik 
> 
> Any reason why this is cannot be applied to ACS 4.1?  Since I'm Release 
> Manager for 4.1.x branch, I will work on testing and releasing 4.1.2
> 
> Thanks
> ilya
> 
>> -----Original Message-----
>> From: Sowmya Krishnan [mailto:sowmya.krish...@citrix.com]
>> Sent: Thursday, September 26, 2013 3:03 AM
>> To: dev@cloudstack.apache.org
>> Subject: RE: [MAJOR][BUG] ACS powers off some VMs in vSphere - when MS
>> service is restarted
>> 
>> Ilya,
>> Is this similar to this issue:
>> https://issues.apache.org/jira/browse/CLOUDSTACK-4636
>> Fix for this has gone in 4.2-forward and master. As stated in the bug, this
>> could happen if more than one thread tries to process the same host post
>> MS restart.
>> 
>>> -----Original Message-----
>>> From: Musayev, Ilya [mailto:imusa...@webmd.net]
>>> Sent: Thursday, September 26, 2013 3:43 AM
>>> To: dev@cloudstack.apache.org
>>> Subject: RE: [MAJOR][BUG] ACS powers off some VMs in vSphere - when
>> MS
>>> service is restarted
>>> 
>>> Sorry if the error log is abit hard to read, as an example, please
>>> track the vmname i-2-262-acs-docs-fc17.
>>> 
>>> 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource]
>>> (AgentTaskPool-1:null) Detecting a new state but couldn't find a old
>>> state so adding it to the changes: i-2-262-acs-docs-fc17
>>> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>> (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running
>>> and realState = Stopped
>>> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>> (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running
>>> and realState = Stopped
>>> 2013-09-25 14:35:53,614 INFO  [cloud.ha.HighAvailabilityManagerImpl]
>>> (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
>>> 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request]
>>> (AgentTaskPool-
>>> 1:null) Seq 11-1418264581: Sending  { Cmd , MgmtId: 345049078181, via:
>>> 11,
>>> Ver: v1, Flags: 100101,
>>> [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-
>>> acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-
>>> 2-
>>> 278-demo01t-ops-
>>> 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-
>>> appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-28
>>> 3-
>>> vmbld01l-ops-
>> 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"
>>> i-2-
>>> 285-ossec01l-ops-08","wait":0}}] }
>>> 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request]
>>> (AgentTaskPool-
>>> 1:null) Seq 11-1418264581: Executing:  { Cmd , MgmtId: 345049078181,
>>> via: 11,
>>> Ver: v1, Flags: 100101,
>>> [{"StopCommand":{"isProxy":false,"vmName":"i-2-262-
>>> acs-docs-fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-
>>> 2-
>>> 278-demo01t-ops-
>>> 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-281-acs-
>>> appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-28
>>> 3-
>>> vmbld01l-ops-
>> 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"
>>> i-2-
>>> 285-ossec01l-ops-08","wait":0}}] }
>>> 2013-09-25 14:35:53,702 INFO  [vmware.resource.VmwareResource]
>>> (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource
>>> StopCommand:
>>> {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}
>>> 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-
>>> 3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on
>>> host
>>> 2013-09-25 14:35:54,753 INFO  [vmware.resource.VmwareResource]
>>> (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM
>>> i-2-262-acs-docs-fc17 is already in stopped state
>>> 
>>>> -----Original Message-----
>>>> From: Musayev, Ilya [mailto:imusa...@webmd.net]
>>>> Sent: Wednesday, September 25, 2013 6:08 PM
>>>> To: dev@cloudstack.apache.org
>>>> Subject: [MAJOR][BUG] ACS powers off some VMs in vSphere - when MS
>>>> service is restarted
>>>> 
>>>> Hi All,
>>>> 
>>>> I'd like to raise an awareness on the issue with ACS 4.1.1 (I assume
>>>> other versions are affected - since I could not find any changes in
>>>> the latest code that would state otherwise).
>>>> 
>>>> On MS server start/restart, it checks all the VMs for their state,
>>>> if some reason state is either not found or comes as stopped - which
>>>> is inaccurate, the vms will *power off*.
>>>> 
>>>> On todays occurrence, half of my plant  went down because ACS
>>>> invoked StopCommand on the vms that either had no state or for some
>>>> reason agentState == Stopped.
>>>> 
>>>> The details of this issue is in
>>>> https://issues.apache.org/jira/browse/CLOUDSTACK-4740
>>>> 
>>>> The error in the logs is:
>>>> 
>>>> 2013-09-25 14:35:49,928 DEBUG [vmware.resource.VmwareResource]
>>>> (AgentTaskPool-1:null) Detecting a new state but couldn't find a old
>>>> state so adding it to the changes: i-2-262-acs-docs-fc17
>>>> 2013-09-25 14:35:51,213 DEBUG [agent.transport.Request]
>>>> (AgentTaskPool-
>>>> 1:null) Seq -1--1: Startup request from directly connected host: {
>>>> Cmd ,
>>>> MgmtId: -1, via: -1, Ver: v1, Flags: 11,
>>>> 
>> [{"cpus":16,"speed":2199,"memory":68683468800,"dom0MinMemory":0,"po
>>>> olSync":false,"vms":{"i-8-270-CLOUD411":{"state":"Running"},"r-15-
>>>> CLOUD41-OLD":{"state":"Stopped"},"v-260-
>>>> CLOUD411":{"state":"Running"},"i-2-283-vmbld01l-ops-
>>>> 08":{"state":"Running"},"i-2-104-ACS41VM":{"state":"Running"},"--s-1
>>>> -
>>>> CLOUD41-OLD":{"state":"Running"},"i-27-280-
>>>> CLOUD411":{"state":"Running"},"i-2-285-ossec01l-ops-
>>>> 08":{"state":"Running"},"i-2-262-acs-docs-fc17":{"state":"Stopped"},
>>>> "i
>>>> -24-
>>>> 265-test3":{"state":"Running"},"cloud01l-ops-
>>>> 08.portal.webmd.com":{"state":"Running"},"i-2-278-demo01t-ops-
>>>> 08":{"state":"Running"},"s-63-CLOUD411":{"state":"Running"},"r-66-
>>>> CLOUD411":{"state":"Running"},"i-2-281-acs-
>>>> 
>> appliance":{"state":"Running"}},"caps":"hvm","hypervisorType":"VMware","
>>>> hostDetails":{"com.cloud.network.Networks.RouterPrivateIpStrategy":"
>>>> Dc
>>>> Gl
>>>> obal","NativeHA":"true"},"hypervisorVersion":"5.0","type":"Routing",
>>>> "d
>>>> ataC
>>>> enter":"2","pod":"2","cluster":"3","guid":"HostSystem:host-19716@vc0
>>>> 0q
>>>> -
>>>> ops-08.portal.webmd.com","name":"vmha62d-ops-
>>>> 08.portal.webmd.com","version":"4.1.1-
>>>> 
>> SNAPSHOT","privateIpAddress":"172.25.243.31","privateMacAddress":"68
>>>> :b
>>>> 5
>>>> 
>> :99:73:0b:c2","privateNetmask":"255.255.255.0","storageIpAddress":"172.25.
>>>> 243.31","storageNetmask":"255.255.255.0","storageMacAddress":"68:b5:
>>>> 99
>>>> :7
>>>> 3:0b:c2","wait":0},{"totalSize":0,"poolInfo":{"uuid":"72c8aedb-58c4-
>>>> 45
>>>> 69-
>>>> ac51-adc5af770bf6","host":"vmha62d-ops-
>>>> 08.portal.webmd.com","localPath":"","hostPath":"datastore-
>>>> 
>> 19718","poolType":"LVM","capacityBytes":141465485312,"availableBytes":
>>>> 14
>>>> 
>> 0383354880},"resourceType":"STORAGE_POOL","hostDetails":{},"type":"S
>>>> to
>>>> r
>>>> age","dataCenter":"2","pod":"2","cluster":"3","guid":"72c8aedb-58c4-
>>>> 45
>>>> 69-
>>>> ac51-adc5af770bf6","name":"72c8aedb-58c4-4569-ac51-
>>>> adc5af770bf6","wait":0}] }
>>>> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>> (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running
>>>> and realState = Stopped
>>>> 2013-09-25 14:35:53,614 DEBUG [cloud.vm.VirtualMachineManagerImpl]
>>>> (AgentTaskPool-1:null) VM i-2-262-acs-docs-fc17: cs state = Running
>>>> and realState = Stopped
>>>> 2013-09-25 14:35:53,614 INFO [cloud.ha.HighAvailabilityManagerImpl]
>>>> (AgentTaskPool-1:null) Skip HA for VMware VM i-2-262-acs-docs-fc17
>>>> 2013-09-25 14:35:53,694 DEBUG [agent.transport.Request]
>>>> (AgentTaskPool-
>>>> 1:null) Seq 11-1418264581: Sending { Cmd , MgmtId: 345049078181, via:
>>>> 11,
>>>> Ver: v1, Flags: 100101,
>>>> [{"StopCommand":{"isProxy":false,"vmName":"i-2-
>>>> 262-acs-docs-
>>>> fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-
>>>> demo01t-ops-
>> 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":
>>>> "i
>>>> -
>>>> 2-281-acs-
>>>> appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-
>>>> 28
>>>> 3-
>>>> vmbld01l-ops-
>> 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"
>>>> i-
>>>> 2-285-ossec01l-ops-08","wait":0}}] }
>>>> 2013-09-25 14:35:53,695 DEBUG [agent.transport.Request]
>>>> (AgentTaskPool-
>>>> 1:null) Seq 11-1418264581: Executing: { Cmd , MgmtId: 345049078181,
>>>> via: 11,
>>>> Ver: v1, Flags: 100101,
>>>> [{"StopCommand":{"isProxy":false,"vmName":"i-2-
>>>> 262-acs-docs-
>>>> fc17","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-278-
>>>> demo01t-ops-
>> 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":
>>>> "i
>>>> -
>>>> 2-281-acs-
>>>> appliance","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"i-2-
>>>> 28
>>>> 3-
>>>> vmbld01l-ops-
>> 08","wait":0}},{"StopCommand":{"isProxy":false,"vmName":"
>>>> i-
>>>> 2-285-ossec01l-ops-08","wait":0}}] }
>>>> 2013-09-25 14:35:53,702 INFO [vmware.resource.VmwareResource]
>>>> (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) Executing resource
>>>> StopCommand:
>>>> {"isProxy":false,"vmName":"i-2-262-acs-docs-fc17","wait":0}
>>>> 2013-09-25 14:35:53,703 DEBUG [vmware.mo.HostMO] (DirectAgent-
>>>> 3:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on
>>>> host
>>>> 2013-09-25 14:35:54,753 INFO [vmware.resource.VmwareResource]
>>>> (DirectAgent-3:vmha62d-ops-08.portal.webmd.com) VM i-2-262-acs-
>> docs-
>>>> fc17 is already in stopped state
>>>> 2013-09-25 14:37:50,264 DEBUG [agent.transport.Request]
>>>> (DirectAgent-
>>>> 3:null) Seq 11-1418264581: Processing: { Ans: , MgmtId:
>>>> 345049078181,
>>>> via: 11,
>>>> Ver: v1, Flags: 100,
>>>> [{"StopAnswer":{"vncPort":0,"result":true,"details":"VM
>>>> i-2-262-acs-docs-fc17 is already in stopped
>>>> state","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details":"
>>>> Stop
>>>> VM i-2-278-demo01t-ops-08
>>>> Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details"
>>>> :"Stop
>>>> VM i-2-281-acs-appliance
>>>> Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details"
>>>> :"Stop
>>>> VM i-2-283-vmbld01l-ops-08
>>>> Succeed","wait":0}},{"StopAnswer":{"vncPort":0,"result":true,"details"
>>>> :"Stop VM i-2-285-ossec01l-ops-08 Succeed","wait":0}}] }
>>>> 2013-09-25 14:38:14,528 DEBUG [vmware.mo.HostMO] (DirectAgent-
>>>> 20:vmha62d-ops-08.portal.webmd.com) find VM i-2-262-acs-docs-fc17 on
>>>> host
>>>> 
>>>> I wonder why this was not noted earlier.
>>>> 
>>>> The code in question is below, and comes from
>>>> 
>> engine/orchestration/src/com/cloud/vm/VirtualMachineManagerImpl.java
>>>> on master branch.
>>>> 
>>>> 2365 if (serverState == State.Running) {
>>>> 2366 // Our records showed that it should be running so let's
>>>> restart
>>>> 2367 // it.
>>>> 2368 _haMgr.scheduleRestart(vm, false);
>>>> 2369 } else if (serverState == State.Stopping) {
>>>> 2370 _haMgr.scheduleStop(vm, hostId, WorkType.ForceStop);
>>>> 2371 s_logger.debug("Scheduling a check stop for VM in stopping mode:
>>>> " + vm);
>>>> 2372 } else if (serverState == State.Starting) {
>>>> 2373 s_logger.debug("Ignoring VM in starting mode: " +
>>>> vm.getInstanceName());
>>>> 2374 _haMgr.scheduleRestart(vm, false);
>>>> 2375 }
>>>> 2376 command = cleanup(vm);
>>>> 
>>>> If there is a quick fix, which I think for now, I can just comment
>>>> out the "command= cleanup(vm)", please let me know. I'm not certain
>>>> what would be the repercussions.
>>>> 
>>>> As always any feedback, is appreciated.
>>>> 
>>>> Thanks
>>>> ilya
>>>> 
>>>> 
>>> 
>> 
> 
> 

Reply via email to