Hi Wido,

Good day to you, and thanks for your reply. Nice to hear from you again. :)

So is this caused by a bug on 4.2 which is resolved on newer version of
ACS? Any specific information on the bug, e.g. bug ID and description on
how to fix it? Is there a way I can resolve the problem without having to
upgrade?

Is there any documentation I can follow on how to upgrade from 4.2 to 4.8?
Will this be quite straight-forward or will this involve many steps? We are
running a production environment and we don't have staging / test
environment to play with.

Looking forward to your reply, thank you.

Cheers.

On Sat, Feb 6, 2016 at 3:48 PM, Wido den Hollander <w...@widodh.nl> wrote:

> Hi,
>
> > Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>:
> >
> >
> > Dear all,
> >
> > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
> > storage. In the past one week, many of our KVM host agents would often be
> > disconnected from the management server, causing the VMs to go down
> because
> > of HA work. While we used to have host disconnection in the past,
> normally
> > it would only affect just one host, but this time round, when the problem
> > happens, it would happen on multiple hosts, up to 4-5 hosts at the same
> > time.
> >
>
> Any reason to still run 4.2? I've seen this happen as well and I haven't
> seen
> this with recent versions of ACS.
>
> Could you maybe upgrade to 4.8?
>
> Wido
>
> > Nothing much I can find on both the management-server.log and agent.log,
> > with no significant warn, error or exceptions logged before the
> > disconnection. Here are the sample logs from the agent:
> >
> > ===
> > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
> > Interval has gone past 300000.  Attempting to reconnect.
> > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
> > Clearing watch list: 2
> > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
> > (Agent-Selector:null) Closing socket
> > Socket[addr=/*.*.3.3,port=8250,localport=50489]
> > 2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > Reconnecting...
> > 2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connecting to *.*.3.3:8250
> > 2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > SSL: Handshake done
> > 2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
> > Connected to *.*.3.3:8250
> > 2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
> > Connected to the server
> > ===
> >
> > Sometimes, the Cloudstack agent will not be able to re-connect unless if
> we
> > stop and start the agent again manually:
> >
> > ===
> > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
> > Interval has gone past 300000.  Attempting to reconnect.
> > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
> > Clearing watch list: 2
> > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
> > (Agent-Selector:null) Closing socket
> > Socket[addr=/*.*.3.3,port=8250,localport=46231]
> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > 2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
> > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
> > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
> > (AgentShutdownThread:null) Sending shutdown to management server
> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
> > connection to the server. Dealing with the remaining commands...
> > 2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null)
> Cannot
> > connect because we still have 3 commands in progress.
> > ===
> >
> > During the time when the problem happens, I can confirm that the CPU and
> > memory load on the server is healthy and not overloaded. I understand
> that
> > such problem can happen when the hypervisor host is overloaded, however
> > this is not the case -- especially since when such problem happens now,
> it
> > will strike up to 4-5 hypervisor hosts at the same time.
> >
> > On the management server, here are the typical logs:
> >
> > ===
> > 2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with
> > event AgentDisconnected
> > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) checking if agent (72) is alive
> > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
> > 161342671900, via: 72, Ver: v1, Flags: 100011
> > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
> > 2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
> > [Host:72] is unreachable: Host 72: C
> > hannel is closed
> > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
> > 2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72:
> Channel
> > is closed
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state
> of
> > the host.  Moving on.
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the
> state
> > of the host.  Moving on
> > .
> > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) checking if agent (72) is alive
> > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
> > (*.*.3.25)
> > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 100011
> > ,
> >
> [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
> > }
> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > (AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
> > MgmtId: 161342671900, via: 34, Ver: v1, F
> > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
> > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
> > (AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
> > 161342671900, via: 34, Ver: v1, Flags: 10,
> >  { Answer } }
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
> > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
> > returning that host is up
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
> > (AgentTaskPool-10:null) ping from (34) to agent's host ip address
> > (*.*.3.25) successful, returning th
> > at agent is disconnected
> > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> > (AgentTaskPool-10:null) null was able to determine host 72 is in
> > Disconnected
> > 2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) The state determined is Disconnected
> > 2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Agent is disconnected but the host is still up:
> > 72-**-kvm-05
> > 2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
> > (AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6
> //
> > clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
> > availability zone: *****-******-01, pod: *****-******-Pod-01
> > 2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Host 72 is disconnecting with event
> > AgentDisconnected
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) The next status of agent 72is Alert, current
> status
> > is Up
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Remove Agent : 72
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
> > (AgentTaskPool-10:null) Processing Disconnect.
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
> > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
> > com.cloud.network.security.SecurityGroupListener
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> >
> com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> >
> com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
> > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
> > (AgentTaskPool-10:null) Sending Disconnect to listener:
> > com.cloud.storage.secondary.SecondaryStorageListener
> > ===
> >
> > Can anyone advise how can I start troubleshooting to find the root cause
> of
> > the problem? So that I can take whatever measures to prevent similar
> > problem from happening.
> >
> > Any advice is greatly appreciated.
> >
> > Looking forward to your reply, thank you.
> >
> > Cheers.
> >
> > -ip-
>

Reply via email to