Dear all,

We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary
storage. In the past one week, many of our KVM host agents would often be
disconnected from the management server, causing the VMs to go down because
of HA work. While we used to have host disconnection in the past, normally
it would only affect just one host, but this time round, when the problem
happens, it would happen on multiple hosts, up to 4-5 hosts at the same
time.

Nothing much I can find on both the management-server.log and agent.log,
with no significant warn, error or exceptions logged before the
disconnection. Here are the sample logs from the agent:

===
2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping
Interval has gone past 300000.  Attempting to reconnect.
2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null)
Clearing watch list: 2
2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection]
(Agent-Selector:null) Closing socket
Socket[addr=/*.*.3.3,port=8250,localport=50489]
2016-02-05 03:20:33,825 INFO  [cloud.agent.Agent] (UgentTask-7:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:20:38,826 INFO  [cloud.agent.Agent] (UgentTask-7:null)
Reconnecting...
2016-02-05 03:20:38,829 INFO  [utils.nio.NioClient] (Agent-Selector:null)
Connecting to *.*.3.3:8250
2016-02-05 03:20:38,925 INFO  [utils.nio.NioClient] (Agent-Selector:null)
SSL: Handshake done
2016-02-05 03:20:38,926 INFO  [utils.nio.NioClient] (Agent-Selector:null)
Connected to *.*.3.3:8250
2016-02-05 03:20:43,926 INFO  [cloud.agent.Agent] (UgentTask-7:null)
Connected to the server
===

Sometimes, the Cloudstack agent will not be able to re-connect unless if we
stop and start the agent again manually:

===
2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping
Interval has gone past 300000.  Attempting to reconnect.
2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null)
Clearing watch list: 2
2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection]
(Agent-Selector:null) Closing socket
Socket[addr=/*.*.3.3,port=8250,localport=46231]
2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:25,332 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:30,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:35,333 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:35,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:40,334 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:40,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:45,335 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:50,336 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:22:55,337 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:00,338 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:00,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:05,339 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:05,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:10,340 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:15,341 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
2016-02-05 03:23:16,945 INFO  [cloud.agent.Agent]
(AgentShutdownThread:null) Stopping the agent: Reason = sig.kill
2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent]
(AgentShutdownThread:null) Sending shutdown to management server
2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Lost
connection to the server. Dealing with the remaining commands...
2016-02-05 03:23:17,949 INFO  [cloud.agent.Agent] (UgentTask-6:null) Cannot
connect because we still have 3 commands in progress.
===

During the time when the problem happens, I can confirm that the CPU and
memory load on the server is healthy and not overloaded. I understand that
such problem can happen when the hypervisor host is overloaded, however
this is not the case -- especially since when such problem happens now, it
will strike up to 4-5 hypervisor hosts at the same time.

On the management server, here are the typical logs:

===
2016-02-05 03:20:28,828 INFO  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Investigating why host 72 has disconnected with
event AgentDisconnected
2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) checking if agent (72) is alive
2016-02-05 03:20:28,833 DEBUG [agent.transport.Request]
(AgentTaskPool-10:null) Seq 72-215557373: Sending  { Cmd , MgmtId:
161342671900, via: 72, Ver: v1, Flags: 100011
, [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] }
2016-02-05 03:20:28,833 INFO  [agent.manager.AgentAttache]
(AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource
[Host:72] is unreachable: Host 72: C
hannel is closed
2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache]
(AgentTaskPool-10:null) Seq 72-215557373: Cancelling.
2016-02-05 03:20:28,833 WARN  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72: Channel
is closed
2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-10:null) SimpleInvestigator unable to determine the state of
the host.  Moving on.
2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-10:null) XenServerInvestigator unable to determine the state
of the host.  Moving on
.
2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-10:null) checking if agent (72) is alive
2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-10:null) sending ping from (34) to agent's host ip address
(*.*.3.25)
2016-02-05 03:20:28,843 DEBUG [agent.transport.Request]
(AgentTaskPool-10:null) Seq 34-815540048: Sending  { Cmd , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 100011
,
[{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}]
}
2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
(AgentManager-Handler-14:null) Seq 34-815540048: Processing:  { Ans: ,
MgmtId: 161342671900, via: 34, Ver: v1, F
lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] }
2016-02-05 03:20:29,024 DEBUG [agent.transport.Request]
(AgentTaskPool-10:null) Seq 34-815540048: Received:  { Ans: , MgmtId:
161342671900, via: 34, Ver: v1, Flags: 10,
 { Answer } }
2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl]
(AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged,
returning that host is up
2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator]
(AgentTaskPool-10:null) ping from (34) to agent's host ip address
(*.*.3.25) successful, returning th
at agent is disconnected
2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(AgentTaskPool-10:null) null was able to determine host 72 is in
Disconnected
2016-02-05 03:20:29,024 INFO  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) The state determined is Disconnected
2016-02-05 03:20:29,024 WARN  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Agent is disconnected but the host is still up:
72-**-kvm-05
2016-02-05 03:20:29,025 WARN  [apache.cloudstack.alerts]
(AgentTaskPool-10:null)  alertType:: 7 // dataCenterId:: 6 // podId:: 6 //
clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72),
availability zone: *****-******-01, pod: *****-******-Pod-01
2016-02-05 03:20:29,086 INFO  [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Host 72 is disconnecting with event
AgentDisconnected
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) The next status of agent 72is Alert, current status
is Up
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Deregistering link for 72 with state Alert
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Remove Agent : 72
2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache]
(AgentTaskPool-10:null) Processing Disconnect.
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache]
(AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class
com.cloud.network.security.SecurityGroupListener
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Sending Disconnect to listener:
com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Sending Disconnect to listener:
com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Sending Disconnect to listener:
com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f
2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl]
(AgentTaskPool-10:null) Sending Disconnect to listener:
com.cloud.storage.secondary.SecondaryStorageListener
===

Can anyone advise how can I start troubleshooting to find the root cause of
the problem? So that I can take whatever measures to prevent similar
problem from happening.

Any advice is greatly appreciated.

Looking forward to your reply, thank you.

Cheers.

-ip-

Reply via email to