Dear all, We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary storage. In the past one week, many of our KVM host agents would often be disconnected from the management server, causing the VMs to go down because of HA work. While we used to have host disconnection in the past, normally it would only affect just one host, but this time round, when the problem happens, it would happen on multiple hosts, up to 4-5 hosts at the same time.
Nothing much I can find on both the management-server.log and agent.log, with no significant warn, error or exceptions logged before the disconnection. Here are the sample logs from the agent: === 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping Interval has gone past 300000. Attempting to reconnect. 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null) Clearing watch list: 2 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection] (Agent-Selector:null) Closing socket Socket[addr=/*.*.3.3,port=8250,localport=50489] 2016-02-05 03:20:33,825 INFO [cloud.agent.Agent] (UgentTask-7:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:20:38,826 INFO [cloud.agent.Agent] (UgentTask-7:null) Reconnecting... 2016-02-05 03:20:38,829 INFO [utils.nio.NioClient] (Agent-Selector:null) Connecting to *.*.3.3:8250 2016-02-05 03:20:38,925 INFO [utils.nio.NioClient] (Agent-Selector:null) SSL: Handshake done 2016-02-05 03:20:38,926 INFO [utils.nio.NioClient] (Agent-Selector:null) Connected to *.*.3.3:8250 2016-02-05 03:20:43,926 INFO [cloud.agent.Agent] (UgentTask-7:null) Connected to the server === Sometimes, the Cloudstack agent will not be able to re-connect unless if we stop and start the agent again manually: === 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping Interval has gone past 300000. Attempting to reconnect. 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null) Clearing watch list: 2 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection] (Agent-Selector:null) Closing socket Socket[addr=/*.*.3.3,port=8250,localport=46231] 2016-02-05 03:22:25,332 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:22:25,332 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:22:30,333 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:22:30,333 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:22:35,333 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:22:35,334 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:22:40,334 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:22:40,335 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:22:45,335 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:22:45,335 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:22:50,336 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:22:50,336 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:22:55,337 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:22:55,337 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:23:00,338 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:23:00,339 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:23:05,339 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:23:05,340 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:23:10,340 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:23:10,340 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:23:15,341 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:23:15,341 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. 2016-02-05 03:23:16,945 INFO [cloud.agent.Agent] (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent] (AgentShutdownThread:null) Sending shutdown to management server 2016-02-05 03:23:17,949 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost connection to the server. Dealing with the remaining commands... 2016-02-05 03:23:17,949 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot connect because we still have 3 commands in progress. === During the time when the problem happens, I can confirm that the CPU and memory load on the server is healthy and not overloaded. I understand that such problem can happen when the hypervisor host is overloaded, however this is not the case -- especially since when such problem happens now, it will strike up to 4-5 hypervisor hosts at the same time. On the management server, here are the typical logs: === 2016-02-05 03:20:28,828 INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Investigating why host 72 has disconnected with event AgentDisconnected 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) checking if agent (72) is alive 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request] (AgentTaskPool-10:null) Seq 72-215557373: Sending { Cmd , MgmtId: 161342671900, via: 72, Ver: v1, Flags: 100011 , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } 2016-02-05 03:20:28,833 INFO [agent.manager.AgentAttache] (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource [Host:72] is unreachable: Host 72: C hannel is closed 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-10:null) Seq 72-215557373: Cancelling. 2016-02-05 03:20:28,833 WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72: Channel is closed 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state of the host. Moving on. 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-10:null) XenServerInvestigator unable to determine the state of the host. Moving on . 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-10:null) checking if agent (72) is alive 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address (*.*.3.25) 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request] (AgentTaskPool-10:null) Seq 34-815540048: Sending { Cmd , MgmtId: 161342671900, via: 34, Ver: v1, Flags: 100011 , [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}] } 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request] (AgentManager-Handler-14:null) Seq 34-815540048: Processing: { Ans: , MgmtId: 161342671900, via: 34, Ver: v1, F lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] } 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request] (AgentTaskPool-10:null) Seq 34-815540048: Received: { Ans: , MgmtId: 161342671900, via: 34, Ver: v1, Flags: 10, { Answer } } 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl] (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged, returning that host is up 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator] (AgentTaskPool-10:null) ping from (34) to agent's host ip address (*.*.3.25) successful, returning th at agent is disconnected 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl] (AgentTaskPool-10:null) null was able to determine host 72 is in Disconnected 2016-02-05 03:20:29,024 INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) The state determined is Disconnected 2016-02-05 03:20:29,024 WARN [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Agent is disconnected but the host is still up: 72-**-kvm-05 2016-02-05 03:20:29,025 WARN [apache.cloudstack.alerts] (AgentTaskPool-10:null) alertType:: 7 // dataCenterId:: 6 // podId:: 6 // clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72), availability zone: *****-******-01, pod: *****-******-Pod-01 2016-02-05 03:20:29,086 INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Host 72 is disconnecting with event AgentDisconnected 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) The next status of agent 72is Alert, current status is Up 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Deregistering link for 72 with state Alert 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Remove Agent : 72 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache] (AgentTaskPool-10:null) Processing Disconnect. 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache] (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class com.cloud.network.security.SecurityGroupListener 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Sending Disconnect to listener: com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Sending Disconnect to listener: com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Sending Disconnect to listener: com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] (AgentTaskPool-10:null) Sending Disconnect to listener: com.cloud.storage.secondary.SecondaryStorageListener === Can anyone advise how can I start troubleshooting to find the root cause of the problem? So that I can take whatever measures to prevent similar problem from happening. Any advice is greatly appreciated. Looking forward to your reply, thank you. Cheers. -ip-