Hi, > Op 5 februari 2016 om 17:24 schreef Indra Pramana <in...@sg.or.id>: > > > Dear all, > > We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD for primary > storage. In the past one week, many of our KVM host agents would often be > disconnected from the management server, causing the VMs to go down because > of HA work. While we used to have host disconnection in the past, normally > it would only affect just one host, but this time round, when the problem > happens, it would happen on multiple hosts, up to 4-5 hosts at the same > time. >
Any reason to still run 4.2? I've seen this happen as well and I haven't seen this with recent versions of ACS. Could you maybe upgrade to 4.8? Wido > Nothing much I can find on both the management-server.log and agent.log, > with no significant warn, error or exceptions logged before the > disconnection. Here are the sample logs from the agent: > > === > 2016-02-05 03:20:28,820 ERROR [cloud.agent.Agent] (UgentTask-7:null) Ping > Interval has gone past 300000. Attempting to reconnect. > 2016-02-05 03:20:28,825 DEBUG [cloud.agent.Agent] (UgentTask-7:null) > Clearing watch list: 2 > 2016-02-05 03:20:28,825 DEBUG [utils.nio.NioConnection] > (Agent-Selector:null) Closing socket > Socket[addr=/*.*.3.3,port=8250,localport=50489] > 2016-02-05 03:20:33,825 INFO [cloud.agent.Agent] (UgentTask-7:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:20:38,826 INFO [cloud.agent.Agent] (UgentTask-7:null) > Reconnecting... > 2016-02-05 03:20:38,829 INFO [utils.nio.NioClient] (Agent-Selector:null) > Connecting to *.*.3.3:8250 > 2016-02-05 03:20:38,925 INFO [utils.nio.NioClient] (Agent-Selector:null) > SSL: Handshake done > 2016-02-05 03:20:38,926 INFO [utils.nio.NioClient] (Agent-Selector:null) > Connected to *.*.3.3:8250 > 2016-02-05 03:20:43,926 INFO [cloud.agent.Agent] (UgentTask-7:null) > Connected to the server > === > > Sometimes, the Cloudstack agent will not be able to re-connect unless if we > stop and start the agent again manually: > > === > 2016-02-05 03:22:20,330 ERROR [cloud.agent.Agent] (UgentTask-6:null) Ping > Interval has gone past 300000. Attempting to reconnect. > 2016-02-05 03:22:20,331 DEBUG [cloud.agent.Agent] (UgentTask-6:null) > Clearing watch list: 2 > 2016-02-05 03:22:20,353 DEBUG [utils.nio.NioConnection] > (Agent-Selector:null) Closing socket > Socket[addr=/*.*.3.3,port=8250,localport=46231] > 2016-02-05 03:22:25,332 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:22:25,332 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:22:30,333 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:22:30,333 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:22:35,333 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:22:35,334 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:22:40,334 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:22:40,335 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:22:45,335 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:22:45,335 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:22:50,336 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:22:50,336 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:22:55,337 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:22:55,337 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:23:00,338 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:23:00,339 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:23:05,339 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:23:05,340 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:23:10,340 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:23:10,340 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:23:15,341 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:23:15,341 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > 2016-02-05 03:23:16,945 INFO [cloud.agent.Agent] > (AgentShutdownThread:null) Stopping the agent: Reason = sig.kill > 2016-02-05 03:23:16,946 DEBUG [cloud.agent.Agent] > (AgentShutdownThread:null) Sending shutdown to management server > 2016-02-05 03:23:17,949 INFO [cloud.agent.Agent] (UgentTask-6:null) Lost > connection to the server. Dealing with the remaining commands... > 2016-02-05 03:23:17,949 INFO [cloud.agent.Agent] (UgentTask-6:null) Cannot > connect because we still have 3 commands in progress. > === > > During the time when the problem happens, I can confirm that the CPU and > memory load on the server is healthy and not overloaded. I understand that > such problem can happen when the hypervisor host is overloaded, however > this is not the case -- especially since when such problem happens now, it > will strike up to 4-5 hypervisor hosts at the same time. > > On the management server, here are the typical logs: > > === > 2016-02-05 03:20:28,828 INFO [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Investigating why host 72 has disconnected with > event AgentDisconnected > 2016-02-05 03:20:28,828 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) checking if agent (72) is alive > 2016-02-05 03:20:28,833 DEBUG [agent.transport.Request] > (AgentTaskPool-10:null) Seq 72-215557373: Sending { Cmd , MgmtId: > 161342671900, via: 72, Ver: v1, Flags: 100011 > , [{"com.cloud.agent.api.CheckHealthCommand":{"wait":50}}] } > 2016-02-05 03:20:28,833 INFO [agent.manager.AgentAttache] > (AgentTaskPool-10:null) Seq 72-215557373: Unable to send due to Resource > [Host:72] is unreachable: Host 72: C > hannel is closed > 2016-02-05 03:20:28,833 DEBUG [agent.manager.AgentAttache] > (AgentTaskPool-10:null) Seq 72-215557373: Cancelling. > 2016-02-05 03:20:28,833 WARN [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Resource [Host:72] is unreachable: Host 72: Channel > is closed > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-10:null) SimpleInvestigator unable to determine the state of > the host. Moving on. > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-10:null) XenServerInvestigator unable to determine the state > of the host. Moving on > . > 2016-02-05 03:20:28,836 DEBUG [cloud.ha.UserVmDomRInvestigator] > (AgentTaskPool-10:null) checking if agent (72) is alive > 2016-02-05 03:20:28,840 DEBUG [cloud.ha.UserVmDomRInvestigator] > (AgentTaskPool-10:null) sending ping from (34) to agent's host ip address > (*.*.3.25) > 2016-02-05 03:20:28,843 DEBUG [agent.transport.Request] > (AgentTaskPool-10:null) Seq 34-815540048: Sending { Cmd , MgmtId: > 161342671900, via: 34, Ver: v1, Flags: 100011 > , > [{"com.cloud.agent.api.PingTestCommand":{"_computingHostIp":"*.*.3.25","wait":20}}] > } > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request] > (AgentManager-Handler-14:null) Seq 34-815540048: Processing: { Ans: , > MgmtId: 161342671900, via: 34, Ver: v1, F > lags: 10, [{"com.cloud.agent.api.Answer":{"result":true,"wait":0}}] } > 2016-02-05 03:20:29,024 DEBUG [agent.transport.Request] > (AgentTaskPool-10:null) Seq 34-815540048: Received: { Ans: , MgmtId: > 161342671900, via: 34, Ver: v1, Flags: 10, > { Answer } } > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.AbstractInvestigatorImpl] > (AgentTaskPool-10:null) host (*.*.3.25) has been successfully pinged, > returning that host is up > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.UserVmDomRInvestigator] > (AgentTaskPool-10:null) ping from (34) to agent's host ip address > (*.*.3.25) successful, returning th > at agent is disconnected > 2016-02-05 03:20:29,024 DEBUG [cloud.ha.HighAvailabilityManagerImpl] > (AgentTaskPool-10:null) null was able to determine host 72 is in > Disconnected > 2016-02-05 03:20:29,024 INFO [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) The state determined is Disconnected > 2016-02-05 03:20:29,024 WARN [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Agent is disconnected but the host is still up: > 72-**-kvm-05 > 2016-02-05 03:20:29,025 WARN [apache.cloudstack.alerts] > (AgentTaskPool-10:null) alertType:: 7 // dataCenterId:: 6 // podId:: 6 // > clusterId:: null // message:: Host disconnected, name: **-kvm-05 (id:72), > availability zone: *****-******-01, pod: *****-******-Pod-01 > 2016-02-05 03:20:29,086 INFO [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Host 72 is disconnecting with event > AgentDisconnected > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) The next status of agent 72is Alert, current status > is Up > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Deregistering link for 72 with state Alert > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Remove Agent : 72 > 2016-02-05 03:20:29,089 DEBUG [agent.manager.ConnectedAgentAttache] > (AgentTaskPool-10:null) Processing Disconnect. > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentAttache] > (AgentTaskPool-10:null) Seq 72-215547906: Sending disconnect to class > com.cloud.network.security.SecurityGroupListener > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Sending Disconnect to listener: > com.cloud.hypervisor.xen.discoverer.XcpServerDiscoverer_EnhancerByCloudStack_8f0e0536 > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Sending Disconnect to listener: > com.cloud.deploy.DeploymentPlanningManagerImpl_EnhancerByCloudStack_55d28fac > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Sending Disconnect to listener: > com.cloud.network.NetworkManagerImpl_EnhancerByCloudStack_6763a13f > 2016-02-05 03:20:29,089 DEBUG [agent.manager.AgentManagerImpl] > (AgentTaskPool-10:null) Sending Disconnect to listener: > com.cloud.storage.secondary.SecondaryStorageListener > === > > Can anyone advise how can I start troubleshooting to find the root cause of > the problem? So that I can take whatever measures to prevent similar > problem from happening. > > Any advice is greatly appreciated. > > Looking forward to your reply, thank you. > > Cheers. > > -ip-