Following this up, I just found the following errors on my management server. Very odd as they are resolved within the same second, ping.interval = 5, ping.timeout (multiplier) = 2
Thanks again, Marty Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentMonitor] (Thread-6:) Found the following agents behind on ping: [40, 27, 37, 38, 29, 39] Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) Investigating why host 40 has disconnected with event PingTimeout Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) Investigating why host 27 has disconnected with event PingTimeout Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) Investigating why host 37 has disconnected with event PingTimeout Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) Investigating why host 38 has disconnected with event PingTimeout Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) Investigating why host 29 has disconnected with event PingTimeout Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) Investigating why host 39 has disconnected with event PingTimeout Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) The state determined is Up Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-5:) Agent is determined to be up and running Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) The state determined is Up Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-4:) Agent is determined to be up and running Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) The state determined is Up Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-8:) Agent is determined to be up and running Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) The state determined is Up Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-15:) Agent is determined to be up and running Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) The state determined is Up Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-16:) Agent is determined to be up and running Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) The state determined is Up Aug 17 19:04:24 discovery jsvc.exec[31061]: INFO [agent.manager.AgentManagerImpl] (AgentTaskPool-9:) Agent is determined to be up and running On Sat, Aug 17, 2013 at 6:58 PM, Marty Sweet <msweet....@gmail.com> wrote: > Hi Guys, > > I have just had a VMHost randomly disconnect in production and > subsequently take down some VMs. > I have attached the logs (happened to be running agent trace on this > node), but it would seem that the agent (or management?) waited 25 seconds > before erroring, and then the cloudstack agent froze until 1800. > I assume the agent syslog stack traces were caused by force closes of VMs, > no other nodes were affected during this time period. > > While the host was in disconnect mode, I could connect to a VM which was > running on that host, although Cloudstack was already reporting that is was > down. > Would it be a good idea to ping VM's (their allocated IPs before > attempting to start them on other nodes - especially in a HA setup)? > > If someone could look at the logs and let me know if there is something > obvious it would be most appreciated, I have included the management bond > for reference that the link didn't go down. > > Thanks in advance, > Marty >