[ https://issues.apache.org/jira/browse/CLOUDSTACK-8883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14901633#comment-14901633 ]
ASF GitHub Bot commented on CLOUDSTACK-8883: -------------------------------------------- GitHub user borisroman opened a pull request: https://github.com/apache/cloudstack/pull/863 [BLOCKER][4.6]CLOUDSTACK-8883: Resolved connect/reconnect issue. Hi! @wilderrodrigues by implementing Callable you switched a couple of methods and fields. I switched them some more! The reason why the Agent wouldn't reconnect was due to two facts. Problem 1: Selector was blocking. In the while loop at [1] _selector.select(); was blocking when the connection was lost. This means at [2] _isStartup = false; was never excecuted. Therefore at [3] the call to isStartup() always returned true resulting in an infinite loop. Resolution 1: Move the call to cleanUp() [4] before checking if isStartup() has turned to false. cleanUp() will close() the _selector resulting in _isStartup to be set to false. Problem 2: Setting _isStartup & _isRunning to true when init() throwed an unchecked exception (ConnectException). The exception was nicely caught, but only logged. No action was taken! Resulting in _isStartup & _isRunning being set to true. Resulting in the fact the Agent thought it was connected successfully, though it wasn't. Resolution 2: Adding return to the catch statement [5]. This way _isStartup & _isRunning aren't set to true. Steps to test: 1. Deploy ACS. 2. Try all combinations of stopping/starting managment server/agent. [1]https://github.com/borisroman/cloudstack/blob/b34f86c8d55a1cfc057585eab4db0fa2d98a7b3e/utils/src/main/java/com/cloud/utils/nio/NioConnection.java#L128 [2]https://github.com/borisroman/cloudstack/blob/b34f86c8d55a1cfc057585eab4db0fa2d98a7b3e/utils/src/main/java/com/cloud/utils/nio/NioConnection.java#L176 [3]https://github.com/borisroman/cloudstack/blob/b34f86c8d55a1cfc057585eab4db0fa2d98a7b3e/agent/src/com/cloud/agent/Agent.java#L404 [4]https://github.com/borisroman/cloudstack/blob/b34f86c8d55a1cfc057585eab4db0fa2d98a7b3e/agent/src/com/cloud/agent/Agent.java#L399 [5]https://github.com/borisroman/cloudstack/blob/b34f86c8d55a1cfc057585eab4db0fa2d98a7b3e/utils/src/main/java/com/cloud/utils/nio/NioConnection.java#L91 You can merge this pull request into a Git repository by running: $ git pull https://github.com/borisroman/cloudstack CLOUDSTACK-8883 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/cloudstack/pull/863.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #863 ---- commit 9693b97c2147b3fdb9579a1ebb33597cd3bf1d11 Author: Boris Schrijver <bo...@pcextreme.nl> Date: 2015-09-21T14:54:56Z Call cleanUp() before looping isStartup(). commit b34f86c8d55a1cfc057585eab4db0fa2d98a7b3e Author: Boris Schrijver <bo...@pcextreme.nl> Date: 2015-09-21T22:38:16Z Added return statement to stop start() if there has been an ConnectException. ---- > [Blocker] KVM host goes into disconnected state when MS is restarted > -------------------------------------------------------------------- > > Key: CLOUDSTACK-8883 > URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8883 > Project: CloudStack > Issue Type: Bug > Security Level: Public(Anyone can view this level - this is the > default.) > Affects Versions: 4.6.0 > Reporter: Raja Pullela > Assignee: Boris Schrijver > Priority: Blocker > Fix For: 4.6.0 > > > steps to reproduce: > - restart MS > - see the KVM host status > Expected > - Agent should reconnect > Actual > - Host states in disconnect state and Agent does not reconnect > Apparently a recent commit broke and BVTs are for KVM are all failing because > Hosts go into a disconnected state and the SSVM/CPVMs don't come up. > Current Agent Log - during the MS restart > 2015-09-18 07:05:37,301 INFO [kvm.storage.LibvirtStorageAdaptor] > (agentRequest- Handler-5:null) Asking libvirt to refresh > storage pool c8bd627f-101f-3215-8545-7 2f7ce50f2c6 > 2015-09-18 07:06:37,452 INFO [kvm.storage.LibvirtStorageAdaptor] > (agentRequest-Handler-1:null) Trye pool c8bd627f-101f-3215-8545-72f7ce50f2c6 > from libvirt > 2015-09-18 07:06:37,469 INFO [kvm.storage.LibvirtStorageAdaptor] > (agentRequest-Handler-1:null) Askesh storage pool > c8bd627f-101f-3215-8545-72f7ce50f2c6 > 2015-09-18 07:07:32,417 INFO [cloud.agent.Agent] (Agent-Handler-5:null) Lost > connection to the server. Dealing with the remaining commands... > Previously Agent used to reconnect - > 2015-09-18 12:15:11,902 INFO [cloud.agent.Agent] (Agent-Handler-2:null) > Reconnecting... > 2015-09-18 12:15:11,903 INFO [utils.nio.NioClient] (Agent-Selector:null) > Connecting to 10.147.28.47:8250 > 2015-09-18 12:15:11,904 WARN [utils.nio.NioConnection] (Agent-Selector:null) > Unable to connect to remote: is there a server running on port 8250 -- This message was sent by Atlassian JIRA (v6.3.4#6332)