[JIRA] [master-slave] (JENKINS-23852) slave builds repeatedly broken by "hudson.remoting.ChannelClosedException: channel is already closed"

[email protected] (JIRA) Tue, 09 Sep 2014 01:48:07 -0700

I have been seeing similar problems to this. In my situation one of my slaves is a Windows 7 desktop computer that is set to suspend after a period of inactivity. I've had this running for a few years without problem and the slave has always reconnected successfully when the desktop is woken up.

However since the NIO selector thread has been introduced the slaves have ended up being offline for prolonged periods of time and require a manual restart of the Jenkins slave service to bring them back online (see JENKINS-24272). One other symptom that I have observed during testing of this is the server rejecting connections similar to that reported here namely:

Jul 16, 2014 9:07:06 PM hudson.remoting.jnlp.Main$CuiListener error
SEVERE: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
java.lang.Exception: The server rejected the connection: XXX is already connected to this master. Rejecting this connection.
at hudson.remoting.Engine.onConnectionRejected(Engine.java:304)
at hudson.remoting.Engine.run(Engine.java:276)

Having investigated I'm seeing that the slave has decided that the connection is broken and has tore down its TCP connection to the master but the master never sees the TCP RST for this so it continues to retry the TCP packets for 10 or so minutes until the TCP connection times out and then the master tears down the connection.

The really strange this is that this behaviour seems to be different between NIO and threaded methods on the master. With threaded the TCP RST is sent by W7 and kills the connection before the slave reconnects but in NIO no RST is ever sent. That happens down at the network packet level so it shouldn't really be affected by NIO but I wonder if it a slight different configuration in the network stack (non-blocking vs blocking) that causes the different behaviour.

I also suspect that this may be related to the change to re-start JNLP slaves using a new JVM rather than just restarting in the same process. It may be that the Windows TCP stack just immediately forgets the old TCP socket when the old JVM process exits and hence ignores retries rather than RSTing the connection.

I'm planning on waiting for the current crop of JNLP slave fixes to get integrated and will then experiment a bit further to see if I can work out what is stopping the NIO selector getting notified that the slave has closed the TCP connection.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators.
For more information on JIRA, see: http://www.atlassian.com/software/jira

--
You received this message because you are subscribed to the Google Groups "Jenkins Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
For more options, visit https://groups.google.com/d/optout.

[JIRA] [master-slave] (JENKINS-23852) slave builds repeatedly broken by "hudson.remoting.ChannelClosedException: channel is already closed"

Reply via email to