How many slaves do you have? It is rather easy to saturate a server with a small number of ssh-slaves based slaves.
For example, on an AWS m3.large class machine, 10 ssh-slaves concurrently building jobs as chatty as the mock-load-builder job type is the most you can push. If you use JNLP slaves, you can get close to 60 concurrent builds before the system starts falling over. The CloudBees NIO ssh-slaves plugin (part if the enterprise offering) has a different performance characteristic... My most recent tests I was able to get up to 120 concurrent builds, without affecting the Jenkins UI (I only had set up for that number of slaves... It likely can go further, though m3.large is not beefy enough) what was affected though we're build times. The builds were 2-3 times slower due to back-pressure effects causing the builds to block on STDOUT. If anyone else is interested, we will be releasing our scalability test harness (actually I will be ripping the bottom out of the acceptance test framework and putting the scalability harness in its place... But the harness is also useful for scalability testing). We will also be publishing our findings. The other thing to watch is how your entropy pool is holding up. The default random source in Linux typically gets exhausted quite quickly. That can cause your ssh slaves to fail ping tests and timeout/block I think the package you want to install is haveged That or switch java to /dev/urandom Note: I am currently not recommending any specific slave connector, there are trade-offs with each type of connector. I will be writing up a blog post in the near future discussing the various trade-offs. Standard ssh-slaves degrades poorly... This is great if you want to know when you have reached your limit NIO ssh-slaves degrades gracefully, I need to determine where it starts degrading relative to standard ssh-slaves, but if UI responsiveness is more important than build times then this has advantages (though you need to be a paying cloudbees customer) JNLP scales the highest without affecting build times, but degrades fastest, is a poor fit for on-demand connection/retention strategies and does not offer the same transport encryption security as the ssh- versions Those are just the brief high-level measures On Monday, 5 May 2014, Charles Chan <charles.wh.c...@gmail.com> wrote: > Hello, > > One of the issue we have recently been experiencing with Jenkins is that the > slaves (node) would go offline for no apparent reason and would not reconnect > automatically. > When slaves appear as offline, we tried to launch/reconnect the slave > manually but it does not work either. However, we are able to SSH into the > machine using PuTTy. > The only workaround is to restart the Jenkins server, until the problem > surfaces again. (Typically in a week.) > > Instance Information > -------------------- > Jenkins Server: 1.562 > SSH Credentials Plugin: 1.6.1 > SSH Slaves Plugin 1.6 > > Thread dump of slave node: > {dump} > "Channel reader thread: qa-linbuild-02" prio=5 WAITING > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:485) > > com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109) > > com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583) > com.trilead.ssh2.Session.<init>(Session.java:41) > com.trilead.ssh2.Connection.openSession(Connection.java:1129) > com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99) > com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) > > hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160) > hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437) > hudson.remoting.Channel.terminate(Channel.java:819) > > hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76) > > "Channel reader thread: qa-linbuild-03" prio=5 WAITING > java.lang.Object.wait(Native Method) > java.lang.Object.wait(Object.java:485) > > com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109) > > com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583) > com.trilead.ssh2.Session.<init>(Session.java:41) > com.trilead.ssh2.Connection.openSession(Connection.java:1129) > com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99) > com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119) > > hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160) > hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437) > hudson.remoting.Channel.terminate(Channel.java:819) > > hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76) > {dump} > > Also concerning is the number of threads is in the BLOCKED (126!). > Doesn't seem normal as there are no BLOCKED threads after the server is > restarted. > {dump} > // 118 instances > "Computer.threadPoolForRemoting [#26]" daemon prio=5 BLOCKED > > hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152) > hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:542) > > jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28) > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > java.util.concurrent.FutureTask.run(FutureTask.java:138) > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > java.lang.Thread.run(Thread.java:662) > > // 8 instances > "Computer.threadPoolForRemoting [#2922]" daemon prio=5 BLOCKED > hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639) > hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222) > > jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46) > java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > java.util.concurrent.FutureTask.run(FutureTask.java:138) > > java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) > java.lang.Thread.run(Thread.java:662) > {dump} > > Looking forward to any ideas or suggestions. > > Thank you. > Charles Chan > > -- > You received this message because you are subscribed to the Google Groups > "Jenkins Users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to > jenkinsci-users+unsubscr...@googlegroups.com<javascript:_e(%7B%7D,'cvml','jenkinsci-users%2bunsubscr...@googlegroups.com');> > . > For more options, visit https://groups.google.com/d/optout. > -- Sent from my phone -- You received this message because you are subscribed to the Google Groups "Jenkins Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-users+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.