Hello Stephen,

Thank you for the informative reply. I look forward to your blog post!

To answer your question, we have approximately 2 dozen standard ssh Linux 
slaves, and about 10 JNLP Windows slaves to support various 
platform/configurations.

Based on the build history, sometimes we have up to 10 jobs running 
concurrently. Not 24x7, approximately once every 2 hours, and queue is 
pretty much empty most of the time. I would qualify the system as light 
traffic.

>From your reply, I am even more concerned with disproportionally high 
number of the blocked threads (120) compare to offline slaves (2 at the 
time), as it sounds like it should be closer to 1:1? Also, do you know if 
the standard ssh connector performs a timeout and reconnect or does it 
block indefinitely? Not sure if each attempt to reconnect is spawning off 
new blocked threads?!

Let me know if there is any other information which could prove to be 
useful.

Charles

On Monday, May 5, 2014 12:42:23 PM UTC-7, Stephen Connolly wrote:
>
>
> How many slaves do you have?
>
> It is rather easy to saturate a server with a small number of ssh-slaves 
> based slaves.
>
> For example, on an AWS m3.large class machine, 10 ssh-slaves concurrently 
> building jobs as chatty as the mock-load-builder job type is the most you 
> can push.
>
> If you use JNLP slaves, you can get close to 60 concurrent builds before 
> the system starts falling over.
>
> The CloudBees NIO ssh-slaves plugin (part if the enterprise offering) has 
> a different performance characteristic... My most recent tests I was able 
> to get up to 120 concurrent builds, without affecting the Jenkins UI (I 
> only had set up for that number of slaves... It likely can go further, 
> though m3.large is not beefy enough) what was affected though we're build 
> times. The builds were 2-3 times slower due to back-pressure effects 
> causing the builds to block on STDOUT.
>
> If anyone else is interested, we will be releasing our scalability test 
> harness (actually I will be ripping the bottom out of the acceptance test 
> framework and putting the scalability harness in its place... But the 
> harness is also useful for scalability testing). We will also be publishing 
> our findings.
>
> The other thing to watch is how your entropy pool is holding up. The 
> default random source in Linux typically gets exhausted quite quickly. That 
> can cause your ssh slaves to fail ping tests and timeout/block
>
> I think the package you want to install is haveged
>
> That or switch java to /dev/urandom
>
> Note: I am currently not recommending any specific slave connector, there 
> are trade-offs with each type of connector. I will be writing up a blog 
> post in the near future discussing the various trade-offs.
>
> Standard ssh-slaves degrades poorly... This is great if you want to know 
> when you have reached your limit
>
> NIO ssh-slaves degrades gracefully, I need to determine where it starts 
> degrading relative to standard ssh-slaves, but if UI responsiveness is more 
> important than build times then this has advantages (though you need to be 
> a paying cloudbees customer)
>
> JNLP scales the highest without affecting build times, but degrades 
> fastest, is a poor fit for on-demand connection/retention strategies and 
> does not offer the same transport encryption security as the ssh- versions
>
> Those are just the brief high-level measures
>
> On Monday, 5 May 2014, Charles Chan <charles...@gmail.com <javascript:>> 
> wrote:
>
>> Hello,
>>
>> One of the issue we have recently been experiencing with Jenkins is that the 
>> slaves (node) would go offline for no apparent reason and would not 
>> reconnect automatically.
>> When slaves appear as offline, we tried to launch/reconnect the slave 
>> manually but it does not work either. However, we are able to SSH into the 
>> machine using PuTTy.
>>
>> The only workaround is to restart the Jenkins server, until the problem 
>> surfaces again. (Typically in a week.)
>>
>> Instance Information
>> --------------------
>> Jenkins Server:            1.562
>> SSH Credentials Plugin:    1.6.1
>>
>> SSH Slaves Plugin          1.6
>>
>> Thread dump of slave node:
>> {dump}
>> "Channel reader thread: qa-linbuild-02" prio=5 WAITING
>>      java.lang.Object.wait(Native Method)
>>      java.lang.Object.wait(Object.java:485)
>>      
>> com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
>>      
>> com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
>>      com.trilead.ssh2.Session.<init>(Session.java:41)
>>      com.trilead.ssh2.Connection.openSession(Connection.java:1129)
>>      com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
>>      com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
>>      
>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)
>>      hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437)
>>      hudson.remoting.Channel.terminate(Channel.java:819)
>>      
>> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)
>>
>> "Channel reader thread: qa-linbuild-03" prio=5 WAITING
>>      java.lang.Object.wait(Native Method)
>>      java.lang.Object.wait(Object.java:485)
>>      
>> com.trilead.ssh2.channel.ChannelManager.waitUntilChannelOpen(ChannelManager.java:109)
>>      
>> com.trilead.ssh2.channel.ChannelManager.openSessionChannel(ChannelManager.java:583)
>>      com.trilead.ssh2.Session.<init>(Session.java:41)
>>      com.trilead.ssh2.Connection.openSession(Connection.java:1129)
>>      com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:99)
>>      com.trilead.ssh2.SFTPv3Client.<init>(SFTPv3Client.java:119)
>>      
>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1160)
>>      hudson.slaves.SlaveComputer$2.onClosed(SlaveComputer.java:437)
>>      hudson.remoting.Channel.terminate(Channel.java:819)
>>      
>> hudson.remoting.SynchronousCommandTransport$ReaderThread.run(SynchronousCommandTransport.java:76)
>> {dump}
>>
>> Also concerning is the number of threads is in the BLOCKED (126!). 
>> Doesn't seem normal as there are no BLOCKED threads after the server is 
>> restarted.
>>
>> {dump}
>> // 118 instances
>> "Computer.threadPoolForRemoting [#26]" daemon prio=5 BLOCKED
>>      
>> hudson.plugins.sshslaves.SSHLauncher.afterDisconnect(SSHLauncher.java:1152)
>>      hudson.slaves.SlaveComputer$3.run(SlaveComputer.java:542)
>>
>>      
>> jenkins.util.ContextResettingExecutorService$1.run(ContextResettingExecutorService.java:28)
>>      java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
>>      java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>
>>      java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>      
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>      
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>
>>      java.lang.Thread.run(Thread.java:662)
>>
>> // 8 instances
>> "Computer.threadPoolForRemoting [#2922]" daemon prio=5 BLOCKED
>>      hudson.plugins.sshslaves.SSHLauncher.launch(SSHLauncher.java:639)
>>      hudson.slaves.SlaveComputer$1.call(SlaveComputer.java:222)
>>
>>      
>> jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
>>      java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
>>      java.util.concurrent.FutureTask.run(FutureTask.java:138)
>>
>>      
>> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
>>      
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
>>      java.lang.Thread.run(Thread.java:662)
>> {dump}
>>
>> Looking forward to any ideas or suggestions.
>>
>> Thank you.
>> Charles Chan
>>
>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Jenkins Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to jenkinsci-users+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
> Sent from my phone
>

-- 
You received this message because you are subscribed to the Google Groups 
"Jenkins Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to jenkinsci-users+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to