I have a set of 12 hour builds that run across 45 nodes on 3 machines (4 if you 
count the master).

All the machines are Red Hat Enterprise.
All the communication is via ssh (both job launch and node startup).

Here is the problem I am trying to track down:

Sometimes, the job finishes, and the node immediately (within a few minutes) 
updates its status with the master and is ready for the next job.

Sometimes, however, it will take the node hours to realize the job is finished 
and update.  Of my 45 nodes, 10 are currently in this state.

The job itself is a paramerized job, the actual build is this shell fragment:

#!/bin/sh
source ~/.bashrc
echo "Build Starting..."
$CVSHOME/build/scripts/armada/galleons/allIntegration.sh
echo "Build Finished"
exit 0

There are No post build actions.


So the questions I have are:

1.       What is the polling cycle on the node monitoring the job and is it 
configurable?

2.       Is there a way to get more information out of the node than just 
pinging systeminfo on the main Jenkins?

3.       Where in the Jenkins code base is the node management code?


This is the thread dump for one of them (http://jenkins/node1/systeminfo )
Thread Dump
Channel reader thread: channel

"Channel reader thread: channel" Id=9 Group=main RUNNABLE (in native)
                at java.io.FileInputStream.readBytes(Native Method)
                at java.io.FileInputStream.read(FileInputStream.java:199)
                at 
java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
                at 
java.io.BufferedInputStream.read(BufferedInputStream.java:237)
                -  locked java.io.BufferedInputStream@2486ae
                at 
java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249)
                at 
java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542)
                at 
java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552)
                at 
java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297)
                at 
java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
                at hudson.remoting.Channel$ReaderThread.run(Channel.java:1030)


main

"main" Id=1 Group=main WAITING on hudson.remoting.Channel@a17083
                at java.lang.Object.wait(Native Method)
                -  waiting on hudson.remoting.Channel@a17083
                at java.lang.Object.wait(Object.java:485)
                at hudson.remoting.Channel.join(Channel.java:766)
                at hudson.remoting.Launcher.main(Launcher.java:420)
                at 
hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:366)
                at hudson.remoting.Launcher.run(Launcher.java:206)
                at hudson.remoting.Launcher.main(Launcher.java:168)


Ping thread for channel hudson.remoting.Channel@a17083:channel

"Ping thread for channel hudson.remoting.Channel@a17083:channel" Id=10 
Group=main TIMED_WAITING
                at java.lang.Thread.sleep(Native Method)
                at hudson.remoting.PingThread.run(PingThread.java:86)


pool-1-thread-666

"pool-1-thread-666" Id=719 Group=main RUNNABLE
                at sun.management.ThreadImpl.dumpThreads0(Native Method)
                at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:374)
                at hudson.Functions.getThreadInfos(Functions.java:872)
                at 
hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:93)
                at 
hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:89)
                at hudson.remoting.UserRequest.perform(UserRequest.java:118)
                at hudson.remoting.UserRequest.perform(UserRequest.java:48)
                at hudson.remoting.Request$2.run(Request.java:287)
                at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441)
                at 
java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
                at java.util.concurrent.FutureTask.run(FutureTask.java:138)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
                at java.lang.Thread.run(Thread.java:619)

                Number of locked synchronizers = 1
                - java.util.concurrent.locks.ReentrantLock$NonfairSync@1630de2


Finalizer

"Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@64514
                at java.lang.Object.wait(Native Method)
                -  waiting on java.lang.ref.ReferenceQueue$Lock@64514
                at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
                at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
                at 
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159)


Reference Handler

"Reference Handler" Id=2 Group=system WAITING on 
java.lang.ref.Reference$Lock@1a12930
                at java.lang.Object.wait(Native Method)
                -  waiting on java.lang.ref.Reference$Lock@1a12930
                at java.lang.Object.wait(Object.java:485)
                at 
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116)


Signal Dispatcher

"Signal Dispatcher" Id=4 Group=system RUNNABLE

Thank you,

-Clark.

The information in this message is for the intended recipient(s) only and may 
be the proprietary and/or confidential property of Litle & Co., LLC, and thus 
protected from disclosure. If you are not the intended recipient(s), or an 
employee or agent responsible for delivering this message to the intended 
recipient, you are hereby notified that any use, dissemination, distribution or 
copying of this communication is prohibited. If you have received this 
communication in error, please notify Litle & Co. immediately by replying to 
this message and then promptly deleting it and your reply permanently from your 
computer.

Reply via email to