I have a set of 12 hour builds that run across 45 nodes on 3 machines (4 if you count the master).
All the machines are Red Hat Enterprise. All the communication is via ssh (both job launch and node startup). Here is the problem I am trying to track down: Sometimes, the job finishes, and the node immediately (within a few minutes) updates its status with the master and is ready for the next job. Sometimes, however, it will take the node hours to realize the job is finished and update. Of my 45 nodes, 10 are currently in this state. The job itself is a paramerized job, the actual build is this shell fragment: #!/bin/sh source ~/.bashrc echo "Build Starting..." $CVSHOME/build/scripts/armada/galleons/allIntegration.sh echo "Build Finished" exit 0 There are No post build actions. So the questions I have are: 1. What is the polling cycle on the node monitoring the job and is it configurable? 2. Is there a way to get more information out of the node than just pinging systeminfo on the main Jenkins? 3. Where in the Jenkins code base is the node management code? This is the thread dump for one of them (http://jenkins/node1/systeminfo ) Thread Dump Channel reader thread: channel "Channel reader thread: channel" Id=9 Group=main RUNNABLE (in native) at java.io.FileInputStream.readBytes(Native Method) at java.io.FileInputStream.read(FileInputStream.java:199) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:237) - locked java.io.BufferedInputStream@2486ae at java.io.ObjectInputStream$PeekInputStream.peek(ObjectInputStream.java:2249) at java.io.ObjectInputStream$BlockDataInputStream.peek(ObjectInputStream.java:2542) at java.io.ObjectInputStream$BlockDataInputStream.peekByte(ObjectInputStream.java:2552) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1297) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351) at hudson.remoting.Channel$ReaderThread.run(Channel.java:1030) main "main" Id=1 Group=main WAITING on hudson.remoting.Channel@a17083 at java.lang.Object.wait(Native Method) - waiting on hudson.remoting.Channel@a17083 at java.lang.Object.wait(Object.java:485) at hudson.remoting.Channel.join(Channel.java:766) at hudson.remoting.Launcher.main(Launcher.java:420) at hudson.remoting.Launcher.runWithStdinStdout(Launcher.java:366) at hudson.remoting.Launcher.run(Launcher.java:206) at hudson.remoting.Launcher.main(Launcher.java:168) Ping thread for channel hudson.remoting.Channel@a17083:channel "Ping thread for channel hudson.remoting.Channel@a17083:channel" Id=10 Group=main TIMED_WAITING at java.lang.Thread.sleep(Native Method) at hudson.remoting.PingThread.run(PingThread.java:86) pool-1-thread-666 "pool-1-thread-666" Id=719 Group=main RUNNABLE at sun.management.ThreadImpl.dumpThreads0(Native Method) at sun.management.ThreadImpl.dumpAllThreads(ThreadImpl.java:374) at hudson.Functions.getThreadInfos(Functions.java:872) at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:93) at hudson.util.RemotingDiagnostics$GetThreadDump.call(RemotingDiagnostics.java:89) at hudson.remoting.UserRequest.perform(UserRequest.java:118) at hudson.remoting.UserRequest.perform(UserRequest.java:48) at hudson.remoting.Request$2.run(Request.java:287) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:441) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) at java.util.concurrent.FutureTask.run(FutureTask.java:138) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:619) Number of locked synchronizers = 1 - java.util.concurrent.locks.ReentrantLock$NonfairSync@1630de2 Finalizer "Finalizer" Id=3 Group=system WAITING on java.lang.ref.ReferenceQueue$Lock@64514 at java.lang.Object.wait(Native Method) - waiting on java.lang.ref.ReferenceQueue$Lock@64514 at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116) at java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:159) Reference Handler "Reference Handler" Id=2 Group=system WAITING on java.lang.ref.Reference$Lock@1a12930 at java.lang.Object.wait(Native Method) - waiting on java.lang.ref.Reference$Lock@1a12930 at java.lang.Object.wait(Object.java:485) at java.lang.ref.Reference$ReferenceHandler.run(Reference.java:116) Signal Dispatcher "Signal Dispatcher" Id=4 Group=system RUNNABLE Thank you, -Clark. The information in this message is for the intended recipient(s) only and may be the proprietary and/or confidential property of Litle & Co., LLC, and thus protected from disclosure. If you are not the intended recipient(s), or an employee or agent responsible for delivering this message to the intended recipient, you are hereby notified that any use, dissemination, distribution or copying of this communication is prohibited. If you have received this communication in error, please notify Litle & Co. immediately by replying to this message and then promptly deleting it and your reply permanently from your computer.