so if NUMTHREADS has not been set/changed and defaults to 5, how do I 
account for the 44 (lightweight) processes associated with the STAX JVM?
UID        PID  PPID   LWP  C NLWP STIME TTY          TIME CMD
root      2565  2514  2565  0   44 Feb17 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2566  0   44 Feb17 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2567  0   44 Feb17 ?        01:33:06 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2568  0   44 Feb17 ?        01:33:29 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2569  0   44 Feb17 ?        01:33:11 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2570  0   44 Feb17 ?        01:33:12 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2571  0   44 Feb17 ?        00:56:29 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2572  0   44 Feb17 ?        00:00:09 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2573  0   44 Feb17 ?        00:00:19 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2574  0   44 Feb17 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2575  0   44 Feb17 ?        00:01:37 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2576  0   44 Feb17 ?        00:01:57 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2577  0   44 Feb17 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2578  0   44 Feb17 ?        00:17:43 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2579  0   44 Feb17 ?        00:08:50 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2580  0   44 Feb17 ?        00:11:52 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2581  0   44 Feb17 ?        00:00:26 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2624  0   44 Feb17 ?        00:28:11 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2625  0   44 Feb17 ?        00:28:47 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2626  0   44 Feb17 ?        00:28:21 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2627  0   44 Feb17 ?        00:28:59 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2628  0   44 Feb17 ?        00:28:47 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  3484  0   44 Feb17 ?        00:22:59 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  7326  0   44 Feb17 ?        00:34:02 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 14097  0   44 Feb18 ?        00:33:18 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 20376 59   44 Feb18 ?        19-11:57:34 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 20466  0   44 Feb18 ?        00:05:43 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 20474  0   44 Feb18 ?        00:02:59 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 20487  0   44 Feb18 ?        00:06:03 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 20577  0   44 Feb18 ?        00:01:22 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 14154 59   44 Mar05 ?        10-14:40:07 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 14974  0   44 Mar05 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29079  0   44 Mar08 ?        00:00:01 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29081  0   44 Mar08 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29082  0   44 Mar08 ?        00:19:12 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29083  0   44 Mar08 ?        00:11:20 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29084  0   44 Mar08 ?        00:24:29 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29085  0   44 Mar08 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29086  0   44 Mar08 ?        00:06:55 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29095  0   44 Mar08 ?        00:22:46 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514 29098  0   44 Mar08 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  1893  0   44 16:12 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2059  0   44 16:24 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX
root      2565  2514  2110  0   44 16:29 ?        00:00:00 
/usr/java/jdk1.6.0_16/bin/java -Xmx384m 
com.ibm.staf.service.STAFServiceHelper STAX

I do note that these seem to have bene started in several separate 
batches, i.e. most Feb 17 when I last restarted STAX, but others a day 
later, some March 5, 8, and then today.  and those last 3 would seem to 
naturally correlate with 3 currently running jobs:
Job ID Job Name                                  Start Date-Time   Function
------ ----------------------------------------- ----------------- --------
11314  rep_slave_fail_inject.xml-rowrep          20100323-16:12:15 rep_tpcb
11317  rejoin_cluster_bug560-1021.xml-snowball10 20100323-16:24:24 main
11318  bug1195_conn_memleak-797.xml-snowball4    20100323-16:29:44 main


it is possible that a STAX thread could go off to lala land if it is 
unable to deliver it's process completion message (this occurs to me 
based on http://staf.sourceforge.net/current/STAFFAQ.htm#d0e2032 )?



Sharon Lucas wrote:
> 
> The STAX service has the following threads:
> 
> - Main thread for the STAX service:  Handles any requests submitted to 
> the STAX service
> 
> - Thread for the STAXThreadQueue:  Maintains a list of STAX-Threads 
> waiting for execution and schedules their execution on the limited 
> number of physical threads (e.g. set by the
> NUMTHREADS parameter when configuring the STAX service, or 5 by default).
> 
> - Thread for the STAXTimedEventQueue:  Checks for when a <timer> expires.
> 
> - Each STAXJob currently running has a STAFQueueMonitor thread.  This 
> thread has a loop that continually submits a GET WAIT request to the 
> QUEUE service to get any messages (STAF/Process/End and 
> STAF/RequestComplete) for this STAX job.
> Note:  I made a change (that I referred to in my previous email) in the 
> STAXJob::STAFQueueMonitorThread get multiple messages off the queue at 
> once (instead of 1 at a time).  This change is in the next release of 
> STAX planned to be released at the end of March 2010 (e.g. next week).
> 
> --------------------------------------------------------------
> Sharon Lucas
> IBM Austin,   luc...@us.ibm.com
> (512) 286-7313 or Tieline 363-7313
> 
> 
> 
> *Nathan Parrish <nparr...@clustrix.com>*
> 
> 03/23/2010 03:09 PM
> 
>       
> To
>       Sharon Lucas/Austin/i...@ibmus
> cc
>       staf-users@lists.sourceforge.net
> Subject
>       Re: [staf-users] high CPU utilization by STAFProc and STAX JVM
> 
> 
>       
> 
> 
> 
> 
> 
> Name       Library    Executable
> ---------- ---------- ------------------------------------------------------
> DELAY      <Internal> <None>
> DIAG       <Internal> <None>
> ECHO       <Internal> <None>
> EVENT      JSTAF      /usr/local/staf/services/stax/STAFEvent.jar
> FS         <Internal> <None>
> HANDLE     <Internal> <None>
> HELP       <Internal> <None>
> LIFECYCLE  <Internal> <None>
> LOG        STAFLog    <None>
> MISC       <Internal> <None>
> PING       <Internal> <None>
> PROCESS    <Internal> <None>
> QUEUE      <Internal> <None>
> RESPOOL    STAFPool   <None>
> SEM        <Internal> <None>
> SERVICE    <Internal> <None>
> SHUTDOWN   <Internal> <None>
> STAX       JSTAF      /usr/local/staf/services/stax/STAX.jar
> TESTSTAFIA JSTAF      /usr/local/staf/lib/services/teststafia/TestSTAFia.jar
> TRACE      <Internal> <None>
> TRUST      <Internal> <None>
> VAR        <Internal> <None>
> 
> I retract my earlier suggestion that this might be unrelated to STAX.
> when I connect to the STAX JVM with gdb, I see the CPU load drop to
> negligible levels.  so I'm guessing that indeed something within STAX
> (i.e. a rogue job which is not shown in job list?) is continually
> pelting STAFProc with some request(s), driving the CPU load.
> 
> Sharon Lucas wrote:
>  >
>  > What services are registered on this machine?  What is the output of:  
>  > STAF local SERVICE LIST
>  >
>  > --------------------------------------------------------------
>  > Sharon Lucas
>  > IBM Austin,   luc...@us.ibm.com
>  > (512) 286-7313 or Tieline 363-7313
>  >
>  >
>  >
>  > *Nathan Parrish <nparr...@clustrix.com>*
>  >
>  > 03/23/2010 02:31 PM
>  >
>  >                  
>  > To
>  >                  Sharon Lucas/Austin/i...@ibmus
>  > cc
>  >                  staf-users@lists.sourceforge.net
>  > Subject
>  >                  Re: [staf-users] high CPU utilization by STAFProc 
> and STAX JVM
>  >
>  >
>  >                  
>  >
>  >
>  >
>  >
>  >
>  > looking at the STAX processes/threads I see two constantly doing
>  > something like:
>  > futex(0x2aaacc943f60, FUTEX_WAKE, 1)    = 0
>  > socket(PF_FILE, SOCK_STREAM, 0)         = 15
>  > fcntl(15, F_GETFD)                      = 0
>  > fcntl(15, F_SETFD, FD_CLOEXEC)          = 0
>  > fcntl(15, F_GETFL)                      = 0x2 (flags O_RDWR)
>  > fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK)   = 0
>  > connect(15, {sa_family=AF_FILE, path="/tmp/STAFIPC_STAF"}, 110) = 0
>  > fcntl(15, F_GETFL)                      = 0x802 (flags O_RDWR|O_NONBLOCK)
>  > fcntl(15, F_SETFL, O_RDWR)              = 0
>  > sendto(15, "\0\0\0\0\2\0\0\0", 8, 0, NULL, 0) = 8
>  > recvfrom(15, "\0\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4
>  > sendto(15,
>  > "\0\0\0\0\5\n\0\0d\351\0\0\5\0\0\0\5\0\0\0\10\0\0\0localQUE"..., 42,
>  > 0, NULL, 0) = 42
>  > recvfrom(15, "\5\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4
>  > recvfrom(15, ">\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4
>  > recvfrom(15, "HandleManager::updateTimestamp()"..., 62, MSG_NOSIGNAL,
>  > NULL, NULL
>  > ) = 62
>  > close(15)
>  >
>  > I think this is the "QUEUE GET traffic you described in your last email.
>  >  however, I don't see any jobs when I STAX LIST JOBS on this machine,
>  > so where are are these coming from?
>  >
>  > Nathan Parrish wrote:
>  >  > Hi Sharon,
>  >  >
>  >  > Back to this problem again...  CPU load on our STAX machine is sitting
>  >  > around 3, and has been well above 1 since early February.  What's
>  >  > interesting to note this time is that we apparently restarted STAX for
>  >  > some other reason on February 17, and yet load was plateaued just 
> south
>  >  > of 2 from February 3-ish.  It then increased to the current peak of ~3
>  >  > midway through the first week of March.  This suggests to me that 
> while
>  >  > STAX may be contributing, there seems to be some component of this
>  >  > problem involving STAF itself.
>  >  >
>  >  > In fact, I'm beginning to think there are several discrete issues 
> which
>  >  > are combining to push the CPU load up to where it begins to cause
>  >  > problems.  Possibly, one or more of these issues are related to the
>  >  > network problems I mentioned earlier this month on the STAF/STAX 
> forums;
>  >  > that is, I wonder if there might be some areas where some STAF or STAX
>  >  > threads get into a bad (CPU-eating) state if they miss a packet or 4.
>  >  >
>  >  > Presently I've managed to eliminate all long-running STAX jobs, 
> and yet
>  >  > I'm still seeing significant spew from my strace -fp on STAFProc.  It
>  >  > seems there are several pattens of the forked processes (of which 
> there
>  >  > are 300), and I'd like your help understanding what might be happening
>  >  > with each.
>  >  >
>  >  > Thread Behavior Pattern #1 (pid 2516; which is only 2 greater than
>  >  > STAFProc pid 2514, so I'm guessing this is a listener thread?):
>  >  > fcntl(16, F_GETFD)                      = 0
>  >  > fcntl(16, F_SETFD, FD_CLOEXEC)          = 0
>  >  > futex(0x2aaab005c504, FUTEX_CMP_REQUEUE, 1, 2147483647, 
> 0x2aaab005c4d8,
>  >  > 0) = 1
>  >  > futex(0x2aaab005c4d8, FUTEX_WAKE, 1)    = 1
>  >  > accept(8, {sa_family=AF_FILE, pa...@}, [12300248949435924482]) = 14
>  >  > fcntl(14, F_GETFD)                      = 0
>  >  > fcntl(14, F_SETFD, FD_CLOEXEC)          = 0
>  >  > futex(0x2aaab005c504, FUTEX_CMP_REQUEUE, 1, 2147483647, 
> 0x2aaab005c4d8,
>  >  > 0) = 1
>  >  > futex(0x2aaab005c4d8, FUTEX_WAKE, 1)    = 1
>  >  > accept(8, {sa_family=AF_FILE, pa...@}, [12300248949435924482]) = 16
>  >  > fcntl(16, F_GETFD)                      = 0
>  >  > fcntl(16, F_SETFD, FD_CLOEXEC)          = 0
>  >  > futex(0x2aaab005c504, FUTEX_CMP_REQUEUE, 1, 2147483647, 
> 0x2aaab005c4d8,
>  >  > 0) = 1
>  >  > futex(0x2aaab005c4d8, FUTEX_WAKE, 1)    = 1
>  >  > accept(8, {sa_family=AF_FILE, pa...@}, [12300248949435924482]) = 14
>  >  > fcntl(14, F_GETFD)                      = 0
>  >  > fcntl(14, F_SETFD, FD_CLOEXEC)          = 0
>  >  > futex(0x2aaab005c504, FUTEX_CMP_REQUEUE, 1, 2147483647, 
> 0x2aaab005c4d8,
>  >  > 0) = 1
>  >  >
>  >  > Thread Behavior Pattern #2 (pid 4228):
>  >  > futex(0x2aab9c000020, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource
>  >  > temporarily unavailable)
>  >  > futex(0x2aab9c000020, FUTEX_WAKE, 1)    = 0
>  >  > close(17)                               = 0
>  >  > futex(0x89ffa4, FUTEX_WAIT, 589133383, NULL) = 0
>  >  > futex(0x89ff78, FUTEX_WAKE, 1)          = 0
>  >  > select(10, [9], NULL, NULL, {120, 0})   = 1 (in [9], left {120, 0})
>  >  > recvfrom(9, "\0\0\0\0\2\0\0\0", 8, MSG_NOSIGNAL, NULL, NULL) = 8
>  >  > select(10, NULL, [9], NULL, {120, 0})   = 1 (out [9], left {120, 0})
>  >  > sendto(9, "\0\0\0\0", 4, 0, NULL, 0)    = 4
>  >  > select(10, [9], NULL, NULL, {120, 0})   = 1 (in [9], left {120, 0})
>  >  > recvfrom(9, "\0\0\0\0\5\n\0\0\345\r\0\0\5\0\0\0\5\0\0\0\10\0\0\0", 24,
>  >  > MSG_NOSIGNAL, NULL, NULL) = 24
>  >  > select(10, [9], NULL, NULL, {120, 0})   = 1 (in [9], left {120, 0})
>  >  > recvfrom(9, "localQUEUEGET WAIT", 18, MSG_NOSIGNAL, NULL, NULL) = 18
>  >  > select(10, NULL, [9], NULL, {120, 0})   = 1 (out [9], left {120, 0})
>  >  > sendto(9, "\5\0\0\0", 4, 0, NULL, 0)    = 4
>  >  > select(10, NULL, [9], NULL, {120, 0})   = 1 (out [9], left {120, 0})
>  >  > sendto(9, "=\0\0\0", 4, 0, NULL, 0)     = 4
>  >  > select(10, NULL, [9], NULL, {120, 0})   = 1 (out [9], left {120, 0})
>  >  > sendto(9, "HandleManager::updateTimestamp()"..., 61, 0, NULL, 0) = 61
>  >  > close(9)                                = 0
>  >  > futex(0x89ffa4, FUTEX_WAIT, 589133385, NULL) = 0
>  >  > futex(0x89ff78, FUTEX_WAKE, 1)          = 0
>  >  > select(18, [17], NULL, NULL, {120, 0})  = 1 (in [17], left {120, 0})
>  >  > recvfrom(17, "\0\0\0\0\2\0\0\0", 8, MSG_NOSIGNAL, NULL, NULL) = 8
>  >  > select(18, NULL, [17], NULL, {120, 0})  = 1 (out [17], left {120, 0})
>  >  > sendto(17, "\0\0\0\0", 4, 0, NULL, 0)   = 4
>  >  > select(18, [17], NULL, NULL, {120, 0})  = 1 (in [17], left {120, 0})
>  >  > recvfrom(17, 
> "\0\0\0\0\5\n\0\0\345\r\0\0\5\0\0\0\5\0\0\0\10\0\0\0", 24,
>  >  > MSG_NOSIGNAL, NULL, NULL) = 24
>  >  > select(18, [17], NULL, NULL, {120, 0})  = 1 (in [17], left {120, 0})
>  >  > recvfrom(17, "localQUEUEGET WAIT", 18, MSG_NOSIGNAL, NULL, NULL) = 18
>  >  > select(18, NULL, [17], NULL, {120, 0})  = 1 (out [17], left {120, 0})
>  >  > sendto(17, "\5\0\0\0", 4, 0, NULL, 0)   = 4
>  >  > select(18, NULL, [17], NULL, {120, 0})  = 1 (out [17], left {120, 0})
>  >  > sendto(17, "=\0\0\0", 4, 0, NULL, 0)    = 4
>  >  > select(18, NULL, [17], NULL, {120, 0})  = 1 (out [17], left {120, 0})
>  >  > sendto(17, "HandleManager::updateTimestamp()"..., 61, 0, NULL, 0) = 61
>  >  > close(17)
>  >  > futex(0x322814c960, FUTEX_WAKE, 1)      = 0
>  >  > futex(0x89ffa4, FUTEX_WAIT, 589133389, NULL
>  >  > [sits here quietly for some time]
>  >  >
>  >  > Thread Behavior Pattern #3 (pid 14664):
>  >  > recvfrom(16, "get system var testenvs/snowball"..., 62, MSG_NOSIGNAL,
>  >  > NULL, NULL) = 62
>  >  > select(17, [16], NULL, NULL, {120, 0})  = 1 (in [16], left {120, 0})
>  >  > recvfrom(16, "\4\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4
>  >  > select(17, [16], NULL, NULL, {120, 0})  = 1 (in [16], left {120, 0})
>  >  > recvfrom(16, "none", 4, MSG_NOSIGNAL, NULL, NULL) = 4
>  >  > select(17, [16], NULL, NULL, {120, 0})  = 1 (in [16], left {120, 0})
>  >  > recvfrom(16, "\t\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4
>  >  > select(17, [16], NULL, NULL, {120, 0})  = 1 (in [16], left {120, 0})
>  >  > recvfrom(16, "anonymous", 9, MSG_NOSIGNAL, NULL, NULL) = 9
>  >  > select(17, [16], NULL, NULL, {120, 0})  = 1 (in [16], left {120, 0})
>  >  > recvfrom(16, "\0\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4
>  >  > select(17, [16], NULL, NULL, {120, 0})  = 1 (in [16], left {120, 0})
>  >  > recvfrom(16, "\33\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4
>  >  > select(17, [16], NULL, NULL, {120, 0})  = 1 (in [16], left {120, 0})
>  >  > recvfrom(16, "devnode4.colo.sproutsys.com", 27, MSG_NOSIGNAL, NULL,
>  >  > NULL) = 27
>  >  > this appears to be the VAR service, which we use extensively.  it 
> wakes
>  >  > up and does something every few seconds, which seems quite reasonable.
>  >  >
>  >  >
>  >  >
>  >  > What it seems like, from this distance, is that one of many PIDs 
> will be
>  >  > acting like #2, in what appears to be 'competition' with #1 (from 
> a very
>  >  > naive/cargo cult perspective; on further thought this is an I/O 
> thread,
>  >  > right?).  If I strace specifically the the #2-type PID, it will appear
>  >  > to be mostly idle.  This could be some sort of strace anomaly, or more
>  >  > likely my misapprehension of the threading behavior of STAF.  But I'm
>  >  > led back to my working theory that some or many STAFProc threads 
> are off
>  >  > the ranch somewhere, causing this CPU churn.  Beyond stracing these
>  >  > threads/procs, what can I do to determine what these threads are 
> or why
>  >  > they are in this state?
>  >  >
>  >  > Hmm, gdb perhaps?
>  >  > ok, so my theory about 2516 seems to be validated with:
>  >  >    301 Thread 1081932112 (LWP 2516)  0x0000003228a0ce2b in accept ()
>  >  > from /lib64/libpthread.so.0
>  >  >    300 Thread 1086028112 (LWP 2517)  0x0000003228a0ce2b in accept ()
>  >  > from /lib64/libpthread.so.0
>  >  >
>  >  > everything else is in:
>  >  >    2 Thread 46913072695632 (LWP 28913)  0x0000003228a0a486 in
>  >  > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>  >  >
>  >  > except for:
>  >  >    302 Thread 1077836112 (LWP 2515)  0x0000003228a0a677 in
>  >  > pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
>  >  >
>  >  > I don't feel terribly enlightened by this, unfortunately.  Other
>  >  > suggestions?
>  >  >
>  >  > Thanks again,
>  >  > Nathan
>  >  >
>  >  > Sharon Lucas wrote:
>  >  >> Ok, then that shows that the high CPU usage was related to STAX.
>  >  >>
>  >  >> The strace output you provided shows lots of "GET WAIT" requests
>  >  >> submitted to the QUEUE service on the local STAX machine.  Each STAX
>  > job
>  >  >> has a STAFQueueMonitor class that in a loop continuously checks for
>  >  >> messages on the STAX job handle's queue (e.g. STAF command completion
>  >  >> messages and process completion messages) by submitting a QUEUE GET
>  > WAIT
>  >  >> request.  If there are no messages, the QUEUE GET WAIT request waits
>  >  >> indefinitely for a message to appear on its queue (and this 
> doesn't use
>  >  >> up CPU).  But, in your case, one or more STAX jobs were constantly
>  >  >> getting messages on its queue and processing them.  
>  >  >>
>  >  >> In STAF V3.3.3, we added support for getting multiple messages off a
>  >  >> handle's queue by adding the ALL and FIRST options to the QUEUE
>  >  >> service's GET/PEEK requests as getting multiple messages off a queue
>  > can
>  >  >> improve performance (over getting a single message at a time off a
>  > queue
>  >  >> multiple times).  So, I'll look into changing STAX to do a QUEUE GET
>  > ALL
>  >  >> WAIT request and when I get this done, I could send an updated 
> STAX.jar
>  >  >> file to you and perhaps you could try it out to see if it improves
>  >  >> performance and decreases CPU utilization.
>  >  >>
>  >  >> But, if you have some "abusive" STAX jobs that do CPU intensive 
> things
>  >  >> like in a loop then those STAX jobs should be fixed.  For example,
>  > if in
>  >  >> a STAX job someone uses a <loop> element to constantly submit a STAF
>  >  >> service request (like a PING request) via a <stafcmd>, there should
>  > also
>  >  >> be a delay in the loop between STAF service requests so that it isn't
>  >  >> using up all the CPU.  Anyone can write a STAX job that eats up 
> all the
>  >  >> CPU, just like anyone can write a program in any language that 
> eats up
>  >  >> all the CPU.
>  >  >>
>  >  >> --------------------------------------------------------------
>  >  >> Sharon Lucas
>  >  >> IBM Austin,   luc...@us.ibm.com
>  >  >> (512) 286-7313 or Tieline 363-7313
>  >  >>
>  >  >>
>  >  >>
>  >  >> *Nathan Parrish <nparr...@clustrix.com>*
>  >  >>
>  >  >> 11/03/2009 07:39 PM
>  >  >>
>  >  >>                  
>  >  >> To
>  >  >>                  Sharon Lucas/Austin/i...@ibmus
>  >  >> cc
>  >  >>                  staf-users@lists.sourceforge.net
>  >  >> Subject
>  >  >>                  Re: [staf-users] high CPU utilization by STAFProc
>  > and STAX JVM
>  >  >>
>  >  >>
>  >  >>                  
>  >  >>
>  >  >>
>  >  >>
>  >  >>
>  >  >>
>  >  >> after removing/adding the STAX service (and taking the opportunity to
>  >  >> switch to a newer JVM), the CPU load has dropped.
>  >  >>
>  >  >> before the service restart, strace did show STAFProc as being awfully
>  >  >> busy, compared to what I see now.  now I see it stopping every
>  > second or
>  >  >> so, whereas before it spewed at a rate of 5-20 lines a second. 
>  here is
>  >  >> a sampling of output, not that I expect it will be particularly
>  >  >> revealing: http://pastebin.com/m769fe708
>  >  >>
>  >  >> thanks again for your help in digging into this matter; perhaps the
>  >  >> newer JVM will have fixed the issue, though I suspect it's our 
> tendency
>  >  >> towards abusive stax jobs, and we'll see this crop up again.  
>  >  >>
>  >  >> nathan
>  >  >>
>  >  >> ----- "Sharon Lucas" <luc...@us.ibm.com> wrote:
>  >  >>
>  >  >>  > Right. STAF does not currently ever "reap" threads. If at one
>  > time 270
>  >  >>  > threads were created, then that's how many threads will remain
>  >  >>  > created. We have considered adding the ability to reap 
> threads, but
>  >  >>  > that has not been implemented.
>  >  >>  >
>  >  >>  > However, this should not effect CPU usage. But it does effect 
> memory
>  >  >>  > usage, as when a thread is created, a default thread stack of 
> a size
>  >  >>  > dependent on the operating system is created. On Linux 
> machines, STAF
>  >  >>  > overrides the default thread stack size to $M if its > 4M (as 
> it can
>  >  >>  > be 8M or 10M) in STAF V3.2.3 or later. Note that you can further
>  >  >>  > override it by setting the STAF_THREAD_STACK_SIZE environment
>  > variable
>  >  >>  > before starting STAFProc. For example, setting
>  >  >>  > STAF_THREAD_STACK_SIZE=4194304 specifies 4M as the value 
> specified is
>  >  >>  > the thread stack size in kilobytes.
>  >  >>  >
>  >  >>  > --------------------------------------------------------------
>  >  >>  > Sharon Lucas
>  >  >>  > IBM Austin, luc...@us.ibm.com
>  >  >>  > (512) 286-7313 or Tieline 363-7313
>  >  >>  >
>  >  >>  >
>  >  >>  >
>  >  >>  >                  Nathan Parrish <nparr...@clustrix.com>
>  >  >>  >
>  >  >>  > 11/03/2009 03:29 PM
>  >  >>  > To                  Sharon Lucas/Austin/i...@ibmus
>  >  >>  >
>  >  >>  > cc                  staf-users@lists.sourceforge.net
>  >  >>  >
>  >  >>  > Subject                  Re: [staf-users] high CPU utilization by
>  >  >> STAFProc and STAX
>  >  >>  > JVM
>  >  >>  >
>  >  >>  >
>  >  >>  >
>  >  >>  >
>  >  >>  > right, I understand this. my question is how 'thread growth' 
> works.
>  >  >>  > presumably the delta controls some behavior regarding how many new
>  >  >>  > threads we spin off when the system detects that we are running
>  >  >>  > low/out of threads.
>  >  >>  > also, are threads ever reaped? or if my system was incredibly 
> busy at
>  >  >>  > some time in the past, such that 270 threads were really all 
> going at
>  >  >>  > once, will I have all those threads sitting around even if they
>  > are no
>  >  >>  > longer unneeded? is it possible that the overhead of this many 
> idle
>  >  >>  > threads could be causing my CPU churn?
>  >  >>  >
>  >  >>  > thanks,
>  >  >>  > nathan
>  >  >>  >
>  >  >>  > ----- "Sharon Lucas" <luc...@us.ibm.com> wrote:
>  >  >>  >
>  >  >>  > > No, the STAF MISC service simply shows the initial threads 
> and the
>  >  >>  > > thread growth delta. It doesn't show the number of threads that
>  > STAF
>  >  >>  > > has created.
>  >  >>  > >
>  >  >>  > > --------------------------------------------------------------
>  >  >>  > > Sharon Lucas
>  >  >>  > > IBM Austin, luc...@us.ibm.com
>  >  >>  > > (512) 286-7313 or Tieline 363-7313
>  >  >>  > >
>  >  >>  > >
>  >  >>  > >
>  >  >>  > > Nathan Parrish <nparr...@clustrix.com>
>  >  >>  > >
>  >  >>  > > 11/03/2009 02:43 PM
>  >  >>  > > To Sharon Lucas/Austin/i...@ibmus
>  >  >>  > >
>  >  >>  > > cc staf-users@lists.sourceforge.net
>  >  >>  > >
>  >  >>  > > Subject Re: [staf-users] high CPU utilization by STAFProc 
> and STAX
>  >  >>  > > JVM
>  >  >>  > >
>  >  >>  > >
>  >  >>  > >
>  >  >>  > >
>  >  >>  > > Our longevity test has finally finished, so I'm able to poke 
> around
>  >  >>  > a
>  >  >>  > > bit deeper, and can look at doing things like restarting the 
> STAX
>  >  >>  > > service, or STAFProc itself later this afternoon.
>  >  >>  > >
>  >  >>  > > I enabled tracing, and I see traffic on the order of maybe 5-20
>  >  >>  > > queries a second, mostly the VAR service (90+% gets), and some
>  >  >>  > > semaphore requests as well. I'll get breaks in traffic as 
> long as 3
>  >  >>  > > seconds...
>  >  >>  > >
>  >  >>  > > short of doing gdb, I tried strace -fp on the process, and 
> noticed
>  >  >>  > > that it was quite busy, and was also dealing with a very large
>  >  >>  > number
>  >  >>  > > of child processes. I believe linux does some weird stuff mixing
>  >  >>  > what
>  >  >>  > > constitutes a process vs. a thread; I don't see all these 
> process
>  >  >>  > with
>  >  >>  > > ps, but pstree -p does show them, something like ~270 (not 
> counting
>  >  >>  > > the java procs/threads which are also underneath the main 
> STAFProc
>  >  >>  > > PID). looking on another machine (which is not running STAX, or
>  >  >>  > > servicing tons of variable requests, etc.), it has 82. my 
> desktop,
>  >  >>  > > which does run STAX for testing purposes, looks to have 90.
>  >  >>  > >
>  >  >>  > > from misc settings:
>  >  >>  > > Initial Threads : 10
>  >  >>  > > Thread Growth Delta : 1
>  >  >>  > >
>  >  >>  > > does this suggest that some threads have gotten spun off into
>  >  >>  > > never-never land and other threads have been created as a 
> result?
>  >  >>  > >
>  >  >>  > >
>  >  >>  > >
>  >  >>  > > ----- "Sharon Lucas" <luc...@us.ibm.com> wrote:
>  >  >>  > > > Unfortunately, I didn't really get many clues as to why 
> STAFProc
>  >  >>  > is
>  >  >>  > > > using up so much CPU from the information you provided. 
> Does the
>  >  >>  > CPU
>  >  >>  > > > usage for STAFProc constantly stay very high (e.g. above 
> 100%)?
>  >  >>  > > >
>  >  >>  > > > What are your STAX jobs doing? Can you give a description 
> of some
>  >  >>  > of
>  >  >>  > > > the STAF service requests that they are submitting.
>  >  >>  > > >
>  >  >>  > > > Since STAFProc was started 21 days ago on your STAX service
>  >  >>  > machine,
>  >  >>  > > > 1,832,518,328 STAF service requests have been submitted to 
> this
>  >  >>  > > > machine. That's a lot of STAF service requests. Do you 
> have any
>  >  >>  > > > "rogue" STAX jobs that are constantly submitting STAF service
>  >  >>  > > requests
>  >  >>  > > > in a loop (without a good reason)? What STAF service 
> requests are
>  >  >>  > > > being submitted the most?
>  >  >>  > > >
>  >  >>  > > > Yes, you can enable STAF tracing to see if that gives any 
> clues
>  >  >>  > > about
>  >  >>  > > > what's driving the CPU load (though note that enabling STAF
>  >  >>  > tracing
>  >  >>  > > > may slow things down a little). To see what STAF service 
> requests
>  >  >>  > > are
>  >  >>  > > > being submitted and when each STAF service request 
> completes, you
>  >  >>  > > > could enable trace points ServiceRequest, ServiceComplete, and
>  >  >>  > > > RemoteRequests. Note that this will generate tons of trace 
> output
>  >  >>  > > > since lots of STAF service requests are being submitted to 
> this
>  >  >>  > > > machine, so you'll also want to redirect STAFProc's trace 
> output
>  >  >>  > to
>  >  >>  > > a
>  >  >>  > > > file in a location where there is lots of available disk
>  > space and
>  >  >>  > > > monitor the size of this file. You may also want to enable the
>  >  >>  > > Warning
>  >  >>  > > > tracepoint so that any warning messages are also logged.For
>  >  >>  > example:
>  >  >>  > > >
>  >  >>  > > > STAF staxMachine TRACE SET DESTINATION TO FILE
>  >  >>  > > > /usr/local/staf/STAFProc.trc
>  >  >>  > > > STAF staxMachine TRACE ENABLE TRACEPOINTS "ServiceRequest
>  >  >>  > > > ServiceComplete RemoteRequests Warning"
>  >  >>  > > >
>  >  >>  > > > See section "8.18 Trace Service" in the STAF User's Guide for
>  > more
>  >  >>  > > > information.
>  >  >>  > > >
>  >  >>  > > > I don't know if the trace output will help us in 
> determining why
>  >  >>  > > > STAFProc is using up so much CPU. Maybe there is some 
> thread in
>  >  >>  > > > STAFProc that is in a bad state and constantly looping for 
> some
>  >  >>  > > > unknown reason (like the STAFProcessMonitor thread that 
> monitors
>  >  >>  > for
>  >  >>  > > > processes to complete). The only way I know of to check this
>  > would
>  >  >>  > > be
>  >  >>  > > > if STAFProc was started via a debugger like gdb. Then, once
>  > it got
>  >  >>  > > in
>  >  >>  > > > this "bad" state of high CPU usage, you could break in via
>  > gdb and
>  >  >>  > > > list threads (info threads) and change to each thread 
> (thread n)
>  >  >>  > and
>  >  >>  > > > check the backtraces (bt) for each thread to see what they are
>  >  >>  > > doing.
>  >  >>  > > > For example:
>  >  >>  > > >
>  >  >>  > > > Use gdb to debug STAF locally as follows:
>  >  >>  > > > 1. gdb STAFProc
>  >  >>  > > > 2. run
>  >  >>  > > > 3. Recreate the problem.
>  >  >>  > > >
>  >  >>  > > > Various commands that you might need while using gdb are:
>  >  >>  > > >
>  >  >>  > > > • help
>  >  >>  > > > • help tracepoints
>  >  >>  > > > • help stack
>  >  >>  > > > • info threads
>  >  >>  > > > • thread n
>  >  >>  > > > • bt
>  >  >>  > > >
>  >  >>  > > > Of course, this will require that STAF be shut down and then
>  >  >>  > > restarted
>  >  >>  > > > using gdb so you may not be able to do that now while your 
> long
>  >  >>  > runs
>  >  >>  > > > are still running.
>  >  >>  > > >
>  >  >>  > > > There's no reason that I know of not to use a 1.6.0 JVM 
> with the
>  >  >>  > > STAX
>  >  >>  > > > service so you can try that instead of upgrading to a more 
> recent
>  >  >>  > > > 1.5.0 JVM.
>  >  >>  > > >
>  >  >>  > > > You may want to increase the STAX service's 
> MaxFileCacheSize from
>  >  >>  > 20
>  >  >>  > > > to something like 50. This won't "fix" the problem of STAFProc
>  >  >>  > using
>  >  >>  > > > 100%+ CPU, but whenever a STAX job is executed, it first 
> needs to
>  >  >>  > be
>  >  >>  > > > XML parsed and this is a very CPU-intensive process (so if
>  > the CPU
>  >  >>  > > > usage is already high, it will take longer for STAX to parse a
>  >  >>  > STAX
>  >  >>  > > > job before execution of the STAX job begins). So, if you are
>  >  >>  > running
>  >  >>  > > > the same STAX job file more than one, the first time STAX
>  > needs to
>  >  >>  > > > parse it, but then it will cache it so that if the exact same
>  > STAX
>  >  >>  > > job
>  >  >>  > > > is submitted to be executed again, it doesn't have to be
>  > re-parsed
>  >  >>  > > (if
>  >  >>  > > > it's still in the STAX file cache). So, that's why I 
> recommended
>  >  >>  > > > increasing the STAX service's MaxFileCacheSize. It can be
>  >  >>  > increased
>  >  >>  > > > dynamically as follows:
>  >  >>  > > >
>  >  >>  > > > STAF staxMachine STAX SET MAXFILECACHESIZE 50
>  >  >>  > > >
>  >  >>  > > > Note that this setting only applies to this instance of 
> the STAX
>  >  >>  > > > service. If you shutdown and restarted STAFProc, it would no
>  >  >>  > longer
>  >  >>  > > > apply. You would want to add this setting when registering the
>  >  >>  > STAX
>  >  >>  > > > service in your STAF.cfg file to make it "permanent".
>  >  >>  > > >
>  >  >>  > > > You may also want to increase the maximum heap size for 
> the STAX
>  >  >>  > JVM
>  >  >>  > > > from 384m to something like 512m by specifying OPTION 
> J2=-Xmx512m
>  >  >>  > > > (instead of OPTION J2=-Xmx384m).
>  >  >>  > > >
>  >  >>  > > > SERVICE STAX LIBRARY JSTAF EXECUTE
>  >  >>  > > > /usr/local/staf/services/stax/STAX.jar \
>  >  >>  > > > OPTION JVM=/usr/java/jdk1.6.0_16/bin/java OPTION 
> JVMName=STAX \
>  >  >>  > > > OPTION J2=-Xmx384m PARMS "MAXFILECACHESIZE 50"
>  >  >>  > > >
>  >  >>  > > > --------------------------------------------------------------
>  >  >>  > > > Sharon Lucas
>  >  >>  > > > IBM Austin, luc...@us.ibm.com
>  >  >>  > > > (512) 286-7313 or Tieline 363-7313
>  >  >>
>  >  >
>  >  >
>  > 
> ------------------------------------------------------------------------------
>  >  > Download Intel&#174; Parallel Studio Eval
>  >  > Try the new software tools for yourself. Speed compiling, find bugs
>  >  > proactively, and fine-tune applications for parallel performance.
>  >  > See why Intel Parallel Studio got high marks during beta.
>  >  > http://p.sf.net/sfu/intel-sw-dev
>  >  > _______________________________________________
>  >  > staf-users mailing list
>  >  > staf-users@lists.sourceforge.net
>  >  > https://lists.sourceforge.net/lists/listinfo/staf-users
>  >
> 

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
_______________________________________________
staf-users mailing list
staf-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/staf-users

Reply via email to