so if NUMTHREADS has not been set/changed and defaults to 5, how do I account for the 44 (lightweight) processes associated with the STAX JVM? UID PID PPID LWP C NLWP STIME TTY TIME CMD root 2565 2514 2565 0 44 Feb17 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2566 0 44 Feb17 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2567 0 44 Feb17 ? 01:33:06 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2568 0 44 Feb17 ? 01:33:29 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2569 0 44 Feb17 ? 01:33:11 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2570 0 44 Feb17 ? 01:33:12 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2571 0 44 Feb17 ? 00:56:29 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2572 0 44 Feb17 ? 00:00:09 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2573 0 44 Feb17 ? 00:00:19 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2574 0 44 Feb17 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2575 0 44 Feb17 ? 00:01:37 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2576 0 44 Feb17 ? 00:01:57 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2577 0 44 Feb17 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2578 0 44 Feb17 ? 00:17:43 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2579 0 44 Feb17 ? 00:08:50 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2580 0 44 Feb17 ? 00:11:52 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2581 0 44 Feb17 ? 00:00:26 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2624 0 44 Feb17 ? 00:28:11 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2625 0 44 Feb17 ? 00:28:47 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2626 0 44 Feb17 ? 00:28:21 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2627 0 44 Feb17 ? 00:28:59 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2628 0 44 Feb17 ? 00:28:47 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 3484 0 44 Feb17 ? 00:22:59 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 7326 0 44 Feb17 ? 00:34:02 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 14097 0 44 Feb18 ? 00:33:18 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 20376 59 44 Feb18 ? 19-11:57:34 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 20466 0 44 Feb18 ? 00:05:43 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 20474 0 44 Feb18 ? 00:02:59 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 20487 0 44 Feb18 ? 00:06:03 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 20577 0 44 Feb18 ? 00:01:22 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 14154 59 44 Mar05 ? 10-14:40:07 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 14974 0 44 Mar05 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29079 0 44 Mar08 ? 00:00:01 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29081 0 44 Mar08 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29082 0 44 Mar08 ? 00:19:12 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29083 0 44 Mar08 ? 00:11:20 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29084 0 44 Mar08 ? 00:24:29 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29085 0 44 Mar08 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29086 0 44 Mar08 ? 00:06:55 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29095 0 44 Mar08 ? 00:22:46 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 29098 0 44 Mar08 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 1893 0 44 16:12 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2059 0 44 16:24 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX root 2565 2514 2110 0 44 16:29 ? 00:00:00 /usr/java/jdk1.6.0_16/bin/java -Xmx384m com.ibm.staf.service.STAFServiceHelper STAX
I do note that these seem to have bene started in several separate batches, i.e. most Feb 17 when I last restarted STAX, but others a day later, some March 5, 8, and then today. and those last 3 would seem to naturally correlate with 3 currently running jobs: Job ID Job Name Start Date-Time Function ------ ----------------------------------------- ----------------- -------- 11314 rep_slave_fail_inject.xml-rowrep 20100323-16:12:15 rep_tpcb 11317 rejoin_cluster_bug560-1021.xml-snowball10 20100323-16:24:24 main 11318 bug1195_conn_memleak-797.xml-snowball4 20100323-16:29:44 main it is possible that a STAX thread could go off to lala land if it is unable to deliver it's process completion message (this occurs to me based on http://staf.sourceforge.net/current/STAFFAQ.htm#d0e2032 )? Sharon Lucas wrote: > > The STAX service has the following threads: > > - Main thread for the STAX service: Handles any requests submitted to > the STAX service > > - Thread for the STAXThreadQueue: Maintains a list of STAX-Threads > waiting for execution and schedules their execution on the limited > number of physical threads (e.g. set by the > NUMTHREADS parameter when configuring the STAX service, or 5 by default). > > - Thread for the STAXTimedEventQueue: Checks for when a <timer> expires. > > - Each STAXJob currently running has a STAFQueueMonitor thread. This > thread has a loop that continually submits a GET WAIT request to the > QUEUE service to get any messages (STAF/Process/End and > STAF/RequestComplete) for this STAX job. > Note: I made a change (that I referred to in my previous email) in the > STAXJob::STAFQueueMonitorThread get multiple messages off the queue at > once (instead of 1 at a time). This change is in the next release of > STAX planned to be released at the end of March 2010 (e.g. next week). > > -------------------------------------------------------------- > Sharon Lucas > IBM Austin, luc...@us.ibm.com > (512) 286-7313 or Tieline 363-7313 > > > > *Nathan Parrish <nparr...@clustrix.com>* > > 03/23/2010 03:09 PM > > > To > Sharon Lucas/Austin/i...@ibmus > cc > staf-users@lists.sourceforge.net > Subject > Re: [staf-users] high CPU utilization by STAFProc and STAX JVM > > > > > > > > > Name Library Executable > ---------- ---------- ------------------------------------------------------ > DELAY <Internal> <None> > DIAG <Internal> <None> > ECHO <Internal> <None> > EVENT JSTAF /usr/local/staf/services/stax/STAFEvent.jar > FS <Internal> <None> > HANDLE <Internal> <None> > HELP <Internal> <None> > LIFECYCLE <Internal> <None> > LOG STAFLog <None> > MISC <Internal> <None> > PING <Internal> <None> > PROCESS <Internal> <None> > QUEUE <Internal> <None> > RESPOOL STAFPool <None> > SEM <Internal> <None> > SERVICE <Internal> <None> > SHUTDOWN <Internal> <None> > STAX JSTAF /usr/local/staf/services/stax/STAX.jar > TESTSTAFIA JSTAF /usr/local/staf/lib/services/teststafia/TestSTAFia.jar > TRACE <Internal> <None> > TRUST <Internal> <None> > VAR <Internal> <None> > > I retract my earlier suggestion that this might be unrelated to STAX. > when I connect to the STAX JVM with gdb, I see the CPU load drop to > negligible levels. so I'm guessing that indeed something within STAX > (i.e. a rogue job which is not shown in job list?) is continually > pelting STAFProc with some request(s), driving the CPU load. > > Sharon Lucas wrote: > > > > What services are registered on this machine? What is the output of: > > STAF local SERVICE LIST > > > > -------------------------------------------------------------- > > Sharon Lucas > > IBM Austin, luc...@us.ibm.com > > (512) 286-7313 or Tieline 363-7313 > > > > > > > > *Nathan Parrish <nparr...@clustrix.com>* > > > > 03/23/2010 02:31 PM > > > > > > To > > Sharon Lucas/Austin/i...@ibmus > > cc > > staf-users@lists.sourceforge.net > > Subject > > Re: [staf-users] high CPU utilization by STAFProc > and STAX JVM > > > > > > > > > > > > > > > > > > looking at the STAX processes/threads I see two constantly doing > > something like: > > futex(0x2aaacc943f60, FUTEX_WAKE, 1) = 0 > > socket(PF_FILE, SOCK_STREAM, 0) = 15 > > fcntl(15, F_GETFD) = 0 > > fcntl(15, F_SETFD, FD_CLOEXEC) = 0 > > fcntl(15, F_GETFL) = 0x2 (flags O_RDWR) > > fcntl(15, F_SETFL, O_RDWR|O_NONBLOCK) = 0 > > connect(15, {sa_family=AF_FILE, path="/tmp/STAFIPC_STAF"}, 110) = 0 > > fcntl(15, F_GETFL) = 0x802 (flags O_RDWR|O_NONBLOCK) > > fcntl(15, F_SETFL, O_RDWR) = 0 > > sendto(15, "\0\0\0\0\2\0\0\0", 8, 0, NULL, 0) = 8 > > recvfrom(15, "\0\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4 > > sendto(15, > > "\0\0\0\0\5\n\0\0d\351\0\0\5\0\0\0\5\0\0\0\10\0\0\0localQUE"..., 42, > > 0, NULL, 0) = 42 > > recvfrom(15, "\5\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4 > > recvfrom(15, ">\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4 > > recvfrom(15, "HandleManager::updateTimestamp()"..., 62, MSG_NOSIGNAL, > > NULL, NULL > > ) = 62 > > close(15) > > > > I think this is the "QUEUE GET traffic you described in your last email. > > however, I don't see any jobs when I STAX LIST JOBS on this machine, > > so where are are these coming from? > > > > Nathan Parrish wrote: > > > Hi Sharon, > > > > > > Back to this problem again... CPU load on our STAX machine is sitting > > > around 3, and has been well above 1 since early February. What's > > > interesting to note this time is that we apparently restarted STAX for > > > some other reason on February 17, and yet load was plateaued just > south > > > of 2 from February 3-ish. It then increased to the current peak of ~3 > > > midway through the first week of March. This suggests to me that > while > > > STAX may be contributing, there seems to be some component of this > > > problem involving STAF itself. > > > > > > In fact, I'm beginning to think there are several discrete issues > which > > > are combining to push the CPU load up to where it begins to cause > > > problems. Possibly, one or more of these issues are related to the > > > network problems I mentioned earlier this month on the STAF/STAX > forums; > > > that is, I wonder if there might be some areas where some STAF or STAX > > > threads get into a bad (CPU-eating) state if they miss a packet or 4. > > > > > > Presently I've managed to eliminate all long-running STAX jobs, > and yet > > > I'm still seeing significant spew from my strace -fp on STAFProc. It > > > seems there are several pattens of the forked processes (of which > there > > > are 300), and I'd like your help understanding what might be happening > > > with each. > > > > > > Thread Behavior Pattern #1 (pid 2516; which is only 2 greater than > > > STAFProc pid 2514, so I'm guessing this is a listener thread?): > > > fcntl(16, F_GETFD) = 0 > > > fcntl(16, F_SETFD, FD_CLOEXEC) = 0 > > > futex(0x2aaab005c504, FUTEX_CMP_REQUEUE, 1, 2147483647, > 0x2aaab005c4d8, > > > 0) = 1 > > > futex(0x2aaab005c4d8, FUTEX_WAKE, 1) = 1 > > > accept(8, {sa_family=AF_FILE, pa...@}, [12300248949435924482]) = 14 > > > fcntl(14, F_GETFD) = 0 > > > fcntl(14, F_SETFD, FD_CLOEXEC) = 0 > > > futex(0x2aaab005c504, FUTEX_CMP_REQUEUE, 1, 2147483647, > 0x2aaab005c4d8, > > > 0) = 1 > > > futex(0x2aaab005c4d8, FUTEX_WAKE, 1) = 1 > > > accept(8, {sa_family=AF_FILE, pa...@}, [12300248949435924482]) = 16 > > > fcntl(16, F_GETFD) = 0 > > > fcntl(16, F_SETFD, FD_CLOEXEC) = 0 > > > futex(0x2aaab005c504, FUTEX_CMP_REQUEUE, 1, 2147483647, > 0x2aaab005c4d8, > > > 0) = 1 > > > futex(0x2aaab005c4d8, FUTEX_WAKE, 1) = 1 > > > accept(8, {sa_family=AF_FILE, pa...@}, [12300248949435924482]) = 14 > > > fcntl(14, F_GETFD) = 0 > > > fcntl(14, F_SETFD, FD_CLOEXEC) = 0 > > > futex(0x2aaab005c504, FUTEX_CMP_REQUEUE, 1, 2147483647, > 0x2aaab005c4d8, > > > 0) = 1 > > > > > > Thread Behavior Pattern #2 (pid 4228): > > > futex(0x2aab9c000020, FUTEX_WAIT, 2, NULL) = -1 EAGAIN (Resource > > > temporarily unavailable) > > > futex(0x2aab9c000020, FUTEX_WAKE, 1) = 0 > > > close(17) = 0 > > > futex(0x89ffa4, FUTEX_WAIT, 589133383, NULL) = 0 > > > futex(0x89ff78, FUTEX_WAKE, 1) = 0 > > > select(10, [9], NULL, NULL, {120, 0}) = 1 (in [9], left {120, 0}) > > > recvfrom(9, "\0\0\0\0\2\0\0\0", 8, MSG_NOSIGNAL, NULL, NULL) = 8 > > > select(10, NULL, [9], NULL, {120, 0}) = 1 (out [9], left {120, 0}) > > > sendto(9, "\0\0\0\0", 4, 0, NULL, 0) = 4 > > > select(10, [9], NULL, NULL, {120, 0}) = 1 (in [9], left {120, 0}) > > > recvfrom(9, "\0\0\0\0\5\n\0\0\345\r\0\0\5\0\0\0\5\0\0\0\10\0\0\0", 24, > > > MSG_NOSIGNAL, NULL, NULL) = 24 > > > select(10, [9], NULL, NULL, {120, 0}) = 1 (in [9], left {120, 0}) > > > recvfrom(9, "localQUEUEGET WAIT", 18, MSG_NOSIGNAL, NULL, NULL) = 18 > > > select(10, NULL, [9], NULL, {120, 0}) = 1 (out [9], left {120, 0}) > > > sendto(9, "\5\0\0\0", 4, 0, NULL, 0) = 4 > > > select(10, NULL, [9], NULL, {120, 0}) = 1 (out [9], left {120, 0}) > > > sendto(9, "=\0\0\0", 4, 0, NULL, 0) = 4 > > > select(10, NULL, [9], NULL, {120, 0}) = 1 (out [9], left {120, 0}) > > > sendto(9, "HandleManager::updateTimestamp()"..., 61, 0, NULL, 0) = 61 > > > close(9) = 0 > > > futex(0x89ffa4, FUTEX_WAIT, 589133385, NULL) = 0 > > > futex(0x89ff78, FUTEX_WAKE, 1) = 0 > > > select(18, [17], NULL, NULL, {120, 0}) = 1 (in [17], left {120, 0}) > > > recvfrom(17, "\0\0\0\0\2\0\0\0", 8, MSG_NOSIGNAL, NULL, NULL) = 8 > > > select(18, NULL, [17], NULL, {120, 0}) = 1 (out [17], left {120, 0}) > > > sendto(17, "\0\0\0\0", 4, 0, NULL, 0) = 4 > > > select(18, [17], NULL, NULL, {120, 0}) = 1 (in [17], left {120, 0}) > > > recvfrom(17, > "\0\0\0\0\5\n\0\0\345\r\0\0\5\0\0\0\5\0\0\0\10\0\0\0", 24, > > > MSG_NOSIGNAL, NULL, NULL) = 24 > > > select(18, [17], NULL, NULL, {120, 0}) = 1 (in [17], left {120, 0}) > > > recvfrom(17, "localQUEUEGET WAIT", 18, MSG_NOSIGNAL, NULL, NULL) = 18 > > > select(18, NULL, [17], NULL, {120, 0}) = 1 (out [17], left {120, 0}) > > > sendto(17, "\5\0\0\0", 4, 0, NULL, 0) = 4 > > > select(18, NULL, [17], NULL, {120, 0}) = 1 (out [17], left {120, 0}) > > > sendto(17, "=\0\0\0", 4, 0, NULL, 0) = 4 > > > select(18, NULL, [17], NULL, {120, 0}) = 1 (out [17], left {120, 0}) > > > sendto(17, "HandleManager::updateTimestamp()"..., 61, 0, NULL, 0) = 61 > > > close(17) > > > futex(0x322814c960, FUTEX_WAKE, 1) = 0 > > > futex(0x89ffa4, FUTEX_WAIT, 589133389, NULL > > > [sits here quietly for some time] > > > > > > Thread Behavior Pattern #3 (pid 14664): > > > recvfrom(16, "get system var testenvs/snowball"..., 62, MSG_NOSIGNAL, > > > NULL, NULL) = 62 > > > select(17, [16], NULL, NULL, {120, 0}) = 1 (in [16], left {120, 0}) > > > recvfrom(16, "\4\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4 > > > select(17, [16], NULL, NULL, {120, 0}) = 1 (in [16], left {120, 0}) > > > recvfrom(16, "none", 4, MSG_NOSIGNAL, NULL, NULL) = 4 > > > select(17, [16], NULL, NULL, {120, 0}) = 1 (in [16], left {120, 0}) > > > recvfrom(16, "\t\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4 > > > select(17, [16], NULL, NULL, {120, 0}) = 1 (in [16], left {120, 0}) > > > recvfrom(16, "anonymous", 9, MSG_NOSIGNAL, NULL, NULL) = 9 > > > select(17, [16], NULL, NULL, {120, 0}) = 1 (in [16], left {120, 0}) > > > recvfrom(16, "\0\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4 > > > select(17, [16], NULL, NULL, {120, 0}) = 1 (in [16], left {120, 0}) > > > recvfrom(16, "\33\0\0\0", 4, MSG_NOSIGNAL, NULL, NULL) = 4 > > > select(17, [16], NULL, NULL, {120, 0}) = 1 (in [16], left {120, 0}) > > > recvfrom(16, "devnode4.colo.sproutsys.com", 27, MSG_NOSIGNAL, NULL, > > > NULL) = 27 > > > this appears to be the VAR service, which we use extensively. it > wakes > > > up and does something every few seconds, which seems quite reasonable. > > > > > > > > > > > > What it seems like, from this distance, is that one of many PIDs > will be > > > acting like #2, in what appears to be 'competition' with #1 (from > a very > > > naive/cargo cult perspective; on further thought this is an I/O > thread, > > > right?). If I strace specifically the the #2-type PID, it will appear > > > to be mostly idle. This could be some sort of strace anomaly, or more > > > likely my misapprehension of the threading behavior of STAF. But I'm > > > led back to my working theory that some or many STAFProc threads > are off > > > the ranch somewhere, causing this CPU churn. Beyond stracing these > > > threads/procs, what can I do to determine what these threads are > or why > > > they are in this state? > > > > > > Hmm, gdb perhaps? > > > ok, so my theory about 2516 seems to be validated with: > > > 301 Thread 1081932112 (LWP 2516) 0x0000003228a0ce2b in accept () > > > from /lib64/libpthread.so.0 > > > 300 Thread 1086028112 (LWP 2517) 0x0000003228a0ce2b in accept () > > > from /lib64/libpthread.so.0 > > > > > > everything else is in: > > > 2 Thread 46913072695632 (LWP 28913) 0x0000003228a0a486 in > > > pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > > > > > > except for: > > > 302 Thread 1077836112 (LWP 2515) 0x0000003228a0a677 in > > > pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 > > > > > > I don't feel terribly enlightened by this, unfortunately. Other > > > suggestions? > > > > > > Thanks again, > > > Nathan > > > > > > Sharon Lucas wrote: > > >> Ok, then that shows that the high CPU usage was related to STAX. > > >> > > >> The strace output you provided shows lots of "GET WAIT" requests > > >> submitted to the QUEUE service on the local STAX machine. Each STAX > > job > > >> has a STAFQueueMonitor class that in a loop continuously checks for > > >> messages on the STAX job handle's queue (e.g. STAF command completion > > >> messages and process completion messages) by submitting a QUEUE GET > > WAIT > > >> request. If there are no messages, the QUEUE GET WAIT request waits > > >> indefinitely for a message to appear on its queue (and this > doesn't use > > >> up CPU). But, in your case, one or more STAX jobs were constantly > > >> getting messages on its queue and processing them. > > >> > > >> In STAF V3.3.3, we added support for getting multiple messages off a > > >> handle's queue by adding the ALL and FIRST options to the QUEUE > > >> service's GET/PEEK requests as getting multiple messages off a queue > > can > > >> improve performance (over getting a single message at a time off a > > queue > > >> multiple times). So, I'll look into changing STAX to do a QUEUE GET > > ALL > > >> WAIT request and when I get this done, I could send an updated > STAX.jar > > >> file to you and perhaps you could try it out to see if it improves > > >> performance and decreases CPU utilization. > > >> > > >> But, if you have some "abusive" STAX jobs that do CPU intensive > things > > >> like in a loop then those STAX jobs should be fixed. For example, > > if in > > >> a STAX job someone uses a <loop> element to constantly submit a STAF > > >> service request (like a PING request) via a <stafcmd>, there should > > also > > >> be a delay in the loop between STAF service requests so that it isn't > > >> using up all the CPU. Anyone can write a STAX job that eats up > all the > > >> CPU, just like anyone can write a program in any language that > eats up > > >> all the CPU. > > >> > > >> -------------------------------------------------------------- > > >> Sharon Lucas > > >> IBM Austin, luc...@us.ibm.com > > >> (512) 286-7313 or Tieline 363-7313 > > >> > > >> > > >> > > >> *Nathan Parrish <nparr...@clustrix.com>* > > >> > > >> 11/03/2009 07:39 PM > > >> > > >> > > >> To > > >> Sharon Lucas/Austin/i...@ibmus > > >> cc > > >> staf-users@lists.sourceforge.net > > >> Subject > > >> Re: [staf-users] high CPU utilization by STAFProc > > and STAX JVM > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> after removing/adding the STAX service (and taking the opportunity to > > >> switch to a newer JVM), the CPU load has dropped. > > >> > > >> before the service restart, strace did show STAFProc as being awfully > > >> busy, compared to what I see now. now I see it stopping every > > second or > > >> so, whereas before it spewed at a rate of 5-20 lines a second. > here is > > >> a sampling of output, not that I expect it will be particularly > > >> revealing: http://pastebin.com/m769fe708 > > >> > > >> thanks again for your help in digging into this matter; perhaps the > > >> newer JVM will have fixed the issue, though I suspect it's our > tendency > > >> towards abusive stax jobs, and we'll see this crop up again. > > >> > > >> nathan > > >> > > >> ----- "Sharon Lucas" <luc...@us.ibm.com> wrote: > > >> > > >> > Right. STAF does not currently ever "reap" threads. If at one > > time 270 > > >> > threads were created, then that's how many threads will remain > > >> > created. We have considered adding the ability to reap > threads, but > > >> > that has not been implemented. > > >> > > > >> > However, this should not effect CPU usage. But it does effect > memory > > >> > usage, as when a thread is created, a default thread stack of > a size > > >> > dependent on the operating system is created. On Linux > machines, STAF > > >> > overrides the default thread stack size to $M if its > 4M (as > it can > > >> > be 8M or 10M) in STAF V3.2.3 or later. Note that you can further > > >> > override it by setting the STAF_THREAD_STACK_SIZE environment > > variable > > >> > before starting STAFProc. For example, setting > > >> > STAF_THREAD_STACK_SIZE=4194304 specifies 4M as the value > specified is > > >> > the thread stack size in kilobytes. > > >> > > > >> > -------------------------------------------------------------- > > >> > Sharon Lucas > > >> > IBM Austin, luc...@us.ibm.com > > >> > (512) 286-7313 or Tieline 363-7313 > > >> > > > >> > > > >> > > > >> > Nathan Parrish <nparr...@clustrix.com> > > >> > > > >> > 11/03/2009 03:29 PM > > >> > To Sharon Lucas/Austin/i...@ibmus > > >> > > > >> > cc staf-users@lists.sourceforge.net > > >> > > > >> > Subject Re: [staf-users] high CPU utilization by > > >> STAFProc and STAX > > >> > JVM > > >> > > > >> > > > >> > > > >> > > > >> > right, I understand this. my question is how 'thread growth' > works. > > >> > presumably the delta controls some behavior regarding how many new > > >> > threads we spin off when the system detects that we are running > > >> > low/out of threads. > > >> > also, are threads ever reaped? or if my system was incredibly > busy at > > >> > some time in the past, such that 270 threads were really all > going at > > >> > once, will I have all those threads sitting around even if they > > are no > > >> > longer unneeded? is it possible that the overhead of this many > idle > > >> > threads could be causing my CPU churn? > > >> > > > >> > thanks, > > >> > nathan > > >> > > > >> > ----- "Sharon Lucas" <luc...@us.ibm.com> wrote: > > >> > > > >> > > No, the STAF MISC service simply shows the initial threads > and the > > >> > > thread growth delta. It doesn't show the number of threads that > > STAF > > >> > > has created. > > >> > > > > >> > > -------------------------------------------------------------- > > >> > > Sharon Lucas > > >> > > IBM Austin, luc...@us.ibm.com > > >> > > (512) 286-7313 or Tieline 363-7313 > > >> > > > > >> > > > > >> > > > > >> > > Nathan Parrish <nparr...@clustrix.com> > > >> > > > > >> > > 11/03/2009 02:43 PM > > >> > > To Sharon Lucas/Austin/i...@ibmus > > >> > > > > >> > > cc staf-users@lists.sourceforge.net > > >> > > > > >> > > Subject Re: [staf-users] high CPU utilization by STAFProc > and STAX > > >> > > JVM > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > Our longevity test has finally finished, so I'm able to poke > around > > >> > a > > >> > > bit deeper, and can look at doing things like restarting the > STAX > > >> > > service, or STAFProc itself later this afternoon. > > >> > > > > >> > > I enabled tracing, and I see traffic on the order of maybe 5-20 > > >> > > queries a second, mostly the VAR service (90+% gets), and some > > >> > > semaphore requests as well. I'll get breaks in traffic as > long as 3 > > >> > > seconds... > > >> > > > > >> > > short of doing gdb, I tried strace -fp on the process, and > noticed > > >> > > that it was quite busy, and was also dealing with a very large > > >> > number > > >> > > of child processes. I believe linux does some weird stuff mixing > > >> > what > > >> > > constitutes a process vs. a thread; I don't see all these > process > > >> > with > > >> > > ps, but pstree -p does show them, something like ~270 (not > counting > > >> > > the java procs/threads which are also underneath the main > STAFProc > > >> > > PID). looking on another machine (which is not running STAX, or > > >> > > servicing tons of variable requests, etc.), it has 82. my > desktop, > > >> > > which does run STAX for testing purposes, looks to have 90. > > >> > > > > >> > > from misc settings: > > >> > > Initial Threads : 10 > > >> > > Thread Growth Delta : 1 > > >> > > > > >> > > does this suggest that some threads have gotten spun off into > > >> > > never-never land and other threads have been created as a > result? > > >> > > > > >> > > > > >> > > > > >> > > ----- "Sharon Lucas" <luc...@us.ibm.com> wrote: > > >> > > > Unfortunately, I didn't really get many clues as to why > STAFProc > > >> > is > > >> > > > using up so much CPU from the information you provided. > Does the > > >> > CPU > > >> > > > usage for STAFProc constantly stay very high (e.g. above > 100%)? > > >> > > > > > >> > > > What are your STAX jobs doing? Can you give a description > of some > > >> > of > > >> > > > the STAF service requests that they are submitting. > > >> > > > > > >> > > > Since STAFProc was started 21 days ago on your STAX service > > >> > machine, > > >> > > > 1,832,518,328 STAF service requests have been submitted to > this > > >> > > > machine. That's a lot of STAF service requests. Do you > have any > > >> > > > "rogue" STAX jobs that are constantly submitting STAF service > > >> > > requests > > >> > > > in a loop (without a good reason)? What STAF service > requests are > > >> > > > being submitted the most? > > >> > > > > > >> > > > Yes, you can enable STAF tracing to see if that gives any > clues > > >> > > about > > >> > > > what's driving the CPU load (though note that enabling STAF > > >> > tracing > > >> > > > may slow things down a little). To see what STAF service > requests > > >> > > are > > >> > > > being submitted and when each STAF service request > completes, you > > >> > > > could enable trace points ServiceRequest, ServiceComplete, and > > >> > > > RemoteRequests. Note that this will generate tons of trace > output > > >> > > > since lots of STAF service requests are being submitted to > this > > >> > > > machine, so you'll also want to redirect STAFProc's trace > output > > >> > to > > >> > > a > > >> > > > file in a location where there is lots of available disk > > space and > > >> > > > monitor the size of this file. You may also want to enable the > > >> > > Warning > > >> > > > tracepoint so that any warning messages are also logged.For > > >> > example: > > >> > > > > > >> > > > STAF staxMachine TRACE SET DESTINATION TO FILE > > >> > > > /usr/local/staf/STAFProc.trc > > >> > > > STAF staxMachine TRACE ENABLE TRACEPOINTS "ServiceRequest > > >> > > > ServiceComplete RemoteRequests Warning" > > >> > > > > > >> > > > See section "8.18 Trace Service" in the STAF User's Guide for > > more > > >> > > > information. > > >> > > > > > >> > > > I don't know if the trace output will help us in > determining why > > >> > > > STAFProc is using up so much CPU. Maybe there is some > thread in > > >> > > > STAFProc that is in a bad state and constantly looping for > some > > >> > > > unknown reason (like the STAFProcessMonitor thread that > monitors > > >> > for > > >> > > > processes to complete). The only way I know of to check this > > would > > >> > > be > > >> > > > if STAFProc was started via a debugger like gdb. Then, once > > it got > > >> > > in > > >> > > > this "bad" state of high CPU usage, you could break in via > > gdb and > > >> > > > list threads (info threads) and change to each thread > (thread n) > > >> > and > > >> > > > check the backtraces (bt) for each thread to see what they are > > >> > > doing. > > >> > > > For example: > > >> > > > > > >> > > > Use gdb to debug STAF locally as follows: > > >> > > > 1. gdb STAFProc > > >> > > > 2. run > > >> > > > 3. Recreate the problem. > > >> > > > > > >> > > > Various commands that you might need while using gdb are: > > >> > > > > > >> > > > • help > > >> > > > • help tracepoints > > >> > > > • help stack > > >> > > > • info threads > > >> > > > • thread n > > >> > > > • bt > > >> > > > > > >> > > > Of course, this will require that STAF be shut down and then > > >> > > restarted > > >> > > > using gdb so you may not be able to do that now while your > long > > >> > runs > > >> > > > are still running. > > >> > > > > > >> > > > There's no reason that I know of not to use a 1.6.0 JVM > with the > > >> > > STAX > > >> > > > service so you can try that instead of upgrading to a more > recent > > >> > > > 1.5.0 JVM. > > >> > > > > > >> > > > You may want to increase the STAX service's > MaxFileCacheSize from > > >> > 20 > > >> > > > to something like 50. This won't "fix" the problem of STAFProc > > >> > using > > >> > > > 100%+ CPU, but whenever a STAX job is executed, it first > needs to > > >> > be > > >> > > > XML parsed and this is a very CPU-intensive process (so if > > the CPU > > >> > > > usage is already high, it will take longer for STAX to parse a > > >> > STAX > > >> > > > job before execution of the STAX job begins). So, if you are > > >> > running > > >> > > > the same STAX job file more than one, the first time STAX > > needs to > > >> > > > parse it, but then it will cache it so that if the exact same > > STAX > > >> > > job > > >> > > > is submitted to be executed again, it doesn't have to be > > re-parsed > > >> > > (if > > >> > > > it's still in the STAX file cache). So, that's why I > recommended > > >> > > > increasing the STAX service's MaxFileCacheSize. It can be > > >> > increased > > >> > > > dynamically as follows: > > >> > > > > > >> > > > STAF staxMachine STAX SET MAXFILECACHESIZE 50 > > >> > > > > > >> > > > Note that this setting only applies to this instance of > the STAX > > >> > > > service. If you shutdown and restarted STAFProc, it would no > > >> > longer > > >> > > > apply. You would want to add this setting when registering the > > >> > STAX > > >> > > > service in your STAF.cfg file to make it "permanent". > > >> > > > > > >> > > > You may also want to increase the maximum heap size for > the STAX > > >> > JVM > > >> > > > from 384m to something like 512m by specifying OPTION > J2=-Xmx512m > > >> > > > (instead of OPTION J2=-Xmx384m). > > >> > > > > > >> > > > SERVICE STAX LIBRARY JSTAF EXECUTE > > >> > > > /usr/local/staf/services/stax/STAX.jar \ > > >> > > > OPTION JVM=/usr/java/jdk1.6.0_16/bin/java OPTION > JVMName=STAX \ > > >> > > > OPTION J2=-Xmx384m PARMS "MAXFILECACHESIZE 50" > > >> > > > > > >> > > > -------------------------------------------------------------- > > >> > > > Sharon Lucas > > >> > > > IBM Austin, luc...@us.ibm.com > > >> > > > (512) 286-7313 or Tieline 363-7313 > > >> > > > > > > > > > ------------------------------------------------------------------------------ > > > Download Intel® Parallel Studio Eval > > > Try the new software tools for yourself. Speed compiling, find bugs > > > proactively, and fine-tune applications for parallel performance. > > > See why Intel Parallel Studio got high marks during beta. > > > http://p.sf.net/sfu/intel-sw-dev > > > _______________________________________________ > > > staf-users mailing list > > > staf-users@lists.sourceforge.net > > > https://lists.sourceforge.net/lists/listinfo/staf-users > > > ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ staf-users mailing list staf-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/staf-users