On Sunday, 6 בNovember 2005 22:13, Yedidyah Bar-David wrote:
> On Sun, Nov 06, 2005 at 01:05:39PM +0200, Oded Arbel wrote:
> > Hi list.
> >
> > I have a problem with a P4 (hyper-threaded) powered server. It
> > constantly has a load average of 2.something, while looking with
> > top I don't see any process actually taking all that CPU resource.

> > The server is mostly used to ran Nagios monitor and some Java
> > daemons. Tomcat is running taking about 1.2GB of virtual, which is
> > about 60% of all memory, but it sees absolutely no usage and uses
> > less then 5% real memory. two other java services and MySQL
> > together grab another 600MB of virtual and everything else is
> > mostly scripts and use negliable amounts of VIRT, RES and CPU.
>
> Maybe one of the scripts/daemons has a loop of quite short delays?
> Testing this isn't very easy - you can either strace some of the
> suspects or try something like syscalltrack.

Thanks

The server doesn't run a lot of processes (or shouldn't anyway). I 
removed everything I didn't absolutely needed and straced all the other 
non-kernel processes, and found nothing interesting.

Then I started removing processes until I got to the culprit - the Java 
program that implements the services provided by the server. 
I of course did the testing in the off-peak hours so there will be no 
disturbance of service to our client. At that time there was absolutely 
no activity whatsoever on any of the services, so the only thing the 
Java program was supposed to do was call wait() (a Java thread 
synchronization call) every second, which was indeed verified by 
stracing the Java process, and here is the output:

futex(0x4d907b60, FUTEX_WAIT, 233, {0, 265545000}) = -1 ETIMEDOUT 
(Connection timed out)
futex(0x805d33c, FUTEX_WAKE, 1)         = 0
gettimeofday({1131445296, 417683}, NULL) = 0
clock_gettime(0, {1131445296, 417799000}) = 0
futex(0x4d907b60, FUTEX_WAIT, 234, {0, 499884000}) = -1 ETIMEDOUT 
(Connection timed out)
futex(0x805d33c, FUTEX_WAKE, 1)         = 0
gettimeofday({1131445296, 918529}, NULL) = 0
clock_gettime(0, {1131445296, 918646000}) = 0
futex(0x4d907b60, FUTEX_WAIT, 235, {0, 499883000}) = -1 ETIMEDOUT 
(Connection timed out)
futex(0x805d33c, FUTEX_WAKE, 1)         = 0
gettimeofday({1131445297, 419424}, NULL) = 0
clock_gettime(0, {1131445297, 419540000}) = 0
futex(0x4d907b60, FUTEX_WAIT, 236, {0, 499884000}) = -1 ETIMEDOUT 
(Connection timed out)
futex(0x805d33c, FUTEX_WAKE, 1)         = 0
gettimeofday({1131445297, 920319}, NULL) = 0
clock_gettime(0, {1131445297, 920436000}) = 0
...
and so on and so forth

Nonetheless, stopping the program resulted in immediate drop of CPU 
usage to just about 0% and starting it again put it back to 100%.

Someone suggested that this is two separate  issues:

1) procps is broken and misreports CPU usage of processes.
- I'm currently using procps 3.1.15 on a Mandrake 10.0 official vanilla 
kernel 2.6.3-7mdk-p3-smp-64GB. I looked up in the procps changelog and 
didn't find anything that sounds related between that above mentioned 
version and the current one.

2) Java is eating up all CPU power.
I'm using Sun's J2RE 1.4.2_06 packaged by Mandriva. From the strace of 
the process I can't see how it can consume all the CPU. Another Java 
program running on the same machine, which is runs a much smaller 
subset of the exact same code as the offending process - and with a 
similar strace - does not show this behavior.

I suspect the futex() calls from the above trace - AFAIK they stand for 
"fast use mutex", but I don't understand enough about them to guess as 
to why it behaves that way.

-- 
Oded

::..
"The truth is more important than the facts."
        -- Frank Lloyd Wright

================================================================To unsubscribe, 
send mail to [EMAIL PROTECTED] with
the word "unsubscribe" in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]

Reply via email to