Mads Toftum wrote:
On Sun, Mar 08, 2009 at 04:48:26PM -0500, Bob Friesenhahn wrote:
Defunct processes are due to the parent process not doing a wait(3C) or
waitpid(3C) call for the process ID of the child. Unless the parent
process has the signal handling for SIGCHLD set to SIG_IGN, then each child
process remains in the process table until the parent process has invoked
waitpid(3C) to obtain its exit status.
A large number of defunct processes either indicates that the parent
process went away (e.g. crashed) or the parent process is not properly
designed/implemented to execute waitpid(3C) for each of the child processes
that it starts. If the parent process goes away, then the child process
becomes owned by 'init' (PID 1).
I should have added that the parent doesn't die, it sticks around and it
usually gets around to cleaning up pretty fast.
Killing the parent, it took about 2 minutes to get rid of the defuncts
and starting up again it took a couple of minutes to build back about 50
running procs and 700 defuncts.
Broken code? possibly, but unfortunately not my own.
Here are some plausible hypotheses for you to consider ...
Perhaps all those stat() calls are to a relatively-slow filesystem,
such as NFS? If so, I'd imagine the rate at which the server code
can dutifully cleanup and recycle each completed thread is being
throttled by that serialization feeding into Amdahl's Law. If the
location of those .file files is a parameter, you might want to try
putting them in /tmp (or /var/tmp if persistence across reboots is
required).
The rate at which threads can be recycled also depends on the
network latency between the server and its clients. A simple
close() operation on a socket takes at least one round-trip.
Failure to consider this is the single most-common cause for
things that "worked well in pre-production stress-testing" with
LAN latencies failing to scale on the real-world Internet.
I've seen getcwd(3) calls be painful over NFS. They use lots
of stat() calls to tell the program what it should already know --
where its last successful chdir(2) left it! When it's an issue, the
pain will be proportional to the depth of the cwd. You might
check nfsstat for signs.
Finally, perhaps filesystem namespace operations are slow due
to DNLC issues, such as your DNLC cache being disabled?
kstat has good counters for such things.
HTH,
-- Bob
vh
Mads Toftum
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org