Philip Beevers wrote:
David McDaniel wrote:
Not sure if this is really the right board for this, but here goes. In
a performance critical, highly available application, when a
misbehaving process core dumps the creation of the corefile (gigabytes
in size) puts a lot of pressure on the abililty to restart and recover
in a timely fashion.
In my experience, this is down to memory pressure or simply the
additional IO load of dumping out such a large core. I've seen
particularly slow core dumping when the system has to swap pages back in
simply to write them out to the core file! Worse still, there's a
reasonable chance such a large core file will run you out of disk space.
Our application deals with this by stopping the dumping of core files
entirely - we do this through ulimit (setting max core file size to 0),
but it can also be done with coreadm. We then have application code
which catches the signals causing core dumps, prints out the stack
(using printstack(3C)) and exits; obviously you have to be very careful
to only use functions which are async signal safe in such a handler.
This removes the ability to poke around in the entrails of the core
file, but does give you the key piece of information - where the process
was when it crashed.
This isn't perfect - it would also be worth looking at what coreadm can
give you. For example, I think you can simulate what we do - just much
more simply and reliably - by using coreadm to just specify that the
stack (or perhaps stack and heap, to give you the option of poking
around the entrails after the crash) should be dumped to a file. Our
current approach evolved before coreadm was around, and I've not got
round to revisiting it.
I would think the perceptible performance problem is due to
flooding the disk w/ write requests; the read requests for your
next page fault end up stuck behind this flood of writes.
Besides running ZFS, which implements a rather clever IO
scheduler in the filesystem to avoid exactly this sort of
read starvation, the use of coreadm to put core files onto
disks or NFS servers that will cope w/ a flood of IO is
a good idea. This would also help diagnose the
actual cause of the problem.
In general, we strongly encourage ISVs not to disable core
dumping as it makes finding that once every 6 month crash
very difficult indeed.
- Bart
--
Bart Smaalders Solaris Kernel Performance
[EMAIL PROTECTED] http://blogs.sun.com/barts
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org