Philip Beevers wrote:
David McDaniel wrote:

Not sure if this is really the right board for this, but here goes. In a performance critical, highly available application, when a misbehaving process core dumps the creation of the corefile (gigabytes in size) puts a lot of pressure on the abililty to restart and recover in a timely fashion.
In my experience, this is down to memory pressure or simply the additional IO load of dumping out such a large core. I've seen particularly slow core dumping when the system has to swap pages back in simply to write them out to the core file! Worse still, there's a reasonable chance such a large core file will run you out of disk space.

Our application deals with this by stopping the dumping of core files entirely - we do this through ulimit (setting max core file size to 0), but it can also be done with coreadm. We then have application code which catches the signals causing core dumps, prints out the stack (using printstack(3C)) and exits; obviously you have to be very careful to only use functions which are async signal safe in such a handler. This removes the ability to poke around in the entrails of the core file, but does give you the key piece of information - where the process was when it crashed.

This isn't perfect - it would also be worth looking at what coreadm can give you. For example, I think you can simulate what we do - just much more simply and reliably - by using coreadm to just specify that the stack (or perhaps stack and heap, to give you the option of poking around the entrails after the crash) should be dumped to a file. Our current approach evolved before coreadm was around, and I've not got round to revisiting it.



I would think the perceptible performance problem is due to
flooding the disk w/ write requests; the read requests for your
next page fault end up stuck behind this flood of writes.

Besides running ZFS, which implements a rather clever IO
scheduler in the filesystem to avoid exactly this sort of
read starvation, the use of coreadm to put core files onto
disks or NFS servers that will cope w/ a flood of IO is
a good idea.  This would also help diagnose the
actual cause of the problem.

In general, we strongly encourage ISVs not to disable core
dumping as it makes finding that once every 6 month crash
very difficult indeed.

- Bart

--
Bart Smaalders                  Solaris Kernel Performance
[EMAIL PROTECTED]               http://blogs.sun.com/barts
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to