Thanks for the observation(s).
  In this case, althought the io load is transiently high will the
dumping is taking place, io contention is not a factor since the apps
have their entire working set pinned in memory, ie there is no paging.
Its pretty clear that cpu contention is the culprit and the dumping
itself is running at priority 60, thus starving the others for the
duration.
  I'd try to use dtrace to get more insight but I cant figure out where
to start since I cant find the place(s) in the kernel where the dumping
is actually taking place.
-d

> -----Original Message-----
> From: Bart Smaalders [mailto:[EMAIL PROTECTED] 
> Sent: Tuesday, April 18, 2006 8:35 PM
> To: Philip Beevers
> Cc: David McDaniel (damcdani); perf-discuss@opensolaris.org
> Subject: Re: [perf-discuss] Means to reduce core-dumping 
> impact on performance critical system?
> 
> Philip Beevers wrote:
> > David McDaniel wrote:
> > 
> >> Not sure if this is really the right board for this, but 
> here goes. 
> >> In a performance critical, highly available application, when a 
> >> misbehaving process core dumps the creation of the corefile 
> >> (gigabytes in size) puts a lot of pressure on the abililty 
> to restart 
> >> and recover in a timely fashion.
> >>  
> >>
> > In my experience, this is down to memory pressure or simply the 
> > additional IO load of dumping out such a large core. I've seen 
> > particularly slow core dumping when the system has to swap 
> pages back 
> > in simply to write them out to the core file! Worse still, 
> there's a 
> > reasonable chance such a large core file will run you out 
> of disk space.
> > 
> > Our application deals with this by stopping the dumping of 
> core files 
> > entirely - we do this through ulimit (setting max core file size to 
> > 0), but it can also be done with coreadm. We then have application 
> > code which catches the signals causing core dumps, prints out the 
> > stack (using printstack(3C)) and exits; obviously you have 
> to be very 
> > careful to only use functions which are async signal safe 
> in such a handler.
> > This removes the ability to poke around in the entrails of the core 
> > file, but does give you the key piece of information - where the 
> > process was when it crashed.
> > 
> > This isn't perfect - it would also be worth looking at what coreadm 
> > can give you. For example, I think you can simulate what we 
> do - just 
> > much more simply and reliably - by using coreadm to just 
> specify that 
> > the stack (or perhaps stack and heap, to give you the 
> option of poking 
> > around the entrails after the crash) should be dumped to a 
> file. Our 
> > current approach evolved before coreadm was around, and 
> I've not got 
> > round to revisiting it.
> > 
> > 
> 
> I would think the perceptible performance problem is due to 
> flooding the disk w/ write requests; the read requests for 
> your next page fault end up stuck behind this flood of writes.
> 
> Besides running ZFS, which implements a rather clever IO 
> scheduler in the filesystem to avoid exactly this sort of 
> read starvation, the use of coreadm to put core files onto 
> disks or NFS servers that will cope w/ a flood of IO is a 
> good idea.  This would also help diagnose the actual cause of 
> the problem.
> 
> In general, we strongly encourage ISVs not to disable core 
> dumping as it makes finding that once every 6 month crash 
> very difficult indeed.
> 
> - Bart
> 
> -- 
> Bart Smaalders                        Solaris Kernel Performance
> [EMAIL PROTECTED]             http://blogs.sun.com/barts
> 
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Reply via email to