Wietse Venema wrote:
The 15-minute distance suggests that the system was already in
trouble long before the qmgr voluntarily exited with status 1.

True, it was only when the load began blocking processes for extended amounts of time that the problems would occur. One of the monitors, outputting the results of "top" and "free" to a file, I had to run at nice -5 in order to get any useful results at the highest loads.

When a system is totally hosed, it is unfortunate but understandable
that the syslog datagram with the error message gets lost.

Yes I understand. I also didn't trust timestamps that much when the system was having such high load.

I recommend that you update the monitoring process to identify the
process that is gobbling up all the system resources.

The above mentioned monitor actually successfully did that. And once it managed to show useful output with nice -5 the offending (java) process showed an interesting 1500+% CPU usage in top. Interesting because the server has "only" 4 CPU cores (and 32 GB of ram).

When I managed systems I had a "watcher" (*) cron job that would
(*) ftp://ftp.isc.org/usenet/comp.sources.unix/volume11/watcher/

Thanks I'm checking it out. It'll be a bit of a challenge compiling it on my debian system.

Greetings,
Jeroen

Reply via email to