Jeroen van Aart: > Wietse Venema wrote: > > Jeroen van Aart: > >> Yes I did: > >> egrep '(warning|error|fatal|panic):' /var/log/mail.* | grep qmgr > >> gunzip -c /var/log/mail.*.*.gz | egrep '(warning|error|fatal|panic):' | > >> grep qmgr > > > > How much time is between the LAST qmgr[9582] logfile record BEFORE > > the master warning? If the distance is large, then you have a > > Aug 15 02:41:20 prod101 postfix/qmgr[9582]: EF86A239E8B: > from=<exam...@example.com>, size=8105, nrcpt=1 (queue active) > > Aug 15 02:55:06 prod101 postfix/master[9402]: warning: process > /usr/lib/postfix/qmgr pid 9582 exit status 1 > > About 14 minutes, I assume that's not a long time?
Actually, that is pretty long by today's standards. My tiny server has 60-100 qmgr logfile records per hour during the night, twice that during the day. This machine has only two users. The 15-minute distance suggests that the system was already in trouble long before the qmgr voluntarily exited with status 1. When a system is totally hosed, it is unfortunate but understandable that the syslog datagram with the error message gets lost. I recommend that you update the monitoring process to identify the process that is gobbling up all the system resources. When I managed systems I had a "watcher" (*) cron job that would alert me about unexpected changes in a daemon's CPU usage, memory size, number of process instances, and about unexpected changes in the total system load, disk usage, and more. Wietse (*) ftp://ftp.isc.org/usenet/comp.sources.unix/volume11/watcher/