Jeroen van Aart:
> Wietse Venema wrote:
> > Jeroen van Aart:
> >> Yes I did:
> >> egrep '(warning|error|fatal|panic):' /var/log/mail.* | grep qmgr
> >> gunzip -c /var/log/mail.*.*.gz | egrep '(warning|error|fatal|panic):' | 
> >> grep qmgr
> > 
> > How much time is between the LAST qmgr[9582] logfile record BEFORE    
> > the master warning? If the distance is large, then you have a
> 
> Aug 15 02:41:20 prod101 postfix/qmgr[9582]: EF86A239E8B: 
> from=<exam...@example.com>, size=8105, nrcpt=1 (queue active)
> 
> Aug 15 02:55:06 prod101 postfix/master[9402]: warning: process 
> /usr/lib/postfix/qmgr pid 9582 exit status 1
> 
> About 14 minutes, I assume that's not a long time?

Actually, that is pretty long by today's standards.

My tiny server has 60-100 qmgr logfile records per hour during the
night, twice that during the day. This machine has only two users.

The 15-minute distance suggests that the system was already in
trouble long before the qmgr voluntarily exited with status 1.
When a system is totally hosed, it is unfortunate but understandable
that the syslog datagram with the error message gets lost.

I recommend that you update the monitoring process to identify the
process that is gobbling up all the system resources. 

When I managed systems I had a "watcher" (*) cron job that would
alert me about unexpected changes in a daemon's CPU usage, memory
size, number of process instances, and about unexpected changes
in the total system load, disk usage, and more.

        Wietse

(*) ftp://ftp.isc.org/usenet/comp.sources.unix/volume11/watcher/

Reply via email to