Wietse Venema wrote:
Jeroen van Aart:
Yes I did:
egrep '(warning|error|fatal|panic):' /var/log/mail.* | grep qmgr
gunzip -c /var/log/mail.*.*.gz | egrep '(warning|error|fatal|panic):' |
grep qmgr
How much time is between the LAST qmgr[9582] logfile record BEFORE
the master warning? If the distance is large, then you have a
Aug 15 02:41:20 prod101 postfix/qmgr[9582]: EF86A239E8B:
from=<exam...@example.com>, size=8105, nrcpt=1 (queue active)
Aug 15 02:55:06 prod101 postfix/master[9402]: warning: process
/usr/lib/postfix/qmgr pid 9582 exit status 1
About 14 minutes, I assume that's not a long time?
We have since moved a few services over and restarted a long running
process. This did solve most issues though there were occasionally
incidents that postfix would still become unresponsive, but it wouldn't
quit.
mis-configured Postfix / syslog setup, which is unfortunately common.
To fix logging DO NOT chroot the qmgr in master.cf.
It's never been chrooted.
Once the logging is fixed, we can find out WHY the qmgr exits with
status 1.
Other log entries show processes which segfaulted at times. Judging from
all the symptons I believe now that there is a memory problem which
acted up due to a combination of a long running process and other
services causing a regular, but normal, spike in load. If so there is
nothing really that can be done with regards to postfix.
I recommended a thorough over night check with memtest86(+). Of course
this means downtime so I doubt it's gonna happen anytime soon. So we'll
have to wait until the next (stupid) problem.
Thanks for the help and I learned quite a bit about postfix in the process.
Jeroen