Wietse Venema wrote:
Jeroen van Aart:
Yes I did:
egrep '(warning|error|fatal|panic):' /var/log/mail.* | grep qmgr
gunzip -c /var/log/mail.*.*.gz | egrep '(warning|error|fatal|panic):' | grep qmgr

How much time is between the LAST qmgr[9582] logfile record BEFORE the master warning? If the distance is large, then you have a

Aug 15 02:41:20 prod101 postfix/qmgr[9582]: EF86A239E8B: from=<exam...@example.com>, size=8105, nrcpt=1 (queue active)

Aug 15 02:55:06 prod101 postfix/master[9402]: warning: process /usr/lib/postfix/qmgr pid 9582 exit status 1

About 14 minutes, I assume that's not a long time?

We have since moved a few services over and restarted a long running process. This did solve most issues though there were occasionally incidents that postfix would still become unresponsive, but it wouldn't quit.

mis-configured Postfix / syslog setup, which is unfortunately common.
To fix logging DO NOT chroot the qmgr in master.cf.

It's never been chrooted.

Once the logging is fixed, we can find out WHY the qmgr exits with
status 1.

Other log entries show processes which segfaulted at times. Judging from all the symptons I believe now that there is a memory problem which acted up due to a combination of a long running process and other services causing a regular, but normal, spike in load. If so there is nothing really that can be done with regards to postfix.

I recommended a thorough over night check with memtest86(+). Of course this means downtime so I doubt it's gonna happen anytime soon. So we'll have to wait until the next (stupid) problem.

Thanks for the help and I learned quite a bit about postfix in the process.

Jeroen

Reply via email to