Hey postfix-users, we are currently analyzing very strange postfix behavior which I can only describe as lockup or freeze. Honestly we reached our abilities to gather more info and to find the root cause of this issue.
You are my last hope obi wan ... eh Wietse .... --- Setup / Configuration --- We are running three VMware ESX 4.1 guests (multiple VMware hosts) with Debian Lenny amd64. Each machine has 8 virtual CPUs and 4GB of RAM. We are using the Debian lenny postfix package (2.5.5) and now moved up to 2.7.1 which we backported (problems remained unchanged). Each instance of postfix recieves about 200-300 mails/minute. The mail addresses of the incoming emails are checked using a relay_recipient_map (proxymap + mysql) are rewritten (domain is changed) and then relayed using a transport map towards a multi-A record containing two IPs. --- Lockup Issues --- The setup and configuration works like a charm for hours at a time and all of a sudden it stops working leading to two issues (not at the same time): 1) First issue was that suddenly smtp stopped delivering email to that mutli-A record. We noticed a few thousand emails in the active queue (I guess all emails where in the active queue by that time). We rule out problems with the destination servers since the remaining postfix instances still delivered mail during that time. Even the last submission of email from the now locked up postfix finished without issue. There are just no more tries to reach the destination. Postfix also stops to deliver to both destination IPs at the same time. There was no logging anymore, but from anvil giving us some statistics about connection rates. 2) The second issue, occurring more often is that smtpd stopped responding or doing anything acutally. Sometimes the number of processes went to max (500) at other occurences it just stayed at like 30-50 but symptoms where still the same. Looking at netstat all tcp connections to the smtpd processes went away after some time.We believe that clients trying to deliver email to us did disconnect due to reaching their timeout. Here is a sniplet of log from before and after the issue: --- connect before and after lockup --- Oct 29 12:57:35 mailserver postfix/smtpd[12457]: connect from newsletter.xxx.de[xxx.162.xxx.28] Oct 29 13:29:02 mailserver postfix/smtpd[12457]: 826C611FA8: client=newsletter.xxx.de[xxx.162.xxx.28] Oct 29 13:29:03 mailserver postfix/smtpd[12457]: lost connection after RCPT from newsletter.xxx.de[xxx.162.xxx.28] Oct 29 13:29:03 mailserver postfix/smtpd[12457]: disconnect from newsletter.xxx.de[xxx.162.xxx.28] --- / connect before and after lockup --- There is not a single line of log from smtpd processes while being in the locked up state. At 12:57:35 there is a last connect from a client, no more log happens until we run "postfix reload" (! a reload is sufficient) about half an hour later. Then the last few lines show that the smtpd process is "alive" again talking to the now already disconnected client. Also the master process is still there but not accepting anymore connections until we issue "postfix reload". Here is an strace of the master process: --- strace on master-pid while locked up --- 10648 alarm(333) = 307 10648 epoll_wait(13, --- / strace on master-pid while locked up --- Another strange thing is, that some of the last log entries contain "UNKNOWN" instead of the IP: --- smtpd connects towards UNKNOWN --- Oct 29 13:05:02 mailserver postfix/smtpd[14196]: connect from unknown[unknown] Oct 29 13:05:02 mailserver postfix/smtpd[14196]: NOQUEUE: reject: CONNECT from unknown[unknown]: 550 5.7.1 Client host rejected: cannot find your reverse hostname, [unknown]; proto=SMTP Oct 29 13:05:02 mailserver postfix/smtpd[14196]: too many errors after CONNECT from unknown[unknown] Oct 29 13:05:02 mailserver postfix/smtpd[14196]: disconnect from unknown[unknown] --- / smtpd connects towards UNKNOWN --- Anvil also gives statistics about this obscure client-IP "UNKNOWN": --- anvil statistics for UNKNOWN --- Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max connection rate 92/60s for (smtp:unknown) at Oct 29 13:05:02 Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max connection count 1 for (smtp:unknown) at Oct 29 13:05:02 Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max cache size 413 at Oct 29 13:05:02 --- / anvil statistics for UNKNOWN --- During the lock up we still see the *:25 listening socket of the master process. When we tried to connect ourselves we just waited forever for a connection or even a SMTP prompt. New incoming connections are in state "SYN_RECV": --- netstat while locked up--- Proto Recv-Q Send-Q Local Address Foreign Address State tcp 0 0 x.x.x.143:25 xx.0.xx.217:58556 SYN_RECV tcp 0 0 x.x.x.143:25 xx.115.xx.166:46355 SYN_RECV tcp 0 0 x.x.x.143:25 xx.63.xx.197:37216 SYN_RECV . . . ---/ netstat while locked up--- Again a "postfix reload" does fix the problem (for a few hours) and postfix delivers the > 10,000 emails in the active queue in a few seconds towards the two IPs from the DNS round robin and everything goes back to normal for a few hours. I'm hoping for suggestions on what to look for or check, TIA Christian