Hey postfix-users,

we are currently analyzing very strange postfix behavior which I can
only describe as lockup or freeze.
Honestly we reached our abilities to gather more info and to find the
root cause of this issue.

You are my last hope obi wan ... eh Wietse ....


--- Setup / Configuration ---
We are running three VMware ESX 4.1 guests (multiple VMware hosts) with
Debian Lenny amd64.
Each machine has 8 virtual CPUs and 4GB of RAM. We are using the Debian
lenny postfix package (2.5.5) and now moved up to 2.7.1 which we
backported (problems remained unchanged).
Each instance of postfix recieves about 200-300 mails/minute. The mail
addresses of the incoming emails are checked using a relay_recipient_map
(proxymap + mysql) are rewritten (domain is changed) and then relayed
using a transport map towards a multi-A record containing two IPs.


--- Lockup Issues ---
The setup and configuration works like a charm for hours at a time and
all of a sudden it stops working leading to two issues (not at the same
time):

1) First issue was that suddenly smtp stopped delivering email to that
mutli-A record. We noticed a few thousand emails in the active queue (I
guess all emails where in the active queue by that time). We rule out
problems with the destination servers since the remaining postfix
instances still delivered mail during that time. Even the last
submission of email from the now locked up postfix finished without
issue. There are just no more tries to reach the destination. Postfix
also stops to deliver to both destination IPs at the same time. There
was no logging anymore, but from anvil giving us some statistics about
connection rates.

2) The second issue, occurring more often is that smtpd stopped
responding or doing anything acutally. Sometimes the number of processes
went to max (500) at other occurences it just stayed at like 30-50 but
symptoms where still the same.

Looking at netstat all tcp connections to the smtpd processes went away
after some time.We believe that clients trying to deliver email to us
did disconnect due to reaching their timeout.

Here is a sniplet of log from before and after the issue:
--- connect before and after lockup ---
Oct 29 12:57:35 mailserver postfix/smtpd[12457]: connect from
newsletter.xxx.de[xxx.162.xxx.28]
Oct 29 13:29:02 mailserver postfix/smtpd[12457]: 826C611FA8:
client=newsletter.xxx.de[xxx.162.xxx.28]
Oct 29 13:29:03 mailserver postfix/smtpd[12457]: lost connection after
RCPT from newsletter.xxx.de[xxx.162.xxx.28]
Oct 29 13:29:03 mailserver postfix/smtpd[12457]: disconnect from
newsletter.xxx.de[xxx.162.xxx.28]
--- / connect before and after lockup ---
 There is not a single line of log from smtpd processes while being in
the locked up state. At 12:57:35 there is a last connect from a client,
no more log happens until we run "postfix reload" (! a reload is
sufficient) about half an hour later. Then the last few lines show that
the smtpd process is "alive" again talking to the now already
disconnected client.

Also the master process is still there but not accepting anymore
connections until we issue "postfix reload". Here is an strace of the
master process:
--- strace on master-pid while locked up ---
10648 alarm(333)                        = 307
10648 epoll_wait(13,
--- / strace on master-pid while locked up ---


Another strange thing is, that some of the last log entries contain
"UNKNOWN" instead of the IP:
--- smtpd connects towards UNKNOWN ---
Oct 29 13:05:02 mailserver postfix/smtpd[14196]: connect from
unknown[unknown]
Oct 29 13:05:02 mailserver postfix/smtpd[14196]: NOQUEUE: reject:
CONNECT from unknown[unknown]: 550 5.7.1 Client host rejected: cannot
find your reverse hostname, [unknown]; proto=SMTP
Oct 29 13:05:02 mailserver postfix/smtpd[14196]: too many errors after
CONNECT from unknown[unknown]
Oct 29 13:05:02 mailserver postfix/smtpd[14196]: disconnect from
unknown[unknown]
--- / smtpd connects towards UNKNOWN ---

Anvil also gives statistics about this obscure client-IP "UNKNOWN":
--- anvil statistics for UNKNOWN ---
Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max
connection rate 92/60s for (smtp:unknown) at Oct 29 13:05:02
Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max
connection count 1 for (smtp:unknown) at Oct 29 13:05:02
Oct 29 13:14:57 mailserver postfix/anvil[9312]: statistics: max cache
size 413 at Oct 29 13:05:02
--- / anvil statistics for UNKNOWN ---

During the lock up we still see the *:25 listening socket of the master
process. When we tried to connect ourselves we just waited forever for a
connection or even a SMTP prompt.

New incoming connections are in state "SYN_RECV":
--- netstat while locked up---
Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0      0 x.x.x.143:25           xx.0.xx.217:58556      SYN_RECV
tcp        0      0 x.x.x.143:25           xx.115.xx.166:46355   SYN_RECV
tcp        0      0 x.x.x.143:25           xx.63.xx.197:37216     SYN_RECV
.
.
.
---/ netstat while locked up---


Again a "postfix reload" does fix the problem (for a few hours) and
postfix delivers the > 10,000 emails in the active queue in a few
seconds towards the two IPs from the DNS round robin and everything goes
back to normal for a few hours.



I'm hoping for suggestions on what to look for or check,
TIA

Christian

Reply via email to