On 23 Nov 2017, at 5:28 (-0500), Niclas Rautenhaus wrote:
The symptoms:
I am not yet sure whether all incoming mails are affected or not, but
at least sometimes I get the following entry in my mailq (and
respecively the mail.log):
5086760443 35318 Mon Nov 13 16:04:20 u...@externaldomain.tld
(lost connection with 192.168.N.NN[192.168.N.NN] while sending end of
data -- message may be sent more than once)
bcc-address@bcc-domain.local
The most likely cause is that your appliance or the network between it
and Postfix is behaving badly. Postfix sent the '<CRLF>.<CRLF>' that
signals the end of message data, but instead of getting a '250' reply
from the appliance and then either sending a QUIT command to initiate
the shutdown of the connection or a RSET command and sending another
queued message, the connection is closed. The fact that the problem is
happening so late in the process of sending a message eliminates Postfix
configuration as a critical cause.
The next steps for troubleshooting this would be:
1. Find ALL of the log lines related to one message that is stuck.
2. If there's a long delay between when Postfix connects to the
appliance and the failure, the likely proximate cause of the failure is
that the appliance is not detecting the end-of-data signal, and
eventually just drops the connection due to a timeout. This can happen
for a few different reasons, all involving how the message data and the
end-of-data signal are packaged in IP packets. Sometimes this is a
problem on the receiving end, with software which can't recognize the
end-of-data signal if it is split between two packets. Sometimes it is a
Path MTU issue, where something between the sender and receiver can't
handle a too-large packet but also can't get ICMP NEED-FRAG messages
back to the sender. To differentiate between those issues, you probable
need a network capture (using something like tcpdump or wireshark) of a
session that fails. However, if it is simply a Path MTU issue, you may
be able to reduce the MTU on the Postfix server (e.g. from the usual
1500 down to 1460 or even 1400) and make the problem entirely vanish. If
it is a problem with how the appliance handles an end-of-data signal
split between packets, try to get a refund for the device: it's junk.
3. If the failure is fast (<10s) then the problem could be an unwise
(and SMTP non-compliant) optimization of session closure by the
appliance that the Postfix server sees as a simple dropped connection.
Here again, a network capture of a session that fails would help
identify the problem (such as a final packet with the 250 reply to the
end-of-data signal and with the FIN flag set.)
--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Currently Seeking Steady Work: https://linkedin.com/in/billcole