On 23 Nov 2017, at 5:28 (-0500), Niclas Rautenhaus wrote:

The symptoms:
I am not yet sure whether all incoming mails are affected or not, but at least sometimes I get the following entry in my mailq (and respecively the mail.log):
5086760443    35318 Mon Nov 13 16:04:20  u...@externaldomain.tld
(lost connection with 192.168.N.NN[192.168.N.NN] while sending end of data -- message may be sent more than once)
                                         bcc-address@bcc-domain.local

The most likely cause is that your appliance or the network between it and Postfix is behaving badly. Postfix sent the '<CRLF>.<CRLF>' that signals the end of message data, but instead of getting a '250' reply from the appliance and then either sending a QUIT command to initiate the shutdown of the connection or a RSET command and sending another queued message, the connection is closed. The fact that the problem is happening so late in the process of sending a message eliminates Postfix configuration as a critical cause.

The next steps for troubleshooting this would be:

1. Find ALL of the log lines related to one message that is stuck.

2. If there's a long delay between when Postfix connects to the appliance and the failure, the likely proximate cause of the failure is that the appliance is not detecting the end-of-data signal, and eventually just drops the connection due to a timeout. This can happen for a few different reasons, all involving how the message data and the end-of-data signal are packaged in IP packets. Sometimes this is a problem on the receiving end, with software which can't recognize the end-of-data signal if it is split between two packets. Sometimes it is a Path MTU issue, where something between the sender and receiver can't handle a too-large packet but also can't get ICMP NEED-FRAG messages back to the sender. To differentiate between those issues, you probable need a network capture (using something like tcpdump or wireshark) of a session that fails. However, if it is simply a Path MTU issue, you may be able to reduce the MTU on the Postfix server (e.g. from the usual 1500 down to 1460 or even 1400) and make the problem entirely vanish. If it is a problem with how the appliance handles an end-of-data signal split between packets, try to get a refund for the device: it's junk.

3. If the failure is fast (<10s) then the problem could be an unwise (and SMTP non-compliant) optimization of session closure by the appliance that the Postfix server sees as a simple dropped connection. Here again, a network capture of a session that fails would help identify the problem (such as a final packet with the 250 reply to the end-of-data signal and with the FIN flag set.)


--
Bill Cole
b...@scconsult.com or billc...@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Currently Seeking Steady Work: https://linkedin.com/in/billcole

Reply via email to