I have been recently debugging some corner cases in OpenSSL's
SSL_shutdown call in sendmail (I ask your forgiveness) and now that
I seem to have it right there I have decided to look at other mailers
for similar issues.

A discussion with the OpenSSL folks on how to properly shut down a
connection, along with my revised sendmail patch (which has been
submitted to the sendmail maintiners) is at 
https://github.com/openssl/openssl/issues/13976

Looking at postfix/src/tls/tls_bio_ops.c I see the following potential
problems.  Note I do not run, nor have I tested a postfix patch, so I
may not have this _exactly_ right.

What I think happens on a call to tls_bio for an SSL_shutdown:
 - Enter the for(;;).
 - hsfunc = SSL_shutdown is called the first time, which will return status = 0.
 - SSL_get_error will return err = SSL_ERROR_ZERO_RETURN.
 - In the switch, case SSL_ERROR_ZERO_RETURN is triggered.
 - It falls through SSL_ERROR_NONE.
 - It falls through SSL_ERROR_SYSCALL.
 - It returns up stack.

I suspect the calling function is then going to close the connection,
potentally before the 2-way handshake has been completed with a
peer.  This will result in an unclean shutdown.

1) For SSL_shutdown case SSL_ERROR_ZERO_RETURN should pause and retry.

   As documented by OpenSSL, to complete a 2-way handshake SSL_shutdown
   should be called the first time to initiate the handshake, and will
   return 0.  It should then be retried while it returns
   SSL_ERROR_WANT_READ, until it finally returns 1 to indicate a
   successful 2 way shutdown.

   I believe minimally case SSL_ERROR_ZERO_RETURN should do something
   like this:

   /* Waiting for 2-way handshake. */
   if (hsfunc == SSL_shutdown) {
      usleep(100);
      continue;
   }

2) SSL_ERROR_SYSCALL has a special case where it does not return a
   bad syscall.

   See https://www.openssl.org/docs/man1.1.1/man3/SSL_get_error.html
   in the BUGS section, which states:

   The SSL_ERROR_SYSCALL with errno value of 0 indicates unexpected EOF from
   the peer. This will be properly reported as SSL_ERROR_SSL with reason code
   SSL_R_UNEXPECTED_EOF_WHILE_READING in the OpenSSL 3.0 release because
   it is truly a TLS protocol error to terminate the connection without a 
SSL_shutdown().

   Postfix does not seem to have this special case, instead it assumes
   every SSL_ERROR_SYSCALL is in fact a syscall error.  I suspect this 
   is largely a cosmetic bug where it reports the wrong error message
   in logs/debugging.

3) Postfix will busy on a 2-way shutdown.

   If a fix like I describe in point #1 were put in place, Postfix would
   then busy wait on a 2-way shutdown.

   With TLS1.2+ the shutdown sequence has been changed to be a 2-way
   shutdown handshake.  A request is sent to shut down TLS, and the 
   far end must respond with an acknowledgement.  In OpenSSL this is 
   accomplished by a first call to SSL_shutdown to send the request,
   which returns 0, and then a subsquent call to SSL_shutdown that
   returns 1 to indicate the reply was received.  While waiting for the
   reply, SSL_shutdown returns SSL_ERROR_WANT_READ, that it wants to
   read from the far end.

   When Postfix receives SSL_ERROR_WANT_READ it checks for a timeout
   (this is good, don't wait forever), but then it simply breaks out of
   the switch and goes back into the for(;;) loop.  The effect is to
   busy-wait for the reply from the far end.  Given this can take 
   network-time, e.g. 150ms from the US to Europe, that could be an
   expensive busy wait loop.

   I recommend at least for SSL_shutdown that SSL_ERROR_WANT_READ result
   in a short delay.  I am testing a usleep of 100 microseconds (0.1ms)
   in sendmail so that this is not a busy wait loop.  Some small delay
   may be appropriate for SSL_ERROR_WANT_READ from other functions as
   well, for similar reasons.

As a warning for anyone testing modifications on an internet facing mail
server.  I found a case in sendmail where when acting as a client it
does not perform a clean TLS shutdown.  I also believe point #1 means
that Postfix may not be cleanly shutting down all connections.  As a
result it's quite likely you will see significant unclean shutdowns from
other mailers.  Currernly I am seeing about 44% clean shutdowns, and
66% unclean shutdowns from remote Internet mailers.  That's why I'm on
a quest here to see if I can get all mailers to handle all of the cases
correctly, and make the Internet a better place.

-- 
Leo Bicknell - bickn...@ufp.org
PGP keys at http://www.ufp.org/~bicknell/

Reply via email to