I have a client/server application secured by certificates on both
ends using OpenSSL 0.9.7c on RedHat 9.  Client and server exchange
messages consisting of lines of ASCII text using BIO_puts() and
BIO_gets(). I include a call to BIO_flush() after each BIO_puts() in
order to ensure that the entire message gets flushed to the other
side.  The conversation between client and server is transaction-
oriented in that the client makes a request to the server and the
server sends back a response.  In some cases the server might issue
one or more queries back to the client before giving its final response.
In all cases, the client and the server strictly alternate in sending
messages.

I have a version of the client that reads transactions from a file
and issues them to the server in order.  If I re-arrange the transaction
order, I can reliably cause the client side to either hang in a
BIO_flush() call or run to completion.  At the point of the hang, the
server is waiting for a line in BIO_gets().  The client has returned
from sending a line with BIO_puts() and is wedged in BIO_flush().  If
I arrange for NULL crypto to be used, a network trace shows that the
client's final message has not been sent.  Interestingly, the last
packet is a TCP ACK from the client containing no data.  I don't know
enough about TCP or SSL to tell if this is significant.  When the
client hangs, if I leave it alone for a few minutes it will time
out and die with an "Alarm Clock" error.  (I'm apparently not handling
SIGALRM.)  The server then wakes up, notices that the client has gone
away, and listens for more client connections.  In this particular
setup, client and server are on the same machine communicating over
the loopback interface.

I've been working on this software for some time now, and this
particular setup of RH9 and OpenSSL 0.9.7c has been stable since
December.  This is the first time I've seen the software behave this
way.  Since I've been continuously modifying the software on both
ends, I'm reasonably certain that the problem is in my code somewhere.
I'm darned if I can put my finger on the error, however.

If I attach a debugger to the hung client, it claims to be in
int_malloc() in /lib/tls/libc.so.6.  This was called by malloc()
in the same library and gdb sees nothing above this on the stack.
Very strange.  I've strace'd the compiler and linker while building
both client and server, and they're clearly picking up my installed
copy of 0.9.7c in /usr/local rather than the RedHat version in /usr/lib.
The SSL libraries are linked statically, so there's no confusion at
runtime about which library to link.

I'm wondering if anyone can suggest what sorts of things might cause
the behavior I'm seeing?  Is there a way to ask a chain of BIO's about
their state of health?  I'm querying the number of bytes waiting in
the write buffer just before calling BIO_flush().  In all cases,
including just before the final hang, there are bytes waiting to be
flushed.  No error is ever reported until the final hang.

Is this an interesting enough problem? Anybody have any ideas?

Paul Allen
--
Boeing Phantom Works                   \ Paul L. Allen, (425) 865-3297
Math & Computing Technology              \ [EMAIL PROTECTED]
POB 3707 M/S 7L-40, Seattle, WA 98124-2207 \ Prototype Systems Group

______________________________________________________________________
OpenSSL Project                                 http://www.openssl.org
User Support Mailing List                    [EMAIL PROTECTED]
Automated List Manager                           [EMAIL PROTECTED]

Reply via email to