I have a client/server application secured by certificates on both ends using OpenSSL 0.9.7c on RedHat 9. Client and server exchange messages consisting of lines of ASCII text using BIO_puts() and BIO_gets(). I include a call to BIO_flush() after each BIO_puts() in order to ensure that the entire message gets flushed to the other side. The conversation between client and server is transaction- oriented in that the client makes a request to the server and the server sends back a response. In some cases the server might issue one or more queries back to the client before giving its final response. In all cases, the client and the server strictly alternate in sending messages.
I have a version of the client that reads transactions from a file and issues them to the server in order. If I re-arrange the transaction order, I can reliably cause the client side to either hang in a BIO_flush() call or run to completion. At the point of the hang, the server is waiting for a line in BIO_gets(). The client has returned from sending a line with BIO_puts() and is wedged in BIO_flush(). If I arrange for NULL crypto to be used, a network trace shows that the client's final message has not been sent. Interestingly, the last packet is a TCP ACK from the client containing no data. I don't know enough about TCP or SSL to tell if this is significant. When the client hangs, if I leave it alone for a few minutes it will time out and die with an "Alarm Clock" error. (I'm apparently not handling SIGALRM.) The server then wakes up, notices that the client has gone away, and listens for more client connections. In this particular setup, client and server are on the same machine communicating over the loopback interface.
I've been working on this software for some time now, and this particular setup of RH9 and OpenSSL 0.9.7c has been stable since December. This is the first time I've seen the software behave this way. Since I've been continuously modifying the software on both ends, I'm reasonably certain that the problem is in my code somewhere. I'm darned if I can put my finger on the error, however.
If I attach a debugger to the hung client, it claims to be in int_malloc() in /lib/tls/libc.so.6. This was called by malloc() in the same library and gdb sees nothing above this on the stack. Very strange. I've strace'd the compiler and linker while building both client and server, and they're clearly picking up my installed copy of 0.9.7c in /usr/local rather than the RedHat version in /usr/lib. The SSL libraries are linked statically, so there's no confusion at runtime about which library to link.
I'm wondering if anyone can suggest what sorts of things might cause the behavior I'm seeing? Is there a way to ask a chain of BIO's about their state of health? I'm querying the number of bytes waiting in the write buffer just before calling BIO_flush(). In all cases, including just before the final hang, there are bytes waiting to be flushed. No error is ever reported until the final hang.
Is this an interesting enough problem? Anybody have any ideas?
Paul Allen -- Boeing Phantom Works \ Paul L. Allen, (425) 865-3297 Math & Computing Technology \ [EMAIL PROTECTED] POB 3707 M/S 7L-40, Seattle, WA 98124-2207 \ Prototype Systems Group
______________________________________________________________________ OpenSSL Project http://www.openssl.org User Support Mailing List [EMAIL PROTECTED] Automated List Manager [EMAIL PROTECTED]