Hi List,

I have a very obscure problem with an application using O_NONBLOCK still 
blocking. Over the course of a year of running with hundreds of thousands of 
clients, it has happened twice over the last month that a worker thread froze. 
It's a long story, but I'm pretty sure it's not a deadlock or spinning event 
loop or something, primarily because the application recovers after about 20 
minutes with a client errorring out with ETIMEDOUT. Coincidentally, that 20 
minutes matches the timeout description of the tcp man page [1].

It really looks like a non-blocking socket is still blocking. I found something 
with a similar problem ([2]), but what they think of SSL_MODE_AUTO_RETRY does 
not match the documentation.

So, is there indeed any way an application that has SSL_MODE_AUTO_RETRY on 
(which is default since 1.1.1) can block? Looking at the source code, I don't 
see any calls to fcntl() that removes the O_NONBLOCK.

My IO method is SSL_read() and SSL_write() with an SSL object given to 
SSL_set_fd().

The only SSL modes I change from the default is that I set 
SSL_MODE_ACCEPT_MOVING_WRITE_BUFFER. 

There are two primary deployments of this application, one with OpenSSL 1.1.1 
and one with 3.0.0. Only 1.1.1 has shown this problem, but it may be a 
coincidence.

Side question, is it a problem to set SSL_set_fd() before using fcntl to set 
the fd to O_NONBLOCK? I ask, because the docs say "The BIO and hence the SSL 
engine inherit the behaviour of fd. If fd is non-blocking, the ssl will also 
have non-blocking behaviour.". The 'inherit' may be a key word here; not sure 
when it's done.

Regards,

Wiebe Cazemier



[1] https://man7.org/linux/man-pages/man7/tcp.7.html
[2] https://github.com/alanxz/rabbitmq-c/issues/586

Reply via email to