Hi everybody! Three months ago one of my PHP applications started to hang webserver processes. The schema of the application is as follows (it's a webservice):
-Main server, with LINUX, Apache and PHP, handles the requests from the clients. -Once the client is authenticated, and the request is valid, it relays the request to one of the four slave servers (let's call them nodes). -One of the nodes uses PHP4 (works fine), and the other three use LINUX, Apache and PHP 5.1.6 with the following SSL configuration: OpenSSL support => enabled OpenSSL Version => OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008 -The nodes send the request to a remote webservice, and reply it back to the main server. -The main server processes the response and sends back the response to the client. The problem is that, eventually, the PHP5 nodes ends with blocked (sleeping) processes from Apache. Finally the system runs out of resources because all the blocked-sleeping processes have a ESTABLISHED socket. First I noticed that, at least in that version of PHP, the blocking sockets can have a send and recive timeout. Then I switched to non blocking sockets and select. Now the application works fine in the upper level of the communication with the other webservices, but I noticed that it still ends with sleeping processes. I "straced" the Apache and this is what I found (the bound IP and the remote webservice IP are not real; the second one is from google.com): Block 1 - 5503 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 20 5503 bind(20, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("10.6.0.11")}, 16) = 0 5503 fcntl64(20, F_GETFL) = 0x2 (flags O_RDWR) 5503 fcntl64(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0 5503 connect(20, {sa_family=AF_INET, sin_port=htons(443), sin_addr=inet_addr("74.125.127.100")}, 16) = -1 EINPROGRESS (Operation now in progress) 5503 poll([{fd=20, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1370327) = 1 ([{fd=20, revents=POLLOUT}]) Block 2 - 5503 getsockopt(20, SOL_SOCKET, SO_ERROR, [0], [4]) = 0 5503 fcntl64(20, F_SETFL, O_RDWR) = 0 5503 time(NULL) = 1251911098 5503 brk(0xb9e39000) = 0xb9e39000 5503 time(NULL) = 1251911098 5503 write(20, "\200d\1\3\1\0K\0\0\0\20\0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"..., 102) = 102 5503 read(20, "\26\3\1\0J\2\0", 7) = 7 5503 time(NULL) = 1251911098 5503 time(NULL) = 1251911098 5503 read(20, "\0F\3\1J\236\245\273^\v\375\357<\205\223\334n}\226k\232\\\311\300\21Q\256\267\304\257o#"..., 72) = 72 5503 read(20, "\26\3\1\6\252", 5) = 5 5503 read(20, "\v\0\6\246\0\6\243\0\3\2400\202\3\2340\202\3\5\240\3\2\1\2\2\4<\260Eh0\r\6"..., 1706) = 1376 5503 read(20, " 2 CA1\r0\v\6\3U\4\3\23\4CRL10+\6\3U\35\20\4$0\"\200"..., 330) = 330 5503 read(20, "\26\3\1\1^", 5) = 5 5503 read(20, "\r\0\1v\3\...@\1p\0m0k1\v0\t\6\3u\4\6\23\2es1\0170\r\6"..., 350) = 350 5503 write(20, "\26\3\1\0\7\v\0\0\3\0\0\0\26\3\1\0\206\20\0\0\202\0\200\215\30 \tT\312\313\255\210"..., 210) = 210 5503 read(20, "\24\3\1\0\1", 5) = 5 5503 read(20, "\1", 1) = 1 5503 read(20, "\26\3\1\0000", 5) = 5 5503 read(20, "\364\35k\17\25\224\314]|\263Y\307z\310\223\314\210\235h\224t\345\305\373\267\241\377\t\1s\345\250"..., 48) = 48 Block 3 - 5503 fcntl64(20, F_GETFL) = 0x2 (flags O_RDWR) 5503 fcntl64(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0 Block 4 - 5503 select(21, [], [20], [], {44320, 0}) = 1 (out [20], left {44320, 0}) 5503 gettimeofday({1251911098, 885029}, NULL) = 0 5503 write(13, "[Wed Sep 02 19:04:58 2009] [erro"..., 161) = 161 5503 write(20, "\27\3\1\2\240\10\260\325k\v\37{\251\355t_R\274\312\16\354r\315\1\277\204b\342xU\30L"..., 677) = 677 5503 select(21, [20], [], [], {44320, 0} <unfinished ...> 5503 <... select resumed> ) = 1 (in [20], left {44319, 828000}) 5503 read(20, "\27\3\1\0\240", 5) = 5 5503 read(20, "\304\310\273s\370\304e\335\273\1\346\376N\234\376\226\276\0\fG9T\313\235\2&\31\340B=\f\""..., 160) = 160 5503 select(21, [20], [], [], {44320, 0}) = 1 (in [20], left {44320, 0}) 5503 read(20, "\27\3\1\4P", 5) = 5 5503 read(20, "\333`\341\212},\215\3476\370\325\273\"\260\340\234\365\324\324\10a\25v\327\266'+\340\34\1\317\364"..., 1104) = 1104 Block 5 - 5503 write(20, "\25\3\1\0 \305\10\236\5\10\226JS\330\365\204\21\230\246\362\247R\332\220:\210o\2\333\10.\27"..., 37) = 37 5503 brk(0xb9e2d000) = 0xb9e2d000 5503 close(20) = 0 Block 6 - 5503 brk(0xb9e2b000) = 0xb9e2b000 5503 chdir("/") = 0 5503 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 5503 writev(19, [{"\27\3\1\0\260n?\301\204\263\0341yi\204\v\273\332\335\263\277\335M\t\223\364$\266\214\212c\362"..., 714}], 1) = 714 5503 gettimeofday({1251911099, 58549}, NULL) = 0 5503 write(14, "10.6.0.1 - noone [02/Sep"..., 97) = 97 5503 gettimeofday({1251911099, 58696}, NULL) = 0 5503 times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 369262016 5503 gettimeofday({1251911099, 58799}, NULL) = 0 5503 writev(19, [{"\25\3\1\0 MJi\215B\321Q\2657E\30cw\341n\264o\215\17\237{ej\342n\222\202"..., 37}], 1) = 37 5503 brk(0xb9e2a000) = 0xb9e2a000 5503 shutdown(19, 1 /* send */) = 0 5503 poll([{fd=19, events=POLLIN}], 1, 2000) = 1 ([{fd=19, revents=POLLIN}]) 5503 read(19, "\25\3\1\0 \230\220\315\210n\342zc\201\367H\r'xPt\277\243\346\227\320g\2223;U\352"..., 512) = 37 5503 gettimeofday({1251911099, 59317}, NULL) = 0 5503 gettimeofday({1251911099, 59372}, NULL) = 0 5503 poll([{fd=19, events=POLLIN}], 1, 2000) = 1 ([{fd=19, revents=POLLIN|POLLERR|POLLHUP}]) 5503 read(19, 0xbf81c504, 512) = -1 ECONNRESET (Connection reset by peer) 5503 close(19) = 0 5503 read(7, 0xbf81c713, 1) = -1 EAGAIN (Resource temporarily unavailable) 5503 close(18) = 0 5503 close(8) = 0 5503 close(7) = 0 5503 exit_group(0) = ? As you can see I devided in 6 blocks this piece of strace output. In the first block, the socket it's created, its bound to a network interface, and it's connected to the remote host; fine. But in the block 2 the socket it's switched back to blocking and then starts the SSL communication. In the block 3 the socket it's switched to non blocking (this is the first time that our application have chenged the socket options, because we don't have the stream until the stream_socket_client function returns). Then, in the block 4, the PHP that I wrote starts to work and the send / receive takes place. In the block 5 there is a write without select... why? Well, I think that is another "feature" of the underlaying socket communications, related to the fclose of the stream. At last, the application ends in the block 6 sending the result back to the main server and closing resources. After saying all of this, my question is... why the SSL implementation uses potentially blocking code? The only work around that I found (inside PHP/Apache) for facing this issue is setting a max execution time of the apache processes because the PHP max_execution_time doesn't count the sleeping time. I can't test right now this code in a latest version of PHP, but I'm afraid that the strace output will be the same. Regards, Eloy Bote.