Hi everybody!

Three months ago one of my PHP applications started to hang webserver
processes. The schema of the application is as follows (it's a webservice):

-Main server, with LINUX, Apache and PHP, handles the requests from the
clients.
-Once the client is authenticated, and the request is valid, it relays the
request to one of the four slave servers (let's call them nodes).
-One of the nodes uses PHP4 (works fine), and the other three use LINUX,
Apache and PHP 5.1.6 with the following SSL configuration:
OpenSSL support => enabled
OpenSSL Version => OpenSSL 0.9.8e-fips-rhel5 01 Jul 2008

-The nodes send the request to a remote webservice, and reply it back to the
main server.
-The main server processes the response and sends back the response to the
client.


The problem is that, eventually, the PHP5 nodes ends with blocked (sleeping)
processes from Apache. Finally the system runs out of resources because all
the blocked-sleeping processes have a ESTABLISHED socket.
First I noticed that, at least in that version of PHP, the blocking sockets
can have a send and recive timeout. Then I switched to non blocking sockets
and select. Now the application works fine in the upper level of the
communication with the other webservices, but I noticed that it still ends
with sleeping processes. I "straced" the Apache and this is what I found
(the bound IP and the remote webservice IP are not real; the second one is
from google.com):

Block 1 -

5503  socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 20
5503  bind(20, {sa_family=AF_INET, sin_port=htons(0),
sin_addr=inet_addr("10.6.0.11")}, 16) = 0
5503  fcntl64(20, F_GETFL)              = 0x2 (flags O_RDWR)
5503  fcntl64(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0
5503  connect(20, {sa_family=AF_INET, sin_port=htons(443),
sin_addr=inet_addr("74.125.127.100")}, 16) = -1 EINPROGRESS (Operation now
in progress)
5503  poll([{fd=20, events=POLLIN|POLLOUT|POLLERR|POLLHUP}], 1, 1370327) = 1
([{fd=20, revents=POLLOUT}])

Block 2 -

5503  getsockopt(20, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
5503  fcntl64(20, F_SETFL, O_RDWR)      = 0
5503  time(NULL)                        = 1251911098
5503  brk(0xb9e39000)                   = 0xb9e39000
5503  time(NULL)                        = 1251911098
5503  write(20,
"\200d\1\3\1\0K\0\0\0\20\0\0009\0\0008\0\0005\0\0\26\0\0\23\0\0\n\7\0\300"...,
102) = 102
5503  read(20, "\26\3\1\0J\2\0", 7)     = 7
5503  time(NULL)                        = 1251911098
5503  time(NULL)                        = 1251911098
5503  read(20,
"\0F\3\1J\236\245\273^\v\375\357<\205\223\334n}\226k\232\\\311\300\21Q\256\267\304\257o#"...,
72) = 72
5503  read(20, "\26\3\1\6\252", 5)      = 5
5503  read(20,
"\v\0\6\246\0\6\243\0\3\2400\202\3\2340\202\3\5\240\3\2\1\2\2\4<\260Eh0\r\6"...,
1706) = 1376
5503  read(20, " 2 CA1\r0\v\6\3U\4\3\23\4CRL10+\6\3U\35\20\4$0\"\200"...,
330) = 330
5503  read(20, "\26\3\1\1^", 5)         = 5
5503  read(20, "\r\0\1v\3\...@\1p\0m0k1\v0\t\6\3u\4\6\23\2es1\0170\r\6"...,
350) = 350
5503  write(20,
"\26\3\1\0\7\v\0\0\3\0\0\0\26\3\1\0\206\20\0\0\202\0\200\215\30
\tT\312\313\255\210"..., 210) = 210
5503  read(20, "\24\3\1\0\1", 5)        = 5
5503  read(20, "\1", 1)                 = 1
5503  read(20, "\26\3\1\0000", 5)       = 5
5503  read(20,
"\364\35k\17\25\224\314]|\263Y\307z\310\223\314\210\235h\224t\345\305\373\267\241\377\t\1s\345\250"...,
48) = 48

Block 3 -

5503  fcntl64(20, F_GETFL)              = 0x2 (flags O_RDWR)
5503  fcntl64(20, F_SETFL, O_RDWR|O_NONBLOCK) = 0

Block 4 -

5503  select(21, [], [20], [], {44320, 0}) = 1 (out [20], left {44320, 0})
5503  gettimeofday({1251911098, 885029}, NULL) = 0
5503  write(13, "[Wed Sep 02 19:04:58 2009] [erro"..., 161) = 161
5503  write(20,
"\27\3\1\2\240\10\260\325k\v\37{\251\355t_R\274\312\16\354r\315\1\277\204b\342xU\30L"...,
677) = 677
5503  select(21, [20], [], [], {44320, 0} <unfinished ...>
5503  <... select resumed> )            = 1 (in [20], left {44319, 828000})
5503  read(20, "\27\3\1\0\240", 5)      = 5
5503  read(20,
"\304\310\273s\370\304e\335\273\1\346\376N\234\376\226\276\0\fG9T\313\235\2&\31\340B=\f\""...,
160) = 160
5503  select(21, [20], [], [], {44320, 0}) = 1 (in [20], left {44320, 0})
5503  read(20, "\27\3\1\4P", 5)         = 5
5503  read(20,
"\333`\341\212},\215\3476\370\325\273\"\260\340\234\365\324\324\10a\25v\327\266'+\340\34\1\317\364"...,
1104) = 1104

Block 5 -

5503  write(20, "\25\3\1\0
\305\10\236\5\10\226JS\330\365\204\21\230\246\362\247R\332\220:\210o\2\333\10.\27"...,
37) = 37
5503  brk(0xb9e2d000)                   = 0xb9e2d000
5503  close(20)                         = 0

Block 6 -

5503  brk(0xb9e2b000)                   = 0xb9e2b000
5503  chdir("/")                        = 0
5503  setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) =
0
5503  writev(19,
[{"\27\3\1\0\260n?\301\204\263\0341yi\204\v\273\332\335\263\277\335M\t\223\364$\266\214\212c\362"...,
714}], 1) = 714
5503  gettimeofday({1251911099, 58549}, NULL) = 0
5503  write(14, "10.6.0.1 - noone [02/Sep"..., 97) = 97
5503  gettimeofday({1251911099, 58696}, NULL) = 0
5503  times({tms_utime=1, tms_stime=0, tms_cutime=0, tms_cstime=0}) =
369262016
5503  gettimeofday({1251911099, 58799}, NULL) = 0
5503  writev(19, [{"\25\3\1\0
MJi\215B\321Q\2657E\30cw\341n\264o\215\17\237{ej\342n\222\202"..., 37}], 1)
= 37
5503  brk(0xb9e2a000)                   = 0xb9e2a000
5503  shutdown(19, 1 /* send */)        = 0
5503  poll([{fd=19, events=POLLIN}], 1, 2000) = 1 ([{fd=19,
revents=POLLIN}])
5503  read(19, "\25\3\1\0
\230\220\315\210n\342zc\201\367H\r'xPt\277\243\346\227\320g\2223;U\352"...,
512) = 37
5503  gettimeofday({1251911099, 59317}, NULL) = 0
5503  gettimeofday({1251911099, 59372}, NULL) = 0
5503  poll([{fd=19, events=POLLIN}], 1, 2000) = 1 ([{fd=19,
revents=POLLIN|POLLERR|POLLHUP}])
5503  read(19, 0xbf81c504, 512)         = -1 ECONNRESET (Connection reset by
peer)
5503  close(19)                         = 0
5503  read(7, 0xbf81c713, 1)            = -1 EAGAIN (Resource temporarily
unavailable)
5503  close(18)                         = 0
5503  close(8)                          = 0
5503  close(7)                          = 0
5503  exit_group(0)                     = ?

As you can see I devided in 6 blocks this piece of strace output. In the
first block, the socket it's created, its bound to a network interface, and
it's connected to the remote host; fine. But in the block 2 the socket it's
switched back to blocking and then starts the SSL communication. In the
block 3 the socket it's switched to non blocking (this is the first time
that our application have chenged the socket options, because we don't have
the stream until the stream_socket_client function returns). Then, in the
block 4, the PHP that I wrote starts to work and the send / receive takes
place. In the block 5 there is a write without select... why? Well, I think
that is another "feature" of the underlaying socket communications, related
to the fclose of the stream. At last, the application ends in the block 6
sending the result back to the main server and closing resources.

After saying all of this, my question is... why the SSL implementation uses
potentially blocking code? The only work around that I found (inside
PHP/Apache) for facing this issue is setting a max execution time of the
apache processes because the PHP max_execution_time doesn't count the
sleeping time. I can't test right now this code in a latest version of PHP,
but I'm afraid that the strace output will be the same.


Regards,

Eloy Bote.

Reply via email to