In message <[EMAIL PROTECTED]>, Graham Barr writes:

>Also why does this happen only every few hours ? There is a lot of
>data going through these connections maybe the timer for SO_RCVTIMEO
>is not being reset.
>
>But then we have another server, with a similar number of clients and
>data through put, but it does not suffer from this problem.

I suspect that the server seeing this problem has a client that
occasionally disappears from the network, or for whatever reason
fails to respond to any packets for a long time (something like 5
or 10 minutes). I've seen blocking TCP writes return ETIMEDOUT when
the network between the client and the server goes down. In the
non-blocking case I think the following can happen:

        1) Client is connected to server.
        2) Network goes down, or client is turned off
        3) Server performs non-blocking write() on socket
        4) Server uses poll/select/kevent waiting for data from socket
        5) The write operation times out because no acknowledgements
           have been received. This occurs after TCP_MAXRXTSHIFT
           retransmits, so->so_error is set to ETIMEDOUT and the
           connection is shut down (I haven't read the code very
           carefully, so the details could be wrong.
        6) select/poll/kevent notes the EOF condition, and says that
           the descriptor is ready to read.
        7) read() returns the real error, which is ETIMEDOUT.

I guess this should possibly be documented in read(2), but in
practice there are numerous network errors that can be returned
from read(). Normal practice in single-process servers is to
consider any unknown errors from read(),write() etc as only
fatal to that client rather than the whole server.

Ian

To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-hackers" in the body of the message

Reply via email to