Re: httpd stops accepting connections after a few hours on current

Jack Burton Sun, 12 Jul 2015 03:32:49 -0700

On Sat, 2015-07-11 at 15:38 +0930, Jack Burton wrote: 
> It hasn't happened here in a few days now so I don't have a log extract
> on hand to share (but can post one next time it happens).


Okay, the issue returned this afternoon and the httpd debug output
certainly sheds more light on the problem.

This time we didn't see either the TLS or buffer event errors anywhere
near the time at which httpd stopped responding to requests.

Instead, we're getting "server_accept: deferring connections". According
to the comments in server.c, that means we're running out of file
descriptors.

That struck me as odd, as our traffic generally isn't anywhere near high
enough to expect that, so I checked the traffic at the time and there
was indeed a spike although it didn't seem high enough to cause issues.
Peak load was 48 requests in the one second before httpd stopped
responding to requests.

All 48 of those requests were to the trivial http server, whose config
is just:

  listen on $int_addr port 80
  block return 303 "https://portal.tvir.acscomp.net";

(yes I know that that hostname doesn't resolve publicly -- but it does
when using the resolver assigned by dhcp on the semi-public [but not
Internet-facing] network on which our httpd listens)

As an aside, I didn't see in the debug output any requests during that
final second [although there were two a couple of seconds later] to the
target https server "portal" (which is served by the same instance of
httpd) -- but I guess it's possible that all 48 clients either didn't
act on the 303 or already had its target in their caches (environment
is a residential building for tertiary students, so the user base is
fairly static at this time of year -- so seems well within the realms
of possibility that all 48 had / on portal cached).

Debug output at the time httpd stopped responding reads (after 47 other
requests to the trivial http server all timestamped 16:08:54):

redir 192.168.137.160 - - [12/Jul/2015:16:08:54 +0930] "GET /personal
HTTP/1.1" 303 0
server redir, client 119933 (505 active), 192.168.137.160:40521 ->
192.168.137.1, https://portal.tvir.acscomp.net (303 See Other)
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119935 (505 active), 192.168.137.160:45643 ->
192.168.137.1, done
server redir, client 119934 (504 active), 192.168.137.160:40526 ->
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119936 (505 active), 192.168.137.160:47925 ->
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119938 (505 active), 192.168.137.160:40528 ->
192.168.137.1, done
server redir, client 119937 (504 active), 192.168.137.160:40527 ->
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server redir, client 119940 (505 active), 192.168.137.160:37213 ->
192.168.137.1, done
server_accept: deferring connections
server_accept: deferring connections
portal.tvir.acscomp.net 192.168.137.99 - - [12/Jul/2015:16:08:56 +0930]
"GET / HTTP/1.1" 200 0
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
server_accept: deferring connections
portal.tvir.acscomp.net 192.168.137.112 - - [12/Jul/2015:16:08:57 +0930]
"GET / HTTP/1.1" 200 0
server_accept: deferring connections

Then nothing but "server_accept: deferring connections" over and over
again.

It is possible I simply failed to provision sufficient capacity --
which could easily be fixed by adding a login class for www with a
higher limit on open fds -- but I fear that might just be hiding the
problem rather than addressing it: exhausting a 512 fd limit with with
peak load of only 48 req/sec (and average load of 2 req/sec) just
doesn't feel right (especially when that peak load is all 303s
generated internally by httpd, which each take only a tiny fraction of
a second to process).

I notice in the source that server_close_http() is responsible for
freeing session-specific fds, and that it's called from server_close(),
which is also responsible for generating the "..., done" debug messages
and decrementing the active client count.

We're only seeing those "..., done" messages in the debug output for a
small proportion of completed HTTP sessions, and the active client count
continues to grow (and only falls occasionally), even when there is much
less HTTP traffic.

Is seems as if some HTTP sessions get their fds freed on completion
while others don't ... but I can't find anything in the source to
support that conjecture.

Could someone who's more familiar with httpd than I am offer a clue
please?

Re: httpd stops accepting connections after a few hours on current

Reply via email to