On Sat, 2015-07-11 at 15:38 +0930, Jack Burton wrote: > It hasn't happened here in a few days now so I don't have a log extract > on hand to share (but can post one next time it happens).
Okay, the issue returned this afternoon and the httpd debug output certainly sheds more light on the problem. This time we didn't see either the TLS or buffer event errors anywhere near the time at which httpd stopped responding to requests. Instead, we're getting "server_accept: deferring connections". According to the comments in server.c, that means we're running out of file descriptors. That struck me as odd, as our traffic generally isn't anywhere near high enough to expect that, so I checked the traffic at the time and there was indeed a spike although it didn't seem high enough to cause issues. Peak load was 48 requests in the one second before httpd stopped responding to requests. All 48 of those requests were to the trivial http server, whose config is just: listen on $int_addr port 80 block return 303 "https://portal.tvir.acscomp.net" (yes I know that that hostname doesn't resolve publicly -- but it does when using the resolver assigned by dhcp on the semi-public [but not Internet-facing] network on which our httpd listens) As an aside, I didn't see in the debug output any requests during that final second [although there were two a couple of seconds later] to the target https server "portal" (which is served by the same instance of httpd) -- but I guess it's possible that all 48 clients either didn't act on the 303 or already had its target in their caches (environment is a residential building for tertiary students, so the user base is fairly static at this time of year -- so seems well within the realms of possibility that all 48 had / on portal cached). Debug output at the time httpd stopped responding reads (after 47 other requests to the trivial http server all timestamped 16:08:54): redir 192.168.137.160 - - [12/Jul/2015:16:08:54 +0930] "GET /personal HTTP/1.1" 303 0 server redir, client 119933 (505 active), 192.168.137.160:40521 -> 192.168.137.1, https://portal.tvir.acscomp.net (303 See Other) server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections server redir, client 119935 (505 active), 192.168.137.160:45643 -> 192.168.137.1, done server redir, client 119934 (504 active), 192.168.137.160:40526 -> 192.168.137.1, done server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections server redir, client 119936 (505 active), 192.168.137.160:47925 -> 192.168.137.1, done server_accept: deferring connections server_accept: deferring connections server redir, client 119938 (505 active), 192.168.137.160:40528 -> 192.168.137.1, done server redir, client 119937 (504 active), 192.168.137.160:40527 -> 192.168.137.1, done server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections server redir, client 119940 (505 active), 192.168.137.160:37213 -> 192.168.137.1, done server_accept: deferring connections server_accept: deferring connections portal.tvir.acscomp.net 192.168.137.99 - - [12/Jul/2015:16:08:56 +0930] "GET / HTTP/1.1" 200 0 server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections server_accept: deferring connections portal.tvir.acscomp.net 192.168.137.112 - - [12/Jul/2015:16:08:57 +0930] "GET / HTTP/1.1" 200 0 server_accept: deferring connections Then nothing but "server_accept: deferring connections" over and over again. It is possible I simply failed to provision sufficient capacity -- which could easily be fixed by adding a login class for www with a higher limit on open fds -- but I fear that might just be hiding the problem rather than addressing it: exhausting a 512 fd limit with with peak load of only 48 req/sec (and average load of 2 req/sec) just doesn't feel right (especially when that peak load is all 303s generated internally by httpd, which each take only a tiny fraction of a second to process). I notice in the source that server_close_http() is responsible for freeing session-specific fds, and that it's called from server_close(), which is also responsible for generating the "..., done" debug messages and decrementing the active client count. We're only seeing those "..., done" messages in the debug output for a small proportion of completed HTTP sessions, and the active client count continues to grow (and only falls occasionally), even when there is much less HTTP traffic. Is seems as if some HTTP sessions get their fds freed on completion while others don't ... but I can't find anything in the source to support that conjecture. Could someone who's more familiar with httpd than I am offer a clue please?