On Mon, 2015-07-13 at 11:02 +0200, Tor Houghton wrote: > On Sun, Jul 12, 2015 at 07:56:37PM +0930, Jack Burton wrote: > > > > It is possible I simply failed to provision sufficient capacity -- > > which could easily be fixed by adding a login class for www with a > > higher limit on open fds -- but I fear that might just be hiding the > > problem rather than addressing it: exhausting a 512 fd limit with with > > peak load of only 48 req/sec (and average load of 2 req/sec) just > > doesn't feel right (especially when that peak load is all 303s > > generated internally by httpd, which each take only a tiny fraction of > > a second to process). > > I don't pretend to know httpd (at all), but I'm wondering, what should > fstat(1) say, over time, for the httpd processes?
Thanks Tor -- that was exactly the clue I needed to isolate the problem. Wrote a short script to parse the output of running fstat -p for each running httpd (we're running with prefork 8, so I didn't fancy doing it by hand), and report the timestamp of the last request in the relevant access log of each client IP with an open socket (or 'missing' if no entry in the current access log). Ran it roughly 4 hours after the last log rotation and found only 34 matches out of 73 open sockets. We don't run anything here that would take anywhere near 4 hours to return a response, so the 39 that didn't match entries in any of the current access logs were clearly where I needed to look. All 39 related to "admin" -- the one HTTPS server that I hadn't spent any time looking into (since it accounts for only 0.02% of httpd's load here, it didn't occur to me that that tiny little thing could be bringing httpd to its knees ... famous last words). admin talks to a custom FastCGI daemon, which is most likely the culprit -- I'll debug it tomorrow. "portal" (the other HTTPS server) also talks to a (different) custom FastCGI daemon, but carries orders of magnitude more traffic and didn't have any stale sockets -- so clearly our problem is at the other end of admin's FastCGI socket (not with httpd itself). Sorry for the noise. Ted -- similarly, you may want to look into whatever is at the other end of your "server1"'s FastCGI socket. If your issue is the same as ours, that's likely where you'll find the cause.