On Thu, 2015-07-09 at 11:59 +0200, Tor Houghton wrote: 
> On Wed, Jul 08, 2015 at 10:04:27PM -0500, Theodore Wynnychenko wrote:
> > 
> > [snip]
> >
> > server https://server2.tldn.com, client 2067 (63 active), 10.0.28.254:60330 
> > ->
> > 10.0.28.130:443, buffer event error
> > [..]
> > server https://server2.tldn.com, client 2068 (63 active), 10.0.28.254:52350 
> > ->
> > 10.0.28.130:443, buffer event error
> 
> I'm going to "me too" on this one (have not been until now, as I thought
> perhaps it was due to my setup, and therefore off-topic).

Likewise, seeing the same behaviour here on 5.7-stable -- so the
problem is not confined to -current.

Fairly small & simple httpd setup here, httpd configured with 3 server
stanzas: 2 HTTPS-only (both using FastCGI) plus one trivial HTTP-only
(just a block return 303 pointing to one of the HTTPS servers). Quite a
light load too (averaging 178k requests/day -- about 2/sec).

Frequency of problem varies wildly -- sometimes occurs after only an
hour or two since last httpd restart and at other times httpd will last
for up to 4 days before it stops responding to requests. Variation in
volume of requests appears to have no effect on frequency of recurrence
either.

On every occasion, httpd continues to respond correctly to signals
(httpd restarts are always clean), just not to HTTP[S] requests.

On at least one occasion, the http socket continued to respond correctly
to requests, whilst the two https ones stopped responding. On other
occasions, all 3 stopped responding at around the same time.

When a socket stops responding, it still accepts requests but httpd
neither logs (at least, when not in debug mode) nor responds to them
(i.e. I can successfully open a TCP session to the listening socket and
send it a request, but nothing comes back after the initial ACK).

It hasn't happened here in a few days now so I don't have a log extract
on hand to share (but can post one next time it happens).

>From memory in the past we were seeing TLS accept fail errors in the
logs, as reported by the original poster, but not at the time the
sockets stopped responding (only well beforehand), so I'd also assumed
that those were unrelated. Running tcpdump on both user-facing
interfaces (and on pflog0 just to rule out the possibility of some
error in our pf.conf) whilst httpd was not responding to requests on
previous occasions revealed nothing new.

Have tried watching debug output a couple of times before, but it
rapidly gets quite unwieldy, even with our modest load (especially over
a remote ssh session -- both uplinks at that site are nearing
capacity), given the length of time it can take for the problem to
manifest (on each occasion I gave up after a few hours without the
problem occurring).

Am now running httpd -dvvv with stdout/err redirected to a temporary log
file (probably should have done that in the first place).

We are already seeing (after less than a minute) entries in the debug
logs similar to those reported by Theodore, for example:

* On an HTTPS server (using FastCGI):
server portal, client 305 (14 active), 192.168.137.161:52224 ->
192.168.137.1:443, buffer event error

and

* On the trivial HTTP server (using just a block return 303):
server redir, client 132 (11 active), 192.168.137.100:61081 ->
192.168.137.1, buffer event timeout

However, the original problem (httpd stops responding to requests) is
*not* occurring at present.

Will post debug log extract & httpd.conf next time the problem recurs
(should be within the next few days).

Reply via email to