On our production systems we did hit the "relay_connect: no connection in
flight" on a so regular bases that I had to make it non-fatal with the
result of leaking sockets.
After more investigation I found the problem to be a race against
connecting to the backend servers. In short:
- relay_read_http() will open a connection if following conditions are met
cre->dir == RELAY_DIR_REQUEST && cre->toread <= 0 && cre->dst->bev == NULL
- relay_connect() does not initialize con->se_out.bev (which is also
cre->dst->bev). Instead this is deferred to relay_connected()
- if a event happens that calls relay_read_http() while connecting to the
backend then relay_connect() will be called again. Result is the panic
since the count gets out of sync.
The following diff solves this issue by adding an extra flag to
ctl_relay_event to know if a relay is already connected (or the connect is
pending). relay_close() will then clean the flag when closing the session.
I decided to use a flag since the EMFILE || ENFILE case is hard to detect
otherwise.
Running with this on production with no visible issues at the moment.
I think it would make sense to restructure the http proxy code more and
introduce a proper state machine but that is a much bigger and complex
issue, so lets fix the bug first.
OK?
--
:wq Claudio
Index: relay.c
===================================================================
RCS file: /cvs/src/usr.sbin/relayd/relay.c,v
retrieving revision 1.194
diff -u -p -r1.194 relay.c
--- relay.c 18 May 2015 16:57:20 -0000 1.194
+++ relay.c 21 May 2015 08:15:05 -0000
@@ -1496,6 +1498,12 @@ relay_connect(struct rsession *con)
struct timeval evtpause = { 1, 0 };
int bnds = -1, ret;
+ /* relay_connect should only be called once per relay */
+ if (con->se_out.connected) {
+ log_debug("%s: connect already called once", __func__);
+ return (0);
+ }
+
/* Connection is already established but session not active */
if ((rlay->rl_conf.flags & F_TLSINSPECT) && con->se_out.s != -1) {
if (con->se_out.ssl == NULL) {
@@ -1555,6 +1565,8 @@ relay_connect(struct rsession *con)
event_del(&rlay->rl_ev);
evtimer_add(&con->se_inflightevt, &evtpause);
evtimer_add(&rlay->rl_evt, &evtpause);
+ /* this connect is pending */
+ con->se_out.connected = 1;
return (0);
} else {
if (con->se_retry) {
@@ -1572,6 +1584,7 @@ relay_connect(struct rsession *con)
}
}
+ con->se_out.connected = 1;
relay_inflight--;
DPRINTF("%s: inflight decremented, now %d",__func__,
relay_inflight);
@@ -1673,6 +1686,7 @@ relay_close(struct rsession *con, const
event_add(&rlay->rl_ev, NULL);
}
}
+ con->se_out.connected = 0;
if (con->se_out.buf != NULL)
free(con->se_out.buf);
Index: relay_http.c
===================================================================
RCS file: /cvs/src/usr.sbin/relayd/relay_http.c,v
retrieving revision 1.47
diff -u -p -r1.47 relay_http.c
--- relay_http.c 22 May 2015 01:34:13 -0000 1.47
+++ relay_http.c 22 May 2015 13:40:38 -0000
@@ -419,7 +419,7 @@ relay_read_http(struct bufferevent *bev,
relay_reset_http(cre);
done:
if (cre->dir == RELAY_DIR_REQUEST && cre->toread <= 0 &&
- cre->dst->bev == NULL) {
+ !cre->dst->connected) {
if (rlay->rl_conf.fwdmode == FWD_TRANS) {
relay_bindanyreq(con, 0, IPPROTO_TCP);
return;
Index: relayd.h
===================================================================
RCS file: /cvs/src/usr.sbin/relayd/relayd.h,v
retrieving revision 1.209
diff -u -p -r1.209 relayd.h
--- relayd.h 2 May 2015 13:15:24 -0000 1.209
+++ relayd.h 21 May 2015 06:55:44 -0000
@@ -200,6 +200,7 @@ struct ctl_relay_event {
int line;
int done;
int timedout;
+ int connected;
enum direction dir;
u_int8_t *buf;