On 04/09/2014 12:41 PM, graham sanderson wrote:
Michael, it is not that the connections are being dropped, it is that
the connections are not being dropped.
Thanks for the clarification.
These server side sockets are ESTABLISHED, even though the client
connection on the other side of the network device is long gone. This
may well be an issue with the network device (it is valiantly trying
to keep the connection alive it seems).
Have you tested if they *ever* time out on their own, or do they just
keep sticking around forever? (maybe 432000 sec (120 hours), which is
the default for nf_conntrack_tcp_timeout_established?) Trying out all
the usage scenarios is really the way to track it down - directly on
switch, behind/in front of firewall, on/off the VPN.
That said KEEPALIVE on the server side would not be a bad idea. At
least then the OS on the server would eventually (probably after 2
hours of inactivity) attempt to ping the client. At that point
hopefully something interesting would happen perhaps causing an error
and destroying the server side socket (note KEEPALIVE is also good
for preventing idle connections from being dropped by other network
devices along the way)
Tuning net.ipv4.tcp_keepalive_* could be helpful, if you know they
timeout after 2 hours, which is the default.
rpc_keepalive on the server sets keep alive on the server side
sockets for thrift, and is true by default
There doesn’t seem to be a setting for the native protocol
Note this isn’t a huge issue for us, they can be cleaned up by a
rolling restart, and this particular case is not production, but
related to development/testing against alpha by people working
remotely over VPN - and it may well be the VPNs fault in this case…
that said and maybe this is a dev list question, it seems like the
option to set keepalive should exist.
Yeah, but I agree you shouldn't have to restart to clean up connections
- that's why I think it is lower in the network stack, and that a bit of
troubleshooting and tuning might be helpful. That setting sounds like a
good Jira request - keepalive may be the default, I'm not sure. :)
--
Michael
On Apr 9, 2014, at 12:25 PM, Michael Shuler <mich...@pbandjelly.org>
wrote:
On 04/09/2014 11:39 AM, graham sanderson wrote:
Thanks, but I would think that just sets keep alive from the
client end; I’m talking about the server end… this is one of
those issues where there is something (e.g. switch, firewall, VPN
in between the client and the server) and we get left with
orphaned established connections to the server when the client is
gone.
There would be no server setting for any service, not just c*, that
would correct mis-configured connection-assassinating network gear
between the client and server. Fix the gear to allow persistent
connections.
Digging through the various timeouts in c*.yaml didn't lead me to a
simple answer for something tunable, but I think this may be more
basic networking related. I believe it's up to the client to keep
the connection open as Duy indicated. I don't think c* will
arbitrarily sever connections - something that disconnects the
client may happen. In that case, the TCP connection on the server
should drop to TIME_WAIT. Is this what you are seeing in `netstat
-a` on the server - a bunch of TIME_WAIT connections hanging
around? Those should eventually be recycled, but that's tunable in
the network stack, if they are being generated at a high rate.
-- Michael