Hi,
On 27/02/14 11:32, Julien Charbon wrote:
On 07/11/13 14:55, Julien Charbon wrote:
On Mon, 04 Nov 2013 22:21:04 +0100, Julien Charbon
<jchar...@verisign.com> wrote:
I have put technical and how-to-repeat details in below PR:
kern/183659: TCP stack lock contention with short-lived connections
http://www.freebsd.org/cgi/query-pr.cgi?pr=183659
We are currently working on this performance improvement effort; it
will impact only the TCP locking strategy not the TCP stack logic
itself. We will share on freebsd-net the patches we made for
reviewing and improvement propositions; anyway this change might also
require enough eyeballs to avoid tricky race conditions introduction
in TCP stack.
[...]
2. We studied an another lock contention related to INP_INFO when TCP
connections in TIME_WAIT state are cleaned-up. [...]
First a follow-up on TCP performance improvements and BSDCan 2014
discussions:
- First, changes related to TCP lock connection and TIME_WAIT list
have been reviewed, fixed, improved and accepted in HEAD:
http://svnweb.freebsd.org/base?view=revision&revision=264321
http://svnweb.freebsd.org/base?view=revision&revision=264342
http://svnweb.freebsd.org/base?view=revision&revision=264351
http://svnweb.freebsd.org/base?view=revision&revision=264356
Thanks to testers, reviewers and committers (jhb, rwatson, rrs,
adrian, glebius, gnn, bde, jmg, rrs, etc.).
- Second, it has been discussed in BSDCan that it would nice to also
share Verisign patches in public github. This is done here:
https://github.com/verisign/freebsd
We currently maintain two branches:
- One that follows 10.0-RELENG branch:
https://github.com/verisign/freebsd/commits/share/10.0-next
- One that follows HEAD:
https://github.com/verisign/freebsd/commits/share/head-next
These branches are cloned from on the handy github FreeBSD repository
mirror:
https://github.com/freebsd/freebsd
We pushed our stable enough TCP related patches in this repo.
- Third, joined a patch (tcp-scale-syn-v1.patch) we would propose for
review. Obviously, it is also available here:
https://github.com/verisign/freebsd/commit/38679f5f66b470d069d635bfc8d9ec6c39eb787b
This patch is based (like most of our TCP changes) on Robert Watson
major change to decompose inpcbinfo lock (aka ipi_lock or INP_INFO:)
Decompose the current single inpcbinfo lock into two locks:
http://svnweb.freebsd.org/base?view=revision&revision=222488
https://github.com/freebsd/freebsd/commit/fdfdadb612a4b077d62094c7d4aa65de3524cf62
In particular on this comment:
"
- The TCP syncache still relies on the pcbinfo lock, something that we
may want to revisit.
"
Basically this change prevents to take the TCP ipi_lock lock when a
SYN targeting a listening socket is received. We expected a need to add
another syncache mutex to take over ipi_lock role, but surprisingly we
don't find any requirements for a new mutex (and it doesn't mean that
this requirement doesn't exist).
This patch is quite stable under our workload, improves only a bit the
short-lived TCP connection rate (~+2%), however it can become useful
under SYN flood attack pressure.
We still have to test more thoroughly this patch in two contexts:
- VNET and,
- syncache.hashsize sets very low relative to TCP workload
As usual, tests, reviews and comments more than welcome.
Thanks.
--
Julien
diff --git a/sys/netinet/tcp_input.c b/sys/netinet/tcp_input.c
index f9a751c..4b6f41f 100644
--- a/sys/netinet/tcp_input.c
+++ b/sys/netinet/tcp_input.c
@@ -588,6 +588,7 @@ tcp_input(struct mbuf *m, int off0)
#endif /* INET6 */
struct tcpopt to; /* options in this segment */
char *s = NULL; /* address and port logging */
+ int needlock;
int ti_locked;
#define TI_UNLOCKED 1
#define TI_WLOCKED 2
@@ -770,12 +771,12 @@ tcp_input(struct mbuf *m, int off0)
/*
* Locate pcb for segment; if we're likely to add or remove a
- * connection then first acquire pcbinfo lock. There are two cases
+ * connection then first acquire pcbinfo lock. There are three cases
* where we might discover later we need a write lock despite the
* flags: ACKs moving a connection out of the syncache, and ACKs for
- * a connection in TIMEWAIT.
+ * a connection in TIMEWAIT, SYNs for a non-listening socket.
*/
- if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0) {
+ if ((thflags & (TH_FIN | TH_RST)) != 0) {
INP_INFO_WLOCK(&V_tcbinfo);
ti_locked = TI_WLOCKED;
} else
@@ -1004,10 +1005,18 @@ tcp_input(struct mbuf *m, int off0)
* now be in TIMEWAIT.
*/
#ifdef INVARIANTS
- if ((thflags & (TH_SYN | TH_FIN | TH_RST)) != 0)
+ if ((thflags & (TH_FIN | TH_RST)) != 0)
INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
#endif
- if (tp->t_state != TCPS_ESTABLISHED) {
+ needlock = 1;
+ if (tp->t_state == TCPS_ESTABLISHED) {
+ if ((thflags & TH_SYN) == 0)
+ needlock = 0;
+ } else {
+ if (tp->t_state == TCPS_LISTEN && (thflags & TH_SYN))
+ needlock = 0;
+ }
+ if (needlock) {
if (ti_locked == TI_UNLOCKED) {
if (INP_INFO_TRY_WLOCK(&V_tcbinfo) == 0) {
in_pcbref(inp);
@@ -1048,17 +1057,13 @@ tcp_input(struct mbuf *m, int off0)
/*
* When the socket is accepting connections (the INPCB is in LISTEN
* state) we look into the SYN cache if this is a new connection
- * attempt or the completion of a previous one. Because listen
- * sockets are never in TCPS_ESTABLISHED, the V_tcbinfo lock will be
- * held in this case.
+ * attempt or the completion of a previous one.
*/
if (so->so_options & SO_ACCEPTCONN) {
struct in_conninfo inc;
KASSERT(tp->t_state == TCPS_LISTEN, ("%s: so accepting but "
"tp not listening", __func__));
- INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
-
bzero(&inc, sizeof(inc));
#ifdef INET6
if (isipv6) {
@@ -1081,6 +1086,8 @@ tcp_input(struct mbuf *m, int off0)
* socket appended to the listen queue in SYN_RECEIVED state.
*/
if ((thflags & (TH_RST|TH_ACK|TH_SYN)) == TH_ACK) {
+
+ INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
/*
* Parse the TCP options here because
* syncookies need access to the reflected
@@ -1361,8 +1368,12 @@ tcp_input(struct mbuf *m, int off0)
syncache_add(&inc, &to, th, inp, &so, m, NULL, NULL);
/*
* Entry added to syncache and mbuf consumed.
- * Everything already unlocked by syncache_add().
+ * Nothing is unlocked by syncache_add().
*/
+ if (ti_locked == TI_WLOCKED) {
+ INP_INFO_WUNLOCK(&V_tcbinfo);
+ ti_locked = TI_UNLOCKED;
+ }
INP_INFO_UNLOCK_ASSERT(&V_tcbinfo);
return;
} else if (tp->t_state == TCPS_LISTEN) {
diff --git a/sys/netinet/tcp_syncache.c b/sys/netinet/tcp_syncache.c
index beb6749..9b981d3 100644
--- a/sys/netinet/tcp_syncache.c
+++ b/sys/netinet/tcp_syncache.c
@@ -1119,7 +1119,6 @@ syncache_add(struct in_conninfo *inc, struct tcpopt *to,
struct tcphdr *th,
struct syncache scs;
struct ucred *cred;
- INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
INP_WLOCK_ASSERT(inp); /* listen socket */
KASSERT((th->th_flags & (TH_RST|TH_ACK|TH_SYN)) == TH_SYN,
("%s: unexpected tcp flags", __func__));
@@ -1150,13 +1149,11 @@ syncache_add(struct in_conninfo *inc, struct tcpopt
*to, struct tcphdr *th,
#ifdef MAC
if (mac_syncache_init(&maclabel) != 0) {
INP_WUNLOCK(inp);
- INP_INFO_WUNLOCK(&V_tcbinfo);
goto done;
} else
mac_syncache_create(maclabel, inp);
#endif
INP_WUNLOCK(inp);
- INP_INFO_WUNLOCK(&V_tcbinfo);
/*
* Remember the IP options, if any.
_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"