Re: Limits on jumbo mbuf cluster allocation

Rick Macklem Tue, 12 Mar 2013 20:48:25 -0700

Garrett Wollman wrote:
> <<On Mon, 11 Mar 2013 21:25:45 -0400 (EDT), Rick Macklem
> <rmack...@uoguelph.ca> said:
> 
> > To be honest, I'd consider seeing a lot of non-empty receive queues
> > for TCP connections to the NFS server to be an indication that it is
> > near/at its load limit. (Sure, if you do netstat a lot, you will
> > occasionally
> > see a non-empty queue here or there, but I would not expect to see a
> > lot
> > of them non-empty a lot of the time.) If that is the case, then the
> > question becomes "what is the bottleneck?". Below I suggest getting
> > rid
> > of the DRC in case it is the bottleneck for your server.
> 
> The problem is not the DRC in "normal" operation, but the DRC when it
> gets into the livelocked state. I think we've talked about a number
> of solutions to the livelock problem, but I haven't managed to
> implement or test these ideas yet. I have a duplicate server up now,
> so I hope to do some testing this week.
> 
> In normal operation, the server is mostly idle, and the nfsd threads
> that aren't themselves idle are sleeping deep in ZFS waiting for
> something to happen on disk. When the arrival rate exceeds the rate
> at which requests are cleared from the DRC, *all* of the nfsd threads
> will spin, either waiting for the DRC mutex or walking the DRC finding
> that there is nothing that can be released yet. *That* is the
> livelock condition -- the spinning that takes over all nfsd threads is
> what causes the receive buffers to build up, and the large queues then
> maintain the livelocked condition -- and that is why it clears
> *immediately* when the DRC size is increased. (It's possible to
> reproduce this condition on a loaded server by simply reducing the
> tcphighwater to less than the current size.) Unfortunately, I'm at
> the NFSRC_FLOODSIZE limit right now (64k), so there is no room for
> further increases until I recompile the kernel. It's probably a bug
> that the sysctl definition in drc3.patch doesn't check the new value
> against this limit.
> 
> Note that I'm currently running 64 nfsd threads on a 12-core
> (24-thread) system. In the livelocked condition, as you would expect,
> the system goes to 100% CPU utilization and the load average peaks out
> at 64, while goodput goes to nearly nil.
> 
Ok, I think I finally understand what you are referring to by your livelock.
Basically, you are at the tcphighwater mark and the nfsd threads don't
succeed in freeing up many cache entries so each nfsd thread tries to
trim the cache for each RPC and that slows the server right down.


I suspect it is the cached entries from dismounted clients that are
filling up the cache (you did mention clients using amd at some point
in the discussion, which implies frequent mounts/dismounts).
I'm guessing that the tcp cache timeout needs to be made a lot smaller
for your case.

> > For either A or B, I'd suggest that you disable the DRC for TCP
> > connections
> > (email if you need a patch for that), which will have a couple of
> > effects:
> 
> I would like to see your patch, since it's more likely to be correct
> than one I might dream up.
> 
> The alternative solution is twofold: first, nfsrv_trimcache() needs to
> do something to ensure forward progress, even when that means dropping
> something that hasn't timed out yet, and second, the server code needs
> to ensure that nfsrv_trimcache() is only executing on one thread at a
> time. An easy way to do the first part would be to maintain an LRU
> queue for TCP in addition to the UDP LRU, and just blow away the first
> N (>NCPU) entries on the queue if, after checking all the TCP replies,
> the DRC is still larger than the limit. The second part is just an
> atomic_cmpset_int().
> 
I've attached a patch that has assorted changes. I didn't use an LRU list,
since that results in a single mutex to contend on, but I added a second
pass to the nfsrc_trimcache() function that frees old entries. (Approximate
LRU, using a histogram of timeout values to select a timeout value that
frees enough of the oldest ones.)

Basically, this patch:
- allows setting of the tcp timeout via vfs.nfsd.tcpcachetimeo
  (I'd suggest you go down to a few minutes instead of 12hrs)
- allows TCP caching to be disabled by setting vfs.nfsd.cachetcp=0
- does the above 2 things you describe to try and avoid the livelock,
  although not quite using an lru list
- increases the hash table size to 500 (still a compile time setting)
  (feel free to make it even bigger)
- sets nfsrc_floodlevel to at least nfsrc_tcphighwater, so you can
  grow vfs.nfsd.tcphighwater as big as you dare

The patch includes a lot of drc2.patch and drc3.patch, so don't try
and apply it to a patched kernel. Hopefully it will apply cleanly to
vanilla sources.

Tha patch has been minimally tested.

If you'd rather not apply the patch, you can change NFSRVCACHE_TCPTIMEOUT
and set the variable nfsrc_tcpidempotent == 0 to get a couple of
the changes. (You'll have to recompile the kernel for these changes to
take effect.)

Good luck with it, rick

> -GAWollman
> _______________________________________________
> freebsd-net@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-net
> To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

--- fs/nfsserver/nfs_nfsdcache.c.orig	2013-01-07 09:04:13.000000000 -0500
+++ fs/nfsserver/nfs_nfsdcache.c	2013-03-12 22:42:05.000000000 -0400
@@ -160,12 +160,31 @@ __FBSDID("$FreeBSD: projects/nfsv4-packr
 #include <fs/nfs/nfsport.h>
 
 extern struct nfsstats newnfsstats;
-NFSCACHEMUTEX;
+extern struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE];
+extern struct mtx nfsrc_udpmtx;
 int nfsrc_floodlevel = NFSRVCACHE_FLOODLEVEL, nfsrc_tcpsavedreplies = 0;
 #endif	/* !APPLEKEXT */
 
-static int nfsrc_tcpnonidempotent = 1;
-static int nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER, nfsrc_udpcachesize = 0;
+SYSCTL_DECL(_vfs_nfsd);
+
+static u_int	nfsrc_tcphighwater = 0;
+SYSCTL_UINT(_vfs_nfsd, OID_AUTO, tcphighwater, CTLFLAG_RW,
+    &nfsrc_tcphighwater, 0,
+    "High water mark for TCP cache entries");
+static u_int	nfsrc_udphighwater = NFSRVCACHE_UDPHIGHWATER;
+SYSCTL_UINT(_vfs_nfsd, OID_AUTO, udphighwater, CTLFLAG_RW,
+    &nfsrc_udphighwater, 0,
+    "High water mark for UDP cache entries");
+static u_int	nfsrc_tcptimeout = NFSRVCACHE_TCPTIMEOUT;
+SYSCTL_UINT(_vfs_nfsd, OID_AUTO, tcpcachetimeo, CTLFLAG_RW,
+    &nfsrc_tcptimeout, 0,
+    "Timeout for TCP entries in the DRC");
+static u_int nfsrc_tcpnonidempotent = 1;
+SYSCTL_UINT(_vfs_nfsd, OID_AUTO, cachetcp, CTLFLAG_RW,
+    &nfsrc_tcpnonidempotent, 0,
+    "Enable the DRC for NFS over TCP");
+
+static int nfsrc_udpcachesize = 0;
 static TAILQ_HEAD(, nfsrvcache) nfsrvudplru;
 static struct nfsrvhashhead nfsrvhashtbl[NFSRVCACHE_HASHSIZE],
     nfsrvudphashtbl[NFSRVCACHE_HASHSIZE];
@@ -197,10 +216,11 @@ static int newnfsv2_procid[NFS_V3NPROCS]
 	NFSV2PROC_NOOP,
 };
 
+#define	nfsrc_hash(xid)	(((xid) + ((xid) >> 24)) % NFSRVCACHE_HASHSIZE)
 #define	NFSRCUDPHASH(xid) \
-	(&nfsrvudphashtbl[((xid) + ((xid) >> 24)) % NFSRVCACHE_HASHSIZE])
+	(&nfsrvudphashtbl[nfsrc_hash(xid)])
 #define	NFSRCHASH(xid) \
-	(&nfsrvhashtbl[((xid) + ((xid) >> 24)) % NFSRVCACHE_HASHSIZE])
+	(&nfsrvhashtbl[nfsrc_hash(xid)])
 #define	TRUE	1
 #define	FALSE	0
 #define	NFSRVCACHE_CHECKLEN	100
@@ -251,6 +271,18 @@ static int nfsrc_getlenandcksum(mbuf_t m
 static void nfsrc_marksametcpconn(u_int64_t);
 
 /*
+ * Return the correct mutex for this cache entry.
+ */
+static __inline struct mtx *
+nfsrc_cachemutex(struct nfsrvcache *rp)
+{
+
+	if ((rp->rc_flag & RC_UDP) != 0)
+		return (&nfsrc_udpmtx);
+	return (&nfsrc_tcpmtx[nfsrc_hash(rp->rc_xid)]);
+}
+
+/*
  * Initialize the server request cache list
  */
 APPLESTATIC void
@@ -325,10 +357,12 @@ nfsrc_getudp(struct nfsrv_descript *nd, 
 	struct sockaddr_in6 *saddr6;
 	struct nfsrvhashhead *hp;
 	int ret = 0;
+	struct mtx *mutex;
 
+	mutex = nfsrc_cachemutex(newrp);
 	hp = NFSRCUDPHASH(newrp->rc_xid);
 loop:
-	NFSLOCKCACHE();
+	mtx_lock(mutex);
 	LIST_FOREACH(rp, hp, rc_hash) {
 	    if (newrp->rc_xid == rp->rc_xid &&
 		newrp->rc_proc == rp->rc_proc &&
@@ -336,8 +370,8 @@ loop:
 		nfsaddr_match(NETFAMILY(rp), &rp->rc_haddr, nd->nd_nam)) {
 			if ((rp->rc_flag & RC_LOCKED) != 0) {
 				rp->rc_flag |= RC_WANTED;
-				(void)mtx_sleep(rp, NFSCACHEMUTEXPTR,
-				    (PZERO - 1) | PDROP, "nfsrc", 10 * hz);
+				(void)mtx_sleep(rp, mutex, (PZERO - 1) | PDROP,
+				    "nfsrc", 10 * hz);
 				goto loop;
 			}
 			if (rp->rc_flag == 0)
@@ -347,14 +381,14 @@ loop:
 			TAILQ_INSERT_TAIL(&nfsrvudplru, rp, rc_lru);
 			if (rp->rc_flag & RC_INPROG) {
 				newnfsstats.srvcache_inproghits++;
-				NFSUNLOCKCACHE();
+				mtx_unlock(mutex);
 				ret = RC_DROPIT;
 			} else if (rp->rc_flag & RC_REPSTATUS) {
 				/*
 				 * V2 only.
 				 */
 				newnfsstats.srvcache_nonidemdonehits++;
-				NFSUNLOCKCACHE();
+				mtx_unlock(mutex);
 				nfsrvd_rephead(nd);
 				*(nd->nd_errp) = rp->rc_status;
 				ret = RC_REPLY;
@@ -362,7 +396,7 @@ loop:
 					NFSRVCACHE_UDPTIMEOUT;
 			} else if (rp->rc_flag & RC_REPMBUF) {
 				newnfsstats.srvcache_nonidemdonehits++;
-				NFSUNLOCKCACHE();
+				mtx_unlock(mutex);
 				nd->nd_mreq = m_copym(rp->rc_reply, 0,
 					M_COPYALL, M_WAITOK);
 				ret = RC_REPLY;
@@ -392,7 +426,7 @@ loop:
 	}
 	LIST_INSERT_HEAD(hp, newrp, rc_hash);
 	TAILQ_INSERT_TAIL(&nfsrvudplru, newrp, rc_lru);
-	NFSUNLOCKCACHE();
+	mtx_unlock(mutex);
 	nd->nd_rp = newrp;
 	ret = RC_DOIT;
 
@@ -410,12 +444,16 @@ nfsrvd_updatecache(struct nfsrv_descript
 	struct nfsrvcache *rp;
 	struct nfsrvcache *retrp = NULL;
 	mbuf_t m;
+	struct mtx *mutex;
 
+	if (nfsrc_tcphighwater > nfsrc_floodlevel)
+		nfsrc_floodlevel = nfsrc_tcphighwater;
 	rp = nd->nd_rp;
 	if (!rp)
 		panic("nfsrvd_updatecache null rp");
 	nd->nd_rp = NULL;
-	NFSLOCKCACHE();
+	mutex = nfsrc_cachemutex(rp);
+	mtx_lock(mutex);
 	nfsrc_lock(rp);
 	if (!(rp->rc_flag & RC_INPROG))
 		panic("nfsrvd_updatecache not inprog");
@@ -430,7 +468,7 @@ nfsrvd_updatecache(struct nfsrv_descript
 	 */
 	if (nd->nd_repstat == NFSERR_REPLYFROMCACHE) {
 		newnfsstats.srvcache_nonidemdonehits++;
-		NFSUNLOCKCACHE();
+		mtx_unlock(mutex);
 		nd->nd_repstat = 0;
 		if (nd->nd_mreq)
 			mbuf_freem(nd->nd_mreq);
@@ -438,7 +476,7 @@ nfsrvd_updatecache(struct nfsrv_descript
 			panic("reply from cache");
 		nd->nd_mreq = m_copym(rp->rc_reply, 0,
 		    M_COPYALL, M_WAITOK);
-		rp->rc_timestamp = NFSD_MONOSEC + NFSRVCACHE_TCPTIMEOUT;
+		rp->rc_timestamp = NFSD_MONOSEC + nfsrc_tcptimeout;
 		nfsrc_unlock(rp);
 		goto out;
 	}
@@ -463,21 +501,21 @@ nfsrvd_updatecache(struct nfsrv_descript
 		    nfsv2_repstat[newnfsv2_procid[nd->nd_procnum]]) {
 			rp->rc_status = nd->nd_repstat;
 			rp->rc_flag |= RC_REPSTATUS;
-			NFSUNLOCKCACHE();
+			mtx_unlock(mutex);
 		} else {
 			if (!(rp->rc_flag & RC_UDP)) {
-			    nfsrc_tcpsavedreplies++;
+			    atomic_add_int(&nfsrc_tcpsavedreplies, 1);
 			    if (nfsrc_tcpsavedreplies >
 				newnfsstats.srvcache_tcppeak)
 				newnfsstats.srvcache_tcppeak =
 				    nfsrc_tcpsavedreplies;
 			}
-			NFSUNLOCKCACHE();
-			m = m_copym(nd->nd_mreq, 0, M_COPYALL, M_WAITOK);
-			NFSLOCKCACHE();
+			mtx_unlock(mutex);
+			m = m_copym(nd->nd_mreq, 0, M_COPYALL, M_WAIT);
+			mtx_lock(mutex);
 			rp->rc_reply = m;
 			rp->rc_flag |= RC_REPMBUF;
-			NFSUNLOCKCACHE();
+			mtx_unlock(mutex);
 		}
 		if (rp->rc_flag & RC_UDP) {
 			rp->rc_timestamp = NFSD_MONOSEC +
@@ -485,7 +523,7 @@ nfsrvd_updatecache(struct nfsrv_descript
 			nfsrc_unlock(rp);
 		} else {
 			rp->rc_timestamp = NFSD_MONOSEC +
-			    NFSRVCACHE_TCPTIMEOUT;
+			    nfsrc_tcptimeout;
 			if (rp->rc_refcnt > 0)
 				nfsrc_unlock(rp);
 			else
@@ -493,7 +531,7 @@ nfsrvd_updatecache(struct nfsrv_descript
 		}
 	} else {
 		nfsrc_freecache(rp);
-		NFSUNLOCKCACHE();
+		mtx_unlock(mutex);
 	}
 
 out:
@@ -509,14 +547,16 @@ out:
 APPLESTATIC void
 nfsrvd_delcache(struct nfsrvcache *rp)
 {
+	struct mtx *mutex;
 
+	mutex = nfsrc_cachemutex(rp);
 	if (!(rp->rc_flag & RC_INPROG))
 		panic("nfsrvd_delcache not in prog");
-	NFSLOCKCACHE();
+	mtx_lock(mutex);
 	rp->rc_flag &= ~RC_INPROG;
 	if (rp->rc_refcnt == 0 && !(rp->rc_flag & RC_LOCKED))
 		nfsrc_freecache(rp);
-	NFSUNLOCKCACHE();
+	mtx_unlock(mutex);
 }
 
 /*
@@ -528,7 +568,9 @@ APPLESTATIC void
 nfsrvd_sentcache(struct nfsrvcache *rp, struct socket *so, int err)
 {
 	tcp_seq tmp_seq;
+	struct mtx *mutex;
 
+	mutex = nfsrc_cachemutex(rp);
 	if (!(rp->rc_flag & RC_LOCKED))
 		panic("nfsrvd_sentcache not locked");
 	if (!err) {
@@ -537,10 +579,10 @@ nfsrvd_sentcache(struct nfsrvcache *rp, 
 		     so->so_proto->pr_protocol != IPPROTO_TCP)
 			panic("nfs sent cache");
 		if (nfsrv_getsockseqnum(so, &tmp_seq)) {
-			NFSLOCKCACHE();
+			mtx_lock(mutex);
 			rp->rc_tcpseq = tmp_seq;
 			rp->rc_flag |= RC_TCPSEQ;
-			NFSUNLOCKCACHE();
+			mtx_unlock(mutex);
 		}
 	}
 	nfsrc_unlock(rp);
@@ -559,11 +601,13 @@ nfsrc_gettcp(struct nfsrv_descript *nd, 
 	struct nfsrvcache *hitrp;
 	struct nfsrvhashhead *hp, nfsrc_templist;
 	int hit, ret = 0;
+	struct mtx *mutex;
 
+	mutex = nfsrc_cachemutex(newrp);
 	hp = NFSRCHASH(newrp->rc_xid);
 	newrp->rc_reqlen = nfsrc_getlenandcksum(nd->nd_mrep, &newrp->rc_cksum);
 tryagain:
-	NFSLOCKCACHE();
+	mtx_lock(mutex);
 	hit = 1;
 	LIST_INIT(&nfsrc_templist);
 	/*
@@ -621,8 +665,8 @@ tryagain:
 		rp = hitrp;
 		if ((rp->rc_flag & RC_LOCKED) != 0) {
 			rp->rc_flag |= RC_WANTED;
-			(void)mtx_sleep(rp, NFSCACHEMUTEXPTR,
-			    (PZERO - 1) | PDROP, "nfsrc", 10 * hz);
+			(void)mtx_sleep(rp, mutex, (PZERO - 1) | PDROP,
+			    "nfsrc", 10 * hz);
 			goto tryagain;
 		}
 		if (rp->rc_flag == 0)
@@ -630,7 +674,7 @@ tryagain:
 		rp->rc_flag |= RC_LOCKED;
 		if (rp->rc_flag & RC_INPROG) {
 			newnfsstats.srvcache_inproghits++;
-			NFSUNLOCKCACHE();
+			mtx_unlock(mutex);
 			if (newrp->rc_sockref == rp->rc_sockref)
 				nfsrc_marksametcpconn(rp->rc_sockref);
 			ret = RC_DROPIT;
@@ -639,24 +683,24 @@ tryagain:
 			 * V2 only.
 			 */
 			newnfsstats.srvcache_nonidemdonehits++;
-			NFSUNLOCKCACHE();
+			mtx_unlock(mutex);
 			if (newrp->rc_sockref == rp->rc_sockref)
 				nfsrc_marksametcpconn(rp->rc_sockref);
 			ret = RC_REPLY;
 			nfsrvd_rephead(nd);
 			*(nd->nd_errp) = rp->rc_status;
 			rp->rc_timestamp = NFSD_MONOSEC +
-				NFSRVCACHE_TCPTIMEOUT;
+				nfsrc_tcptimeout;
 		} else if (rp->rc_flag & RC_REPMBUF) {
 			newnfsstats.srvcache_nonidemdonehits++;
-			NFSUNLOCKCACHE();
+			mtx_unlock(mutex);
 			if (newrp->rc_sockref == rp->rc_sockref)
 				nfsrc_marksametcpconn(rp->rc_sockref);
 			ret = RC_REPLY;
 			nd->nd_mreq = m_copym(rp->rc_reply, 0,
 				M_COPYALL, M_WAITOK);
 			rp->rc_timestamp = NFSD_MONOSEC +
-				NFSRVCACHE_TCPTIMEOUT;
+				nfsrc_tcptimeout;
 		} else {
 			panic("nfs tcp cache1");
 		}
@@ -674,7 +718,7 @@ tryagain:
 	newrp->rc_cachetime = NFSD_MONOSEC;
 	newrp->rc_flag |= RC_INPROG;
 	LIST_INSERT_HEAD(hp, newrp, rc_hash);
-	NFSUNLOCKCACHE();
+	mtx_unlock(mutex);
 	nd->nd_rp = newrp;
 	ret = RC_DOIT;
 
@@ -685,16 +729,17 @@ out:
 
 /*
  * Lock a cache entry.
- * Also puts a mutex lock on the cache list.
  */
 static void
 nfsrc_lock(struct nfsrvcache *rp)
 {
-	NFSCACHELOCKREQUIRED();
+	struct mtx *mutex;
+
+	mutex = nfsrc_cachemutex(rp);
+	mtx_assert(mutex, MA_OWNED);
 	while ((rp->rc_flag & RC_LOCKED) != 0) {
 		rp->rc_flag |= RC_WANTED;
-		(void)mtx_sleep(rp, NFSCACHEMUTEXPTR, PZERO - 1,
-		    "nfsrc", 0);
+		(void)mtx_sleep(rp, mutex, PZERO - 1, "nfsrc", 0);
 	}
 	rp->rc_flag |= RC_LOCKED;
 }
@@ -705,11 +750,13 @@ nfsrc_lock(struct nfsrvcache *rp)
 static void
 nfsrc_unlock(struct nfsrvcache *rp)
 {
+	struct mtx *mutex;
 
-	NFSLOCKCACHE();
+	mutex = nfsrc_cachemutex(rp);
+	mtx_lock(mutex);
 	rp->rc_flag &= ~RC_LOCKED;
 	nfsrc_wanted(rp);
-	NFSUNLOCKCACHE();
+	mtx_unlock(mutex);
 }
 
 /*
@@ -732,7 +779,6 @@ static void
 nfsrc_freecache(struct nfsrvcache *rp)
 {
 
-	NFSCACHELOCKREQUIRED();
 	LIST_REMOVE(rp, rc_hash);
 	if (rp->rc_flag & RC_UDP) {
 		TAILQ_REMOVE(&nfsrvudplru, rp, rc_lru);
@@ -742,7 +788,7 @@ nfsrc_freecache(struct nfsrvcache *rp)
 	if (rp->rc_flag & RC_REPMBUF) {
 		mbuf_freem(rp->rc_reply);
 		if (!(rp->rc_flag & RC_UDP))
-			nfsrc_tcpsavedreplies--;
+			atomic_add_int(&nfsrc_tcpsavedreplies, -1);
 	}
 	FREE((caddr_t)rp, M_NFSRVCACHE);
 	newnfsstats.srvcache_size--;
@@ -757,20 +803,22 @@ nfsrvd_cleancache(void)
 	struct nfsrvcache *rp, *nextrp;
 	int i;
 
-	NFSLOCKCACHE();
 	for (i = 0; i < NFSRVCACHE_HASHSIZE; i++) {
+		mtx_lock(&nfsrc_tcpmtx[i]);
 		LIST_FOREACH_SAFE(rp, &nfsrvhashtbl[i], rc_hash, nextrp) {
 			nfsrc_freecache(rp);
 		}
+		mtx_unlock(&nfsrc_tcpmtx[i]);
 	}
+	mtx_lock(&nfsrc_udpmtx);
 	for (i = 0; i < NFSRVCACHE_HASHSIZE; i++) {
 		LIST_FOREACH_SAFE(rp, &nfsrvudphashtbl[i], rc_hash, nextrp) {
 			nfsrc_freecache(rp);
 		}
 	}
 	newnfsstats.srvcache_size = 0;
+	mtx_unlock(&nfsrc_udpmtx);
 	nfsrc_tcpsavedreplies = 0;
-	NFSUNLOCKCACHE();
 }
 
 /*
@@ -780,28 +828,97 @@ static void
 nfsrc_trimcache(u_int64_t sockref, struct socket *so)
 {
 	struct nfsrvcache *rp, *nextrp;
-	int i;
+	int i, j, k, time_histo[10];
+	time_t thisstamp;
+	static time_t udp_lasttrim = 0, tcp_lasttrim = 0;
+	static int onethread = 0;
 
-	NFSLOCKCACHE();
-	TAILQ_FOREACH_SAFE(rp, &nfsrvudplru, rc_lru, nextrp) {
-		if (!(rp->rc_flag & (RC_INPROG|RC_LOCKED|RC_WANTED))
-		     && rp->rc_refcnt == 0
-		     && ((rp->rc_flag & RC_REFCNT) ||
-			 NFSD_MONOSEC > rp->rc_timestamp ||
-			 nfsrc_udpcachesize > nfsrc_udphighwater))
-			nfsrc_freecache(rp);
-	}
-	for (i = 0; i < NFSRVCACHE_HASHSIZE; i++) {
-		LIST_FOREACH_SAFE(rp, &nfsrvhashtbl[i], rc_hash, nextrp) {
+	if (atomic_cmpset_acq_int(&onethread, 0, 1) == 0)
+		return;
+	if (NFSD_MONOSEC != udp_lasttrim ||
+	    nfsrc_udpcachesize >= (nfsrc_udphighwater +
+	    nfsrc_udphighwater / 2)) {
+		mtx_lock(&nfsrc_udpmtx);
+		udp_lasttrim = NFSD_MONOSEC;
+		TAILQ_FOREACH_SAFE(rp, &nfsrvudplru, rc_lru, nextrp) {
 			if (!(rp->rc_flag & (RC_INPROG|RC_LOCKED|RC_WANTED))
 			     && rp->rc_refcnt == 0
 			     && ((rp->rc_flag & RC_REFCNT) ||
-				 NFSD_MONOSEC > rp->rc_timestamp ||
-				 nfsrc_activesocket(rp, sockref, so)))
+				 udp_lasttrim > rp->rc_timestamp ||
+				 nfsrc_udpcachesize > nfsrc_udphighwater))
 				nfsrc_freecache(rp);
 		}
+		mtx_unlock(&nfsrc_udpmtx);
+	}
+	if (NFSD_MONOSEC != tcp_lasttrim ||
+	    nfsrc_tcpsavedreplies >= nfsrc_tcphighwater) {
+		for (i = 0; i < 10; i++)
+			time_histo[i] = 0;
+		for (i = 0; i < NFSRVCACHE_HASHSIZE; i++) {
+			mtx_lock(&nfsrc_tcpmtx[i]);
+			if (i == 0)
+				tcp_lasttrim = NFSD_MONOSEC;
+			LIST_FOREACH_SAFE(rp, &nfsrvhashtbl[i], rc_hash,
+			    nextrp) {
+				if (!(rp->rc_flag &
+				     (RC_INPROG|RC_LOCKED|RC_WANTED))
+				     && rp->rc_refcnt == 0) {
+					/*
+					 * The timestamps range from roughly the
+					 * present (tcp_lasttrim) to the present
+					 * + nfsrc_tcptimeout. Generate a simple
+					 * histogram of where the timeouts fall.
+					 */
+					j = rp->rc_timestamp - tcp_lasttrim;
+					if (j >= nfsrc_tcptimeout)
+						j = nfsrc_tcptimeout - 1;
+					if (j < 0)
+						j = 0;
+					j = (j * 10 / nfsrc_tcptimeout) % 10;
+					time_histo[j]++;
+					if ((rp->rc_flag & RC_REFCNT) ||
+					    tcp_lasttrim > rp->rc_timestamp ||
+					    nfsrc_activesocket(rp, sockref, so))
+						nfsrc_freecache(rp);
+				}
+			}
+			mtx_unlock(&nfsrc_tcpmtx[i]);
+		}
+		j = nfsrc_tcphighwater / 5;	/* 20% of it */
+		if (j > 0 && (nfsrc_tcpsavedreplies + j) > nfsrc_tcphighwater) {
+			/*
+			 * Trim some more with a smaller timeout of as little
+			 * as 20% of nfsrc_tcptimeout to try and get below
+			 * 80% of the nfsrc_tcphighwater.
+			 */
+			k = 0;
+			for (i = 0; i < 8; i++) {
+				k += time_histo[i];
+				if (k > j)
+					break;
+			}
+			k = nfsrc_tcptimeout * (i + 1) / 10;
+			if (k < 1)
+				k = 1;
+			thisstamp = tcp_lasttrim + k;
+			for (i = 0; i < NFSRVCACHE_HASHSIZE; i++) {
+				mtx_lock(&nfsrc_tcpmtx[i]);
+				LIST_FOREACH_SAFE(rp, &nfsrvhashtbl[i], rc_hash,
+				    nextrp) {
+					if (!(rp->rc_flag &
+					     (RC_INPROG|RC_LOCKED|RC_WANTED))
+					     && rp->rc_refcnt == 0
+					     && ((rp->rc_flag & RC_REFCNT) ||
+						 thisstamp > rp->rc_timestamp ||
+						 nfsrc_activesocket(rp, sockref,
+						    so)))
+						nfsrc_freecache(rp);
+				}
+				mtx_unlock(&nfsrc_tcpmtx[i]);
+			}
+		}
 	}
-	NFSUNLOCKCACHE();
+	atomic_store_rel_int(&onethread, 0);
 }
 
 /*
@@ -810,12 +927,14 @@ nfsrc_trimcache(u_int64_t sockref, struc
 APPLESTATIC void
 nfsrvd_refcache(struct nfsrvcache *rp)
 {
+	struct mtx *mutex;
 
-	NFSLOCKCACHE();
+	mutex = nfsrc_cachemutex(rp);
+	mtx_lock(mutex);
 	if (rp->rc_refcnt < 0)
 		panic("nfs cache refcnt");
 	rp->rc_refcnt++;
-	NFSUNLOCKCACHE();
+	mtx_unlock(mutex);
 }
 
 /*
@@ -824,14 +943,16 @@ nfsrvd_refcache(struct nfsrvcache *rp)
 APPLESTATIC void
 nfsrvd_derefcache(struct nfsrvcache *rp)
 {
+	struct mtx *mutex;
 
-	NFSLOCKCACHE();
+	mutex = nfsrc_cachemutex(rp);
+	mtx_lock(mutex);
 	if (rp->rc_refcnt <= 0)
 		panic("nfs cache derefcnt");
 	rp->rc_refcnt--;
 	if (rp->rc_refcnt == 0 && !(rp->rc_flag & (RC_LOCKED | RC_INPROG)))
 		nfsrc_freecache(rp);
-	NFSUNLOCKCACHE();
+	mtx_unlock(mutex);
 }
 
 /*
--- fs/nfsserver/nfs_nfsdport.c.orig	2013-03-02 18:19:34.000000000 -0500
+++ fs/nfsserver/nfs_nfsdport.c	2013-03-12 17:51:31.000000000 -0400
@@ -61,7 +61,8 @@ extern struct nfsv4lock nfsd_suspend_loc
 extern struct nfssessionhash nfssessionhash[NFSSESSIONHASHSIZE];
 struct vfsoptlist nfsv4root_opt, nfsv4root_newopt;
 NFSDLOCKMUTEX;
-struct mtx nfs_cache_mutex;
+struct mtx nfsrc_tcpmtx[NFSRVCACHE_HASHSIZE];
+struct mtx nfsrc_udpmtx;
 struct mtx nfs_v4root_mutex;
 struct nfsrvfh nfs_rootfh, nfs_pubfh;
 int nfs_pubfhset = 0, nfs_rootfhset = 0;
@@ -3305,7 +3306,10 @@ nfsd_modevent(module_t mod, int type, vo
 		if (loaded)
 			goto out;
 		newnfs_portinit();
-		mtx_init(&nfs_cache_mutex, "nfs_cache_mutex", NULL, MTX_DEF);
+		for (i = 0; i < NFSRVCACHE_HASHSIZE; i++)
+			mtx_init(&nfsrc_tcpmtx[i], "nfs_tcpcache_mutex", NULL,
+			    MTX_DEF);
+		mtx_init(&nfsrc_udpmtx, "nfs_udpcache_mutex", NULL, MTX_DEF);
 		mtx_init(&nfs_v4root_mutex, "nfs_v4root_mutex", NULL, MTX_DEF);
 		mtx_init(&nfsv4root_mnt.mnt_mtx, "struct mount mtx", NULL,
 		    MTX_DEF);
@@ -3352,7 +3356,9 @@ nfsd_modevent(module_t mod, int type, vo
 			svcpool_destroy(nfsrvd_pool);
 
 		/* and get rid of the locks */
-		mtx_destroy(&nfs_cache_mutex);
+		for (i = 0; i < NFSRVCACHE_HASHSIZE; i++)
+			mtx_destroy(&nfsrc_tcpmtx[i]);
+		mtx_destroy(&nfsrc_udpmtx);
 		mtx_destroy(&nfs_v4root_mutex);
 		mtx_destroy(&nfsv4root_mnt.mnt_mtx);
 		for (i = 0; i < NFSSESSIONHASHSIZE; i++)
--- fs/nfs/nfsport.h.orig	2013-03-02 18:35:13.000000000 -0500
+++ fs/nfs/nfsport.h	2013-03-12 17:51:31.000000000 -0400
@@ -609,11 +609,6 @@ void nfsrvd_rcv(struct socket *, void *,
 #define	NFSREQSPINLOCK		extern struct mtx nfs_req_mutex
 #define	NFSLOCKREQ()		mtx_lock(&nfs_req_mutex)
 #define	NFSUNLOCKREQ()		mtx_unlock(&nfs_req_mutex)
-#define	NFSCACHEMUTEX		extern struct mtx nfs_cache_mutex
-#define	NFSCACHEMUTEXPTR	(&nfs_cache_mutex)
-#define	NFSLOCKCACHE()		mtx_lock(&nfs_cache_mutex)
-#define	NFSUNLOCKCACHE()	mtx_unlock(&nfs_cache_mutex)
-#define	NFSCACHELOCKREQUIRED()	mtx_assert(&nfs_cache_mutex, MA_OWNED)
 #define	NFSSOCKMUTEX		extern struct mtx nfs_slock_mutex
 #define	NFSSOCKMUTEXPTR		(&nfs_slock_mutex)
 #define	NFSLOCKSOCK()		mtx_lock(&nfs_slock_mutex)
--- fs/nfs/nfsrvcache.h.orig	2013-01-07 09:04:15.000000000 -0500
+++ fs/nfs/nfsrvcache.h	2013-03-12 18:02:42.000000000 -0400
@@ -41,7 +41,7 @@
 #define	NFSRVCACHE_MAX_SIZE	2048
 #define	NFSRVCACHE_MIN_SIZE	  64
 
-#define	NFSRVCACHE_HASHSIZE	20
+#define	NFSRVCACHE_HASHSIZE	500
 
 struct nfsrvcache {
 	LIST_ENTRY(nfsrvcache) rc_hash;		/* Hash chain */

_______________________________________________
freebsd-net@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org"

Re: Limits on jumbo mbuf cluster allocation

Reply via email to