On Oct 9, 2012, at 5:12 PM, Nikolay Denev <nde...@gmail.com> wrote: > > On Oct 4, 2012, at 12:36 AM, Rick Macklem <rmack...@uoguelph.ca> wrote: > >> Garrett Wollman wrote: >>> <<On Wed, 3 Oct 2012 09:21:06 -0400 (EDT), Rick Macklem >>> <rmack...@uoguelph.ca> said: >>> >>>>> Simple: just use a sepatate mutex for each list that a cache entry >>>>> is on, rather than a global lock for everything. This would reduce >>>>> the mutex contention, but I'm not sure how significantly since I >>>>> don't have the means to measure it yet. >>>>> >>>> Well, since the cache trimming is removing entries from the lists, I >>>> don't >>>> see how that can be done with a global lock for list updates? >>> >>> Well, the global lock is what we have now, but the cache trimming >>> process only looks at one list at a time, so not locking the list that >>> isn't being iterated over probably wouldn't hurt, unless there's some >>> mechanism (that I didn't see) for entries to move from one list to >>> another. Note that I'm considering each hash bucket a separate >>> "list". (One issue to worry about in that case would be cache-line >>> contention in the array of hash buckets; perhaps NFSRVCACHE_HASHSIZE >>> ought to be increased to reduce that.) >>> >> Yea, a separate mutex for each hash list might help. There is also the >> LRU list that all entries end up on, that gets used by the trimming code. >> (I think? I wrote this stuff about 8 years ago, so I haven't looked at >> it in a while.) >> >> Also, increasing the hash table size is probably a good idea, especially >> if you reduce how aggressively the cache is trimmed. >> >>>> Only doing it once/sec would result in a very large cache when >>>> bursts of >>>> traffic arrives. >>> >>> My servers have 96 GB of memory so that's not a big deal for me. >>> >> This code was originally "production tested" on a server with 1Gbyte, >> so times have changed a bit;-) >> >>>> I'm not sure I see why doing it as a separate thread will improve >>>> things. >>>> There are N nfsd threads already (N can be bumped up to 256 if you >>>> wish) >>>> and having a bunch more "cache trimming threads" would just increase >>>> contention, wouldn't it? >>> >>> Only one cache-trimming thread. The cache trim holds the (global) >>> mutex for much longer than any individual nfsd service thread has any >>> need to, and having N threads doing that in parallel is why it's so >>> heavily contended. If there's only one thread doing the trim, then >>> the nfsd service threads aren't spending time either contending on the >>> mutex (it will be held less frequently and for shorter periods). >>> >> I think the little drc2.patch which will keep the nfsd threads from >> acquiring the mutex and doing the trimming most of the time, might be >> sufficient. I still don't see why a separate trimming thread will be >> an advantage. I'd also be worried that the one cache trimming thread >> won't get the job done soon enough. >> >> When I did production testing on a 1Gbyte server that saw a peak >> load of about 100RPCs/sec, it was necessary to trim aggressively. >> (Although I'd be tempted to say that a server with 1Gbyte is no >> longer relevant, I recently recall someone trying to run FreeBSD >> on a i486, although I doubt they wanted to run the nfsd on it.) >> >>>> The only negative effect I can think of w.r.t. having the nfsd >>>> threads doing it would be a (I believe negligible) increase in RPC >>>> response times (the time the nfsd thread spends trimming the cache). >>>> As noted, I think this time would be negligible compared to disk I/O >>>> and network transit times in the total RPC response time? >>> >>> With adaptive mutexes, many CPUs, lots of in-memory cache, and 10G >>> network connectivity, spinning on a contended mutex takes a >>> significant amount of CPU time. (For the current design of the NFS >>> server, it may actually be a win to turn off adaptive mutexes -- I >>> should give that a try once I'm able to do more testing.) >>> >> Have fun with it. Let me know when you have what you think is a good patch. >> >> rick >> >>> -GAWollman >>> _______________________________________________ >>> freebsd-hackers@freebsd.org mailing list >>> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers >>> To unsubscribe, send any mail to >>> "freebsd-hackers-unsubscr...@freebsd.org" >> _______________________________________________ >> freebsd...@freebsd.org mailing list >> http://lists.freebsd.org/mailman/listinfo/freebsd-fs >> To unsubscribe, send any mail to "freebsd-fs-unsubscr...@freebsd.org" > > My quest for IOPS over NFS continues :) > So far I'm not able to achieve more than about 3000 8K read requests over NFS, > while the server locally gives much more. > And this is all from a file that is completely in ARC cache, no disk IO > involved. > > I've snatched some sample DTrace script from the net : [ > http://utcc.utoronto.ca/~cks/space/blog/solaris/DTraceQuantizationNotes ] > > And modified it for our new NFS server : > > #!/usr/sbin/dtrace -qs > > fbt:kernel:nfsrvd_*:entry > { > self->ts = timestamp; > @counts[probefunc] = count(); > } > > fbt:kernel:nfsrvd_*:return > / self->ts > 0 / > { > this->delta = (timestamp-self->ts)/1000000; > } > > fbt:kernel:nfsrvd_*:return > / self->ts > 0 && this->delta > 100 / > { > @slow[probefunc, "ms"] = lquantize(this->delta, 100, 500, 50); > } > > fbt:kernel:nfsrvd_*:return > / self->ts > 0 / > { > @dist[probefunc, "ms"] = quantize(this->delta); > self->ts = 0; > } > > END > { > printf("\n"); > printa("function %-20s %@10d\n", @counts); > printf("\n"); > printa("function %s(), time in %s:%@d\n", @dist); > printf("\n"); > printa("function %s(), time in %s for >= 100 ms:%@d\n", @slow); > } > > And here's a sample output from one or two minutes during the run of Oracle's > ORION benchmark > tool from a Linux machine, on a 32G file on NFS mount over 10G ethernet: > > [16:01]root@goliath:/home/ndenev# ./nfsrvd.d > ^C > > function nfsrvd_access 4 > function nfsrvd_statfs 10 > function nfsrvd_getattr 14 > function nfsrvd_commit 76 > function nfsrvd_sentcache 110048 > function nfsrvd_write 110048 > function nfsrvd_read 283648 > function nfsrvd_dorpc 393800 > function nfsrvd_getcache 393800 > function nfsrvd_rephead 393800 > function nfsrvd_updatecache 393800 > > function nfsrvd_access(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 4 > 1 | 0 > > function nfsrvd_statfs(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 10 > 1 | 0 > > function nfsrvd_getattr(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 14 > 1 | 0 > > function nfsrvd_sentcache(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110048 > 1 | 0 > > function nfsrvd_rephead(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > 1 | 0 > > function nfsrvd_updatecache(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393800 > 1 | 0 > > function nfsrvd_getcache(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 393798 > 1 | 1 > 2 | 0 > 4 | 1 > 8 | 0 > > function nfsrvd_write(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 110039 > 1 | 5 > 2 | 4 > 4 | 0 > > function nfsrvd_read(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 283622 > 1 | 19 > 2 | 3 > 4 | 2 > 8 | 0 > 16 | 1 > 32 | 0 > 64 | 0 > 128 | 0 > 256 | 1 > 512 | 0 > > function nfsrvd_commit(), time in ms: > value ------------- Distribution ------------- count > -1 | 0 > 0 |@@@@@@@@@@@@@@@@@@@@@@@ 44 > 1 |@@@@@@@ 14 > 2 | 0 > 4 |@ 1 > 8 |@ 1 > 16 | 0 > 32 |@@@@@@@ 14 > 64 |@ 2 > 128 | 0 > > > function nfsrvd_commit(), time in ms for >= 100 ms: > value ------------- Distribution ------------- count > < 100 | 0 > 100 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > 150 | 0 > > function nfsrvd_read(), time in ms for >= 100 ms: > value ------------- Distribution ------------- count > 250 | 0 > 300 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ 1 > 350 | 0 > > > Looks like the nfs server cache functions are quite fast, but extremely > frequently called. > > I hope someone can find this information useful. >
Here's another quick one : #!/usr/sbin/dtrace -qs #pragma D option quiet fbt:kernel:nfsrvd_*:entry { self->trace = 1; } fbt:kernel:nfsrvd_*:return / self->trace / { @calls[probefunc] = count(); } tick-1sec { printf("%40s | %s\n", "function", "calls per second"); printa("%40s %10@d\n", @calls); clear(@calls); printf("\n"); } Showing the number of calls per second to the nfsrvd_* functions. _______________________________________________ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "freebsd-hackers-unsubscr...@freebsd.org"