We have encountered some serious SMP performance/scalability problems that we've tracked back to lstat/namei calls. I've written a quick benchmark with a pair of tests to simplify/measure the problem. Both tests use a tree of directories: the top level directory contains five subdirectories a, b, c, d, and e. Each subdirectory contains five subdirectories a, b, c, d, and e, and so on.. 1 directory at level one, 5 at level two, 25 at level three, 125 at level four, 625 at level five, and 3125 at level six.

In the "realpath" test, a random path is constructed at the bottom of the tree (e.g. /tmp/lstat/a/b/c/d/e) and realpath() is called on that, provoking lstat() calls on the whole tree. This is to simulate a mix of high-contention and low-contention lstat() calls.

In the "lstat" test, lstat is called directly on a path at the bottom of the tree. Since there are 3125 files, this simulates relatively low-contention lstat() calls.

In both cases, the test repeats as many times as possible for 60 seconds. Each test is run simultaneously by multiple processes, with progressively doubling concurrency from 1 to 512.

What I found was that everything is fine at concurrency 2, probably indicating that the benchmark pegged on some other resource limit. At concurrency 4, realpath drops to 31.8% of concurrency 1. At concurrency 8, performance is down to 18.3%. In the interim, CPU load goes to 80-90% system CPU. I've confirmed via ktrace and the rusage that the CPU usage is all system time, and that lstat() is the *only* system call in the test (realpath() is called with an absolute path).

I then reran the 32-process test on 1-7 cores, and found that performance peaks at 2 cores and drops sharply from there. eight cores runs *fifteen* times slower than two cores.

The test full results are at the bottom of this message.

This is on 6.3-RELEASE-p4 with vfs.lookup_shared=1.

I believe this is the same issue that was previously discussed as "2 x quad-core system is slower that 2 x dual core on FreeBSD" archived here:

http://lists.freebsd.org/pipermail/freebsd-stable/2007-November/038441.html

In that post, Kris Kennaway wrote:
> It is hard to say for certain without a direct profile comparison of the > workload, but it is probably due to lockmgr contention. lockmgr is used
> for various locking operations to do with VFS data structures.  It is
> known to have poor performance and scale very badly."

At this point, what I've got is one of those synthetic benchmarks, but it matches our production problems exactly, except that the production processes need a whole lot more RAM and eventually when this manifests, they backlog and the server death spirals through swap, which is a most unfortunate difference.

I've chased my way up the kernel source to kern_lstat(), where a shared lock is obtained, and then onto namei, where vfs.lookup_shared comes into play. But unfortunately, I don't understand lockmgr, I don't know how the macros and flags I see here relate to it, I can't figure out what happened to the changes that Attilio Rao was working on, and there didn't seem to be much other hope at the time.

This is becoming a huge problem for us. Is there anything that at all can be done, or any news? In the case linked above, improvement was made by changing a PHP setting that isn't applicable in our case.

Thanks,
Jeff

Concurrency 1

        realpath
                Total = 1409069 (100%)
                Total/Sec = 23484
                Total/Sec/Worker = 23484

        lstat
                Total = 6828763 (100%)
                Total/Sec = 113812
                Total/Sec/Worker = 113812

Concurrency 2

        realpath
                Total = 1450489 (100%)
                Total/Sec = 24174
                Total/Sec/Worker = 12087

        lstat
                Total = 6891417 (100.9%)
                Total/Sec = 114856
                Total/Sec/Worker = 57428


Concurrency 4

        realpath
                Total = 448693 (31.8%)
                Total/Sec = 7478
                Total/Sec/Worker = 1869

        lstat
                Total = 3047933 (44.6%)
                Total/Sec = 50798
                Total/Sec/Worker = 12699

Concurrency 8

        realpath
                Total = 258281 (18.3%)
                Total/Sec = 4304
                Total/Sec/Worker = 538

        lstat
                Total = 1688728 (24.7%)
                Total/Sec = 28145
                Total/Sec/Worker = 3518

Concurrency 16

        realpath
                Total = 179150 (12.7%)
                Total/Sec = 2985
                Total/Sec/Worker = 186

        lstat
                Total = 966558 (14.1%)
                Total/Sec = 16109
                Total/Sec/Worker = 1006

Concurrency 32

        realpath
                Total = 116982 (8.3%)
                Total/Sec = 1949
                Total/Sec/Worker = 60

        lstat
                Total = 644703 (9.4%)
                Total/Sec = 10745
                Total/Sec/Worker = 335

Concurrency 64

        realpath
                Total = 112050 (7.9%)
                Total/Sec = 1867
                Total/Sec/Worker = 29

        lstat
                Total = 572798 (8.3%)
                Total/Sec = 9546
                Total/Sec/Worker = 149


Concurrency 128

        realpath
                Total = 111544 (7.9%)
                Total/Sec = 1859
                Total/Sec/Worker = 14

        lstat
                Total = 570800 (8.3%)
                Total/Sec = 9513
                Total/Sec/Worker = 74


Concurrency 256

        realpath
                Total = 96461 (6.8%)
                Total/Sec = 1607
                Total/Sec/Worker = 6

        lstat
                Total = 580679 (8.5%)
                Total/Sec = 9677
                Total/Sec/Worker = 37


Concurrency 512

        realpath
                Total = 91224 (6.4%)
                Total/Sec = 1520
                Total/Sec/Worker = 2

        lstat
                Total = 498342 (7.2%)
                Total/Sec = 8305
                Total/Sec/Worker = 16

realpath Concurrency 32 - 1 Core

Total = 1289527
Total/Sec = 21492
Total/Sec/Worker = 671

realpath Concurrency 32 - 2 Core

Total = 1753625
Total/Sec = 29227
Total/Sec/Worker = 913

realpath Concurrency 32 - 3 Core

Total = 1197896
Total/Sec = 19964
Total/Sec/Worker = 623

realpath Concurrency 32 - 4 Core

Total = 631293
Total/Sec = 10521
Total/Sec/Worker = 328

realpath Concurrency 32 - 5 Core

Total = 227814
Total/Sec = 3796
Total/Sec/Worker = 118

realpath Concurrency 32 - 6 Core

Total = 153550
Total/Sec = 2559
Total/Sec/Worker = 79

realpath Concurrency 32 - 7 Core

Total = 136013
Total/Sec = 2266
Total/Sec/Worker = 70


_______________________________________________
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"

Reply via email to