I've got a Thumper doing nothing but serving NFS. Its using B43 with zil_disabled. The system is being consumed in waves, but by what I don't know. Notice vmstat:
3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75 21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8 20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0 25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0 37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0 16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0 17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0 27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2 0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54 0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75 0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75 These spikes of sys load come in waves like this. While there are close to a hundred systems mounting NFS shares on the Thumper, the amount of traffic is really low. Nothing to justify this. We're talking less than 10MB/s. NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a 3Gbps aggregation (3*1Gbps). I've been slamming my head against this problem for days and can't make headway. I'll post some of my notes below. Any thoughts or ideas are welcome! benr. === Step 1 was to disable any ZFS features that might consume large amounts of CPU: # zfs set compression=off joyous # zfs set atime=off joyous # zfs set checksum=off joyous These changes had no effect. Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed "dns" was specified in /etc/nsswitch.conf which won't work given that no DNS servers are accessable from the storage or private networks, but again, no improvement. In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled the dns/client service in SMF. Turning back to CPU usage, we can see the activity is all SYStem time and comes in waves: [private:/tmp] root# sar 1 100 SunOS private.thumper1 5.11 snv_43 i86pc 12/07/2006 10:38:05 %usr %sys %wio %idle 10:38:06 0 27 0 73 10:38:07 0 27 0 73 10:38:09 0 27 0 73 10:38:10 1 26 0 73 10:38:11 0 26 0 74 10:38:12 0 26 0 74 10:38:13 0 24 0 76 10:38:14 0 6 0 94 10:38:15 0 7 0 93 10:38:22 0 99 0 1 <-- 10:38:23 0 94 0 6 <-- 10:38:24 0 28 0 72 10:38:25 0 27 0 73 10:38:26 0 27 0 73 10:38:27 0 27 0 73 10:38:28 0 27 0 73 10:38:29 1 30 0 69 10:38:30 0 27 0 73 And so we consider whether or not there is a pattern to the frequency. The following is sar output from any lines in which sys is above 90%: 10:40:04 %usr %sys %wio %idle Delta 10:40:11 0 97 0 3 10:40:45 0 98 0 2 34 seconds 10:41:02 0 94 0 6 17 seconds 10:41:26 0 100 0 0 24 seconds 10:42:00 0 100 0 0 34 seconds 10:42:25 (end of sample) >25 seconds Looking at the congestion in the run queue: [private:/tmp] root# sar -q 5 100 10:45:43 runq-sz %runocc swpq-sz %swpocc 10:45:51 27.0 85 0.0 0 10:45:57 1.0 20 0.0 0 10:46:02 2.0 60 0.0 0 10:46:13 19.8 99 0.0 0 10:46:23 17.7 99 0.0 0 10:46:34 24.4 99 0.0 0 10:46:41 22.1 97 0.0 0 10:46:48 13.0 96 0.0 0 10:46:55 25.3 102 0.0 0 Looking at the per-CPU breakdown: CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 324 224 0 0 0 154 0 0 0 100 0 0 1 0 0 0 114 0 226 0 10 13 0 86 0 1 0 99 2 0 0 0 162 138 149 0 5 4 0 0 0 1 0 99 3 0 0 0 55 6 46 0 4 3 0 0 0 1 0 99 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 310 210 34 0 17 171 7 5 0 100 0 0 1 0 0 0 152 1 200 0 17 26 5 59 1 65 0 34 2 0 0 0 271 197 175 1 13 20 2 0 0 66 0 34 3 0 0 0 120 6 76 0 6 14 2 0 0 66 0 34 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1 318 204 46 1 19 22 5 18 0 100 0 0 1 0 0 0 112 0 40 0 12 15 3 4 0 100 0 0 2 0 0 0 195 93 31 1 11 27 3 0 0 100 0 0 3 0 0 0 108 1 20 0 9 31 3 2 0 100 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 569 204 809 3 133 432 1 21 0 99 0 1 1 0 0 0 502 0 1159 1 201 453 1 8 0 97 0 3 2 0 0 0 2305 2197 147 6 51 279 0 0 0 100 0 0 3 0 0 0 242 2 433 10 75 301 1 8 0 100 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 371 227 110 0 31 1470 0 80 0 99 0 1 1 0 0 3 1034 0 2079 0 102 241 0 9 0 10 0 90 2 0 0 0 1892 1718 1558 3 129 214 0 36 0 11 0 89 3 0 0 0 303 6 549 0 50 329 0 5 0 8 0 92 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 1 312 205 24 0 10 39 4 4 0 100 0 0 1 0 0 1 134 0 99 0 10 20 3 6 0 90 0 10 2 0 0 0 187 90 40 0 8 15 4 10 0 90 0 10 3 0 0 1 100 1 19 0 8 9 4 6 0 90 0 10 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 2 528 206 916 8 161 312 1 39 0 99 0 1 1 0 0 1 206 1 350 6 120 147 1 68 0 100 0 0 2 83 0 0 1929 1816 192 9 87 176 0 50 0 100 0 0 3 0 0 0 142 4 66 21 12 460 1 17 1 99 0 0 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 360 228 163 11 25 743 0 13 0 100 0 0 1 0 0 0 1587 0 4126 13 520 1435 0 122 0 24 0 76 2 3 0 0 7785 7669 1499 29 225 933 0 48 0 27 0 73 3 1 0 0 1856 6 4683 0 467 1912 0 37 0 23 0 77 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 323 223 0 0 0 244 0 0 0 100 0 0 1 0 0 0 141 0 277 0 14 11 0 1 0 0 0 100 2 0 0 0 307 259 312 0 16 15 0 3 0 1 0 99 3 0 0 0 113 6 165 0 5 5 0 91 0 0 0 100 CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl 0 0 0 0 324 224 0 0 0 203 0 0 0 100 0 0 1 0 0 0 143 0 283 0 11 2 0 13 0 1 0 99 2 0 0 0 312 278 298 0 8 11 0 3 0 1 0 99 3 0 0 0 119 6 175 0 3 5 0 69 0 1 0 99 Looking at prstat we can see the number of nfsd threads are low, much lower than the 1024 limit: [private:/tmp] root# prstat PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 22643 daemon 2228K 1776K sleep 60 -20 2:55:52 55% nfsd/18 Interestingly, using prstat -mL to monitor thread latency, we see that a handful of threads are the culprates for consuming mass CPU: [private:/tmp] root# prstat -mL PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506 22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554 22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543 22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560 Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86 Above SLP represents "The percentage of time the process has spent sleeping", LAT represents "percentage of time the process has spent waiting for CPU", VCX is "number of voluntary context switches" and ICX is for involuntary context switches. Using pstack to try and determine what these threads are doing we see: [private:/tmp] root# pstack 22643/1693 22643: /usr/lib/nfs/nfsd ----------------- lwp# 1693 / thread# 1693 -------------------- fee589b7 nfssys (e, feba0fc8) fee5839f _thr_setup (fed4d400) + 51 fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0) [private:/tmp] root# pstack 22643/1708 22643: /usr/lib/nfs/nfsd ----------------- lwp# 1708 -------------------------------- fee589b7 nfssys (e, febb0fc8) fee5839f _thr_setup (fed44c00) + 51 fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0) [private:/tmp] root# pstack 22643/1664 22643: /usr/lib/nfs/nfsd ----------------- lwp# 1664 / thread# 1664 -------------------- fee589b7 nfssys (e, fec90fc8) fee5839f _thr_setup (fed43800) + 51 fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0) Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I get an idea of what ZFS is doing: [private:/] root# dtrace -s zvop_times.d dtrace: script 'zvop_times.d' matched 66 probes ^C CPU ID FUNCTION:NAME 1 2 :END ZFS COUNT zfs_getsecattr 4 zfs_space 70 zfs_rename 111 zfs_readdir 284 zfs_read 367 zfs_mkdir 670 zfs_setattr 1054 zfs_frlock 1562 zfs_putpage 3916 zfs_write 4110 zfs_create 4178 zfs_remove 7794 zfs_fid 14960 zfs_inactive 17637 zfs_access 20809 zfs_fsync 29668 zfs_lookup 31273 zfs_getattr 175457 ZFS AVG TIME zfs_fsync 2337 zfs_getattr 2735 zfs_access 2774 zfs_fid 2948 zfs_inactive 4081 zfs_lookup 6067 zfs_frlock 6750 zfs_putpage 6840 zfs_mkdir 9259 zfs_getsecattr 10940 zfs_setattr 15364 zfs_space 25214 zfs_readdir 25726 zfs_remove 26117 zfs_write 31942 zfs_rename 60504 zfs_read 144449 zfs_create 71215587 ZFS SUM TIME zfs_getsecattr 43762 zfs_space 1765026 zfs_mkdir 6204101 zfs_rename 6715987 zfs_readdir 7306233 zfs_frlock 10544902 zfs_setattr 16194261 zfs_putpage 26787465 zfs_fid 44111892 zfs_read 53013073 zfs_access 57740468 zfs_fsync 69354886 zfs_inactive 71990292 zfs_write 131282275 zfs_lookup 189748320 zfs_remove 203558835 zfs_getattr 479998059 zfs_create 297538724997 [private:/] root# zfs get all joyous NAME PROPERTY VALUE SOURCE joyous type filesystem - joyous creation Fri Jul 28 17:08 2006 - joyous used 255G - joyous available 15.7T - joyous referenced 81.7K - joyous compressratio 1.21x - joyous mounted yes - joyous quota none default joyous reservation none default joyous recordsize 128K default joyous mountpoint /joyous default joyous sharenfs off default joyous checksum off local joyous compression off local joyous atime off local joyous devices on default joyous exec on default joyous setuid on default joyous readonly off default joyous zoned off default joyous snapdir hidden default joyous aclmode groupmask default joyous aclinherit secure default So what exactly is nfsd doing??? [private:/] root# time dtrace -n 'syscall:::entry /execname == "nfsd"/ { @[probefunc] = count(); }' dtrace: description 'syscall:::entry ' matched 225 probes ^C open 3 close 5 lwp_mutex_timedlock 7 lwp_mutex_wakeup 7 pollsys 7 putmsg 8 getmsg 11 fcntl 12 lwp_park 30 mmap 41 munmap 42 fxstat 47 ioctl 58 lwp_continue 100 lwp_create 100 setcontext 100 lwp_kill 101 lwp_exit 110 schedctl 110 lwp_sigmask 138 nfs 203 real 3m22.464s user 0m0.206s sys 0m0.222s And the answer is, not much of anything. Where is this CPU disappearing too!!! Looking at the stacks: [private:/] root# time dtrace -n 'syscall::nfs:entry /execname == "nfsd"/ { @[ustack()] = count(); }' dtrace: description 'syscall::nfs:entry ' matched 1 probe ^C libc.so.1`_nfssys+0x7 libc.so.1`_thr_setup+0x51 libc.so.1`_lwp_start 22 Not very helpful... lets try this: [private:/] root# truss -c -p 22643 ^C syscall seconds calls errors close .000 1 ioctl .000 2 lwp_park .000 1 lwp_unpark .000 1 getmsg .000 1 putmsg .000 1 setustack .000 30 nfssys 106.146 56 mmap .000 20 munmap .000 3 fxstat .000 4 lwp_create .001 30 lwp_exit .000 12 lwp_continue .000 30 lwp_kill .000 13 13 lwp_sigmask .000 14 pollsys .000 1 schedctl .000 12 -------- ------ ---- sys totals: 106.149 232 13 usr time: .001 elapsed: 47.230 Better. Clearly nfssys is a problem, but why? [private:/] root# dtrace -n 'io:genunix::start { @[pid, execname] = count(); }' dtrace: description 'io:genunix::start ' matched 3 probes ^C 3 fsflush 1 344 syslogd 2 22643 nfsd 59 0 sched 30346 The fucking scheduler is doing more IO than nfsd! Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10 second intervals, disktime is 1000000=1s: 2006 Dec 7 12:06:47, load: 8.08, disk_r: 1408 KB, disk_w: 14482 KB UID PID PPID CMD DEVICE MAJ MIN D DISKTIME 0 3 0 fsflush sd6 54 389 W 242 0 3 0 fsflush sd6 54 384 W 251 1 22643 1 nfsd sd17 54 1088 R 2733 1 22643 1 nfsd sd14 54 896 R 4420 1 22643 1 nfsd sd13 54 832 R 5296 1 22643 1 nfsd sd18 54 1152 R 7103 1 22643 1 nfsd sd47 54 3008 R 7240 1 22643 1 nfsd sd31 54 1984 R 14827 1 22643 1 nfsd sd32 54 2048 R 16653 1 22643 1 nfsd sd45 54 2880 R 17768 1 22643 1 nfsd sd30 54 1920 R 17962 1 22643 1 nfsd sd22 54 1408 R 18687 1 22643 1 nfsd sd8 54 512 R 19693 1 22643 1 nfsd sd33 54 2112 R 20659 1 22643 1 nfsd sd19 54 1216 R 22933 1 22643 1 nfsd sd43 54 2752 R 23602 1 22643 1 nfsd sd7 54 448 R 27458 1 22643 1 nfsd sd10 54 640 R 28715 1 22643 1 nfsd sd49 54 3136 R 29640 1 22643 1 nfsd sd5 54 320 R 46201 1 22643 1 nfsd sd11 54 704 R 46775 1 22643 1 nfsd sd24 54 1536 R 60011 <-- 0.06s 0 0 0 sched sd16 54 1024 W 140682 0 0 0 sched sd47 54 3008 W 142082 0 0 0 sched sd18 54 1152 W 151720 0 0 0 sched sd13 54 832 W 157107 0 0 0 sched sd14 54 896 W 159044 0 0 0 sched sd17 54 1088 W 164386 0 0 0 sched sd35 54 2240 W 167119 0 0 0 sched sd34 54 2176 W 168552 0 0 0 sched sd37 54 2368 W 171872 0 0 0 sched sd40 54 2560 W 193907 0 0 0 sched sd25 54 1600 W 195662 0 0 0 sched sd33 54 2112 W 196480 Disk response times seem really good actually. iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is doing a lot of writing, mostly 1k files. Testing on aeon suggests this is normal. truss -D also points to this nfssys slowness: [private:/tmp] root# truss -D -p `pgrep nfsd` ... /2: 0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED, 0xFED71F48) = 3266 /2: 0.0001 lwp_continue(3266) = 0 /3266: 260128.5978 lwp_create() (returning as new lwp ...) = 0 /3266: 0.0034 setustack(0xFED4BC60) /3266: nfssys(14, 0xFEC40FC8) (sleeping...) /3266: 43.1653 nfssys(14, 0xFEC40FC8) = 0 /3266: 0.0074 schedctl() = 0xFEF9F020 /3266: 0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF [0x0000FFFF] /3266: 0.0002 lwp_kill(3268, SIG#0) Err#3 ESRCH /3266: 0.0002 munmap(0xFEC20000, 4096) = 0 /3266: 0.0002 lwp_exit() Locks... [private:/tmp] root# lockstat -P sleep 10 Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec) Count indv cuml rcnt spin Lock Caller ------------------------------------------------------------------------------- 32 100% 100% 0.00 59963092 0xffffffffb282a1c8 dmu_object_alloc+0x75 \ This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss