Hey Ben - I need more time to look at this and connect some dots,
but real quick....
Some nfsstat data that we could use to potentially correlate to the local
server activity would be interesting. zfs_create() seems to be the
heavy hitter, but a periodic kernel profile (especially if we can catch
a 97% SYS period) would help:
#lockstat -i997 -Ik -s 10 sleep 60
Alternatively:
#dtrace -n 'profile-997hz / arg0 != 0 / { @s[stack()]=count(); }'
It would also be interesting to see what the zfs_create()'s are doing.
Perhaps a quick:
#dtrace -n 'zfs_create:entry { printf("ZFS Create: %s\n",
stringof(args[0]->v_path)); }'
It would also be interesting to see the network stats. Grab Brendan's
nicstat
and collect some samples....
You're reference to low traffic is in bandwidth, which, as you indicate,
is really,
really low. But the data, at least up to this point, suggests the
workload is not
data/bandwidth intensive, but more attribute intensive. Note again
zfs_create()
is the heavy ZFS function, along with zfs_getattr. Perhaps it's the
attribute-intensive
nature of the load that is at the root of this.
I can spend more time on this tomorrow (traveling today).
Thanks,
/jim
Ben Rockwood wrote:
I've got a Thumper doing nothing but serving NFS. Its using B43 with
zil_disabled. The system is being consumed in waves, but by what I don't know.
Notice vmstat:
3 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 926 91 703 0 25 75
21 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 13 14 1720 21 1105 0 92 8
20 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 17 18 2538 70 834 0 100 0
25 0 0 25693580 2586268 0 0 0 0 0 0 0 0 0 0 0 745 18 179 0 100 0
37 0 0 25693552 2586240 0 0 0 0 0 0 0 0 0 7 7 1152 52 313 0 100 0
16 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 15 13 1543 52 767 0 100 0
17 0 0 25693592 2586280 0 0 0 0 0 0 0 0 0 2 2 890 72 192 0 100 0
27 0 0 25693572 2586260 0 0 0 0 0 0 0 0 0 15 15 3271 19 3103 0 98 2
0 0 0 25693456 2586144 0 11 0 0 0 0 0 0 0 281 249 34335 242 37289 0 46 54
0 0 0 25693448 2586136 0 2 0 0 0 0 0 0 0 0 0 2470 103 2900 0 27 73
0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1062 105 822 0 26 74
0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 1076 91 857 0 25 75
0 0 0 25693448 2586136 0 0 0 0 0 0 0 0 0 0 0 917 126 674 0 25 75
These spikes of sys load come in waves like this. While there are close to a
hundred systems mounting NFS shares on the Thumper, the amount of traffic is
really low. Nothing to justify this. We're talking less than 10MB/s.
NFS is pathetically slow. We're using NFSv3 TCP shared via ZFS sharenfs on a
3Gbps aggregation (3*1Gbps).
I've been slamming my head against this problem for days and can't make
headway. I'll post some of my notes below. Any thoughts or ideas are welcome!
benr.
===
Step 1 was to disable any ZFS features that might consume large amounts of CPU:
# zfs set compression=off joyous
# zfs set atime=off joyous
# zfs set checksum=off joyous
These changes had no effect.
Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed
"dns" was specified in /etc/nsswitch.conf which won't work given that no DNS
servers are accessable from the storage or private networks, but again, no improvement.
In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled
the dns/client service in SMF.
Turning back to CPU usage, we can see the activity is all SYStem time and comes
in waves:
[private:/tmp] root# sar 1 100
SunOS private.thumper1 5.11 snv_43 i86pc 12/07/2006
10:38:05 %usr %sys %wio %idle
10:38:06 0 27 0 73
10:38:07 0 27 0 73
10:38:09 0 27 0 73
10:38:10 1 26 0 73
10:38:11 0 26 0 74
10:38:12 0 26 0 74
10:38:13 0 24 0 76
10:38:14 0 6 0 94
10:38:15 0 7 0 93
10:38:22 0 99 0 1 <--
10:38:23 0 94 0 6 <--
10:38:24 0 28 0 72
10:38:25 0 27 0 73
10:38:26 0 27 0 73
10:38:27 0 27 0 73
10:38:28 0 27 0 73
10:38:29 1 30 0 69
10:38:30 0 27 0 73
And so we consider whether or not there is a pattern to the frequency. The
following is sar output from any lines in which sys is above 90%:
10:40:04 %usr %sys %wio %idle Delta
10:40:11 0 97 0 3
10:40:45 0 98 0 2 34 seconds
10:41:02 0 94 0 6 17 seconds
10:41:26 0 100 0 0 24 seconds
10:42:00 0 100 0 0 34 seconds
10:42:25 (end of sample) >25 seconds
Looking at the congestion in the run queue:
[private:/tmp] root# sar -q 5 100
10:45:43 runq-sz %runocc swpq-sz %swpocc
10:45:51 27.0 85 0.0 0
10:45:57 1.0 20 0.0 0
10:46:02 2.0 60 0.0 0
10:46:13 19.8 99 0.0 0
10:46:23 17.7 99 0.0 0
10:46:34 24.4 99 0.0 0
10:46:41 22.1 97 0.0 0
10:46:48 13.0 96 0.0 0
10:46:55 25.3 102 0.0 0
Looking at the per-CPU breakdown:
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 324 224 0 0 0 154 0 0 0 100 0 0
1 0 0 0 114 0 226 0 10 13 0 86 0 1 0 99
2 0 0 0 162 138 149 0 5 4 0 0 0 1 0 99
3 0 0 0 55 6 46 0 4 3 0 0 0 1 0 99
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 310 210 34 0 17 171 7 5 0 100 0 0
1 0 0 0 152 1 200 0 17 26 5 59 1 65 0 34
2 0 0 0 271 197 175 1 13 20 2 0 0 66 0 34
3 0 0 0 120 6 76 0 6 14 2 0 0 66 0 34
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1 318 204 46 1 19 22 5 18 0 100 0 0
1 0 0 0 112 0 40 0 12 15 3 4 0 100 0 0
2 0 0 0 195 93 31 1 11 27 3 0 0 100 0 0
3 0 0 0 108 1 20 0 9 31 3 2 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 569 204 809 3 133 432 1 21 0 99 0 1
1 0 0 0 502 0 1159 1 201 453 1 8 0 97 0 3
2 0 0 0 2305 2197 147 6 51 279 0 0 0 100 0 0
3 0 0 0 242 2 433 10 75 301 1 8 0 100 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 371 227 110 0 31 1470 0 80 0 99 0 1
1 0 0 3 1034 0 2079 0 102 241 0 9 0 10 0 90
2 0 0 0 1892 1718 1558 3 129 214 0 36 0 11 0 89
3 0 0 0 303 6 549 0 50 329 0 5 0 8 0 92
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 1 312 205 24 0 10 39 4 4 0 100 0 0
1 0 0 1 134 0 99 0 10 20 3 6 0 90 0 10
2 0 0 0 187 90 40 0 8 15 4 10 0 90 0 10
3 0 0 1 100 1 19 0 8 9 4 6 0 90 0 10
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 2 528 206 916 8 161 312 1 39 0 99 0 1
1 0 0 1 206 1 350 6 120 147 1 68 0 100 0 0
2 83 0 0 1929 1816 192 9 87 176 0 50 0 100 0 0
3 0 0 0 142 4 66 21 12 460 1 17 1 99 0 0
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 360 228 163 11 25 743 0 13 0 100 0 0
1 0 0 0 1587 0 4126 13 520 1435 0 122 0 24 0 76
2 3 0 0 7785 7669 1499 29 225 933 0 48 0 27 0 73
3 1 0 0 1856 6 4683 0 467 1912 0 37 0 23 0 77
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 323 223 0 0 0 244 0 0 0 100 0 0
1 0 0 0 141 0 277 0 14 11 0 1 0 0 0 100
2 0 0 0 307 259 312 0 16 15 0 3 0 1 0 99
3 0 0 0 113 6 165 0 5 5 0 91 0 0 0 100
CPU minf mjf xcal intr ithr csw icsw migr smtx srw syscl usr sys wt idl
0 0 0 0 324 224 0 0 0 203 0 0 0 100 0 0
1 0 0 0 143 0 283 0 11 2 0 13 0 1 0 99
2 0 0 0 312 278 298 0 8 11 0 3 0 1 0 99
3 0 0 0 119 6 175 0 3 5 0 69 0 1 0 99
Looking at prstat we can see the number of nfsd threads are low, much lower
than the 1024 limit:
[private:/tmp] root# prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
22643 daemon 2228K 1776K sleep 60 -20 2:55:52 55% nfsd/18
Interestingly, using prstat -mL to monitor thread latency, we see that a
handful of threads are the culprates for consuming mass CPU:
[private:/tmp] root# prstat -mL
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506
22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 415 0 0 0 nfsd/1563
22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 417 0 0 0 nfsd/1554
22643 daemon 0.0 74 0.0 0.0 0.0 0.0 26 0.0 419 0 0 0 nfsd/1551
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 26 74 418 0 0 0 nfsd/1553
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1536
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1555
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1539
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1562
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1545
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1559
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419 1 0 0 nfsd/1541
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1546
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417 0 0 0 nfsd/1543
22643 daemon 0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418 0 0 0 nfsd/1560
Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86
Above SLP represents "The percentage of time the process has spent sleeping", LAT represents
"percentage of time the process has spent waiting for CPU", VCX is "number of voluntary
context switches" and ICX is for involuntary context switches.
Using pstack to try and determine what these threads are doing we see:
[private:/tmp] root# pstack 22643/1693
22643: /usr/lib/nfs/nfsd
----------------- lwp# 1693 / thread# 1693 --------------------
fee589b7 nfssys (e, feba0fc8)
fee5839f _thr_setup (fed4d400) + 51
fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0)
[private:/tmp] root# pstack 22643/1708
22643: /usr/lib/nfs/nfsd
----------------- lwp# 1708 --------------------------------
fee589b7 nfssys (e, febb0fc8)
fee5839f _thr_setup (fed44c00) + 51
fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0)
[private:/tmp] root# pstack 22643/1664
22643: /usr/lib/nfs/nfsd
----------------- lwp# 1664 / thread# 1664 --------------------
fee589b7 nfssys (e, fec90fc8)
fee5839f _thr_setup (fed43800) + 51
fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0)
Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I
get an idea of what ZFS is doing:
[private:/] root# dtrace -s zvop_times.d
dtrace: script 'zvop_times.d' matched 66 probes
^C
CPU ID FUNCTION:NAME
1 2 :END
ZFS COUNT
zfs_getsecattr 4
zfs_space 70
zfs_rename 111
zfs_readdir 284
zfs_read 367
zfs_mkdir 670
zfs_setattr 1054
zfs_frlock 1562
zfs_putpage 3916
zfs_write 4110
zfs_create 4178
zfs_remove 7794
zfs_fid 14960
zfs_inactive 17637
zfs_access 20809
zfs_fsync 29668
zfs_lookup 31273
zfs_getattr 175457
ZFS AVG TIME
zfs_fsync 2337
zfs_getattr 2735
zfs_access 2774
zfs_fid 2948
zfs_inactive 4081
zfs_lookup 6067
zfs_frlock 6750
zfs_putpage 6840
zfs_mkdir 9259
zfs_getsecattr 10940
zfs_setattr 15364
zfs_space 25214
zfs_readdir 25726
zfs_remove 26117
zfs_write 31942
zfs_rename 60504
zfs_read 144449
zfs_create 71215587
ZFS SUM TIME
zfs_getsecattr 43762
zfs_space 1765026
zfs_mkdir 6204101
zfs_rename 6715987
zfs_readdir 7306233
zfs_frlock 10544902
zfs_setattr 16194261
zfs_putpage 26787465
zfs_fid 44111892
zfs_read 53013073
zfs_access 57740468
zfs_fsync 69354886
zfs_inactive 71990292
zfs_write 131282275
zfs_lookup 189748320
zfs_remove 203558835
zfs_getattr 479998059
zfs_create 297538724997
[private:/] root# zfs get all joyous
NAME PROPERTY VALUE SOURCE
joyous type filesystem -
joyous creation Fri Jul 28 17:08 2006 -
joyous used 255G -
joyous available 15.7T -
joyous referenced 81.7K -
joyous compressratio 1.21x -
joyous mounted yes -
joyous quota none default
joyous reservation none default
joyous recordsize 128K default
joyous mountpoint /joyous default
joyous sharenfs off default
joyous checksum off local
joyous compression off local
joyous atime off local
joyous devices on default
joyous exec on default
joyous setuid on default
joyous readonly off default
joyous zoned off default
joyous snapdir hidden default
joyous aclmode groupmask default
joyous aclinherit secure default
So what exactly is nfsd doing???
[private:/] root# time dtrace -n 'syscall:::entry /execname == "nfsd"/ {
@[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 225 probes
^C
open 3
close 5
lwp_mutex_timedlock 7
lwp_mutex_wakeup 7
pollsys 7
putmsg 8
getmsg 11
fcntl 12
lwp_park 30
mmap 41
munmap 42
fxstat 47
ioctl 58
lwp_continue 100
lwp_create 100
setcontext 100
lwp_kill 101
lwp_exit 110
schedctl 110
lwp_sigmask 138
nfs 203
real 3m22.464s
user 0m0.206s
sys 0m0.222s
And the answer is, not much of anything. Where is this CPU disappearing too!!!
Looking at the stacks:
[private:/] root# time dtrace -n 'syscall::nfs:entry /execname == "nfsd"/ {
@[ustack()] = count(); }'
dtrace: description 'syscall::nfs:entry ' matched 1 probe
^C
libc.so.1`_nfssys+0x7
libc.so.1`_thr_setup+0x51
libc.so.1`_lwp_start
22
Not very helpful... lets try this:
[private:/] root# truss -c -p 22643
^C
syscall seconds calls errors
close .000 1
ioctl .000 2
lwp_park .000 1
lwp_unpark .000 1
getmsg .000 1
putmsg .000 1
setustack .000 30
nfssys 106.146 56
mmap .000 20
munmap .000 3
fxstat .000 4
lwp_create .001 30
lwp_exit .000 12
lwp_continue .000 30
lwp_kill .000 13 13
lwp_sigmask .000 14
pollsys .000 1
schedctl .000 12
-------- ------ ----
sys totals: 106.149 232 13
usr time: .001
elapsed: 47.230
Better. Clearly nfssys is a problem, but why?
[private:/] root# dtrace -n 'io:genunix::start { @[pid, execname] = count(); }'
dtrace: description 'io:genunix::start ' matched 3 probes
^C
3 fsflush 1
344 syslogd 2
22643 nfsd 59
0 sched 30346
The fucking scheduler is doing more IO than nfsd!
Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10 second
intervals, disktime is 1000000=1s:
2006 Dec 7 12:06:47, load: 8.08, disk_r: 1408 KB, disk_w: 14482 KB
UID PID PPID CMD DEVICE MAJ MIN D DISKTIME
0 3 0 fsflush sd6 54 389 W 242
0 3 0 fsflush sd6 54 384 W 251
1 22643 1 nfsd sd17 54 1088 R 2733
1 22643 1 nfsd sd14 54 896 R 4420
1 22643 1 nfsd sd13 54 832 R 5296
1 22643 1 nfsd sd18 54 1152 R 7103
1 22643 1 nfsd sd47 54 3008 R 7240
1 22643 1 nfsd sd31 54 1984 R 14827
1 22643 1 nfsd sd32 54 2048 R 16653
1 22643 1 nfsd sd45 54 2880 R 17768
1 22643 1 nfsd sd30 54 1920 R 17962
1 22643 1 nfsd sd22 54 1408 R 18687
1 22643 1 nfsd sd8 54 512 R 19693
1 22643 1 nfsd sd33 54 2112 R 20659
1 22643 1 nfsd sd19 54 1216 R 22933
1 22643 1 nfsd sd43 54 2752 R 23602
1 22643 1 nfsd sd7 54 448 R 27458
1 22643 1 nfsd sd10 54 640 R 28715
1 22643 1 nfsd sd49 54 3136 R 29640
1 22643 1 nfsd sd5 54 320 R 46201
1 22643 1 nfsd sd11 54 704 R 46775
1 22643 1 nfsd sd24 54 1536 R 60011 <--
0.06s
0 0 0 sched sd16 54 1024 W 140682
0 0 0 sched sd47 54 3008 W 142082
0 0 0 sched sd18 54 1152 W 151720
0 0 0 sched sd13 54 832 W 157107
0 0 0 sched sd14 54 896 W 159044
0 0 0 sched sd17 54 1088 W 164386
0 0 0 sched sd35 54 2240 W 167119
0 0 0 sched sd34 54 2176 W 168552
0 0 0 sched sd37 54 2368 W 171872
0 0 0 sched sd40 54 2560 W 193907
0 0 0 sched sd25 54 1600 W 195662
0 0 0 sched sd33 54 2112 W 196480
Disk response times seem really good actually.
iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is doing
a lot of writing, mostly 1k files. Testing on aeon suggests this is normal.
truss -D also points to this nfssys slowness:
[private:/tmp] root# truss -D -p `pgrep nfsd`
...
/2: 0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED, 0xFED71F48)
= 3266
/2: 0.0001 lwp_continue(3266) = 0
/3266: 260128.5978 lwp_create() (returning as new lwp ...) = 0
/3266: 0.0034 setustack(0xFED4BC60)
/3266: nfssys(14, 0xFEC40FC8) (sleeping...)
/3266: 43.1653 nfssys(14, 0xFEC40FC8) = 0
/3266: 0.0074 schedctl() = 0xFEF9F020
/3266: 0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF
[0x0000FFFF]
/3266: 0.0002 lwp_kill(3268, SIG#0) Err#3 ESRCH
/3266: 0.0002 munmap(0xFEC20000, 4096) = 0
/3266: 0.0002 lwp_exit()
Locks...
[private:/tmp] root# lockstat -P sleep 10
Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)
Count indv cuml rcnt spin Lock Caller
-------------------------------------------------------------------------------
32 100% 100% 0.00 59963092 0xffffffffb282a1c8 dmu_object_alloc+0x75 \
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss