Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Jim Mauro Thu, 07 Dec 2006 06:07:33 -0800


Hey Ben - I need more time to look at this and connect some dots,
but real quick....


Some nfsstat data that we could use to potentially correlate to the local
server activity would be interesting. zfs_create() seems to be the
heavy hitter, but a periodic kernel profile (especially if we can catch
a 97% SYS period) would help:

#lockstat -i997 -Ik -s 10 sleep 60

Alternatively:

#dtrace -n 'profile-997hz / arg0 != 0 / { @s[stack()]=count(); }'

It would also be interesting to see what the zfs_create()'s are doing.
Perhaps a quick:

#dtrace -n 'zfs_create:entry { printf("ZFS Create: %s\n",stringof(args[0]->v_path)); }'

It would also be interesting to see the network stats. Grab Brendan'snicstat

and collect some samples....

You're reference to low traffic is in bandwidth, which, as you indicate,is really,really low. But the data, at least up to this point, suggests theworkload is notdata/bandwidth intensive, but more attribute intensive. Note againzfs_create()is the heavy ZFS function, along with zfs_getattr. Perhaps it's theattribute-intensive

nature of the load that is at the root of this.

I can spend more time on this tomorrow (traveling today).

Thanks,
/jim


Ben Rockwood wrote:

I've got a Thumper doing nothing but serving NFS.  Its using B43 with 
zil_disabled.  The system is being consumed in waves, but by what I don't know. 
 Notice vmstat:

 3 0 0 25693580 2586268 0 0  0  0  0  0  0  0  0  0  0  926   91  703  0 25 75
 21 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 13 14 1720   21 1105  0 92  8
 20 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0 17 18 2538   70  834  0 100 0
 25 0 0 25693580 2586268 0 0 0  0  0  0  0  0  0  0  0  745   18  179  0 100 0
 37 0 0 25693552 2586240 0 0 0  0  0  0  0  0  0  7  7 1152   52  313  0 100 0
 16 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0 15 13 1543   52  767  0 100 0
 17 0 0 25693592 2586280 0 0 0  0  0  0  0  0  0  2  2  890   72  192  0 100 0
 27 0 0 25693572 2586260 0 0 0  0  0  0  0  0  0 15 15 3271   19 3103  0 98  2
 0 0 0 25693456 2586144 0 11 0  0  0  0  0  0  0 281 249 34335 242 37289 0 46 54
 0 0 0 25693448 2586136 0 2  0  0  0  0  0  0  0  0  0 2470  103 2900  0 27 73
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1062  105  822  0 26 74
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0 1076   91  857  0 25 75
 0 0 0 25693448 2586136 0 0  0  0  0  0  0  0  0  0  0  917  126  674  0 25 75

These spikes of sys load come in waves like this.  While there are close to a 
hundred systems mounting NFS shares on the Thumper, the amount of traffic is 
really low.  Nothing to justify this.  We're talking less than 10MB/s.

NFS is pathetically slow.  We're using NFSv3 TCP shared via ZFS sharenfs on a 
3Gbps aggregation (3*1Gbps).

I've been slamming my head against this problem for days and can't make 
headway.  I'll post some of my notes below.  Any thoughts or ideas are welcome!

benr.

===

Step 1 was to disable any ZFS features that might consume large amounts of CPU:

# zfs set compression=off joyous
# zfs set atime=off joyous
# zfs set checksum=off joyous

These changes had no effect.

Next was to consider that perhaps NFS was doing name lookups when it shouldn't. Indeed 
"dns" was specified in /etc/nsswitch.conf which won't work given that no DNS 
servers are accessable from the storage or private networks, but again, no improvement. 
In this process I removed dns from nsswitch.conf, deleted /etc/resolv.conf, and disabled 
the dns/client service in SMF.

Turning back to CPU usage, we can see the activity is all SYStem time and comes 
in waves:

[private:/tmp] root# sar 1 100

SunOS private.thumper1 5.11 snv_43 i86pc    12/07/2006

10:38:05    %usr    %sys    %wio   %idle
10:38:06       0      27       0      73
10:38:07       0      27       0      73
10:38:09       0      27       0      73
10:38:10       1      26       0      73
10:38:11       0      26       0      74
10:38:12       0      26       0      74
10:38:13       0      24       0      76
10:38:14       0       6       0      94
10:38:15       0       7       0      93
10:38:22       0      99       0       1  <--
10:38:23       0      94       0       6  <--
10:38:24       0      28       0      72
10:38:25       0      27       0      73
10:38:26       0      27       0      73
10:38:27       0      27       0      73
10:38:28       0      27       0      73
10:38:29       1      30       0      69
10:38:30       0      27       0      73

And so we consider whether or not there is a pattern to the frequency. The 
following is sar output from any lines in which sys is above 90%:

10:40:04    %usr    %sys    %wio   %idle    Delta
10:40:11       0      97       0       3
10:40:45       0      98       0       2   34 seconds
10:41:02       0      94       0       6   17 seconds
10:41:26       0     100       0       0   24 seconds
10:42:00       0     100       0       0   34 seconds
10:42:25       (end of sample)             >25 seconds

Looking at the congestion in the run queue:

[private:/tmp] root# sar -q 5 100

10:45:43 runq-sz %runocc swpq-sz %swpocc
10:45:51    27.0      85     0.0       0
10:45:57     1.0      20     0.0       0
10:46:02     2.0      60     0.0       0
10:46:13    19.8      99     0.0       0
10:46:23    17.7      99     0.0       0
10:46:34    24.4      99     0.0       0
10:46:41    22.1      97     0.0       0
10:46:48    13.0      96     0.0       0
10:46:55    25.3     102     0.0       0

Looking at the per-CPU breakdown:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   324  224    0    0    0  154    0     0    0 100   0   0
  1    0   0    0   114    0  226    0   10   13    0    86    0   1   0  99
  2    0   0    0   162  138  149    0    5    4    0     0    0   1   0  99
  3    0   0    0    55    6   46    0    4    3    0     0    0   1   0  99
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   310  210   34    0   17  171    7     5    0 100   0   0
  1    0   0    0   152    1  200    0   17   26    5    59    1  65   0  34
  2    0   0    0   271  197  175    1   13   20    2     0    0  66   0  34
  3    0   0    0   120    6   76    0    6   14    2     0    0  66   0  34
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    1   318  204   46    1   19   22    5    18    0 100   0   0
  1    0   0    0   112    0   40    0   12   15    3     4    0 100   0   0
  2    0   0    0   195   93   31    1   11   27    3     0    0 100   0   0
  3    0   0    0   108    1   20    0    9   31    3     2    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   569  204  809    3  133  432    1    21    0  99   0   1
  1    0   0    0   502    0 1159    1  201  453    1     8    0  97   0   3
  2    0   0    0  2305 2197  147    6   51  279    0     0    0 100   0   0
  3    0   0    0   242    2  433   10   75  301    1     8    0 100   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   371  227  110    0   31 1470    0    80    0  99   0   1
  1    0   0    3  1034    0 2079    0  102  241    0     9    0  10   0  90
  2    0   0    0  1892 1718 1558    3  129  214    0    36    0  11   0  89
  3    0   0    0   303    6  549    0   50  329    0     5    0   8   0  92
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    1   312  205   24    0   10   39    4     4    0 100   0   0
  1    0   0    1   134    0   99    0   10   20    3     6    0  90   0  10
  2    0   0    0   187   90   40    0    8   15    4    10    0  90   0  10
  3    0   0    1   100    1   19    0    8    9    4     6    0  90   0  10
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    2   528  206  916    8  161  312    1    39    0  99   0   1
  1    0   0    1   206    1  350    6  120  147    1    68    0 100   0   0
  2   83   0    0  1929 1816  192    9   87  176    0    50    0 100   0   0
  3    0   0    0   142    4   66   21   12  460    1    17    1  99   0   0
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   360  228  163   11   25  743    0    13    0 100   0   0
  1    0   0    0  1587    0 4126   13  520 1435    0   122    0  24   0  76
  2    3   0    0  7785 7669 1499   29  225  933    0    48    0  27   0  73
  3    1   0    0  1856    6 4683    0  467 1912    0    37    0  23   0  77
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   323  223    0    0    0  244    0     0    0 100   0   0
  1    0   0    0   141    0  277    0   14   11    0     1    0   0   0 100
  2    0   0    0   307  259  312    0   16   15    0     3    0   1   0  99
  3    0   0    0   113    6  165    0    5    5    0    91    0   0   0 100
CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0    0   324  224    0    0    0  203    0     0    0 100   0   0
  1    0   0    0   143    0  283    0   11    2    0    13    0   1   0  99
  2    0   0    0   312  278  298    0    8   11    0     3    0   1   0  99
  3    0   0    0   119    6  175    0    3    5    0    69    0   1   0  99

Looking at prstat we can see the number of nfsd threads are low, much lower 
than the 1024 limit:

[private:/tmp] root# prstat

PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP22643 daemon 2228K 1776K sleep 60 -20 2:55:52 55% nfsd/18


Interestingly, using prstat -mL to monitor thread latency, we see that a 
handful of threads are the culprates for consuming mass CPU:

[private:/tmp] root# prstat -mL

PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID22643 daemon 0.0 75 0.0 0.0 0.0 0.0 25 0.0 416 1 0 0 nfsd/1506

 22643 daemon   0.0  75 0.0 0.0 0.0 0.0  25 0.0 415   0   0   0 nfsd/1563
 22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 417   0   0   0 nfsd/1554
 22643 daemon   0.0  74 0.0 0.0 0.0 0.0  26 0.0 419   0   0   0 nfsd/1551
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0  26  74 418   0   0   0 nfsd/1553
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1536
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1555
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1539
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1562
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1545
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1559
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 419   1   0   0 nfsd/1541
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1546
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 417   0   0   0 nfsd/1543
 22643 daemon   0.0 0.2 0.0 0.0 0.0 0.0 100 0.0 418   0   0   0 nfsd/1560
Total: 33 processes, 218 lwps, load averages: 4.64, 6.20, 5.86

Above SLP represents "The percentage of time the process has spent sleeping", LAT represents 
"percentage of time the process has spent waiting for CPU", VCX is "number of voluntary 
context switches" and ICX is for involuntary context switches.

Using pstack to try and determine what these threads are doing we see:

[private:/tmp] root# pstack 22643/1693
22643:  /usr/lib/nfs/nfsd
-----------------  lwp# 1693 / thread# 1693  --------------------
 fee589b7 nfssys   (e, feba0fc8)
 fee5839f _thr_setup (fed4d400) + 51
 fee585f0 _lwp_start (fed4d400, 0, 0, 0, 0, 0)
[private:/tmp] root# pstack 22643/1708
22643:  /usr/lib/nfs/nfsd
-----------------  lwp# 1708  --------------------------------
 fee589b7 nfssys   (e, febb0fc8)
 fee5839f _thr_setup (fed44c00) + 51
 fee585f0 _lwp_start (fed44c00, 0, 0, 0, 0, 0)
[private:/tmp] root# pstack 22643/1664
22643:  /usr/lib/nfs/nfsd
-----------------  lwp# 1664 / thread# 1664  --------------------
 fee589b7 nfssys   (e, fec90fc8)
 fee5839f _thr_setup (fed43800) + 51
 fee585f0 _lwp_start (fed43800, 0, 0, 0, 0, 0)

Running zvop_times.d (http://blogs.sun.com/erickustarz/resource/zvop_times.d) I 
get an idea of what ZFS is doing:

[private:/] root# dtrace -s  zvop_times.d
dtrace: script 'zvop_times.d' matched 66 probes
^C
CPU     ID                    FUNCTION:NAME

1 2 :ENDZFS COUNT


  zfs_getsecattr                                                    4
  zfs_space                                                        70
  zfs_rename                                                      111
  zfs_readdir                                                     284
  zfs_read                                                        367
  zfs_mkdir                                                       670
  zfs_setattr                                                    1054
  zfs_frlock                                                     1562
  zfs_putpage                                                    3916
  zfs_write                                                      4110
  zfs_create                                                     4178
  zfs_remove                                                     7794
  zfs_fid                                                       14960
  zfs_inactive                                                  17637
  zfs_access                                                    20809
  zfs_fsync                                                     29668
  zfs_lookup                                                    31273
  zfs_getattr                                                  175457



ZFS AVG TIME

  zfs_fsync                                                      2337
  zfs_getattr                                                    2735
  zfs_access                                                     2774
  zfs_fid                                                        2948
  zfs_inactive                                                   4081
  zfs_lookup                                                     6067
  zfs_frlock                                                     6750
  zfs_putpage                                                    6840
  zfs_mkdir                                                      9259
  zfs_getsecattr                                                10940
  zfs_setattr                                                   15364
  zfs_space                                                     25214
  zfs_readdir                                                   25726
  zfs_remove                                                    26117
  zfs_write                                                     31942
  zfs_rename                                                    60504
  zfs_read                                                     144449
  zfs_create                                                 71215587



ZFS SUM TIME

  zfs_getsecattr                                                43762
  zfs_space                                                   1765026
  zfs_mkdir                                                   6204101
  zfs_rename                                                  6715987
  zfs_readdir                                                 7306233
  zfs_frlock                                                 10544902
  zfs_setattr                                                16194261
  zfs_putpage                                                26787465
  zfs_fid                                                    44111892
  zfs_read                                                   53013073
  zfs_access                                                 57740468
  zfs_fsync                                                  69354886
  zfs_inactive                                               71990292
  zfs_write                                                 131282275
  zfs_lookup                                                189748320
  zfs_remove                                                203558835
  zfs_getattr                                               479998059
  zfs_create                                             297538724997


[private:/] root# zfs get all joyous
NAME             PROPERTY       VALUE                      SOURCE

joyous type filesystem -joyous creation Fri Jul 28 17:08 2006 -joyous used 255G -joyous available 15.7T -joyous referenced 81.7K -joyous compressratio 1.21x -joyous mounted yes -joyous quota none defaultjoyous reservation none defaultjoyous recordsize 128K defaultjoyous mountpoint /joyous defaultjoyous sharenfs off defaultjoyous checksum off localjoyous compression off localjoyous atime off localjoyous devices on defaultjoyous exec on defaultjoyous setuid on defaultjoyous readonly off defaultjoyous zoned off defaultjoyous snapdir hidden defaultjoyous aclmode groupmask defaultjoyous aclinherit secure default

So what exactly is nfsd doing???

[private:/] root# time dtrace -n 'syscall:::entry /execname == "nfsd"/ { 
@[probefunc] = count(); }'
dtrace: description 'syscall:::entry ' matched 225 probes
^C

  open                                                              3
  close                                                             5
  lwp_mutex_timedlock                                               7
  lwp_mutex_wakeup                                                  7
  pollsys                                                           7
  putmsg                                                            8
  getmsg                                                           11
  fcntl                                                            12
  lwp_park                                                         30
  mmap                                                             41
  munmap                                                           42
  fxstat                                                           47
  ioctl                                                            58
  lwp_continue                                                    100
  lwp_create                                                      100
  setcontext                                                      100
  lwp_kill                                                        101
  lwp_exit                                                        110
  schedctl                                                        110
  lwp_sigmask                                                     138
  nfs                                                             203

real    3m22.464s
user    0m0.206s
sys     0m0.222s

And the answer is, not much of anything. Where is this CPU disappearing too!!! 
Looking at the stacks:

[private:/] root# time dtrace -n 'syscall::nfs:entry /execname == "nfsd"/ { 
@[ustack()] = count(); }'
dtrace: description 'syscall::nfs:entry ' matched 1 probe
^C


              libc.so.1`_nfssys+0x7
              libc.so.1`_thr_setup+0x51
              libc.so.1`_lwp_start
               22

Not very helpful... lets try this:

[private:/] root# truss -c -p 22643
^C
syscall               seconds   calls  errors
close                    .000       1
ioctl                    .000       2
lwp_park                 .000       1
lwp_unpark               .000       1
getmsg                   .000       1
putmsg                   .000       1
setustack                .000      30
nfssys                106.146      56
mmap                     .000      20
munmap                   .000       3
fxstat                   .000       4
lwp_create               .001      30
lwp_exit                 .000      12
lwp_continue             .000      30
lwp_kill                 .000      13      13
lwp_sigmask              .000      14
pollsys                  .000       1
schedctl                 .000      12
                     --------  ------   ----
sys totals:           106.149     232     13
usr time:                .001
elapsed:               47.230

Better. Clearly nfssys is a problem, but why?

[private:/] root# dtrace -n 'io:genunix::start { @[pid, execname] = count(); }'
dtrace: description 'io:genunix::start ' matched 3 probes
^C

        3  fsflush                                                           1
      344  syslogd                                                           2
    22643  nfsd                                                             59
        0  sched                                                         30346

The fucking scheduler is doing more IO than nfsd!

Lets try iotop (http://www.brendangregg.com/DTrace/iotop). Below are 10 second 
intervals, disktime is 1000000=1s:

2006 Dec  7 12:06:47,  load: 8.08,  disk_r:   1408 KB,  disk_w:  14482 KB

  UID    PID   PPID CMD              DEVICE  MAJ MIN D         DISKTIME
    0      3      0 fsflush          sd6      54 389 W               242
    0      3      0 fsflush          sd6      54 384 W               251
    1  22643      1 nfsd             sd17     54 1088 R             2733
    1  22643      1 nfsd             sd14     54 896 R              4420
    1  22643      1 nfsd             sd13     54 832 R              5296
    1  22643      1 nfsd             sd18     54 1152 R             7103
    1  22643      1 nfsd             sd47     54 3008 R             7240
    1  22643      1 nfsd             sd31     54 1984 R            14827
    1  22643      1 nfsd             sd32     54 2048 R            16653
    1  22643      1 nfsd             sd45     54 2880 R            17768
    1  22643      1 nfsd             sd30     54 1920 R            17962
    1  22643      1 nfsd             sd22     54 1408 R            18687
    1  22643      1 nfsd             sd8      54 512 R             19693
    1  22643      1 nfsd             sd33     54 2112 R            20659
    1  22643      1 nfsd             sd19     54 1216 R            22933
    1  22643      1 nfsd             sd43     54 2752 R            23602
    1  22643      1 nfsd             sd7      54 448 R             27458
    1  22643      1 nfsd             sd10     54 640 R             28715
    1  22643      1 nfsd             sd49     54 3136 R            29640
    1  22643      1 nfsd             sd5      54 320 R             46201
    1  22643      1 nfsd             sd11     54 704 R             46775
    1  22643      1 nfsd             sd24     54 1536 R            60011  <-- 
0.06s
    0      0      0 sched            sd16     54 1024 W           140682
    0      0      0 sched            sd47     54 3008 W           142082
    0      0      0 sched            sd18     54 1152 W           151720
    0      0      0 sched            sd13     54 832 W           157107
    0      0      0 sched            sd14     54 896 W           159044
    0      0      0 sched            sd17     54 1088 W           164386
    0      0      0 sched            sd35     54 2240 W           167119
    0      0      0 sched            sd34     54 2176 W           168552
    0      0      0 sched            sd37     54 2368 W           171872
    0      0      0 sched            sd40     54 2560 W           193907
    0      0      0 sched            sd25     54 1600 W           195662
    0      0      0 sched            sd33     54 2112 W           196480

Disk response times seem really good actually.

iosnoop (http://www.brendangregg.com/DTrace/iosnoop) shows that sched is doing 
a lot of writing, mostly 1k files. Testing on aeon suggests this is normal.

truss -D also points to this nfssys slowness:

[private:/tmp] root# truss -D -p `pgrep nfsd`
...
/2:      0.0001 lwp_create(0xFED71D20, LWP_DETACHED|LWP_SUSPENDED, 0xFED71F48) 
= 3266
/2:      0.0001 lwp_continue(3266)                              = 0
/3266:  260128.5978     lwp_create()    (returning as new lwp ...)      = 0
/3266:   0.0034 setustack(0xFED4BC60)
/3266:  nfssys(14, 0xFEC40FC8)          (sleeping...)
/3266:  43.1653 nfssys(14, 0xFEC40FC8)                          = 0
/3266:   0.0074 schedctl()                                      = 0xFEF9F020
/3266:   0.0002 lwp_sigmask(SIG_SETMASK, 0xFFBFFEFF, 0x0000FFF7) = 0xFFBFFEFF 
[0x0000FFFF]
/3266:   0.0002 lwp_kill(3268, SIG#0)                           Err#3 ESRCH
/3266:   0.0002 munmap(0xFEC20000, 4096)                        = 0
/3266:   0.0002 lwp_exit()

Locks...

[private:/tmp] root# lockstat -P sleep 10

Adaptive mutex spin: 20406 events in 10.026 seconds (2035 events/sec)

Count indv cuml rcnt spin Lock Caller-------------------------------------------------------------------------------

   32 100% 100% 0.00 59963092 0xffffffffb282a1c8     dmu_object_alloc+0x75 \

This message posted from opensolaris.org

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A Plea for Help: Thumper/ZFS/NFS/B43

Reply via email to