Hi.

snv_39, SPARC.
I have several pools (no protection on ZFS) with several filesystems inside 
each pool.
Data are served by nfsd (over 3000 active threads).
Last time I changed it /etc/system:

   set rpcmod:cotsmaxdupreqs=8192
   set rpcmod:maxdupreqs=8192

And now I observer that every few hours nfsd is not issuing any IOs to most 
pools and all its threads (over 3000 right now - its set limit). Locally I can 
issue IOs to zfs filesystems without any problem. After 10-25 minutes the 
problem is gone on itself and then again back later.

Most nfsd threads are hanging in ZFS.

zpool iostat 1

nfs-s5-p0   2.45T  2.08T      0      0      0      0
nfs-s5-p1   1.60T  2.93T      0      0      0      0
nfs-s5-p2   1.39T  3.14T      0      0      0      0
nfs-s5-p3   41.2G  4.49T      0      0      0      0
nfs-s5-s8   4.40T   137G      0      7      0   255K
----------  -----  -----  -----  -----  -----  -----

mdb -kw
> ::ps!grep nfsd
R    320      1    320    320      1 0x42300902 0000030035d9a040 nfsd
> 0000030035d9a040::walk thread|::findstack -v
stack pointer for thread 30001305020: 2a100687021
[ 000002a100687021 cv_wait+0x40() ]
  000002a1006870d1 exitlwps+0x11c(0, 200000, 42000002, 30035d9a040, 100000, 
30035d9a106)
  000002a100687181 proc_exit+0x1c(1, 0, ff131c80, 0, f, 18afe38)
  000002a100687231 exit+8(1, 0, ff131c80, 0, f, ff3a2400)
  000002a1006872e1 syscall_trap32+0xcc(0, 0, ff131c80, 0, f, ff3a2400)
stack pointer for thread 3004138b920: 2a102176621
[ 000002a102176621 cv_wait+0x40() ]
  000002a1021766d1 zil_commit+0x74(600012742ec, 26ca1, 10, 60001274280, 0, 
26ca1)
  000002a102176781 zfs_fsync+0xa8(0, 0, 3000081cf94, 0, 300d7fe1000, 0)
  000002a102176831 fop_fsync+0x14(300d7fea040, 0, 300be5d1358, 3ade4e4, 0, 
7ba3d40c)
  000002a1021768e1 rfs3_remove+0x22c(2a102177198, 2a102177398, 0, 2a102177698, 
300be5d1358, 2a102177220)
  000002a102176ab1 common_dispatch+0x44c(2a102177698, 300c07cbdc0, 2a102177500, 
6003298f200, 7017a1c0, 7bb9c7a8)
  000002a102176dd1 svc_getreq+0x210(300c07cbdc0, 600096837c0, 6003269bc50, 
300844084f8, 18feb90, 6003269bac0)
  000002a102176f21 svc_run+0x194(60001125190, 0, 0, 1, 600011251c8, 30035d9a040)
  000002a102176fd1 nfssys+0x1a4(e, ff0a1f9c, 7bb2f800, c, c, 1d0)
  000002a1021772e1 syscall_trap32+0xcc(e, ff0a1f9c, 0, 0, 0, 0)
stack pointer for thread 300be20d600: 2a101862621
[ 000002a101862621 cv_wait+0x40() ]
  000002a1018626d1 zil_commit+0x74(300418bb62c, 26a38, 10, 300418bb5c0, 0, 
26a38)
  000002a101862781 zfs_fsync+0xa8(0, 0, 6000724d994, 0, 30060eb0010, 0)
  000002a101862831 fop_fsync+0x14(30182c5fa00, 0, 300be5d0dd8, 3aee1e6, 0, 
7ba3d40c)
  000002a1018628e1 rfs3_remove+0x22c(2a101863198, 2a101863398, 0, 2a101863698, 
300be5d0dd8, 2a101863220)
  000002a101862ab1 common_dispatch+0x44c(2a101863698, 300c0892c80, 2a101863500, 
6002fddb000, 7017a1c0, 7bb9c7a8)
  000002a101862dd1 svc_getreq+0x210(300c0892c80, 6001966b0c0, 6002fe72750, c00, 
18feb90, 6002fe725c0)
  000002a101862f21 svc_run+0x194(60001125190, 0, 160, 1, 600011251c8, 
30035d9a040)
  000002a101862fd1 nfssys+0x1a4(e, fefe1f9c, 7bb2f800, c, c, 1d0)
  000002a1018632e1 syscall_trap32+0xcc(e, fefe1f9c, 0, 0, 0, 0)
stack pointer for thread 3000131b960: 2a1002ae621
[ 000002a1002ae621 cv_wait+0x40() ]
  000002a1002ae6d1 zil_commit+0x74(6000574486c, 22049, 10, 60005744800, 0, 
22049)
  000002a1002ae781 zfs_fsync+0xa8(0, 0, 60005770d14, 0, 3007b62a648, 0)
  000002a1002ae831 fop_fsync+0x14(3008549dd00, 0, 300416fc220, 3ad91ac, 0, 
7ba3d40c)
  000002a1002ae8e1 rfs3_remove+0x22c(2a1002af198, 2a1002af398, 0, 2a1002af698, 
300416fc220, 2a1002af220)
  000002a1002aeab1 common_dispatch+0x44c(2a1002af698, 600013a96c0, 2a1002af500, 
300b7293180, 7017a1c0, 7bb9c7a8)
  000002a1002aedd1 svc_getreq+0x210(600013a96c0, 6002e3d2980, 60021510710, 
60003f5d0f8, 18feb90, 60021510580)
  000002a1002aef21 svc_run+0x194(60001125190, 0, 0, 1, 600011251c8, 30035d9a040)
  000002a1002aefd1 nfssys+0x1a4(e, fef31f9c, 7bb2f800, c, c, 1d0)
  000002a1002af2e1 syscall_trap32+0xcc(e, fef31f9c, 0, 0, 0, 0)
stack pointer for thread 300b38dd620: 2a1031f64e1
[ 000002a1031f64e1 cv_wait+0x40() ]
  000002a1031f6591 zil_commit+0x74(300418bbdac, 27f84, 10, 300418bbd40, 
39081e08, 27f84)
  000002a1031f6641 zfs_fsync+0xa8(0, 10000, 30090ab13d4, 0, 3014523f8c8, 0)
  000002a1031f66f1 fop_fsync+0x14(3009fde44c0, 10000, 60001002218, a, 39081e08, 
7ba3d40c)
  000002a1031f67a1 rfs3_create+0x7bc(2a1031f7500, 2a1031f7080, 1, 0, 
60001002218, 2a1031f7220)
  000002a1031f6ab1 common_dispatch+0x44c(2a1031f7698, 300bfd2ce40, 2a1031f7500, 
300b70f2340, 7017a1c0, 7bb9b568)
  000002a1031f6dd1 svc_getreq+0x210(300bfd2ce40, 60034dda0c0, 300ea327690, 1e3, 
18feb90, 300ea327500)
  000002a1031f6f21 svc_run+0x194(60001125190, 1, 0, 1, 600011251c8, 30035d9a040)
  000002a1031f6fd1 nfssys+0x1a4(e, fedf1f9c, 7bb2f800, c, c, 1d0)
  000002a1031f72e1 syscall_trap32+0xcc(e, fedf1f9c, 0, 0, 0, 0)
[...]




using mpstat I can see that one CPU is 100% utlized:

CPU minf mjf xcal  intr ithr  csw icsw migr smtx  srw syscl  usr sys  wt idl
  0    0   0   86   214  114    0    0    0    0    0     0    0   1   0  99
  1    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
  2    0   0    0    37    0   73    0    0    0    0     1    0   0   0 100
  3    0   0    3    12    0   22    0    1    1    0     0    0   0   0 100
  4    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
  5    0   0    2    10    0   18    0    0    0    0     0    0   0   0 100
  6    0   0    1     3    0    4    0    0    1    0     0    0   0   0 100
  7    0   0    3     9    0   20    0    0    0    0     0    0   1   0  99
  8    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
  9    0   0    0    10    2   14    0    0    0    0     0    0   1   0  99
 10    0   0    3     3    0    4    0    0    1    0     0    0   0   0 100
 11    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 12    0   0    0     6    0   10    0    1    0    0     0    0   0   0 100
 13    0   0    2     6    0   10    0    0    0    0     0    0   0   0 100
 14    0   0    1     6    0   10    0    0    1    0   226    0   0   0 100
 15    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 16    0   0    0     3    0    4    0    0    0    0     0    0   0   0 100
 17    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 18    0   0    0     4    0    6    0    0    0    0     0    0   0   0 100
 19    0   0    0     6    0   10    0    0    0    0     0    0   0   0 100
 20    0   0    1     5    0    8    0    1    0    0    18    1   0   0  99
 21    0   0   20    24   21    4    0    0    0    0     0    0   0   0 100
 22    0   0   22    47   38   16    0    0    0    0     0    0   0   0 100
 23    0   0    4    16    4   22    0    1    1    0     0    0   0   0 100
 24    0   0    4     5    4    0    0    0    1    0     0    0   0   0 100
 25    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 26    0   0    0     1    0    0    0    0    0    0     0    0   0   0 100
 27    0   0    0     1    0    0    0    0    0    0     0    0 100   0   0
 28    0   0    1    10    0   18    0    0    2    0     5    0   0   0 100
 29    0   0    2     8    0   14    0    0    0    0     0    0   0   0 100
 30    0   0    0     8    0   14    0    0    2    0   165    1   0   0  99
 31    0   0    2    11    0   20    0    0    0    0    19    0   0   0 100
^C
bash-3.00#

Well I wanted to play with dtrace but it has gone (CPU usage) and now system is 
almost 100% idle but still no IOs. Locally I can issue IOs to these zfs 
filesystems without any problem.
Finally nfsd stopped (I issued kill -9 to nfsd - and it exited after... don't 
know 10-15 minutes).
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to