Posting to zfs-discuss. There's no reason this needs to be
kept confidential.
5-disk RAIDZ2 - doesn't that equate to only 3 data disks?
Seems pointless - they'd be much better off using mirrors,
which is a better choice for random IO...
Looking at this now...
/jim
Jeff Savit wrote:
Hi all,
I'm looking for suggestions for the following situation: I'm helping
another SE with a customer using Thumper with a large ZFS pool mostly
used as an NFS server, and disappointments in performance. The storage
is an intermediate holding place for data to be fed into a relational
database, and the statement is that the NFS side can't keep up with
data feeds written to it as flat files.
The ZFS pool has 8 5-volume RAIDZ2 groups, for 7.3TB of storage, with
1.74TB available. Plenty of idle CPU as shown by vmstat and mpstat.
iostat shows queued I/O and I'm not happy about the total latencies -
wsvc_t in excess of 75ms at times. Average of ~60KB per read and only
~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiest
for long reads and writes, and this is not the use case here.
I was surprised to see commands like tar, rm, and chown running
locally on the NFS server, so it looks like they're locally doing file
maintenance and pruning at the same time it's being accessed remotely.
That makes sense to me for the short write lengths and for the high
ZFS ACL activity shown by DTrace. I wonder if there is a lot of sync
I/O that would benefit from separately defined ZILs (whether SSD or
not), so I've asked them to look for fsync activity.
Data collected thus far is listed below. I've asked for verification
of the Solaris 10 level (I believe it's S10u6) and ZFS recordsize.
Any suggestions will be appreciated.
regards, Jeff
---- stuff starts here ----
zpool iostat -v gives figures like:
bash-3.00# zpool iostat -v
capacity operations bandwidth
pool used avail read write read write
---------- ----- ----- ----- ----- ----- -----
mdpool 7.32T 1.74T 290 455 1.57M 3.21M
raidz2 937G 223G 36 56 201K 411K
c0t0d0 - - 18 40 1.13M 141K
c1t0d0 - - 18 40 1.12M 141K
c4t0d0 - - 18 40 1.13M 141K
c6t0d0 - - 18 40 1.13M 141K
c7t0d0 - - 18 40 1.13M 141K
---the other 7 raidz2 groups have almost identical numbers on their
devices---
iostat -iDnxz looks like:
extended device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0 0 c5t0d0
15.8 95.9 996.9 233.1 4.3 1.3 38.2 12.0 20 37 c6t0d0
16.1 95.6 1018.5 232.4 2.5 2.6 22.2 23.2 16 36 c7t0d0
16.1 96.0 1012.5 232.8 2.8 2.9 24.5 26.1 19 38 c4t0d0
16.0 93.1 1012.9 242.2 3.6 1.5 33.2 14.2 18 36 c5t1d0
15.9 82.2 1000.5 235.0 1.9 1.6 19.2 16.0 12 31 c5t2d0
16.6 95.6 1046.7 232.7 2.5 2.7 22.2 23.7 18 37 c0t0d0
16.6 96.1 1042.4 232.8 4.7 0.6 42.0 5.2 19 38 c1t0d0
...snip...
16.5 95.4 1027.2 263.0 5.9 0.4 53.0 3.6 26 40 c0t4d0
16.6 95.4 1041.1 263.6 3.9 1.0 34.5 9.3 18 36 c1t4d0
16.8 99.1 1060.6 248.6 7.2 0.7 62.0 6.0 32 45 c0t5d0
16.5 99.6 1034.7 248.9 8.2 1.1 70.5 9.1 38 48 c1t5d0
17.0 82.5 1072.9 219.8 4.8 0.5 48.4 4.7 21 38 c0t6d0
prstat looks like:
bash-3.00# prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
815 daemon 3192K 2560K sleep 60 -20 83:10:07 0.6% nfsd/24
27918 root 1092K 920K cpu2 37 4 0:01:37 0.2% rm/1
19142 root 248M 247M sleep 60 0 1:24:24 0.1% chown/1
28794 root 2552K 1304K sleep 59 0 0:00:00 0.1% tar/1
29957 root 1192K 908K sleep 59 0 0:57:30 0.1% find/1
14737 root 7620K 1964K sleep 59 0 0:03:56 0.0% sshd/1
...
prstat -Lm looks like:
bash-3.00# prstat -Lm
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
27918 root 0.0 0.9 0.0 0.0 0.0 0.0 99 0.0 194 7 2K 0 rm/1
28794 root 0.1 0.6 0.0 0.0 0.0 0.0 99 0.0 209 10 909 0 tar/1
19142 root 0.0 0.6 0.0 0.0 0.0 0.0 99 0.0 224 3 1K 0 chown/1
29957 root 0.0 0.4 0.0 0.0 0.0 0.0 100 0.0 213 6 420 0 find/1
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 197 0 0 0 nfsd/28230
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 191 0 0 0 nfsd/28222
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 185 0 0 0 nfsd/28211
---many more nfsd lines of similar appearance---
A small DTrace script for ZFS gives me:
# dtrace -n 'fbt::zfs*:ent...@[pid,execname,probefunc] = count()} END
{trunc(@,20); printa(@)}'
^C
...some lines trimmed...
28835 tar zfs_dirlook 67761
28835 tar zfs_lookup 67761
28835 tar zfs_zaccess 69166
28835 tar zfs_dirent_lock 71083
28835 tar zfs_dirent_unlock 71084
28835 tar zfs_zaccess_common28835 tar zfs_acl_node_read 77251
28835 tar zfs_acl_node_read_internal 77251
28835 tar zfs_acl_alloc 78656
28835 tar zfs_acl_free 78656
27918 rm zfs_acl_alloc 85888
27918 rm zfs_acl_free 85888
27918 rm zfs_acl_node_read 85888
27918 rm zfs_acl_node_read_internal 85888
27918 rm zfs_zaccess_common 85888
--
Jeff Savit
Principal Field Technologist
Sun Microsystems, Inc. Phone: 732-537-3451 (x63451)
2398 E Camelback Rd Email: jeff.sa...@sun.com
Phoenix, AZ 85016 http://blogs.sun.com/jsavit/
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss