Re: [zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

Jim Mauro Sat, 24 Oct 2009 12:32:32 -0700

Posting to zfs-discuss. There's no reason this needs to be
kept confidential.


5-disk RAIDZ2 - doesn't that equate to only 3 data disks?
Seems pointless - they'd be much better off using mirrors,
which is a better choice for random IO...

Looking at this now...

/jim


Jeff Savit wrote:

Hi all,
I'm looking for suggestions for the following situation: I'm helpinganother SE with a customer using Thumper with a large ZFS pool mostlyused as an NFS server, and disappointments in performance. The storageis an intermediate holding place for data to be fed into a relationaldatabase, and the statement is that the NFS side can't keep up withdata feeds written to it as flat files.
The ZFS pool has 8 5-volume RAIDZ2 groups, for 7.3TB of storage, with1.74TB available. Plenty of idle CPU as shown by vmstat and mpstat.iostat shows queued I/O and I'm not happy about the total latencies -wsvc_t in excess of 75ms at times. Average of ~60KB per read and only~2.5KB per write. Evil Tuning guide tells me that RAIDZ2 is happiestfor long reads and writes, and this is not the use case here.
I was surprised to see commands like tar, rm, and chown runninglocally on the NFS server, so it looks like they're locally doing filemaintenance and pruning at the same time it's being accessed remotely.That makes sense to me for the short write lengths and for the highZFS ACL activity shown by DTrace. I wonder if there is a lot of syncI/O that would benefit from separately defined ZILs (whether SSD ornot), so I've asked them to look for fsync activity.
Data collected thus far is listed below. I've asked for verificationof the Solaris 10 level (I believe it's S10u6) and ZFS recordsize.Any suggestions will be appreciated.
regards, Jeff

---- stuff starts here ----


zpool iostat -v gives figures like:

bash-3.00# zpool iostat -v
          capacity operations      bandwidth
pool   used avail read write      read write
---------- ----- ----- -----    ----- ----- -----
mdpool 7.32T 1.74T 290  455     1.57M 3.21M
raidz2  937G  223G  36   56       201K 411K
c0t0d0 -      -     18   40      1.13M 141K
c1t0d0 -      -     18   40      1.12M 141K
c4t0d0 -      -     18   40      1.13M 141K
c6t0d0 -      -     18   40      1.13M 141K
c7t0d0 -      -     18   40      1.13M 141K
---the other 7 raidz2 groups have almost identical numbers on theirdevices---
iostat -iDnxz looks like:
extended device statisticsr/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
    0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.1   0   0 c5t0d0
   15.8   95.9  996.9  233.1  4.3  1.3   38.2   12.0  20  37 c6t0d0
   16.1   95.6 1018.5  232.4  2.5  2.6   22.2   23.2  16  36 c7t0d0
   16.1   96.0 1012.5  232.8  2.8  2.9   24.5   26.1  19  38 c4t0d0
   16.0   93.1 1012.9  242.2  3.6  1.5   33.2   14.2  18  36 c5t1d0
   15.9   82.2 1000.5  235.0  1.9  1.6   19.2   16.0  12  31 c5t2d0
   16.6   95.6 1046.7  232.7  2.5  2.7   22.2   23.7  18  37 c0t0d0
   16.6   96.1 1042.4  232.8  4.7  0.6   42.0    5.2  19  38 c1t0d0
...snip...
   16.5   95.4 1027.2  263.0  5.9  0.4   53.0    3.6  26  40 c0t4d0
   16.6   95.4 1041.1  263.6  3.9  1.0   34.5    9.3  18  36 c1t4d0
   16.8   99.1 1060.6  248.6  7.2  0.7   62.0    6.0  32  45 c0t5d0
   16.5   99.6 1034.7  248.9  8.2  1.1   70.5    9.1  38  48 c1t5d0
   17.0   82.5 1072.9  219.8  4.8  0.5   48.4    4.7  21  38 c0t6d0


prstat  looks like:

bash-3.00# prstat
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
815 daemon 3192K 2560K sleep 60 -20 83:10:07 0.6% nfsd/24
27918 root 1092K 920K cpu2 37 4 0:01:37 0.2% rm/1
19142 root 248M 247M sleep 60 0 1:24:24 0.1% chown/1
28794 root 2552K 1304K sleep 59 0 0:00:00 0.1% tar/1
29957 root 1192K 908K sleep 59 0 0:57:30 0.1% find/1
14737 root 7620K 1964K sleep 59 0 0:03:56 0.0% sshd/1
...


prstat -Lm looks like:

bash-3.00# prstat -Lm
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
27918 root 0.0 0.9 0.0 0.0 0.0 0.0 99 0.0 194 7 2K 0 rm/1
28794 root 0.1 0.6 0.0 0.0 0.0 0.0 99 0.0 209 10 909 0 tar/1
19142 root 0.0 0.6 0.0 0.0 0.0 0.0 99 0.0 224 3 1K 0 chown/1
29957 root 0.0 0.4 0.0 0.0 0.0 0.0 100 0.0 213 6 420 0 find/1
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 197 0 0 0 nfsd/28230
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 191 0 0 0 nfsd/28222
815 daemon 0.0 0.3 0.0 0.0 0.0 0.0 100 0.0 185 0 0 0 nfsd/28211
---many more nfsd lines of similar appearance---


A small DTrace script for ZFS gives me:
# dtrace -n 'fbt::zfs*:ent...@[pid,execname,probefunc] = count()} END{trunc(@,20); printa(@)}'
^C
...some lines trimmed...
28835 tar zfs_dirlook 67761
28835 tar zfs_lookup 67761
28835 tar zfs_zaccess 69166
28835 tar zfs_dirent_lock 71083
28835 tar zfs_dirent_unlock 71084
28835 tar zfs_zaccess_common28835 tar zfs_acl_node_read 77251

28835 tar zfs_acl_node_read_internal 77251
28835 tar zfs_acl_alloc 78656
28835 tar zfs_acl_free 78656
27918 rm zfs_acl_alloc 85888
27918 rm zfs_acl_free 85888
27918 rm zfs_acl_node_read 85888
27918 rm zfs_acl_node_read_internal 85888
27918 rm zfs_zaccess_common 85888





--
Jeff Savit
Principal Field Technologist
Sun Microsystems, Inc.        Phone: 732-537-3451 (x63451)
2398 E Camelback Rd           Email: jeff.sa...@sun.com
Phoenix, AZ 85016 http://blogs.sun.com/jsavit/

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Performance problems with Thumper and >7TB ZFS pool using RAIDZ2

Reply via email to