[zfs-discuss] periodic slow responsiveness

James Lever Sun, 06 Sep 2009 06:15:55 -0700

I’m experiencing occasional slow responsiveness on an OpenSolaris b118system typically noticed when running an ‘ls’ (no extra flags, so nodirectory service lookups). There is a delay of between 2 and 30seconds but no correlation has been noticed with load on the serverand the slow return. This problem has only been noticed via NFS (v3.We are migrating to NFSv4 once the O_EXCL/mtime bug fix has beenintegrated - anticipated for snv_124). The problem has been observedboth locally on the primary filesystem, in an locally automountedreference (/home/foo) and remotely via NFS.

zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI 1078w/ 512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with2x SSDs each partitioned as 10GB slog and 36GB remainder as l2arcbehind another LSI 1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i).

The system is configured as an NFS (currently serving NFSv3), iSCSI(COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)with authentication taking place from a remote openLDAP server.

Automount is in use both locally and remotely (linux clients).Locally /home/* is remounted from the zpool, remotely /home andanother filesystem (and children) are mounted using autofs. There wassome suspicion that automount is the problem, but no definitiveevidence as of yet.

The problem has definitely been observed with stats (of some form,typically ‘/usr/bin/ls’ output) both remotely, locally in /home/* andlocally in /zpool/home/* (the true source location). There is a clearcorrelation with recency of reads of the directories in question andreoccurrence of the fault in that one user has scripted a regular (15m/30m/hourly tests so far) ‘ls’ of the filesystems of interested andthis has reduced the fault to have minimal noted impact since startingdown this path (just for themself).

I have removed the l2arc(s) (cache devices) from the pool and the samebehaviour has been observed. My suspicion here was that there wasperhaps occasional high synchronous load causing heavy writes to theslog devices and when a stat was requested it may have been faultingfrom ARC to L2ARC prior to going to the primary data store. Theslowness has been reported since removing the extra cache devices.

Another thought I had was along the lines of fileystem caching andheavy writes causing read blocking. I have no evidence that this isthe case, but some suggestions on list recently of limiting the ZFSmemory usage for write caching. Can anybody comment to theeffectiveness of this (I have 256MB write cache in front of the slogSSDs and 512MB in front of the primary storage devices).

My DTrace is very poor but I’m suspicious that this is the best way toroot cause this problem. If somebody has any code that may assist indebugging this problem and was able to share it would much appreciated.

Any other suggestions for how to identify this fault and work aroundit would be greatly appreciated.


cheers,
James

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] periodic slow responsiveness

Reply via email to