Re: [zfs-discuss] periodic slow responsiveness

Richard Elling Sun, 06 Sep 2009 13:25:25 -0700

On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:

On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j...@jamver.id.au> wrote:
I’m experiencing occasional slow responsiveness on an OpenSolarisb118
system typically noticed when running an ‘ls’ (no extra flags, so no
directory service lookups). There is a delay of between 2 and 30secondsbut no correlation has been noticed with load on the server and theslowreturn. This problem has only been noticed via NFS (v3. We aremigratingto NFSv4 once the O_EXCL/mtime bug fix has been integrated -anticipated for
snv_124).  The problem has been observed both locally on the primary
filesystem, in an locally automounted reference (/home/foo) andremotely via
NFS.


I'm confused.  If "This problem has only been noticed via NFS (v3" then
how is it "observed locally?"

zpool is RAIDZ2 comprised of 10 * 15kRPM SAS drives behind an LSI1078 w/512MB BBWC exposed as RAID0 LUNs (Dell MD1000 behind PERC 6/E) with2x SSDseach partitioned as 10GB slog and 36GB remainder as l2arc behindanother LSI
1078 w/ 256MB BBWC (Dell R710 server with PERC 6/i).

The system is configured as an NFS (currently serving NFSv3), iSCSI
(COMSTAR) and CIFS (using the SUN SFW package running Samba 3.0.34)with
authentication taking place from a remote openLDAP server.
Automount is in use both locally and remotely (linux clients).Locally/home/* is remounted from the zpool, remotely /home and anotherfilesystem(and children) are mounted using autofs. There was some suspicionthat
automount is the problem, but no definitive evidence as of yet.
The problem has definitely been observed with stats (of some form,typically‘/usr/bin/ls’ output) both remotely, locally in /home/* and locallyin/zpool/home/* (the true source location). There is a clearcorrelation withrecency of reads of the directories in question and reoccurrence ofthefault in that one user has scripted a regular (15m/30m/hourly testsso far)‘ls’ of the filesystems of interested and this has reduced thefault to haveminimal noted impact since starting down this path (just forthemself).


iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.

I have removed the l2arc(s) (cache devices) from the pool and thesamebehaviour has been observed. My suspicion here was that there wasperhapsoccasional high synchronous load causing heavy writes to the slogdevicesand when a stat was requested it may have been faulting from ARC toL2ARCprior to going to the primary data store. The slowness has beenreported
since removing the extra cache devices.
Another thought I had was along the lines of fileystem caching andheavywrites causing read blocking. I have no evidence that this is thecase, butsome suggestions on list recently of limiting the ZFS memory usagefor writecaching. Can anybody comment to the effectiveness of this (I have256MBwrite cache in front of the slog SSDs and 512MB in front of theprimary
storage devices).


stat(2) doesn't write, so you can stop worrying about the slog.

My DTrace is very poor but I’m suspicious that this is the best wayto rootcause this problem. If somebody has any code that may assist indebugging
this problem and was able to share it would much appreciated.
Any other suggestions for how to identify this fault and workaround it
would be greatly appreciated.


Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.

That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.


See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
"Physical Memory Control Using the Resource Capping  Daemon"
in  System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones
 -- richard

You have iSCSI, NFS, CIFS to choose from (most obvious), try
restarting them one at a time during down time and see if performance
improves after each restart to find the culprit.

-Ross
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

Reply via email to