Re: [zfs-discuss] periodic slow responsiveness

Richard Elling Sun, 06 Sep 2009 18:09:41 -0700

On Sep 6, 2009, at 5:06 PM, James Lever wrote:

On 07/09/2009, at 6:24 AM, Richard Elling wrote:
On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:
On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j...@jamver.id.au> wrote:
I’m experiencing occasional slow responsiveness on an OpenSolarisb118system typically noticed when running an ‘ls’ (no extra flags, sonodirectory service lookups). There is a delay of between 2 and 30secondsbut no correlation has been noticed with load on the server andthe slowreturn. This problem has only been noticed via NFS (v3. We aremigratingto NFSv4 once the O_EXCL/mtime bug fix has been integrated -anticipated forsnv_124). The problem has been observed both locally on theprimaryfilesystem, in an locally automounted reference (/home/foo) andremotely via
NFS.
I'm confused. If "This problem has only been noticed via NFS (v3"then
how is it "observed locally?”
Sorry, I was meaning to say it had not been noticed using CIFS oriSCSI.
It has been observed in client:/home/user (NFSv3 automount fromserver:/home/user, redirected to server:/zpool/home/user) and alsoin server:/home/user (local automount) and server:/zpool/home/user(origin).


Ok, just so I am clear, when you mean "local automount" you are
on the server and using the loopback -- no NFS or network involved?

iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.
What specifically should I be looking for here? (using ‘iostat -xen -T d’) and I’m guessing I’ll require a high level of granularity (1sintervals) to see the issue if it is a single disk or similar.


You are looking for I/O that takes seconds to complete or is stuck in
the device.  This is in the actv column stuck > 1 and the asvc_t >> 1000

stat(2) doesn't write, so you can stop worrying about the slog.
My concern here was I may have been trying to write (via otherconcurrent processes) at the same time as there was a memory faultfrom the ARC to L2ARC.


stat(2) looks at metadata, which is generally small and compressed.

It is also cached in the ARC, by default. If this is repeatable in ashort

period of time, then it is not an  I/O problem and you need to look at:
1. the number of files in the directory

2. the locale (ls sorts by default, and your locale affects the sorttime)

Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.


No errors or collisions from either server or clients observed.


retrans?
As Ross mentioned, wireshark, snoop, or most other network monitors
will show network traffic in detail.
 -- richard

That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.


See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
"Physical Memory Control Using the Resource Capping  Daemon"
in  System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones


Thanks Richard, I’ll have a look at that today and see where I get.

cheers,
James


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] periodic slow responsiveness

Reply via email to