On Sep 6, 2009, at 5:06 PM, James Lever wrote:
On 07/09/2009, at 6:24 AM, Richard Elling wrote:
On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:
On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j...@jamver.id.au> wrote:
I’m experiencing occasional slow responsiveness on an OpenSolaris b118 system typically noticed when running an ‘ls’ (no extra flags, so no directory service lookups). There is a delay of between 2 and 30 seconds but no correlation has been noticed with load on the server and the slow return. This problem has only been noticed via NFS (v3. We are migrating to NFSv4 once the O_EXCL/mtime bug fix has been integrated - anticipated for snv_124). The problem has been observed both locally on the primary filesystem, in an locally automounted reference (/home/foo) and remotely via
NFS.

I'm confused. If "This problem has only been noticed via NFS (v3" then
how is it "observed locally?”

Sorry, I was meaning to say it had not been noticed using CIFS or iSCSI.

It has been observed in client:/home/user (NFSv3 automount from server:/home/user, redirected to server:/zpool/home/user) and also in server:/home/user (local automount) and server:/zpool/home/user (origin).

Ok, just so I am clear, when you mean "local automount" you are
on the server and using the loopback -- no NFS or network involved?

iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.

What specifically should I be looking for here? (using ‘iostat -xen - T d’) and I’m guessing I’ll require a high level of granularity (1s intervals) to see the issue if it is a single disk or similar.

You are looking for I/O that takes seconds to complete or is stuck in
the device.  This is in the actv column stuck > 1 and the asvc_t >> 1000

stat(2) doesn't write, so you can stop worrying about the slog.

My concern here was I may have been trying to write (via other concurrent processes) at the same time as there was a memory fault from the ARC to L2ARC.

stat(2) looks at metadata, which is generally small and compressed.
It is also cached in the ARC, by default. If this is repeatable in a short
period of time, then it is not an  I/O problem and you need to look at:
1. the number of files in the directory
2. the locale (ls sorts by default, and your locale affects the sort time)

Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.

No errors or collisions from either server or clients observed.

retrans?
As Ross mentioned, wireshark, snoop, or most other network monitors
will show network traffic in detail.
 -- richard

That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.

See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
"Physical Memory Control Using the Resource Capping  Daemon"
in  System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones

Thanks Richard, I’ll have a look at that today and see where I get.

cheers,
James


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to