On Sep 6, 2009, at 5:06 PM, James Lever wrote:
On 07/09/2009, at 6:24 AM, Richard Elling wrote:
On Sep 6, 2009, at 7:53 AM, Ross Walker wrote:
On Sun, Sep 6, 2009 at 9:15 AM, James Lever<j...@jamver.id.au> wrote:
I’m experiencing occasional slow responsiveness on an OpenSolaris
b118
system typically noticed when running an ‘ls’ (no extra flags, so
no
directory service lookups). There is a delay of between 2 and 30
seconds
but no correlation has been noticed with load on the server and
the slow
return. This problem has only been noticed via NFS (v3. We are
migrating
to NFSv4 once the O_EXCL/mtime bug fix has been integrated -
anticipated for
snv_124). The problem has been observed both locally on the
primary
filesystem, in an locally automounted reference (/home/foo) and
remotely via
NFS.
I'm confused. If "This problem has only been noticed via NFS (v3"
then
how is it "observed locally?”
Sorry, I was meaning to say it had not been noticed using CIFS or
iSCSI.
It has been observed in client:/home/user (NFSv3 automount from
server:/home/user, redirected to server:/zpool/home/user) and also
in server:/home/user (local automount) and server:/zpool/home/user
(origin).
Ok, just so I am clear, when you mean "local automount" you are
on the server and using the loopback -- no NFS or network involved?
iostat(1m) is the program for troubleshooting performance issues
related to latency. It will show the latency of nfs mounts as well as
other devices.
What specifically should I be looking for here? (using ‘iostat -xen -
T d’) and I’m guessing I’ll require a high level of granularity (1s
intervals) to see the issue if it is a single disk or similar.
You are looking for I/O that takes seconds to complete or is stuck in
the device. This is in the actv column stuck > 1 and the asvc_t >> 1000
stat(2) doesn't write, so you can stop worrying about the slog.
My concern here was I may have been trying to write (via other
concurrent processes) at the same time as there was a memory fault
from the ARC to L2ARC.
stat(2) looks at metadata, which is generally small and compressed.
It is also cached in the ARC, by default. If this is repeatable in a
short
period of time, then it is not an I/O problem and you need to look at:
1. the number of files in the directory
2. the locale (ls sorts by default, and your locale affects the sort
time)
Rule out the network by looking at retransmissions and ioerrors
with netstat(1m) on both the client and server.
No errors or collisions from either server or clients observed.
retrans?
As Ross mentioned, wireshark, snoop, or most other network monitors
will show network traffic in detail.
-- richard
That behavior sounds a lot like a process has a memory leak and is
filling the VM. On Linux there is an OOM killer for these, but on
OpenSolaris, your the OOM killer.
See rcapd(1m), rcapadm(1m), and rcapstat(1m) along with the
"Physical Memory Control Using the Resource Capping Daemon"
in System Administration Guide: Solaris Containers-Resource
Management, and Solaris Zones
Thanks Richard, I’ll have a look at that today and see where I get.
cheers,
James
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss