I'm fighting with an identical problem here & am very interested in this thread.

Solaris 10 127112-11 boxes running ZFS on a fiberchannel raid5 device (hardware raid).

Randomly one lun on a machine will stop writing for about 10-15 minutes (during a busy time of day), and then all of a sudden become active with a burst of activity. Reads will continue to happen.

I just captured this today (problem volume is sd3):

extended device statistics tty cpu
device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id
sd3 122.6 0.0 7519.4 0.0 0.0 1.2 9.9 0 94
sd19 19.2 42.4 1121.5 284.2 0.0 0.3 4.5 0 17
extended device statistics tty cpu

device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id
sd3 140.2 0.0 7387.9 0.0 0.0 1.4 9.9 0 93
sd19 13.6 37.6 870.3 303.5 0.0 0.2 3.9 0 13
extended device statistics tty cpu


Then after a few minutes...

device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id
sd3 32.0 1375.3 1988.6 10631.2 0.0 8.4 5.9 1 63
sd19 13.0 41.6 701.9 246.7 0.0 0.2 3.8 0 12
extended device statistics tty cpu
device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id
sd3 13.8 2844.3 883.3 26842.2 0.0 29.9 10.5 2 100
sd19 19.4 52.2 1229.8 408.4 0.0 0.3 4.3 0 17
extended device statistics tty cpu
device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id
sd3 1.6 889.5 55.6 8856.7 0.0 35.0 39.3 1 100
sd19 22.8 45.6 1459.1 344.3 0.0 0.3 5.0 0 21


Then back to 'normal'...

extended device statistics tty cpu
device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id
sd3 62.0 179.4 3546.5 1086.0 0.0 1.5 6.3 0 48
sd19 15.4 38.8 927.1 223.9 0.0 0.2 3.8 0 14
extended device statistics tty cpu
device r/s w/s kr/s kw/s wait actv svc_t %w %b tin tout us sy wt id
sd3 26.2 128.6 1476.7 994.8 0.0 0.7 4.3 0 23
sd19 15.8 52.2 998.8 357.7 0.0 0.3 4.0 0 16



During the write problem, all my app servers were hung stuck in write threads. The zfs machines are Apache/webDav boxes.

I'm in the process of trying to migrate these luns from hardware raid5 to "Enhanced JBOD" and then create raidz2 devices to see if that helps, but that is going to take months.

Any ideas?

Pat S.



Gary Mills wrote:
On Sat, Apr 18, 2009 at 04:27:55PM -0500, Gary Mills wrote:
We have an IMAP server with ZFS for mailbox storage that has recently
become extremely slow on most weekday mornings and afternoons.  When
one of these incidents happens, the number of processes increases, the
load average increases, but ZFS I/O bandwidth decreases.  Users notice
very slow response to IMAP requests.  On the server, even `ps' becomes
slow.

After I moved a couple of Cyrus databases from ZFS to UFS on Sunday
morning, the server seemed to run quite nicely.  One of these
databases is memory-mapped by all of the lmtpd and pop3d processes.
The other is opened by all the lmtpd processes.  Both were quite
active, with many small writes, so I assumed they'd be better on UFS.
All of the IMAP mailboxes were still on ZFS.

However, this morning, things went from bad to worse.  All writes to
the ZFS filesystems stopped completely.  Look at this:

    $ zpool iostat 5 5
                   capacity     operations    bandwidth
    pool         used  avail   read  write   read  write
    ----------  -----  -----  -----  -----  -----  -----
    space       1.04T   975G     86     67  4.53M  2.57M
    space       1.04T   975G      5      0   159K      0
    space       1.04T   975G      7      0   337K      0
    space       1.04T   975G      3      0   179K      0
    space       1.04T   975G      4      0   167K      0

`fsstat' told me that there was both writes and memory-mapped I/O
to UFS, but nothing to ZFS.  At the same time, the `ps' command
would hang and could not be interrupted.  `truss' on `ps' looked
like this, but it eventually also stopped and not be interrupted.

    open("/proc/6359/psinfo", O_RDONLY)             = 4
    read(4, "02\0\0\0\0\0\001\0\018D7".., 416)      = 416
    close(4)                                        = 0
    open("/proc/12782/psinfo", O_RDONLY)            = 4
    read(4, "02\0\0\0\0\0\001\0\0 1EE".., 416)      = 416
    close(4)                                        = 0

What could cause this sort of behavior?  It happened three times today!


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to