james.ma...@sun.com said:
> I'm not yet sure what's broken here, but there's something pathologically
> wrong with the IO rates to the device during the ZFS tests. In both cases,
> the wait queue is getting backed up, with horrific wait queue latency
> numbers. On the read side, I don't understand why we're seeing 4-5 seconds of
> zero disk activity on the read test in between bursts of a small number of
> reads. 

We observed such long pauses (with zero disk activity) with a disk array
that was being fed more operations than it could handle (FC queue depth).
The array was not losing ops, but the OS would fill the device's queue
and then the OS would completely freeze on any disk-related activity for
the affected LUN's.  All zpool or zfs commands related to those pools would
be unresponsive during those periods, until the load slowed down enough
such that the OS wasn't ahead of the array.

This was with Solaris-10 here, not OpenSolaris or SXCE, but I suspect
the principal would still apply.  Naturally, the original poster may have
a very different situation, so take the above as you wish.  Maybe Dtrace
can help:
        http://blogs.sun.com/chrisg/entry/latency_bubble_in_your_io
        http://blogs.sun.com/chrisg/entry/latency_bubbles_follow_up
        http://blogs.sun.com/chrisg/entry/that_we_should_make

Note that using the above references, Dtrace showed that we had some
FC operations which took 60 or even 120 seconds to complete.  Things got
much better here when we zeroed in on two settings:

  (a) set FC queue depth for the device to match its backend capacity (4).
  (b) turn off sorting of the queue by the OS/driver (latency evened out).

Regards,

Marion


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to