You are both right.  More below...

On Sep 10, 2010, at 2:06 PM, Piotr Jasiukajtis wrote:

> I don't have any errors from fmdump or syslog.
> The machine is SUN FIRE X4275 I don't use mpt or lsi drivers.
> It could be a bug in a driver since I see this on 2 the same machines.
> 
> On Fri, Sep 10, 2010 at 9:51 PM, Carson Gaspar <car...@taltos.org> wrote:
>> On 9/10/10 4:16 PM, Piotr Jasiukajtis wrote:
>>> 
>>> Ok, now I know it's not related to the I/O performance, but to the ZFS
>>> itself.
>>> 
>>> At some time all 3 pools were locked in that way:
>>> 
>>>                             extended device statistics       ---- errors
>>> ---
>>>     r/s    w/s   kr/s   kw/s wait actv wsvc_t asvc_t  %w  %b s/w h/w trn 
>>> tot device
>>>     0.0    0.0    0.0    0.0  0.0  0.0    0.0    0.0   0   0   0   1  0   1 
>>> c8t0d0
>>>     0.0    0.0    0.0    0.0  0.0  8.0    0.0    0.0   0 100   0   0  0   0 
>>> c7t0d0
>> 
>> Nope, most likely your disks or disk controller/driver. Note that you have 8
>> outstanding I/O requests that aren't being serviced. Look in your syslog,
>> and I bet you'll see I/O timeout errors. I have seen this before with
>> Western Digital disks attached to an LSI controller using the mpt driver.
>> There was a lot of work diagnosing it, see the list archives - an
>> /etc/system change fixed it for me (set xpv_psm:xen_support_msi = -1), but I
>> was using a xen kernel. Note that replacing my disks with larger Seagate
>> ones made the problem go away as well.

In this case, the diagnosis that I/Os are stuck at the drive, not being 
serviced is
correct.  This is clearly visible as actv>0, asvc_t==0, and the derived %b == 
100%
However, the error reports are also 0 for the affected devices: s/w, h/w, and 
trn.
In many cases where we see I/O timeouts and devices aborting commands, we
will see these logged as transport (trn) errors.  For iostat, these errors are 
reported
as since-boot, not per-sample period, so we know that whatever is getting stuck
isn't getting unstuck.  The symptom we see with questionable devices in the
HBA-to-disk path is hundreds, thousands, or millions of transport errors 
reported.

Next question: what does the software stack look like?  I knew the sd driver 
intimately at one time (pictures were in the Enquirer :-) and it will retry and 
send resets that will ultimately get logged.  In this case, we know that at 
least one hard error was returned for c8t0d0, so there is a ereport somewhere 
with the details, try "fmdump -eV"

This is not a ZFS bug and cannot be fixed at the ZFS layer.
 -- richard

-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com

Richard Elling
rich...@nexenta.com   +1-760-896-4422
Enterprise class storage for everyone
www.nexenta.com





_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to