On May 10, 2011, at 9:18 AM, Ray Van Dolson wrote:

> We recently had a disk fail on one of our whitebox (SuperMicro) ZFS
> arrays (Solaris 10 U9).
> 
> The disk began throwing errors like this:
> 
> May  5 04:33:44 dev-zfs4 scsi: [ID 243001 kern.warning] WARNING: 
> /pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0):
> May  5 04:33:44 dev-zfs4        mptsas_handle_event_sync: IOCStatus=0x8000, 
> IOCLogInfo=0x31110610

These are commonly seen when hardware is having difficulty and devices are
being reset.

> 
> And errors for the drive were incrementing in iostat -En output.
> Nothing was seen in fmdump.

That is unusual because the ereports are sent along with the code that 
increments
error counters in sd. Are you sure you ran "fmdump -e" as root or with 
appropriate 
privileges?

> 
> Unfortunately, it took about three hours for ZFS (or maybe it was MPT)
> to decide the drive was actually dead:
> 
> May  5 07:41:06 dev-zfs4 scsi: [ID 107833 kern.warning] WARNING: 
> /scsi_vhci/disk@g5000c5002cbc76c0 (sd4):
> May  5 07:41:06 dev-zfs4        drive offline
> 
> During this three hours the I/O performance on this server was pretty
> bad and caused issues for us.  Once the drive "failed" completely, ZFS
> pulled in a spare and all was well.
> 
> My question is -- is there a way to tune the MPT driver or even ZFS
> itself to be more/less aggressive on what it sees as a "failure"
> scenario?

mpt driver is closed source. Contact the source author for such details.

mpt_sas is open source, but the decision to retire for Solaris-derived OSes
is done via the Fault Management Architecture (FMA) agents. Many of 
these have tunable algorithms, but AFAIK they are only documented in 
source.

That said, there are failure modes that do not fit the current algorithms very
well. Feel free to propose alternatives.

> 
> I suppose this would have been handled differently / better if we'd
> been using real Sun hardware?

Maybe, maybe not. These are generic conditions and can be seen on all
sorts of hardware under a wide variety of failure conditions.
 -- richard

> 
> Our other option is to watch better for log entries similar to the
> above and either alert someone or take some sort of automated action
> .. I'm hoping there's a better way to tune this via driver or ZFS
> settings however.
> 
> Thanks,
> Ray
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to