We recently had a disk fail on one of our whitebox (SuperMicro) ZFS
arrays (Solaris 10 U9).

The disk began throwing errors like this:

May  5 04:33:44 dev-zfs4 scsi: [ID 243001 kern.warning] WARNING: 
/pci@0,0/pci8086,3410@9/pci15d9,400@0 (mpt_sas0):
May  5 04:33:44 dev-zfs4        mptsas_handle_event_sync: IOCStatus=0x8000, 
IOCLogInfo=0x31110610

And errors for the drive were incrementing in iostat -En output.
Nothing was seen in fmdump.

Unfortunately, it took about three hours for ZFS (or maybe it was MPT)
to decide the drive was actually dead:

May  5 07:41:06 dev-zfs4 scsi: [ID 107833 kern.warning] WARNING: 
/scsi_vhci/disk@g5000c5002cbc76c0 (sd4):
May  5 07:41:06 dev-zfs4        drive offline

During this three hours the I/O performance on this server was pretty
bad and caused issues for us.  Once the drive "failed" completely, ZFS
pulled in a spare and all was well.

My question is -- is there a way to tune the MPT driver or even ZFS
itself to be more/less aggressive on what it sees as a "failure"
scenario?

I suppose this would have been handled differently / better if we'd
been using real Sun hardware?

Our other option is to watch better for log entries similar to the
above and either alert someone or take some sort of automated action
.. I'm hoping there's a better way to tune this via driver or ZFS
settings however.

Thanks,
Ray
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to