[zfs-discuss] Write retry errors to SSD's on SAS backplane (mpt)

Ray Van Dolson Thu, 25 Mar 2010 11:26:14 -0700

We have a Silicon Mechanics server with a SuperMicro X8DT3-F (Rev 1.02)
(onboard LSI 1068E (firmware 1.28.02.00) and a SuperMicro SAS-846EL1
(Rev 1.1) backplane.


We have four Intel X-25E's attached to the backplane with two acting as
ZIL and two as L2ARC.

The remaining 21 drives are 1TB SATA.

The system is being used as an NFS datastore for VMware ESX, and, while
not too heavily loaded, we'll occasionally see these pop up in the
logs:

Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
Feb 28 22:46:22 prodsys-t2-zfs1         Log info 31126000 received for target 
31.
Feb 28 22:46:22 prodsys-t2-zfs1         scsi_status=0, ioc_status=804b, 
scsi_state=c
Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
Feb 28 22:46:22 prodsys-t2-zfs1         Log info 31126000 received for target 
31.
Feb 28 22:46:22 prodsys-t2-zfs1         scsi_status=0, ioc_status=804b, 
scsi_state=c
Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0/s...@1f,0 (sd24):
Feb 28 22:46:22 prodsys-t2-zfs1         Error for Command: write                
   Error Level: Retryable
Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice]   Requested 
Block: 591744                    Error Block: 591744
Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice]   Vendor: ATA     
                           Serial Number: CVEM002600FD
Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice]   Sense Key: Unit 
Attention
Feb 28 22:46:22 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice]   ASC: 0x29 
(power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

Mar  1 01:10:40 prodsys-t2-zfs1 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
Mar  1 01:10:40 prodsys-t2-zfs1         Log info 31126000 received for target 
30.
Mar  1 01:10:40 prodsys-t2-zfs1         scsi_status=0, ioc_status=804b, 
scsi_state=c
Mar  1 01:10:40 prodsys-t2-zfs1 scsi: [ID 365881 kern.info] 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0 (mpt0):
Mar  1 01:10:40 prodsys-t2-zfs1         Log info 31126000 received for target 
30.
Mar  1 01:10:40 prodsys-t2-zfs1         scsi_status=0, ioc_status=804b, 
scsi_state=c
Mar  1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.warning] WARNING: 
/p...@0,0/pci8086,3...@8/pci15d9,1...@0/s...@1e,0 (sd23):
Mar  1 01:10:41 prodsys-t2-zfs1         Error for Command: write                
   Error Level: Retryable
Mar  1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice]   Requested 
Block: 958744                    Error Block: 958744
Mar  1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice]   Vendor: ATA     
                           Serial Number: CVEM0033003T
Mar  1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice]   Sense Key: Unit 
Attention
Mar  1 01:10:41 prodsys-t2-zfs1 scsi: [ID 107833 kern.notice]   ASC: 0x29 
(power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0

The errors _only_ correspond with whichever drives are being used for
ZIL.

The system is fully patched Solaris 10 U8, and the mpt driver is
version 1.92:

# modinfo | grep mpt
 40 ffffffffef8bc000  3b5f0 169   1  mpt (MPT HBA Driver v1.92)

The error messages above aren't fatal -- aparently the OS just retries
the write and all is well.  We haven't seen any performance impact
either, but would like to track the problem down.

We've already swapped out the SSD drives.  The retries continue to
occur as above....

The only thing that "solves" the problem is to either attach the SSD
drives to the motherboard's SATA controllers or to attach them directly
to the LSI controller (bypassing the backplane).

This would seem to point the finger at the backplane, however, the
other 21 SATA drives never throw errors and neither to the two SSD's
being used for L2ARC.

Could there be some sort of latency or timing issue with the mpt driver
that might be causing this that only manifests itself with a high level
of writes to SSD devices hanging off a backplane (potentially longer
latency path?)?  Are there some SCSI command timeout settings I can
tweak to perhaps "mask" these errors for the mpt driver?

The vendor will probably want to send us a backplane, but I'm not
convinced it will fix the issue.

Suggestions or thoughts?

Thanks,
Ray
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] Write retry errors to SSD's on SAS backplane (mpt)

Reply via email to