On 16/09/2015 19:24, Nikola M wrote:
On 09/11/15 08:57 PM, Watson, Dan wrote:
I'm using mpt_sas with SATA drives, and I_DO_ have error counters climbing for some of those drives, is it probably that?
Any other ideas?

It is generally strongly advised to use SATA disks on SATA controllers and SAS disks on SAS controllers. And to use controller that can do JBOD.

Also, using SAS to SATA multipliers or using port multipliers at all is strongly disadvised too, because it is usually cheap logic in it, that can go crazy and disk is not under direct control of the controller..

A disk interface specialist was telling me earlier today what goes wrong here. The problem is that many SATA drives drop the phy interface when they have some internal problem, even just retrying transfers. Normally that doesn't matter a scrap when they are connected 1-to-1 to a SATA controller. However, if they are connected to SAS fabric, it will cause the SAS fabric to re-enumerate all the drives at least at that port multiplier level, likely losing outstanding IOs on other drives, most particularly other SATA drives as implementations of STP (SATA Tunneling Protocol) in SAS HBAs/expanders just aren't very good. This often causes OS drivers to report errors against the wrong drive - i.e. not necessarily the one which is the root cause but others were IOs are lost, and you can't necessarily tell which was to blame (and probably don't even realise you might be being mislead). It happens again if/when the SATA drive recovers and brings its phy back up. This could cause FMA to fault out wrong drives in situations were you do genuinely have a misbehaving drive, leaving the bad drive online when there's no pool redundancy left to fault out any more drives.

Why is this not a problem with SAS drives? Well apparently they don't drop their phy interfaces anywhere near as easily when such things happen, because they are designed for use with SAS fabric where doing so is known to be a problem. Even if they do drop their phy, it doesn't result in confusing error reports from other drives on the SAS fabric. Some SAS drives can actually reset and reboot their firmware if it crashes without the phy interface being dropped.

Also what OI/illumos is that, because I was reading long ago there were some bugs solved in illumos for mpt_sas.

Somewhere around 18 months ago IIRC, Nexenta pushed a load of fixes for this into their git repo. I don't think I've seen these picked up yet by Illumos, although maybe I missed it? The fixes were in mpt_sas and FMA, to more accurately determine when disks are going bad by pushing the timing of the SCSI commands right down to the bottom of the stack (so delays in the software stack are not mistaken for bad drives), and to have FMA better analyse and handle errors when they do happen.

--
Andrew Gabriel

_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to