Re: [OpenIndiana-discuss] Kernel panic on hung zpool accessed via lofi

Andrew Gabriel Wed, 16 Sep 2015 13:44:01 -0700

On 16/09/2015 19:24, Nikola M wrote:

On 09/11/15 08:57 PM, Watson, Dan wrote:
I'm using mpt_sas with SATA drives, and I_DO_ have error countersclimbing for some of those drives, is it probably that?
Any other ideas?
It is generally strongly advised to use SATA disks on SATA controllersand SAS disks on SAS controllers. And to use controller that can do JBOD.
Also, using SAS to SATA multipliers or using port multipliers at allis strongly disadvised too,because it is usually cheap logic in it, that can go crazy and disk isnot under direct control of the controller..

A disk interface specialist was telling me earlier today what goes wronghere. The problem is that many SATA drives drop the phy interface whenthey have some internal problem, even just retrying transfers. Normallythat doesn't matter a scrap when they are connected 1-to-1 to a SATAcontroller. However, if they are connected to SAS fabric, it will causethe SAS fabric to re-enumerate all the drives at least at that portmultiplier level, likely losing outstanding IOs on other drives, mostparticularly other SATA drives as implementations of STP (SATA TunnelingProtocol) in SAS HBAs/expanders just aren't very good. This often causesOS drivers to report errors against the wrong drive - i.e. notnecessarily the one which is the root cause but others were IOs arelost, and you can't necessarily tell which was to blame (and probablydon't even realise you might be being mislead). It happens again if/whenthe SATA drive recovers and brings its phy back up. This could cause FMAto fault out wrong drives in situations were you do genuinely have amisbehaving drive, leaving the bad drive online when there's no poolredundancy left to fault out any more drives.

Why is this not a problem with SAS drives? Well apparently they don'tdrop their phy interfaces anywhere near as easily when such thingshappen, because they are designed for use with SAS fabric where doing sois known to be a problem. Even if they do drop their phy, it doesn'tresult in confusing error reports from other drives on the SAS fabric.Some SAS drives can actually reset and reboot their firmware if itcrashes without the phy interface being dropped.

Also what OI/illumos is that, because I was reading long ago therewere some bugs solved in illumos for mpt_sas.

Somewhere around 18 months ago IIRC, Nexenta pushed a load of fixes forthis into their git repo. I don't think I've seen these picked up yet byIllumos, although maybe I missed it? The fixes were in mpt_sas and FMA,to more accurately determine when disks are going bad by pushing thetiming of the SCSI commands right down to the bottom of the stack (sodelays in the software stack are not mistaken for bad drives), and tohave FMA better analyse and handle errors when they do happen.


--
Andrew Gabriel

_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

Re: [OpenIndiana-discuss] Kernel panic on hung zpool accessed via lofi

Reply via email to