On 16/09/2015 19:24, Nikola M wrote:
On 09/11/15 08:57 PM, Watson, Dan wrote:
I'm using mpt_sas with SATA drives, and I_DO_ have error counters
climbing for some of those drives, is it probably that?
Any other ideas?
It is generally strongly advised to use SATA disks on SATA controllers
and SAS disks on SAS controllers. And to use controller that can do JBOD.
Also, using SAS to SATA multipliers or using port multipliers at all
is strongly disadvised too,
because it is usually cheap logic in it, that can go crazy and disk is
not under direct control of the controller..
A disk interface specialist was telling me earlier today what goes wrong
here. The problem is that many SATA drives drop the phy interface when
they have some internal problem, even just retrying transfers. Normally
that doesn't matter a scrap when they are connected 1-to-1 to a SATA
controller. However, if they are connected to SAS fabric, it will cause
the SAS fabric to re-enumerate all the drives at least at that port
multiplier level, likely losing outstanding IOs on other drives, most
particularly other SATA drives as implementations of STP (SATA Tunneling
Protocol) in SAS HBAs/expanders just aren't very good. This often causes
OS drivers to report errors against the wrong drive - i.e. not
necessarily the one which is the root cause but others were IOs are
lost, and you can't necessarily tell which was to blame (and probably
don't even realise you might be being mislead). It happens again if/when
the SATA drive recovers and brings its phy back up. This could cause FMA
to fault out wrong drives in situations were you do genuinely have a
misbehaving drive, leaving the bad drive online when there's no pool
redundancy left to fault out any more drives.
Why is this not a problem with SAS drives? Well apparently they don't
drop their phy interfaces anywhere near as easily when such things
happen, because they are designed for use with SAS fabric where doing so
is known to be a problem. Even if they do drop their phy, it doesn't
result in confusing error reports from other drives on the SAS fabric.
Some SAS drives can actually reset and reboot their firmware if it
crashes without the phy interface being dropped.
Also what OI/illumos is that, because I was reading long ago there
were some bugs solved in illumos for mpt_sas.
Somewhere around 18 months ago IIRC, Nexenta pushed a load of fixes for
this into their git repo. I don't think I've seen these picked up yet by
Illumos, although maybe I missed it? The fixes were in mpt_sas and FMA,
to more accurately determine when disks are going bad by pushing the
timing of the SCSI commands right down to the bottom of the stack (so
delays in the software stack are not mistaken for bad drives), and to
have FMA better analyse and handle errors when they do happen.
--
Andrew Gabriel
_______________________________________________
openindiana-discuss mailing list
openindiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss