On Dec 3, 2011, at 9:32 PM, Ryan Wehler wrote: > On Dec 3, 2011, at 11:18 PM, Richard Elling wrote: > >> On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote: >>> >>> On Dec 3, 2011, at 10:31 PM, Richard Elling wrote: >>> >>>> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: >>>> >>>>> Hi Richard, >>>>> Thanks for getting back to me. >>>>> >>>>> >>>>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >>>>> >>>>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>>>>> >>>>>>> During the diagnostics of my SAN failure last week we thought we had >>>>>>> seen a backplane failure due to high error counts with 'lsiutil'. >>>>>>> However, even with a new backplane and ruling out failed cards (MPXIO >>>>>>> or singular) or bad cables I'm still seeing my error count with LSIUTIL >>>>>>> increment. I've got no disks attached to the array right now so I've >>>>>>> also ruled those out. >>>>>> >>>>>> The link error counters are on the receiving side. To see the complete >>>>>> picture, you need to look at >>>>>> link errors on both ends of each link (more below…) >>>>>> >>>>>>> >>>>>>> Even with nothing connected but the HBA to the backplane expander, a >>>>>>> simple restart of the SAN into a OpenIndiana LiveCD or other >>>>>>> distribution (NexentaStor) increments the counter. >>>>>> >>>>>> A few counters can tick up when the system is reset at boot. These can >>>>>> be ignored. >>>>>> >>>>>> What you are looking for is a consistent increase of the counters >>>>>> under load. In some cases >>>>>> I have seen millions of errors per minute on a very unhappy system. >>>>> >>>>> But we're talking about 600,000 -> 2,000,000 errors on a simple reset at >>>>> boot. Per my VAR their 6GB hardware show significantly less (in the 10s >>>>> to 100s of errors, not 100s to millions). >>>> >>>> For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start >>>> replacing hardware. >>> >>> >>> And how do you define "high quality hardware"? Obviously these aren't >>> crummy SATA adapters and low cost drives. The Chassis and backplane are on >>> Nexenta's HSL. While the cards are not, explicitly listed. The underlying >>> chip (LSI 1068) is on another card (3081E-R) that is on the HSL. >> >> I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero >> errors. > > I'm assuming these had some sort of LSI cards in them since that's the > primary focus here. Do you happen to know models and what expander chip was > used on the backplane(s)?
LSI 2008 chipset (HP SC08Ge HBA). Expanders are HP-branded, I'll speculate they are LSI SAS2x28. Note: there is also firmware on the HBAs and expanders. But I do not expect firmware to change the link error counts. I suspect that is more of a physical issue. >> Currently, the test process for HSL records any errors, but as long as the >> root cause can be >> explained, the devices can pass certification. > > Well.... since we can't even come to a reasonable justification on why these > errors exist with no "true" indicator of bad hardware, something like this > could pass the HSL if the VAR can justify it? I'm not saying thats what > happened.. I'm just trying to understand the process. A certification does not mean that any specific implementation operates without errors. A failed part, noisy environment, or other influences will affect any specific implementation. -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA '11, Boston, MA, December 4-9 _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss