On Dec 3, 2011, at 11:18 PM, Richard Elling wrote: > On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote: >> >> On Dec 3, 2011, at 10:31 PM, Richard Elling wrote: >> >>> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: >>> >>>> Hi Richard, >>>> Thanks for getting back to me. >>>> >>>> >>>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >>>> >>>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>>>> >>>>>> During the diagnostics of my SAN failure last week we thought we had >>>>>> seen a backplane failure due to high error counts with 'lsiutil'. >>>>>> However, even with a new backplane and ruling out failed cards (MPXIO or >>>>>> singular) or bad cables I'm still seeing my error count with LSIUTIL >>>>>> increment. I've got no disks attached to the array right now so I've >>>>>> also ruled those out. >>>>> >>>>> The link error counters are on the receiving side. To see the complete >>>>> picture, you need to look at >>>>> link errors on both ends of each link (more below…) >>>>> >>>>>> >>>>>> Even with nothing connected but the HBA to the backplane expander, a >>>>>> simple restart of the SAN into a OpenIndiana LiveCD or other >>>>>> distribution (NexentaStor) increments the counter. >>>>> >>>>> A few counters can tick up when the system is reset at boot. These can be >>>>> ignored. >>>>> >>>>> What you are looking for is a consistent increase of the counters under >>>>> load. In some cases >>>>> I have seen millions of errors per minute on a very unhappy system. >>>> >>>> But we're talking about 600,000 -> 2,000,000 errors on a simple reset at >>>> boot. Per my VAR their 6GB hardware show significantly less (in the 10s >>>> to 100s of errors, not 100s to millions). >>> >>> For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start >>> replacing hardware. >> >> >> And how do you define "high quality hardware"? Obviously these aren't >> crummy SATA adapters and low cost drives. The Chassis and backplane are on >> Nexenta's HSL. While the cards are not, explicitly listed. The underlying >> chip (LSI 1068) is on another card (3081E-R) that is on the HSL. > > I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero > errors.
I'm assuming these had some sort of LSI cards in them since that's the primary focus here. Do you happen to know models and what expander chip was used on the backplane(s)? > Currently, the test process for HSL records any errors, but as long as the > root cause can be > explained, the devices can pass certification. Well.... since we can't even come to a reasonable justification on why these errors exist with no "true" indicator of bad hardware, something like this could pass the HSL if the VAR can justify it? I'm not saying thats what happened.. I'm just trying to understand the process. > >>>>>> I've been as careful as I can be to clear the counter between changes to >>>>>> parts to try and eliminate a potentially bad cable/card/etc. You can >>>>>> see phy 8-15 throw errors irregardless of MPXIO or single card config, >>>>>> OR which expander port I use on the backplane. >>>>> >>>>> The info you attaced doesn't show the topology (lsiutil command 16), so >>>>> it is difficult to say >>>>> why this occurs. >>>> >>>> Attached is the output of option 16 on each card. >>>> >>>> <LSI1068.rtf> >>> >>> This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator). >>> >>> It is unusual to see millions of errors there. >>> >>> Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1) >>> you see on the order of thousand errors. From the expander (handle 0009) >>> you see millions of errors on phys 12 to 15, that are connected to the HBA. >>> >>> Also interesting is that one of the phys, adapter phy 0, shows no errors, >>> but we see >>> errors on the others. This is unusual because there are 4 links in the >>> cable. >>> >>> Still smells like hardware to me. >>> -- richard >>> >> >> I'm not quite extrapolating this data like you are. I see handle 0009 which >> looks to be the expander. Card #1 is hooked to phy 8-11 and Card #2 is >> hooked to phy 12-15. (port 0 and 1 on the expander) >> >> As far as symmetrical errors, yeah the whole thing is screwy. The one thing >> I am seeing as stand out that I did not notice before for some reason is >> that "right card" (the one that normally handles phy 12-15) in my previous >> output from my initial inquiry carries 1+M errors on the expander phys >> regardless of the "right or left" cable. Perhaps that is an indicator of >> hardware malfunction. The "left" card (usually responsible for phy 8-11) >> throws something in the order of 600+K (under 1M) using "right or left" >> cable (phy 8-11 or 12-15). Those numbers are uncomfortably high too, though. > > Agree. > >> Basically the output of my SAS Diag.txt was flipping between single use of >> each card with each of the two cables I had available to me. If I were to >> show the output now with both cards enabled phy 8-15 on the expander all >> show "link up" situation. > > Are the cables of the same make/model? Unfortunately, it is not uncommon to > see bad cables :-( > I had one just last week :-( The cables are identical. My VAR put this all together about 2 years ago. I don't have any other cables to test but the present fix is "upgrade to SAS3 (6GB) backplane/cards/cables". >> The other mystery as you mentioned is why Adapter phy 0 is error free while >> the other 3 phys are not. It's also persistent across cables used AND cards >> used. > > A mystery… > -- richard > -- > > ZFS and performance consulting > http://www.RichardElling.com > LISA '11, Boston, MA, December 4-9 > > > > > > > > > > > > > > _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss