On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote: > > On Dec 3, 2011, at 10:31 PM, Richard Elling wrote: > >> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: >> >>> Hi Richard, >>> Thanks for getting back to me. >>> >>> >>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >>> >>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>>> >>>>> During the diagnostics of my SAN failure last week we thought we had seen >>>>> a backplane failure due to high error counts with 'lsiutil'. However, >>>>> even with a new backplane and ruling out failed cards (MPXIO or singular) >>>>> or bad cables I'm still seeing my error count with LSIUTIL increment. >>>>> I've got no disks attached to the array right now so I've also ruled >>>>> those out. >>>> >>>> The link error counters are on the receiving side. To see the complete >>>> picture, you need to look at >>>> link errors on both ends of each link (more below…) >>>> >>>>> >>>>> Even with nothing connected but the HBA to the backplane expander, a >>>>> simple restart of the SAN into a OpenIndiana LiveCD or other distribution >>>>> (NexentaStor) increments the counter. >>>> >>>> A few counters can tick up when the system is reset at boot. These can be >>>> ignored. >>>> >>>> What you are looking for is a consistent increase of the counters under >>>> load. In some cases >>>> I have seen millions of errors per minute on a very unhappy system. >>> >>> But we're talking about 600,000 -> 2,000,000 errors on a simple reset at >>> boot. Per my VAR their 6GB hardware show significantly less (in the 10s to >>> 100s of errors, not 100s to millions). >> >> For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start >> replacing hardware. > > > And how do you define "high quality hardware"? Obviously these aren't crummy > SATA adapters and low cost drives. The Chassis and backplane are on > Nexenta's HSL. While the cards are not, explicitly listed. The underlying > chip (LSI 1068) is on another card (3081E-R) that is on the HSL.
I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero errors. Currently, the test process for HSL records any errors, but as long as the root cause can be explained, the devices can pass certification. >>>>> I've been as careful as I can be to clear the counter between changes to >>>>> parts to try and eliminate a potentially bad cable/card/etc. You can see >>>>> phy 8-15 throw errors irregardless of MPXIO or single card config, OR >>>>> which expander port I use on the backplane. >>>> >>>> The info you attaced doesn't show the topology (lsiutil command 16), so it >>>> is difficult to say >>>> why this occurs. >>> >>> Attached is the output of option 16 on each card. >>> >>> <LSI1068.rtf> >> >> This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator). >> >> It is unusual to see millions of errors there. >> >> Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1) >> you see on the order of thousand errors. From the expander (handle 0009) >> you see millions of errors on phys 12 to 15, that are connected to the HBA. >> >> Also interesting is that one of the phys, adapter phy 0, shows no errors, >> but we see >> errors on the others. This is unusual because there are 4 links in the cable. >> >> Still smells like hardware to me. >> -- richard >> > > I'm not quite extrapolating this data like you are. I see handle 0009 which > looks to be the expander. Card #1 is hooked to phy 8-11 and Card #2 is > hooked to phy 12-15. (port 0 and 1 on the expander) > > As far as symmetrical errors, yeah the whole thing is screwy. The one thing I > am seeing as stand out that I did not notice before for some reason is that > "right card" (the one that normally handles phy 12-15) in my previous output > from my initial inquiry carries 1+M errors on the expander phys regardless of > the "right or left" cable. Perhaps that is an indicator of hardware > malfunction. The "left" card (usually responsible for phy 8-11) throws > something in the order of 600+K (under 1M) using "right or left" cable (phy > 8-11 or 12-15). Those numbers are uncomfortably high too, though. Agree. > Basically the output of my SAS Diag.txt was flipping between single use of > each card with each of the two cables I had available to me. If I were to > show the output now with both cards enabled phy 8-15 on the expander all show > "link up" situation. Are the cables of the same make/model? Unfortunately, it is not uncommon to see bad cables :-( I had one just last week :-( > The other mystery as you mentioned is why Adapter phy 0 is error free while > the other 3 phys are not. It's also persistent across cables used AND cards > used. A mystery… -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA '11, Boston, MA, December 4-9 _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss