On Dec 3, 2011, at 10:31 PM, Richard Elling wrote: > On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote: > >> Hi Richard, >> Thanks for getting back to me. >> >> >> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote: >> >>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: >>> >>>> During the diagnostics of my SAN failure last week we thought we had seen >>>> a backplane failure due to high error counts with 'lsiutil'. However, >>>> even with a new backplane and ruling out failed cards (MPXIO or singular) >>>> or bad cables I'm still seeing my error count with LSIUTIL increment. >>>> I've got no disks attached to the array right now so I've also ruled those >>>> out. >>> >>> The link error counters are on the receiving side. To see the complete >>> picture, you need to look at >>> link errors on both ends of each link (more below…) >>> >>>> >>>> Even with nothing connected but the HBA to the backplane expander, a >>>> simple restart of the SAN into a OpenIndiana LiveCD or other distribution >>>> (NexentaStor) increments the counter. >>> >>> A few counters can tick up when the system is reset at boot. These can be >>> ignored. >>> >>> What you are looking for is a consistent increase of the counters under >>> load. In some cases >>> I have seen millions of errors per minute on a very unhappy system. >> >> But we're talking about 600,000 -> 2,000,000 errors on a simple reset at >> boot. Per my VAR their 6GB hardware show significantly less (in the 10s to >> 100s of errors, not 100s to millions). > > For high-quality hardware, I see 4 to 8. If I see > 1,000, then I start > replacing hardware.
And how do you define "high quality hardware"? Obviously these aren't crummy SATA adapters and low cost drives. The Chassis and backplane are on Nexenta's HSL. While the cards are not, explicitly listed. The underlying chip (LSI 1068) is on another card (3081E-R) that is on the HSL. > >>>> I've been as careful as I can be to clear the counter between changes to >>>> parts to try and eliminate a potentially bad cable/card/etc. You can see >>>> phy 8-15 throw errors irregardless of MPXIO or single card config, OR >>>> which expander port I use on the backplane. >>> >>> The info you attaced doesn't show the topology (lsiutil command 16), so it >>> is difficult to say >>> why this occurs. >> >> Attached is the output of option 16 on each card. >> >> <LSI1068.rtf> > > This shows that the handle 0009 phys 12 to 15 are the other HBA (initiator). > > It is unusual to see millions of errors there. > > Also, the number of errors is not symmetrical. From the HBA (Adapter phy 1) > you see on the order of thousand errors. From the expander (handle 0009) > you see millions of errors on phys 12 to 15, that are connected to the HBA. > > Also interesting is that one of the phys, adapter phy 0, shows no errors, but > we see > errors on the others. This is unusual because there are 4 links in the cable. > > Still smells like hardware to me. > -- richard > I'm not quite extrapolating this data like you are. I see handle 0009 which looks to be the expander. Card #1 is hooked to phy 8-11 and Card #2 is hooked to phy 12-15. (port 0 and 1 on the expander) As far as symmetrical errors, yeah the whole thing is screwy. The one thing I am seeing as stand out that I did not notice before for some reason is that "right card" (the one that normally handles phy 12-15) in my previous output from my initial inquiry carries 1+M errors on the expander phys regardless of the "right or left" cable. Perhaps that is an indicator of hardware malfunction. The "left" card (usually responsible for phy 8-11) throws something in the order of 600+K (under 1M) using "right or left" cable (phy 8-11 or 12-15). Those numbers are uncomfortably high too, though. Basically the output of my SAS Diag.txt was flipping between single use of each card with each of the two cables I had available to me. If I were to show the output now with both cards enabled phy 8-15 on the expander all show "link up" situation. The other mystery as you mentioned is why Adapter phy 0 is error free while the other 3 phys are not. It's also persistent across cables used AND cards used. >>>> >>>> According to my VAR something in the mptsas code changed "recently" (not >>>> sure what that means in time terms) and they do not see the problems with >>>> 6GB backplanes and adapters. >>> >>> These counters are in the physical interfaces, far away from any OS. >>> >>>> >>>> <SAS Diags.txt> >>>> >>>> >>>> Attached is a log I took through NexentaStor 3.1.1 with my disks still >>>> attached. The disks themselves don't seem to be throwing errors, so >>>> that's good. >>> >>> To see errors from the disk's perspective, you need to look at the disk's >>> logs. >>> I use sg3 utils for this (sg_logs -a /dev/rdsk/...) >>> >> >> I'd paste some of this, but the output would be pretty big. :) I'll look >> more into this. Though my "errors corrected without substantial delay" >> stands out as pretty high, even on a new disk I just received. Is there >> anything specific I should be looking at? >> >> >>>> >>>> >>>> Has anyone seen anything like this? I have not tried to boot into an >>>> older version of Solaris or NexentaStor yet, but booting into Scientific >>>> Linux 6.1 yields about the same results with lsiutil. >>> >>> Yes. Root cause is always hardware. >>> >>>> >>>> Nothing from fmadm, /var/adm/messages or otherwise indicate these data >>>> errors outside of lsiutil. >>> >>> Those errors are counters as part of the SAS link state machine. The >>> symptoms will show as >>> poor performance or occasional command resets at the OS level. >>> -- richard >>> >>> -- >>> >>> ZFS and performance consulting >>> http://www.RichardElling.com >>> LISA '11, Boston, MA, December 4-9 >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >> > > -- > > ZFS and performance consulting > http://www.RichardElling.com > LISA '11, Boston, MA, December 4-9 > > > > > > > > > > > > > > _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss