On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote: > During the diagnostics of my SAN failure last week we thought we had seen a > backplane failure due to high error counts with 'lsiutil'. However, even > with a new backplane and ruling out failed cards (MPXIO or singular) or bad > cables I'm still seeing my error count with LSIUTIL increment. I've got no > disks attached to the array right now so I've also ruled those out.
The link error counters are on the receiving side. To see the complete picture, you need to look at link errors on both ends of each link (more below…) > > Even with nothing connected but the HBA to the backplane expander, a simple > restart of the SAN into a OpenIndiana LiveCD or other distribution > (NexentaStor) increments the counter. A few counters can tick up when the system is reset at boot. These can be ignored. What you are looking for is a consistent increase of the counters under load. In some cases I have seen millions of errors per minute on a very unhappy system. > I've been as careful as I can be to clear the counter between changes to > parts to try and eliminate a potentially bad cable/card/etc. You can see phy > 8-15 throw errors irregardless of MPXIO or single card config, OR which > expander port I use on the backplane. The info you attaced doesn't show the topology (lsiutil command 16), so it is difficult to say why this occurs. > > According to my VAR something in the mptsas code changed "recently" (not sure > what that means in time terms) and they do not see the problems with 6GB > backplanes and adapters. These counters are in the physical interfaces, far away from any OS. > > <SAS Diags.txt> > > > Attached is a log I took through NexentaStor 3.1.1 with my disks still > attached. The disks themselves don't seem to be throwing errors, so that's > good. To see errors from the disk's perspective, you need to look at the disk's logs. I use sg3 utils for this (sg_logs -a /dev/rdsk/...) > > > Has anyone seen anything like this? I have not tried to boot into an older > version of Solaris or NexentaStor yet, but booting into Scientific Linux 6.1 > yields about the same results with lsiutil. Yes. Root cause is always hardware. > > Nothing from fmadm, /var/adm/messages or otherwise indicate these data errors > outside of lsiutil. Those errors are counters as part of the SAS link state machine. The symptoms will show as poor performance or occasional command resets at the OS level. -- richard -- ZFS and performance consulting http://www.RichardElling.com LISA '11, Boston, MA, December 4-9 _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss