On Dec 3, 2011, at 11:45 PM, Richard Elling wrote:

> On Dec 3, 2011, at 9:32 PM, Ryan Wehler wrote:
>> On Dec 3, 2011, at 11:18 PM, Richard Elling wrote:
>> 
>>> On Dec 3, 2011, at 9:02 PM, Ryan Wehler wrote:
>>>> 
>>>> On Dec 3, 2011, at 10:31 PM, Richard Elling wrote:
>>>> 
>>>>> On Dec 3, 2011, at 7:36 PM, Ryan Wehler wrote:
>>>>> 
>>>>>> Hi Richard,
>>>>>> Thanks for getting back to me.
>>>>>> 
>>>>>> 
>>>>>> On Dec 3, 2011, at 9:03 PM, Richard Elling wrote:
>>>>>> 
>>>>>>> On Dec 1, 2011, at 5:08 PM, Ryan Wehler wrote:
>>>>>>> 
>>>>>>>> During the diagnostics of my SAN failure last week we thought we had 
>>>>>>>> seen a backplane failure due to high error counts with 'lsiutil'.  
>>>>>>>> However, even with a new backplane and ruling out failed cards (MPXIO 
>>>>>>>> or singular) or bad cables I'm still seeing my error count with 
>>>>>>>> LSIUTIL increment.  I've got no disks attached to the array right now 
>>>>>>>> so I've also ruled those out.
>>>>>>> 
>>>>>>> The link error counters are on the receiving side. To see the complete 
>>>>>>> picture, you need to look at
>>>>>>> link errors on both ends of each link (more below…)
>>>>>>> 
>>>>>>>> 
>>>>>>>> Even with nothing connected but the HBA to the backplane expander, a 
>>>>>>>> simple restart of the SAN into a OpenIndiana LiveCD or other 
>>>>>>>> distribution (NexentaStor) increments the counter.
>>>>>>> 
>>>>>>> A few counters can tick up when the system is reset at boot. These can 
>>>>>>> be ignored.
>>>>>>> 
>>>>>>> What you are looking for is  a consistent increase of the  counters 
>>>>>>> under load. In some cases
>>>>>>> I have seen millions of errors per minute on a very unhappy system.
>>>>>> 
>>>>>> But we're talking about 600,000 -> 2,000,000 errors on a simple reset at 
>>>>>> boot.  Per my VAR their 6GB hardware show significantly less (in the 10s 
>>>>>> to 100s of errors, not 100s to millions). 
>>>>> 
>>>>> For high-quality hardware, I see 4 to 8.  If I see > 1,000, then I start 
>>>>> replacing hardware.
>>>> 
>>>> 
>>>> And how do you define "high quality hardware"?  Obviously these aren't 
>>>> crummy SATA adapters and low cost drives.  The Chassis and backplane are 
>>>> on Nexenta's HSL.  While the cards are not, explicitly listed. The 
>>>> underlying chip (LSI 1068) is on another card (3081E-R) that is on the HSL.
>>> 
>>> I recently tested a HP DL380 G7 with D2600 and D2700 JBOD chassis. Zero 
>>> errors.
>> 
>> I'm assuming these had some sort of LSI cards in them since that's the 
>> primary focus here.  Do you happen to know models and what expander chip was 
>> used on the backplane(s)?
> 
> LSI 2008 chipset (HP SC08Ge HBA).  Expanders are HP-branded, I'll speculate 
> they are LSI SAS2x28.
> 
> Note: there is also firmware on the HBAs and expanders. But I do not expect 
> firmware to change the
> link error counts. I suspect that is more of a physical issue.

In an effort to solve this problem I did update my 3442E-R HBAs from a 2009 
firmware to "Phase 21" which came out earlier this year from LSI.  The 
replacement backplane I got from my VAR when they thought that was the issue 
moved the backplane firmware from 7015 to 7017 per lsiutil's output.   You're 
right it must be a physical issue but it just seems highly unlikely that BOTH 
HBAs failed and BOTH SAS cables failed (we'll take the expander out of the 
equation since it was replaced)

> 
>>> Currently, the test process for HSL records any errors, but as long as the 
>>> root cause can be
>>> explained, the devices can pass certification.
>> 
>> Well.... since we can't even come to a reasonable justification on why these 
>> errors exist with no "true" indicator of bad hardware, something like this 
>> could pass the HSL if the VAR can justify it? I'm not saying thats what 
>> happened.. I'm just trying to understand the process.
> 
> A certification does not mean that any specific implementation operates 
> without errors. A failed part,
> noisy environment, or other influences will affect any specific 
> implementation.

Would it not be more prudent to re-run the tests after a failure was fixed and 
try to eliminate environmental variables?  If you were to look up the reason it 
made it onto the HSL it should be "It just works!", not "it works, but this is 
why we're seeing errors". That leads to doubt when there are caveats and trying 
to diagnose like/same hardware in the future.

> -- richard
> 
> -- 
> 
> ZFS and performance consulting
> http://www.RichardElling.com
> LISA '11, Boston, MA, December 4-9 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to