On Thu, 27 Jan 2011, Giovanni Tirloni wrote:

>>
>> That's a lot of data to have collected.  It sounds extremely valuable.
>>  Would you be able to publish the results?
<--snip-->
>> Tom
>> P.S.  Were you also gathering temperature data?
>>
>
> Yes, from the server but not from the individual disks. It seems the
> megaraid plugin/driver for smartmonctl doesn't work in Solaris.
>
> If people would like to discuss best practices or have any papers to suggest
> on how to collect and make sense of this all, that'd be nice. We've always
> wanted to take the guesswork out of this and perhaps creating a profile
> database would help others too.
>

You might take a look at the computer failure data repository:

http://cfdr.usenix.org/

Garth Gibson and Bianca Schroeder started a project to collect and publish 
data about hardware failure. She might have pointers on best practices, 
though:

"Currently, data collection and analysis is complicated by the fact that 
there is no widely accepted format for anomaly data and there exist no 
guidelines on what data to collect and how. We hope that the experiences 
from working with a variety of sites on collecting and analyzing failure 
data will lead to some best practices for failure data collection. 
Providing such guidelines will make it easier for sites to collect data 
that is useful and comparable across sites."

isn't promising.

HTH,

Tony


> It took quite some time to adjust the media error count that our environment
> would be willing to accept before p1 incidents became the norm (ie. we
> started replacing disks with any media error and increased the threshold
> until failing disks started to generate p1 incidents).
>
> -- 
> Giovanni Tirloni (gtirl...@sysdroid.com)
>
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to