On Thu, 27 Jan 2011, Giovanni Tirloni wrote: >> >> That's a lot of data to have collected. It sounds extremely valuable. >> Would you be able to publish the results? <--snip--> >> Tom >> P.S. Were you also gathering temperature data? >> > > Yes, from the server but not from the individual disks. It seems the > megaraid plugin/driver for smartmonctl doesn't work in Solaris. > > If people would like to discuss best practices or have any papers to suggest > on how to collect and make sense of this all, that'd be nice. We've always > wanted to take the guesswork out of this and perhaps creating a profile > database would help others too. >
You might take a look at the computer failure data repository: http://cfdr.usenix.org/ Garth Gibson and Bianca Schroeder started a project to collect and publish data about hardware failure. She might have pointers on best practices, though: "Currently, data collection and analysis is complicated by the fact that there is no widely accepted format for anomaly data and there exist no guidelines on what data to collect and how. We hope that the experiences from working with a variety of sites on collecting and analyzing failure data will lead to some best practices for failure data collection. Providing such guidelines will make it easier for sites to collect data that is useful and comparable across sites." isn't promising. HTH, Tony > It took quite some time to adjust the media error count that our environment > would be willing to accept before p1 incidents became the norm (ie. we > started replacing disks with any media error and increased the threshold > until failing disks started to generate p1 incidents). > > -- > Giovanni Tirloni (gtirl...@sysdroid.com) > _______________________________________________ Discuss mailing list Discuss@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/