On Thu, Jan 27, 2011 at 4:26 AM, Giovanni Tirloni <gtirl...@sysdroid.com> wrote: > On Wed, Jan 26, 2011 at 11:14 PM, <da...@lang.hm> wrote: >> >> My datacenter just had a problem and the air temp in the 'cold isles' got >> over 105F over about 75 min. I'm trying to make the case that I need to be >> proactive and replace drives in large arrays that were in that area, but I >> need some ammunition. >> > > In our department we have 12000+ disks and found that the temperature didn't > influence failures too much. > > We've been monitoring errors and proactively replacing disks when they reach > 100-150 media errors or 50 transport errors. We've found that from that > point on they usually go up to 800-900 errors very quickly and fail, while > causing a lot of disruption in the systems (high latency while the HBA/RAID > controller still thinks they are good and keeps retrying things). So far it > decreased our p1 incidents considerably.
That's a lot of data to have collected. It sounds extremely valuable. Would you be able to publish the results? Tom P.S. Were you also gathering temperature data? -- http://EverythingSysadmin.comĀ -- my blog (new posts Mon and Wed) http://www.TomOnTime.com -- my advice (more videos coming soon) _______________________________________________ Discuss mailing list Discuss@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/