Re: [lopsa-discuss] hard drive failures

Tom Limoncelli Thu, 27 Jan 2011 05:06:36 -0800

On Thu, Jan 27, 2011 at 4:26 AM, Giovanni Tirloni <gtirl...@sysdroid.com> wrote:
> On Wed, Jan 26, 2011 at 11:14 PM, <da...@lang.hm> wrote:
>>
>> My datacenter just had a problem and the air temp in the 'cold isles' got
>> over 105F over about 75 min. I'm trying to make the case that I need to be
>> proactive and replace drives in large arrays that were in that area, but I
>> need some ammunition.
>>
>
> In our department we have 12000+ disks and found that the temperature didn't
> influence failures too much.
>
> We've been monitoring errors and proactively replacing disks when they reach
> 100-150 media errors or 50 transport errors. We've found that from that
> point on they usually go up to 800-900 errors very quickly and fail, while
> causing a lot of disruption in the systems (high latency while the HBA/RAID
> controller still thinks they are good and keeps retrying things). So far it
> decreased our p1 incidents considerably.


That's a lot of data to have collected.  It sounds extremely valuable.
 Would you be able to publish the results?

Tom
P.S.  Were you also gathering temperature data?

-- 
http://EverythingSysadmin.com  -- my blog (new posts Mon and Wed)
http://www.TomOnTime.com -- my advice (more videos coming soon)
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] hard drive failures

Reply via email to