On Jul 10, 2015, at 10:47, m.r...@5-cent.us wrote:
> 
> Trying to prevent this from happening again, I've decided to replace the
> drive that's in predictive failure. The array has a hot spare. I tried to
> remove, using hpacucli, it refuses "operation not permitted", and there
> doesn't *seem* to be a "mark as failed" command. *Do* I just yank the
> drive?

Hi Mark, 

I’ve never had any problem just pulling and replacing drives on HP hardware 
with the hardware RAID controllers (even the icky cheap one that came out 
around the DL360/380 Gen 8 timeframe, that isn’t really hardware RAID and needs 
closed drivers in Linux).

That said, I also *test it*, long before putting anything important on them… 

From past experience with HP stuff, it usually won’t move the data over to the 
hot spare (especially if it’s a “Global” hot spare and not specific to that 
array) until an actual failure occurs.  “Predictive failure” isn’t considered a 
failure in HP’s world.  I don’t think there is any setting to tell the 
controller to move to the hot spare if there’s a “predictive failure”.

I’ve also had disks that triggered a “predictive failure” under heavy load that 
were simply popped out and back in, and the controller rebuilt them, and the 
drive never did it again for *years*.  The “predictive failure” error rate is 
pretty low.

That last one is more a question of policy than anything.  How much do you 
trust it?  At one employer the game was to pop out and back in any drive that 
showed “predictive failure” on HP systems (Dell stuff we handled differently at 
the time, it was less prone to false alarms, so to speak) and if they did it 
again “soonish”, we’d call for the replacement disk.  That’s how often the HP 
controllers did it.  In a rather large farm of HP stuff, I popped and replaced 
an HP drive a week, whenever I happened by the data center.

As for the question of whether you should be able to do it safely or not… if a 
hardware RAID controller won’t let me yank a physical drive out and shove 
another one in and rebuild itself back to whatever level of redundancy was 
defined by me as “nominal” for that system, I don’t want it anyway.  Look at it 
this way… if the disk had a catastrophic electronics failure while installed in 
the array, the array should handle it… yanking it out is technically nicer than 
some of the failure modes that can affect the busses on the backplane with 
shorted electronics. (GRIN)

Just sharing my thoughts… your call. :-)  YMMV.  We had a service contract at 
that place and a new disk was always just a phone call away and no additional 
$, and even with that level of service, we always did the “re-seat it once” 
thing.  We’d log it and if anyone else saw that same disk flashing the next 
time they were at the data center (we just looked at the logged ones before 
doing the “re-seat”), they’d make the phone call and the service company would 
drop a drive off a few hours later.

--
Nate Duehr
denverpi...@me.com

_______________________________________________
CentOS mailing list
CentOS@centos.org
http://lists.centos.org/mailman/listinfo/centos

Reply via email to