Hi All,

I run ZFS (a version 6 pool) under FreeBSD. Whilst I realise this changes a 
*whole heap* of things - I'm more interested in if I did 'anything wrong' 
when I had a recent drive failure...

On of a mirrored pair of drives on the system started failing, badly 
(confirmed by 'hard' read & write erros logged to the console). ZFS also 
started showing errors, the machine started hanging, waiting for I/O's to 
complete (which is how I noticed it).

How many errors does a drive have to throw before it's considered "failed" 
by ZFS? - Mine had got to about 30-40 [not a huge amount] - but was making 
the system unusable, so I manually attached another hot-spare drive to the 
'good' device left in that mirrored pair.

However, ZFS was still trying to read data off the failing drive - this 
pushed the re-silver time up to 755 hours, whilst the number of errors in 
the next forty minutes or so got to around 300. Not wanting my data 
unprotected for 755 odd hours (and fearing the number was just going up and 
up) I did:

  zpool detach vol ad4

('ad4' was the failing drive).

This hung all I/O on the pool :( - I waited 5 hours, and then decided to 
reboot.

After the reboot the pool came back OK (with 'ad4' removed) and the 
re-silver continued, and completed in half an hour.

Thinking about it - perhaps I should have detached ad4 (the failing drive) 
before attaching another device? - My thinking at the time was I didn't 
know how badly failed the drive was, and obviously removing what might have 
been 200Gb of 'perfectly' accessible data from a mirrored pair, prior to 
re-silvering to a replacement, didn't sit right.

I'm hoping ZFS shouldn't have hung when I later decided to fix the 
situation, and remove ad4?

-Kp
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to