Hi All, I run ZFS (a version 6 pool) under FreeBSD. Whilst I realise this changes a *whole heap* of things - I'm more interested in if I did 'anything wrong' when I had a recent drive failure...
On of a mirrored pair of drives on the system started failing, badly (confirmed by 'hard' read & write erros logged to the console). ZFS also started showing errors, the machine started hanging, waiting for I/O's to complete (which is how I noticed it). How many errors does a drive have to throw before it's considered "failed" by ZFS? - Mine had got to about 30-40 [not a huge amount] - but was making the system unusable, so I manually attached another hot-spare drive to the 'good' device left in that mirrored pair. However, ZFS was still trying to read data off the failing drive - this pushed the re-silver time up to 755 hours, whilst the number of errors in the next forty minutes or so got to around 300. Not wanting my data unprotected for 755 odd hours (and fearing the number was just going up and up) I did: zpool detach vol ad4 ('ad4' was the failing drive). This hung all I/O on the pool :( - I waited 5 hours, and then decided to reboot. After the reboot the pool came back OK (with 'ad4' removed) and the re-silver continued, and completed in half an hour. Thinking about it - perhaps I should have detached ad4 (the failing drive) before attaching another device? - My thinking at the time was I didn't know how badly failed the drive was, and obviously removing what might have been 200Gb of 'perfectly' accessible data from a mirrored pair, prior to re-silvering to a replacement, didn't sit right. I'm hoping ZFS shouldn't have hung when I later decided to fix the situation, and remove ad4? -Kp _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss