Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

Mark J Musante Thu, 04 Mar 2010 04:33:45 -0800

It looks like you're running into a DTL issue.  ZFS believes that ad16p2 has 
some data on it that hasn't been copied off yet, and it's not considering the 
fact that it's part of a raidz group and ad4p2.


There is a CR on this, 
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6909724 but what's 
viewable in the bug database is pretty minimal.

If you haven't made a backup yet (or at least done a complete snapshot and 
generated a send stream from it), my advice would be to do that now.  Then 
reboot and see if that clears the DTL enough to let you do the detach.


On 3 Mar, 2010, at 18.46, Christian Heßmann wrote:

> Hello guys,
> 
> 
> I've already written this on the FreeBSD forums, but so far, the feedback is 
> not so great - seems FreeBSD guys aren't that keen on ZFS. I have some hopes 
> you'll be more experienced on these kind of errors:
> 
> I have a ZFS pool comprised of two 3-disk RAIDs which I've recently moved 
> from OS X to FreeBSD (8 stable).
> 
> One harddisk failed last weekend with lots of shouting, SMART messages and 
> even a kernel panic.
> I attached a new disk and started the replacement.
> Unfortunately, about 20% into the replacement, a second disk in the same RAID 
> showed signs of misbehaviour by giving me read errors. The resilvering did 
> finish, though, and it left me with only three broken files according to 
> zpool status:
> 
> [r...@camelot /]# zpool status -v tank
>  pool: tank
> state: DEGRADED
> status: One or more devices has experienced an error resulting in data
>        corruption.  Applications may be affected.
> action: Restore the file in question if possible.  Otherwise restore the
>        entire pool from backup.
>   see: http://www.sun.com/msg/ZFS-8000-8A
> scrub: resilver completed after 10h42m with 136 errors on Tue Mar  2 07:55:05 
> 2010
> config:
> 
>        NAME           STATE     READ WRITE CKSUM
>        tank           DEGRADED   137     0     0
>          raidz1       ONLINE       0     0     0
>            ad17p2     ONLINE       0     0     0
>            ad18p2     ONLINE       0     0     0
>            ad20p2     ONLINE       0     0     0
>          raidz1       DEGRADED   326     0     0
>            replacing  DEGRADED     0     0     0
>              ad16p2   OFFLINE      2  169K     6
>              ad4p2    ONLINE       0     0     0  839G resilvered
>            ad14p2     ONLINE       0     0     0  5.33G resilvered
>            ad15p2     ONLINE     418     0     0  5.33G resilvered
> 
> errors: Permanent errors have been detected in the following files:
> 
>        tank/DVD:<0x9cd>
>        tank/d...@20100222225100:/Memento.m4v
>        tank/d...@20100222225100:/Payback.m4v
>        tank/d...@20100222225100:/TheManWhoWasntThere.m4v
> 
> I have the feeling the problems on ad15p2 are related to a cable issue, since 
> it doesn't have any SMART errors, is quite a new drive (3 months old) and was 
> IMHO sufficiently "burned in" by repeatedly filling it to the brim and 
> checking the contents (via ZFS). So I'd like to switch off the server, 
> replace the cable and do a scrub afterwards to make sure it doesn't produce 
> additional errors.
> 
> Unfortunately, although it says the resilvering completed, I can't detach 
> ad16p2 (the first faulted disk) from the system:
> 
> [r...@camelot /]# zpool detach tank ad16p2
> cannot detach ad16p2: no valid replicas
> 
> To be honest, I don't know how to proceed now. It feels like my system is in 
> a very unstable state right now, with a replacement not yet finished and 
> errors on two drives in one RAID.Z1.
> 
> I deleted the files affected, but have about 20 snapshots of this filesystem 
> and think these files are in most of them since they're quite old.
> 
> So, what should I do now? Delete all snapshots? Move all other files from 
> this filesystem to a new filesystem and destroy the old filesystem? Try to 
> export and import the pool? Is it even safe to reboot the machine right now?
> 
> I got one response in the FreeBSD Forum telling me I should reboot the 
> machine and do a scrub afterwards, it should then detect that it doesn't need 
> the old disk anymore - I am a bit reluctant doing that, to be honest...
> 
> Any help would be appreciated.
> 
> Thank you.
> 
> Christian
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

Reply via email to