Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

Victor Latushkin Fri, 05 Mar 2010 03:01:33 -0800

Mark J Musante wrote:

It looks like you're running into a DTL issue.  ZFS believes that ad16p2 has
some data on it that hasn't been copied off yet, and it's not considering the
fact that it's part of a raidz group and ad4p2.


There is a CR on this,
http://bugs.opensolaris.org/bugdatabase/view_bug.do?bug_id=6909724 but what's
viewable in the bug database is pretty minimal.

If you haven't made a backup yet (or at least done a complete snapshot and
generated a send stream from it), my advice would be to do that now.  Then
reboot and see if that clears the DTL enough to let you do the detach.

Actually besides the bug mentioned above, resilvering will not clear DTLs uponcompletion due to


6887372 DTLs not cleared after resilver if permanent errors present

as there are permanent errors present. Btw, they affect some files referenced bysnapshots as 'zpool status -v' suggests:


>> tank/DVD:<0x9cd> tank/d...@20100222225100:/Memento.m4v
>> tank/d...@20100222225100:/Payback.m4v
>> tank/d...@20100222225100:/TheManWhoWasntThere.m4v

In case of OpenSolaris it is not that difficult to work around this bug withoutgetting rid of files (snapshots referencing them) with errors, but in I'm notsure how to do the same on FreeBSD.


But you always have option of destroying snapshot indicated above (and may be 
more).

regards,
victor



On 3 Mar, 2010, at 18.46, Christian Heßmann wrote:

Hello guys,


I've already written this on the FreeBSD forums, but so far, the feedback
is not so great - seems FreeBSD guys aren't that keen on ZFS. I have some
hopes you'll be more experienced on these kind of errors:

I have a ZFS pool comprised of two 3-disk RAIDs which I've recently moved
from OS X to FreeBSD (8 stable).

One harddisk failed last weekend with lots of shouting, SMART messages and

even a kernel panic. I attached a new disk and started the replacement.Unfortunately, about 20% into the replacement, a second disk in the same

RAID showed signs of misbehaviour by giving me read errors. The resilvering
did finish, though, and it left me with only three broken files according
to zpool status:

[r...@camelot /]# zpool status -v tank pool: tank state: DEGRADED status:
One or more devices has experienced an error resulting in data corruption.
Applications may be affected. action: Restore the file in question if
possible.  Otherwise restore the entire pool from backup. see:
http://www.sun.com/msg/ZFS-8000-8A scrub: resilver completed after 10h42m
with 136 errors on Tue Mar  2 07:55:05 2010 config:

NAME           STATE     READ WRITE CKSUM tank           DEGRADED   137
0     0 raidz1       ONLINE       0     0     0 ad17p2     ONLINE       0
0     0 ad18p2     ONLINE       0     0     0 ad20p2     ONLINE       0
0     0 raidz1       DEGRADED   326     0     0 replacing  DEGRADED     0
0     0 ad16p2   OFFLINE      2  169K     6 ad4p2    ONLINE       0     0

0 839G resilvered ad14p2 ONLINE 0 0 0 5.33G resilveredad15p2 ONLINE 418 0 0 5.33G resilvered


errors: Permanent errors have been detected in the following files:

tank/DVD:<0x9cd> tank/d...@20100222225100:/Memento.m4vtank/d...@20100222225100:/Payback.m4vtank/d...@20100222225100:/TheManWhoWasntThere.m4v


I have the feeling the problems on ad15p2 are related to a cable issue,
since it doesn't have any SMART errors, is quite a new drive (3 months old)
and was IMHO sufficiently "burned in" by repeatedly filling it to the brim
and checking the contents (via ZFS). So I'd like to switch off the server,
replace the cable and do a scrub afterwards to make sure it doesn't produce
additional errors.

Unfortunately, although it says the resilvering completed, I can't detach
ad16p2 (the first faulted disk) from the system:

[r...@camelot /]# zpool detach tank ad16p2 cannot detach ad16p2: no valid
replicas

To be honest, I don't know how to proceed now. It feels like my system is
in a very unstable state right now, with a replacement not yet finished and
errors on two drives in one RAID.Z1.

I deleted the files affected, but have about 20 snapshots of this
filesystem and think these files are in most of them since they're quite
old.

So, what should I do now? Delete all snapshots? Move all other files from
this filesystem to a new filesystem and destroy the old filesystem? Try to
export and import the pool? Is it even safe to reboot the machine right
now?

I got one response in the FreeBSD Forum telling me I should reboot the
machine and do a scrub afterwards, it should then detect that it doesn't
need the old disk anymore - I am a bit reluctant doing that, to be
honest...

Any help would be appreciated.

Thank you.

Christian _______________________________________________ zfs-discuss

mailing list zfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss

_______________________________________________ zfs-discuss mailing listzfs-discuss@opensolaris.orghttp://mail.opensolaris.org/mailman/listinfo/zfs-discuss


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] (FreeBSD) ZFS RAID: Disk fails while replacing another disk

Reply via email to