Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...

Jim Klimov Fri, 02 Dec 2011 02:00:48 -0800

An intermediate update to my recent post:

2011-11-30 21:01, Jim Klimov wrote:

Hello experts,


I've finally upgraded my troublesome oi-148a home storage box to oi-151a about a week ago 
(using pkg update method from the wiki page - i'm not certain if that repository is fixed 
at release version or is a sliding "current" one).

After the OS upgrade i scrubbed my main pool - 6disk raidz2 - and some checksum 
errors were discovered on individual disks, with one non-correctable error on 
the raid level. It named a file which was indeed not readable (io errors) so i 
deleted it. The dataset pool/media has no snapshots, and dedup was disabled on 
it, so i hoped the error is gone.

I cleared the errors (this only zeroed the counters, but still complained that 
there were some metadata errors in pool/media:0x4) and reran the scrub. While 
the scrub was running, zpool status reported this error and metadata:0x0. The 
computer got hung and reset during the scrub, but apparently resumed from the 
same spot. When the operation completed, however, it had zero checksum errors 
at both disk and raid levels, the pool/media error was gone, but metadata:0x0 
error is still in place.

Searching the list archive i found a similar post relevant to snv134 and 135, 
and at that time Victor Latushkin suggested that the pool must be recreated. I 
have some unique data on the pool, so i'm reluctant to recreate it (besides, 
it's problematic to back up 10tb of data at home, and it can take weeks to try 
and upload it to my work - even if there were so much free space there, which 
is not).

So far i cleared the errors and started a new scrub. I kinda hope that if the 
box won't hang, it might discover that there are no actual errors indeed. I'll 
see that in about 100 hours. The pool is now imported and automounted, and i 
didn't yet try to export and reimport it.


The scrub is running slower this time, for a couple of days
now and only nearing 25% completion (last timings were 89
and 101 hours). However it seems to have confirmed some
raidz-/pool-level checksum errors (without known individual
disk errors); whar puzzles me more - there are 2 raidz-level
errors for the one pool-level error:

# zpool status -v
  pool: pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Nov 30 19:38:47 2011
    1.97T scanned out of 8.34T at 13.6M/s, 135h54m to go
    0 repaired, 23.68% done
config:

        NAME        STATE     READ WRITE CKSUM
        pool        ONLINE       0     0     1
          raidz2-0  ONLINE       0     0     2
            c7t0d0  ONLINE       0     0     0
            c7t1d0  ONLINE       0     0     0
            c7t2d0  ONLINE       0     0     0
            c7t3d0  ONLINE       0     0     0
            c7t4d0  ONLINE       0     0     0
            c7t5d0  ONLINE       0     0     0
        cache
          c4t1d0p7  ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        <metadata>:<0x0>



My question still stands: is it possible to recover
from this error or somehow safely ignore it? ;)
I mean, without backing up data and recreating the
pool?

If the problem is in metadata but presumably the
pool still works, then this particular metadata
is either not critical or redundant, and somehow
can be forged and replaced by valid metadata.
Is this a rightful path of thought?

Are there any tools to remake such a metadata
block?

Again, I did not try to export/reimport the pool
yet, except for that time 3 days ago when the
machine hung, was reset and imported the pool
and continued the scrub automatically...

I think it is now too late to do an export and
a rollback import, too...

Still, i'd like to estimate now what are my chances of living on without 
recreating the pool nor losing data? Perhaps, some ways to actually check, fix 
or forge the needed metadata? Also, previously a zdb walk found some 
inconsistencies (allocated !- referred); can that be better diagnosed or 
repaired? Can this discrepancy by a few sectors worth of size be a cause or be 
caused by that reported metadata error?
Thanks,
// Jim Klimov

sent from a mobile, pardon any typos ,)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...

Reply via email to