Well, I have an intermediate data point. One scrub run
completed without finding any newer errors (beside one
at the pool-level and two and the raidz2-level).
"Zpool clear" alone did not fix it, meaning that the
pool:metadata:<0x0> was still reported as problematic,
but a second attempt at "zpool clear" did clear the
errors from "zpool status".
Before running "zdb" as asked by other commentors,
I decided to rescrub. At some point between 200Gb
and 1.7Tb scanned, the errors returned to the stats.
So, in contrast with Nigel's optimistic theory that
metadata is anyway extra-redundant and should be
easily fixable, it seems that I do still have the
problem. It does not show itself in practice as of
yet, but is found by scrub ;)
After a few days to complete the current scrub,
I plan to run zdb as asked by Steve. If anyone else
has some theories, suggestions or requests to dig
up more clues - bring them on! ;)
2011-12-02 20:08, Nigel W wrote:
On Fri, Dec 2, 2011 at 02:58, Jim Klimov<jimkli...@cos.ru> wrote:
My question still stands: is it possible to recover
from this error or somehow safely ignore it? ;)
I mean, without backing up data and recreating the
pool?
If the problem is in metadata but presumably the
pool still works, then this particular metadata
is either not critical or redundant, and somehow
can be forged and replaced by valid metadata.
Is this a rightful path of thought?
Are there any tools to remake such a metadata
block?
Again, I did not try to export/reimport the pool
yet, except for that time 3 days ago when the
machine hung, was reset and imported the pool
and continued the scrub automatically...
I think it is now too late to do an export and
a rollback import, too...
Unfortunately I cannot provide you with a direct answer as I have only
been a user of ZFS for about a year and in that time only encountered
this once.
Anecdotally, at work I had something similar happen to a Nexcenta Core
3.0 (b134) box three days ago (seemingly caused by a hang then
eventual panic as a result of attempting to add a drive that is having
read failures to the pool). When the box came back up, zfs reported
an error in metadata:0x0. We scrubbed the tank (~400GB used) and like
in your case the checksum error didn't clear. We ran a scrub again
and it seems that the second scrub did clear the metadata error.
I don't know if that means it will work that way for everyone, every
time, or not. But considering that the pool and the data on it
appears to be fine (just not having any replicas until we get the bad
disk replaced) and that all metadata is supposed to have<copies>+1
copies (with an apparent max of 3 copies[1]) on the pool at all times
I can't see why this error shouldn't be cleared by a scrub.
[1] http://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection
//Jim
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss