date:20111202

Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...

2011-12-02 Thread Jim Klimov


An intermediate update to my recent post:

2011-11-30 21:01, Jim Klimov wrote:

Hello experts,

I've finally upgraded my troublesome oi-148a home storage box to oi-151a about a week ago 
(using pkg update method from the wiki page - i'm not certain if that repository is fixed 
at release version or is a sliding "current" one).

After the OS upgrade i scrubbed my main pool - 6disk raidz2 - and some checksum 
errors were discovered on individual disks, with one non-correctable error on 
the raid level. It named a file which was indeed not readable (io errors) so i 
deleted it. The dataset pool/media has no snapshots, and dedup was disabled on 
it, so i hoped the error is gone.

I cleared the errors (this only zeroed the counters, but still complained that 
there were some metadata errors in pool/media:0x4) and reran the scrub. While 
the scrub was running, zpool status reported this error and metadata:0x0. The 
computer got hung and reset during the scrub, but apparently resumed from the 
same spot. When the operation completed, however, it had zero checksum errors 
at both disk and raid levels, the pool/media error was gone, but metadata:0x0 
error is still in place.

Searching the list archive i found a similar post relevant to snv134 and 135, 
and at that time Victor Latushkin suggested that the pool must be recreated. I 
have some unique data on the pool, so i'm reluctant to recreate it (besides, 
it's problematic to back up 10tb of data at home, and it can take weeks to try 
and upload it to my work - even if there were so much free space there, which 
is not).

So far i cleared the errors and started a new scrub. I kinda hope that if the 
box won't hang, it might discover that there are no actual errors indeed. I'll 
see that in about 100 hours. The pool is now imported and automounted, and i 
didn't yet try to export and reimport it.



The scrub is running slower this time, for a couple of days
now and only nearing 25% completion (last timings were 89
and 101 hours). However it seems to have confirmed some
raidz-/pool-level checksum errors (without known individual
disk errors); whar puzzles me more - there are 2 raidz-level
errors for the one pool-level error:

# zpool status -v
  pool: pool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
entire pool from backup.
   see: http://www.sun.com/msg/ZFS-8000-8A
  scan: scrub in progress since Wed Nov 30 19:38:47 2011
1.97T scanned out of 8.34T at 13.6M/s, 135h54m to go
0 repaired, 23.68% done
config:

NAMESTATE READ WRITE CKSUM
poolONLINE   0 0 1
  raidz2-0  ONLINE   0 0 2
c7t0d0  ONLINE   0 0 0
c7t1d0  ONLINE   0 0 0
c7t2d0  ONLINE   0 0 0
c7t3d0  ONLINE   0 0 0
c7t4d0  ONLINE   0 0 0
c7t5d0  ONLINE   0 0 0
cache
  c4t1d0p7  ONLINE   0 0 0

errors: Permanent errors have been detected in the following files:

:<0x0>



My question still stands: is it possible to recover
from this error or somehow safely ignore it? ;)
I mean, without backing up data and recreating the
pool?

If the problem is in metadata but presumably the
pool still works, then this particular metadata
is either not critical or redundant, and somehow
can be forged and replaced by valid metadata.
Is this a rightful path of thought?

Are there any tools to remake such a metadata
block?

Again, I did not try to export/reimport the pool
yet, except for that time 3 days ago when the
machine hung, was reset and imported the pool
and continued the scrub automatically...

I think it is now too late to do an export and
a rollback import, too...


Still, i'd like to estimate now what are my chances of living on without 
recreating the pool nor losing data? Perhaps, some ways to actually check, fix 
or forge the needed metadata? Also, previously a zdb walk found some 
inconsistencies (allocated !- referred); can that be better diagnosed or 
repaired? Can this discrepancy by a few sectors worth of size be a cause or be 
caused by that reported metadata error?
Thanks,
// Jim Klimov

sent from a mobile, pardon any typos ,)

___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS not starting

2011-12-02 Thread Gareth de Vaux

On Thu 2011-12-01 (14:19), Freddie Cash wrote:
> You will need to find a lot of extra RAM to stuff into that machine in
> order for it to boot correctly, load the dedeupe tables into ARC, process
> the intent log, and then import the pool.

Thanks guys, managed to get 24GB together and it made it (looks like it used
12GB of that).

> And, you'll need that extra RAM in order to destroy the ZFS filesystem that
> has dedupe enabled.

That filesystem's gone, and it seems like I've got mostly the right ammount
of free space. Waiting to see what a scrub has to say ..
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...

2011-12-02 Thread Jim Klimov


2011-12-02 18:25, Steve Gonczi пишет:

Hi Jim,

Try to run a "zdb -b poolname" ..
This should report any leaked or double allocated blocks.
(It may or may not run, it tends to run out of memory and crash on large
datasets)
I would be curious what zdb reports, and whether you are able to run it w/o
crashing with "out of memory".



Ok, when/if it completes scrubbing the pool, I'll try that.
But it is likely to fail, unless there are some new failsafe
workarounds for such failures in oi_151a.

In the meanwhile, here are copies of zdb walks which I did
a couple of weeks ago while repairing (finally replacing)
the rpool on this box. At that time it was booted with
oi_148a LiveUSB. Some of the walks (those WITH leak-checks
not disabled) never completed:

root@openindiana:~# time zdb -bb -e 1601233584937321596

Traversing all blocks to verify nothing leaked ...

(box hung: LAN Disconnected; RAM/SWAP used up according to "vmstat 1")



root@openindiana:~# time zdb -bsvc -e 1601233584937321596
Traversing all blocks to verify checksums and verify nothing leaked ...

Assertion failed: zio_wait(zio_claim(0L, zcb->zcb_spa, refcnt ? 0 : 
spa_first_txg(zcb->zcb_spa), bp, 0L, 0L, ZIO_FLAG_CANFAIL)) == 0 (0x2 == 
0x0), file ../zdb.c, line 1950

Abort

real7197m41.288s
user291m39.256s
sys 25m48.133s

This took most of the week just to fail.

And a walk without leak checks took half a day to find
some discrepancies and "unreachable" blocks:

root@openindiana:~# time zdb -bsvL -e 1601233584937321596
Traversing all blocks ...
block traversal size 9044729487360 != alloc 9044729499648
(unreachable 12288)

bp count:85245222
bp logical:8891466103808  avg: 104304
bp physical:   7985508591104  avg:  93676 compression: 
  1.11
bp allocated:  12429007810560  avg: 145802 compression: 
  0.72
bp deduped:3384278323200ref>1: 13909855 
deduplication:   1.27

SPA allocated: 9044729499648 used: 75.64%

Blocks  LSIZE   PSIZE   ASIZE avgcomp   %Total  Type
 -  -   -   -   -   --  unallocated
 232K  4K   72.0K   36.0K8.00 0.00  object directory
 3  1.50K   1.50K108K   36.0K1.00 0.00  object array
 232K   2.50K   72.0K   36.0K   12.80 0.00  packed nvlist
 -  -   -   -   -   --  packed nvlist size
 7.80K   988M208M   1.12G147K4.75 0.01  bpobj
 -  -   -   -   -   --  bpobj header
 -  -   -   -   -   --  SPA space map 
header

  183K   753M517M   6.49G   36.3K1.46 0.06  SPA space map
22  1020K   1020K   1.58M   73.6K1.00 0.00  ZIL intent log
  933K  14.6G   3.11G   25.2G   27.6K4.69 0.22  DMU dnode
 1.75K  3.50M896K   42.0M   24.0K4.00 0.00  DMU objset
 -  -   -   -   -   --  DSL directory
   390   243K200K   13.7M   36.0K1.21 0.00  DSL directory 
child map
   388   298K208K   13.6M   36.0K1.43 0.00  DSL dataset 
snap map

   715  10.2M   1.14M   25.1M   36.0K8.92 0.00  DSL props
 -  -   -   -   -   --  DSL dataset
 -  -   -   -   -   --  ZFS znode
 -  -   -   -   -   --  ZFS V0 ACL
 76.1M  8.06T   7.25T   11.2T150K1.1198.67  ZFS plain file
 2.17M  2.76G   1.33G   52.7G   24.3K2.08 0.46  ZFS directory
   341   314K171K   7.99M   24.0K1.84 0.00  ZFS master node
   857  25.5M   1.16M   20.1M   24.1K   21.94 0.00  ZFS delete queue
 -  -   -   -   -   --  zvol object
 -  -   -   -   -   --  zvol prop
 -  -   -   -   -   --  other uint8[]
 -  -   -   -   -   --  other uint64[]
 -  -   -   -   -   --  other ZAP
 -  -   -   -   -   --  persistent 
error log

33  4.02M763K   4.46M139K5.39 0.00  SPA history
 -  -   -   -   -   --  SPA history offsets
 1512 512   36.0K   36.0K1.00 0.00  Pool properties
 -  -   -   -   -   --  DSL permissions
 17.1K  12.7M   8.63M411M   24.0K1.48 0.00  ZFS ACL
 -  -   -   -   -   --  ZFS SYSACL
 5  80.0K   5.00K120K   24.0K   16.00 0.00  FUID table
 -  -   -   -   -   --  FUID table size
 1.37K   723K705K   49.3M   36.0K1.03 0.00  DSL dataset 
next clones

 -  -   -   -   -   --  scan work queue
 2.69K  2.57M   1.36M   64.6M   24.0K1.89 0.00  ZFS user/group used
 -  -   -   -   -   --  ZFS user/group 
quota
 -  -

Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...

2011-12-02 Thread Nigel W

On Fri, Dec 2, 2011 at 02:58, Jim Klimov  wrote:
> My question still stands: is it possible to recover
> from this error or somehow safely ignore it? ;)
> I mean, without backing up data and recreating the
> pool?
>
> If the problem is in metadata but presumably the
> pool still works, then this particular metadata
> is either not critical or redundant, and somehow
> can be forged and replaced by valid metadata.
> Is this a rightful path of thought?
>
> Are there any tools to remake such a metadata
> block?
>
> Again, I did not try to export/reimport the pool
> yet, except for that time 3 days ago when the
> machine hung, was reset and imported the pool
> and continued the scrub automatically...
>
> I think it is now too late to do an export and
> a rollback import, too...
>

Unfortunately I cannot provide you with a direct answer as I have only
been a user of ZFS for about a year and in that time only encountered
this once.

Anecdotally, at work I had something similar happen to a Nexcenta Core
3.0 (b134) box three days ago (seemingly caused by a hang then
eventual panic as a result of attempting to add a drive that is having
read failures to the pool).  When the box came back up, zfs reported
an error in metadata:0x0.  We scrubbed the tank (~400GB used) and like
in your case the checksum error didn't clear.  We ran a scrub again
and it seems that the second scrub did clear the metadata error.

I don't know if that means it will work that way for everyone, every
time, or not.  But considering that the pool and the data on it
appears to be fine (just not having any replicas until we get the bad
disk replaced) and that all metadata is supposed to have +1
copies (with an apparent max of 3 copies[1]) on the pool at all times
I can't see why this error shouldn't be cleared by a scrub.

[1] http://blogs.oracle.com/relling/entry/zfs_copies_and_data_protection
___
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...

Re: [zfs-discuss] ZFS not starting

Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...

Re: [zfs-discuss] Scrub found error in metadata:0x0, is that always fatal? No checks um errors now...

4 matches

Site Navigation

Mail list logo

Footer information