Re: [zfs-discuss] more ZFS recovery

Victor Latushkin Thu, 09 Oct 2008 12:21:57 -0700

Borys Saulyak wrote:
>> As a follow up to the whole story, with the fantastic help of
>> Victor, the failed pool is now imported and functional thanks to
>> the redundancy in the meta data.


> It would be really useful if you could publish the steps to recover
> the pools.

Here it is:

Executive summary:

Thanks to COW nature of ZFS it was possible to successfully recover pool
state which was only 5 seconds older than last unopenable one.

Details:

The whole story started with the pool which was not importable. Any
attempt to import it with 'zpool import content' reported I/O error.

This situation was preceded by some HW-related issue where host could
not communicate with array for some time over SAS bus. It affected array
so badly that it was power-cycled to get back to life.

I/O error reported by 'zpool import' is fairly generic, so the first
step is to find out more about _when_ it happens, what stage of pool
import process detects it. This is where DTrace is very useful. Simple
DTrace script tracing entries and exits in/out of ZFS module functions
and reporting retuned value provided us with the following output:


...
   4          -> spa_last_synced_txg
   4          <- spa_last_synced_txg            = 156012
   4          -> dsl_pool_open
   4            -> dsl_pool_open_impl
...
   4            <- dsl_pool_open_impl           = 6597523216512
   4            -> dmu_objset_open_impl
   4              -> arc_read
...
   4                -> zio_read
...
   4                <- zio_read                 = 3299027537664
   4                -> zio_wait
...
   4                <- zio_wait                 = 5
   4              <- arc_read                   = 5
   4            <- dmu_objset_open_impl         = 5
...
   4            -> dsl_pool_close
   4            <- dsl_pool_close               = 6597523216512
   4          <- dsl_pool_open                  = 5
   4          -> vdev_set_state
   4          <- vdev_set_state                 = 6597441769472
   4          -> zfs_ereport_post
...
  28  <- zfs_ereport_post                       = 0
  28  <- spa_load


Source code reveals that this means that we fail to open Meta ObjSet
(MOS) of the pool which is pointed by block pointer stored in uberblock.
But MOS has three copies (ditto-blocks) even on unreplicated pools!
There must be something bad happened.

This is the first moment where it is worth to stop and try to understand
what it means. First, we get pointer to MOS from active uberblock, which
means that it passed checksum verification, so it was written to disk
successfully. It has pointers to three copies of MOS and all three are
somehow corrupted. How it could happen? Answer to this question is left
as an exercise to the reader.

Next we tried to extract offsets and sizes from zios initiated to read
MOS from disk (this could be done easier by looking into active
uberblock, but anyway). This provided us with the following results:

CPU FUNCTION
   2  <- zio_read                              cmd: 0, retries: 0,
numerrors: 0, error: 0, checksum: 0, compress: 0, ndvas: 0, txg: 156012,
size: 512, offset1: 23622594110, offset2: 40803003282, offset3: 1610705882
   2  <- zio_read                              cmd: 0, retries: 0,
numerrors: 0, error: 0, checksum: 0, compress: 0, ndvas: 0, txg: 156012,
size: 512, offset1: 23622594110, offset2: 40803003282, offset3: 1610705882

With these offsets we read out three copies of MOS to files to compare
them individually (you need to add 4M to the offset to account for ZFS
front label if you are doing it with, say, 'dd'). It turned out that all
three are completely different. How it could happen? Most likely it
happened because corresponding writes never reached disk surface.

Being unable to read MOS is bad - since it is starting point into the
pool state you cannot discover any other data in the pool?

So is all hope lost? No, not really.

Since ZFS is COW, it is natural to try previous uberblock copy and see
if that points to consistent view of the pool. So we saved all front and
back ZFS labels (which turned out to be the same) and looked for
uberblock array to see what is available there:

Previous:
ub[107].txg = 2616b
o1 = 23622593482
o2 = 40803002710
o3 = 1610705850
timestamp = Sun Jul 27 06:04:03 2008

Active:
ub[108].txg = 2616c
o1 = 23622594110
o2 = 40803003282
o3 = 1610705882
timestamp = Sun Jul 27 06:04:08 2008


This is interesting pieces of data we extracted from active and previous
uberblocks. There are couple things to note:

1. 2616c = 156012 (as reported by spa_last_synced_txg() above)
2. all three offsets from the active uberblock are the same as
discovered earlier with dtrace script.

We tried to read three MOS copies pointed by previous uberblock and
found that all three were the same! So it was likely that previous
uberblock pointed to (more) consistent pool state, at least it's MOS
might be ok.

So the next step was to deactivate active uberblock and activate
previous one.

How this can be done? Well, it is enough to change checksum of the
currently active uberblock to make it inactive and make sure that
checksum for the previous uberblock is correct. This way previous
uberblock would be selected as active since it would have highest txg id
(and timestamp) of all uberblocks with correct checksum.

This can be achieved with simple tools like 'dd', you just need to be
sure to apply "corruption" to all four labels. Though a tool to do this
would be useful and there's RFE to address this:

CR 6667683 "need a way to select an uberblock from a previous txg"

Fortunately pool configuration in this case is simple - it contains only
one vdev, so we were able to leverage little quick and dirty utility
which dumps uberblock array and allows to activate any uberblock in the
array (by changing checksums of all others to be incorrect).

The next step was to try import pool. One option was to try to import it
on a live system (running with three other pools and providing service)
and see the outcome. But we came up with a better idea.

zdb was very helpful here. We took zdb (along with libzpool and libzfs)
from Nevada (one in Solaris 10 does not yet work with exported pools)
and tried to run it on the pool with activated uberblock from txg
0x2616b in userspace to verify pool consistency.

So we started it with 'zdb -bbccsv content' and after couple of days it
ended with the following output:

Traversing all blocks to verify checksums and verify nothing leaked ...

         No leaks (block sum matches space maps exactly)

         bp count:        66320566
         bp logical:    8629761329152     avg: 130121
         bp physical:   8624565999616     avg: 130043    compression:   1.00
         bp allocated:  8628035570688     avg: 130095    compression:   1.00
         SPA allocated: 8628035570688    used: 18.80%

Blocks  LSIZE   PSIZE   ASIZE     avg    comp   %Total  Type
      3  48.0K   5.50K   16.5K   5.50K    8.73     0.00      L1 deferred
free
     35   152K   85.5K    174K   4.97K    1.78     0.00      L0 deferred
free
     38   200K   91.0K    191K   5.01K    2.20     0.00  deferred free
      1    512     512      1K      1K    1.00     0.00  object directory
      3  1.50K   1.50K   3.00K      1K    1.00     0.00  object array
      1    16K      1K      2K      2K   16.00     0.00  packed nvlist
      -      -       -       -       -       -        -  packed nvlist size
      1    16K     16K     32K     32K    1.00     0.00  bplist
      -      -       -       -       -       -        -  bplist header
      -      -       -       -       -       -        -  SPA space map
header
     39   624K    149K    447K   11.5K    4.19     0.00      L1 SPA
space map
  2.22K  8.89M   5.21M   10.4M   4.69K    1.71     0.00      L0 SPA
space map
  2.26K   9.5M   5.35M   10.9M   4.80K    1.77     0.00  SPA space map
      -      -       -       -       -       -        -  ZIL intent log
      1    16K      1K   3.00K   3.00K   16.00     0.00      L6 DMU dnode
      1    16K      1K   3.00K   3.00K   16.00     0.00      L5 DMU dnode
      1    16K      1K   3.00K   3.00K   16.00     0.00      L4 DMU dnode
      1    16K      1K   3.00K   3.00K   16.00     0.00      L3 DMU dnode
      1    16K   1.50K   4.50K   4.50K   10.67     0.00      L2 DMU dnode
     12   192K    110K    329K   27.4K    1.75     0.00      L1 DMU dnode
  1.47K  23.5M   4.85M    9.7M   6.62K    4.84     0.00      L0 DMU dnode
  1.49K  23.8M   4.97M   10.1M   6.77K    4.78     0.00  DMU dnode
      2     2K      1K   3.00K   1.50K    2.00     0.00  DMU objset
      -      -       -       -       -       -        -  DSL directory
      2     1K      1K      2K      1K    1.00     0.00  DSL directory
child map
      1    512     512      1K      1K    1.00     0.00  DSL dataset
snap map
      2     1K      1K      2K      1K    1.00     0.00  DSL props
      -      -       -       -       -       -        -  DSL dataset
      -      -       -       -       -       -        -  ZFS znode
      -      -       -       -       -       -        -  ZFS V0 ACL
  1.07K  17.1M   1.07M   2.14M   2.01K   15.94     0.00      L3 ZFS
plain file
  8.13K   130M   31.8M   63.6M   7.83K    4.09     0.00      L2 ZFS
plain file
   505K  7.89G   3.19G   6.38G   12.9K    2.48     0.08      L1 ZFS
plain file
  62.7M  7.84T   7.84T   7.84T    128K    1.00    99.92      L0 ZFS
plain file
  63.2M  7.85T   7.84T   7.85T    127K    1.00   100.00  ZFS plain file
    991  1.77M    596K   1.16M   1.20K    3.04     0.00  ZFS directory
      1    512     512      1K      1K    1.00     0.00  ZFS master node
      1    512     512      1K      1K    1.00     0.00  ZFS delete queue
      -      -       -       -       -       -        -  zvol object
      -      -       -       -       -       -        -  zvol prop
      -      -       -       -       -       -        -  other uint8[]
      -      -       -       -       -       -        -  other uint64[]
      -      -       -       -       -       -        -  other ZAP
      -      -       -       -       -       -        -  persistent
error log
      1   128K   4.50K   9.00K   9.00K   28.44     0.00  SPA history
      -      -       -       -       -       -        -  SPA history offsets
      -      -       -       -       -       -        -  Pool properties
      -      -       -       -       -       -        -  DSL permissions
      -      -       -       -       -       -        -  ZFS ACL
      -      -       -       -       -       -        -  ZFS SYSACL
      -      -       -       -       -       -        -  FUID table
      -      -       -       -       -       -        -  FUID table size
      -      -       -       -       -       -        -  DSL dataset
next clones
      -      -       -       -       -       -        -  scrub work queue
      1    16K      1K   3.00K   3.00K   16.00     0.00      L6 Total
      1    16K      1K   3.00K   3.00K   16.00     0.00      L5 Total
      1    16K      1K   3.00K   3.00K   16.00     0.00      L4 Total
  1.07K  17.1M   1.07M   2.15M   2.01K   15.94     0.00      L3 Total
  8.13K   130M   31.8M   63.6M   7.83K    4.09     0.00      L2 Total
   505K  7.89G   3.19G   6.38G   12.9K    2.48     0.08      L1 Total
  62.7M  7.84T   7.84T   7.84T    128K    1.00    99.92      L0 Total
  63.2M  7.85T   7.84T   7.85T    127K    1.00   100.00  Total

                       capacity   operations   bandwidth  ---- errors ----
description         used avail  read write  read write  read write cksum
content            7.85T 33.9T   420     0 51.8M     0     0     0     0
  /dev/dsk/c2t9d0s0 7.85T 33.9T   420     0 51.8M     0     0     0     0
bash-3.00#

This confirmed that previous pool state is completely consistent, so it
should be safe to import pool in this state.

Import worked just fine and additional scrub did not find any errors.


Hope this helps,
Victor


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] more ZFS recovery

Reply via email to