Borys Saulyak wrote: >> As a follow up to the whole story, with the fantastic help of >> Victor, the failed pool is now imported and functional thanks to >> the redundancy in the meta data.
> It would be really useful if you could publish the steps to recover > the pools. Here it is: Executive summary: Thanks to COW nature of ZFS it was possible to successfully recover pool state which was only 5 seconds older than last unopenable one. Details: The whole story started with the pool which was not importable. Any attempt to import it with 'zpool import content' reported I/O error. This situation was preceded by some HW-related issue where host could not communicate with array for some time over SAS bus. It affected array so badly that it was power-cycled to get back to life. I/O error reported by 'zpool import' is fairly generic, so the first step is to find out more about _when_ it happens, what stage of pool import process detects it. This is where DTrace is very useful. Simple DTrace script tracing entries and exits in/out of ZFS module functions and reporting retuned value provided us with the following output: ... 4 -> spa_last_synced_txg 4 <- spa_last_synced_txg = 156012 4 -> dsl_pool_open 4 -> dsl_pool_open_impl ... 4 <- dsl_pool_open_impl = 6597523216512 4 -> dmu_objset_open_impl 4 -> arc_read ... 4 -> zio_read ... 4 <- zio_read = 3299027537664 4 -> zio_wait ... 4 <- zio_wait = 5 4 <- arc_read = 5 4 <- dmu_objset_open_impl = 5 ... 4 -> dsl_pool_close 4 <- dsl_pool_close = 6597523216512 4 <- dsl_pool_open = 5 4 -> vdev_set_state 4 <- vdev_set_state = 6597441769472 4 -> zfs_ereport_post ... 28 <- zfs_ereport_post = 0 28 <- spa_load Source code reveals that this means that we fail to open Meta ObjSet (MOS) of the pool which is pointed by block pointer stored in uberblock. But MOS has three copies (ditto-blocks) even on unreplicated pools! There must be something bad happened. This is the first moment where it is worth to stop and try to understand what it means. First, we get pointer to MOS from active uberblock, which means that it passed checksum verification, so it was written to disk successfully. It has pointers to three copies of MOS and all three are somehow corrupted. How it could happen? Answer to this question is left as an exercise to the reader. Next we tried to extract offsets and sizes from zios initiated to read MOS from disk (this could be done easier by looking into active uberblock, but anyway). This provided us with the following results: CPU FUNCTION 2 <- zio_read cmd: 0, retries: 0, numerrors: 0, error: 0, checksum: 0, compress: 0, ndvas: 0, txg: 156012, size: 512, offset1: 23622594110, offset2: 40803003282, offset3: 1610705882 2 <- zio_read cmd: 0, retries: 0, numerrors: 0, error: 0, checksum: 0, compress: 0, ndvas: 0, txg: 156012, size: 512, offset1: 23622594110, offset2: 40803003282, offset3: 1610705882 With these offsets we read out three copies of MOS to files to compare them individually (you need to add 4M to the offset to account for ZFS front label if you are doing it with, say, 'dd'). It turned out that all three are completely different. How it could happen? Most likely it happened because corresponding writes never reached disk surface. Being unable to read MOS is bad - since it is starting point into the pool state you cannot discover any other data in the pool? So is all hope lost? No, not really. Since ZFS is COW, it is natural to try previous uberblock copy and see if that points to consistent view of the pool. So we saved all front and back ZFS labels (which turned out to be the same) and looked for uberblock array to see what is available there: Previous: ub[107].txg = 2616b o1 = 23622593482 o2 = 40803002710 o3 = 1610705850 timestamp = Sun Jul 27 06:04:03 2008 Active: ub[108].txg = 2616c o1 = 23622594110 o2 = 40803003282 o3 = 1610705882 timestamp = Sun Jul 27 06:04:08 2008 This is interesting pieces of data we extracted from active and previous uberblocks. There are couple things to note: 1. 2616c = 156012 (as reported by spa_last_synced_txg() above) 2. all three offsets from the active uberblock are the same as discovered earlier with dtrace script. We tried to read three MOS copies pointed by previous uberblock and found that all three were the same! So it was likely that previous uberblock pointed to (more) consistent pool state, at least it's MOS might be ok. So the next step was to deactivate active uberblock and activate previous one. How this can be done? Well, it is enough to change checksum of the currently active uberblock to make it inactive and make sure that checksum for the previous uberblock is correct. This way previous uberblock would be selected as active since it would have highest txg id (and timestamp) of all uberblocks with correct checksum. This can be achieved with simple tools like 'dd', you just need to be sure to apply "corruption" to all four labels. Though a tool to do this would be useful and there's RFE to address this: CR 6667683 "need a way to select an uberblock from a previous txg" Fortunately pool configuration in this case is simple - it contains only one vdev, so we were able to leverage little quick and dirty utility which dumps uberblock array and allows to activate any uberblock in the array (by changing checksums of all others to be incorrect). The next step was to try import pool. One option was to try to import it on a live system (running with three other pools and providing service) and see the outcome. But we came up with a better idea. zdb was very helpful here. We took zdb (along with libzpool and libzfs) from Nevada (one in Solaris 10 does not yet work with exported pools) and tried to run it on the pool with activated uberblock from txg 0x2616b in userspace to verify pool consistency. So we started it with 'zdb -bbccsv content' and after couple of days it ended with the following output: Traversing all blocks to verify checksums and verify nothing leaked ... No leaks (block sum matches space maps exactly) bp count: 66320566 bp logical: 8629761329152 avg: 130121 bp physical: 8624565999616 avg: 130043 compression: 1.00 bp allocated: 8628035570688 avg: 130095 compression: 1.00 SPA allocated: 8628035570688 used: 18.80% Blocks LSIZE PSIZE ASIZE avg comp %Total Type 3 48.0K 5.50K 16.5K 5.50K 8.73 0.00 L1 deferred free 35 152K 85.5K 174K 4.97K 1.78 0.00 L0 deferred free 38 200K 91.0K 191K 5.01K 2.20 0.00 deferred free 1 512 512 1K 1K 1.00 0.00 object directory 3 1.50K 1.50K 3.00K 1K 1.00 0.00 object array 1 16K 1K 2K 2K 16.00 0.00 packed nvlist - - - - - - - packed nvlist size 1 16K 16K 32K 32K 1.00 0.00 bplist - - - - - - - bplist header - - - - - - - SPA space map header 39 624K 149K 447K 11.5K 4.19 0.00 L1 SPA space map 2.22K 8.89M 5.21M 10.4M 4.69K 1.71 0.00 L0 SPA space map 2.26K 9.5M 5.35M 10.9M 4.80K 1.77 0.00 SPA space map - - - - - - - ZIL intent log 1 16K 1K 3.00K 3.00K 16.00 0.00 L6 DMU dnode 1 16K 1K 3.00K 3.00K 16.00 0.00 L5 DMU dnode 1 16K 1K 3.00K 3.00K 16.00 0.00 L4 DMU dnode 1 16K 1K 3.00K 3.00K 16.00 0.00 L3 DMU dnode 1 16K 1.50K 4.50K 4.50K 10.67 0.00 L2 DMU dnode 12 192K 110K 329K 27.4K 1.75 0.00 L1 DMU dnode 1.47K 23.5M 4.85M 9.7M 6.62K 4.84 0.00 L0 DMU dnode 1.49K 23.8M 4.97M 10.1M 6.77K 4.78 0.00 DMU dnode 2 2K 1K 3.00K 1.50K 2.00 0.00 DMU objset - - - - - - - DSL directory 2 1K 1K 2K 1K 1.00 0.00 DSL directory child map 1 512 512 1K 1K 1.00 0.00 DSL dataset snap map 2 1K 1K 2K 1K 1.00 0.00 DSL props - - - - - - - DSL dataset - - - - - - - ZFS znode - - - - - - - ZFS V0 ACL 1.07K 17.1M 1.07M 2.14M 2.01K 15.94 0.00 L3 ZFS plain file 8.13K 130M 31.8M 63.6M 7.83K 4.09 0.00 L2 ZFS plain file 505K 7.89G 3.19G 6.38G 12.9K 2.48 0.08 L1 ZFS plain file 62.7M 7.84T 7.84T 7.84T 128K 1.00 99.92 L0 ZFS plain file 63.2M 7.85T 7.84T 7.85T 127K 1.00 100.00 ZFS plain file 991 1.77M 596K 1.16M 1.20K 3.04 0.00 ZFS directory 1 512 512 1K 1K 1.00 0.00 ZFS master node 1 512 512 1K 1K 1.00 0.00 ZFS delete queue - - - - - - - zvol object - - - - - - - zvol prop - - - - - - - other uint8[] - - - - - - - other uint64[] - - - - - - - other ZAP - - - - - - - persistent error log 1 128K 4.50K 9.00K 9.00K 28.44 0.00 SPA history - - - - - - - SPA history offsets - - - - - - - Pool properties - - - - - - - DSL permissions - - - - - - - ZFS ACL - - - - - - - ZFS SYSACL - - - - - - - FUID table - - - - - - - FUID table size - - - - - - - DSL dataset next clones - - - - - - - scrub work queue 1 16K 1K 3.00K 3.00K 16.00 0.00 L6 Total 1 16K 1K 3.00K 3.00K 16.00 0.00 L5 Total 1 16K 1K 3.00K 3.00K 16.00 0.00 L4 Total 1.07K 17.1M 1.07M 2.15M 2.01K 15.94 0.00 L3 Total 8.13K 130M 31.8M 63.6M 7.83K 4.09 0.00 L2 Total 505K 7.89G 3.19G 6.38G 12.9K 2.48 0.08 L1 Total 62.7M 7.84T 7.84T 7.84T 128K 1.00 99.92 L0 Total 63.2M 7.85T 7.84T 7.85T 127K 1.00 100.00 Total capacity operations bandwidth ---- errors ---- description used avail read write read write read write cksum content 7.85T 33.9T 420 0 51.8M 0 0 0 0 /dev/dsk/c2t9d0s0 7.85T 33.9T 420 0 51.8M 0 0 0 0 bash-3.00# This confirmed that previous pool state is completely consistent, so it should be safe to import pool in this state. Import worked just fine and additional scrub did not find any errors. Hope this helps, Victor _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss