Hello all, SHORT VERSION:
What conditions can cause the reset of the resilvering process? My lost-and-found disk can't get back into the pool because of resilvers restarting... LONG VERSION: I maintain a number of older systems, and among them a Sun Thumper (X4500) running OpenSolaris SXCE with nowadays-smallish 250Gb disks and a ZPOOL v14 (with the OS supporting ZPOOLv16). This year it has "lost contact" with two different drives, however a poweroff followed by poweron restored the contact. On the first time a couple of months ago the system successfully used the hotspare disk with no issues worth mentioning or remembering. This week however the hotspare took days to get used and ultimately didn't, with pool CKSUM errors growing into millions on the lost disk, and dmesg clobbered with messages about the lost drive. There was even no message about resilvering in "zpool status", other than that the hotspare is "IN USE". After a reboot the disk was re-found and the resilver went humming along, with the hotspare still marked as used. The error count is non-zero but stable overnight, and no data loss was reported: pool: pond state: ONLINE status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h49m, 9.52% done, 7h46m to go config: NAME STATE READ WRITE CKSUM pond ONLINE 0 0 0 ... raidz1 ONLINE 0 0 0 c0t1d0 ONLINE 0 0 0 spare ONLINE 0 0 46 c1t2d0 ONLINE 0 0 0 17.3G resilvered c5t6d0 ONLINE 0 0 0 c4t3d0 ONLINE 0 0 0 c6t5d0 ONLINE 0 0 0 c7t6d0 ONLINE 0 0 0 ... spares c5t6d0 INUSE currently in use errors: No known data errors The problem I come here with is that the resilver does not complete. It got reset after about 40, 84, 80Gb of resilvered progress and restarts from scratch. I hoped that it would at least progress further each time, but that seems to be not true as of the last two restarts. The pool is indeed active, used by zfs-auto-snapshots in particular (I disabled them for now while asking - they do also perform very slowly, with some "zfs destroy" invokations overlaying each other and failing due to missing targets). There are no correlating errors in dmesg that would point me at a problem, and the CKSUM count stays the same. So far I guess that the stale disk may be found referencing some transactions that are gone from the live pool (i.e. via deletion of automatic snapshots), but I didn't think this should cause a restart of the whole resilvering process?.. The system must be quite fragmented I guess, because the resilver of a 250Gb disk is promised to complete in 10 hours (after each report - even if 80Gb are behind). "zpool iostat 60" shows low figures, about 1Mb/s writes and about 5Mb/s reads; "iostat -xnz 1" also shows low utilizations on drives in general (about 5-10%busy, 500-900Kb/s reads), but with a bigger peak on the recovering raidz's drives and the hotspare disk. But even these numbers usually seem within the bounds of IOPS or bandwidth bottlenecks: # iostat -xzn c0t1d0 c1t2d0 c5t6d0 c4t3d0 c6t5d0 c7t6d0 1 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 107.5 21.1 5200.2 93.1 1.3 0.9 10.0 6.9 22 37 c0t1d0 107.7 21.2 5200.7 93.2 1.2 0.9 9.6 7.0 22 37 c4t3d0 107.2 21.1 5199.6 93.2 1.3 0.9 10.0 7.1 22 38 c6t5d0 107.6 21.1 5199.2 93.2 1.2 0.9 9.5 7.3 21 37 c7t6d0 1.5 200.0 71.3 4574.1 0.5 0.4 2.7 1.9 10 16 c1t2d0 106.9 21.2 5124.4 93.1 1.1 0.9 8.8 7.1 20 35 c5t6d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 177.9 0.0 4251.1 0.0 1.2 0.4 6.9 2.0 28 35 c0t1d0 156.8 0.0 4074.2 0.0 1.1 0.4 6.8 2.3 25 35 c4t3d0 156.8 0.0 4048.1 0.0 0.4 1.1 2.6 6.8 12 33 c6t5d0 156.8 0.0 4188.3 0.0 1.2 0.4 7.6 2.6 30 41 c7t6d0 0.0 343.8 0.0 2927.3 0.1 0.1 0.4 0.2 5 6 c1t2d0 167.9 0.0 4017.4 0.0 1.3 0.4 7.7 2.3 29 37 c5t6d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 113.0 114.0 2588.9 134.5 1.7 0.3 7.4 1.5 29 33 c0t1d0 79.0 122.0 2777.0 144.0 1.6 0.4 7.9 1.8 33 37 c4t3d0 98.0 122.0 2775.0 144.5 0.6 1.1 2.8 4.9 17 35 c6t5d0 125.0 125.0 2946.0 144.0 1.5 0.3 6.0 1.3 27 33 c7t6d0 0.0 395.1 0.0 2363.4 0.3 0.1 0.9 0.2 7 9 c1t2d0 131.0 114.0 3091.5 138.5 1.5 0.4 6.0 1.5 30 36 c5t6d0 extended device statistics r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device 82.0 80.0 1715.7 62.0 0.5 0.2 3.4 1.3 9 21 c0t1d0 75.0 69.0 1673.8 57.5 0.7 0.2 4.8 1.4 11 20 c4t3d0 68.0 69.0 1607.8 58.5 0.3 0.5 1.9 3.9 8 22 c6t5d0 71.0 74.0 1539.8 57.0 0.5 0.2 3.1 1.2 8 17 c7t6d0 0.0 205.0 0.0 1079.3 0.2 0.0 0.9 0.2 4 4 c1t2d0 72.0 77.0 1679.8 60.0 0.6 0.2 3.9 1.3 12 19 c5t6d0 What conditions can cause the reset of the resilvering process? And how can it be sped up to complete other than disabling zfs auto-snapshots? There are no other heavy users of the system, though several VMs and webapp servers run off the pool's data, and can not be put down... The knobs I know from OI did not work here: # echo zfs_resilver_delay/W0t0 | mdb -kw mdb: failed to dereference symbol: unknown symbol name # echo zfs_resilver_min_time_ms/W0t20000 | mdb -kw mdb: failed to dereference symbol: unknown symbol name Thanks for any ideas, //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss