Ok, the resilver has been restarted a number of times over the past few days due to two main issues - a drive disconnecting itself, and power failure. I think my troubles are 100% down to these environmental factors, but would like some confidence that after the resilver has completed, if it reports there aren't any persistent errors, that there actually aren't any.
Attempt #1: the resilver started after I initiated the replace on my SXCE105 install. All was well until the box lost power. On starting back up, it hung while starting OpenSolaris - just after the line containing the system hostname. I've had this before when a scrub is in progress. My usual tactic is to boot with the 2009.06 live CD, import the pool, stop the scrub, export, reboot into SXCE105 again, and import. Of course, you can't stop a replace that's in progress, so the remaining attempts are in the 2009.06 live CD (build 111b perhaps?) Attempt #2: the resilver started on imported the pool in 2009.06. It was resilvering fine until one drive reported itself as offline. dmesg showed that the drive was 'gone'. I then noticed a lot of checksum errors at the pool level, and RAIDZ1 level, and a large number of 'permanent' errors. In a panic, thinking that the resilver was now doing more harm than good, I exported the pool and rebooted. Attempt #3: I imported in 2009.06 again. This time, the drive that was disconnected last attempt was online again, and proceeded to resilver along with the original drive. There was only one permanent error - in a particular snapshot of a ZVOL I'm not too concerned about. This is the point that I wrote the original post, wondering if all of those 700+ errors reported the first time around weren't a problem any more. I have been running zpool clear in a loop because there were checksum errors on another of the drives (neither of the two part of the replacing vdev, and not the one that was removed previously). I didn't want it to be marked as faulty, so I kept the zpool clear running. Then .. power failure. Attempt #4: I imported in 2009.06. This time, no errors detected at all. Is that a result of my zpool clear? Would that clear any 'permanent' errors? From the wording, I'd say it wouldn't, and therefore the action of starting the resilver again with all of the correct disks in place hasn't found any errors so far ... ? Then, disk removal again ... :-( Attempt #5: I'm convinced that drive removal is down to faulty cabling. I move the machine, completely disconnect all drives, re-wire all connections with new cables, and start the scrub again in 2009.06. Now, there are checksum errors again, so I'm running zpool clear in order to keep drives from being marked as faulted .. but I also have this: errors: Permanent errors have been detected in the following files: zp/iscsi/meerkat_t...@20090905_1631:<0x1> I have a few of my usual VMs powered up (ESXi connecting using NFS), and they appear to be fine. I've ran a chkdsk in the windows VMs, and no errors are reported. Although I can't be 100% confident that any of those files were in the original list of 700+ errors. In the absence of iscsitgtd, I'm not powering up the ones that rely on iSCSI just yet. My next steps will be: 1. allow the resilver to finish. Assuming I don't have yet another power cut, this will be in about 24 hours. 2. zpool export 3. reboot into SXCE 4. zpool import 5. start all my usual virtual machines on the ESXi host 6. note whether that permanent error is still there <-- this will be an interesting one for me - will the export & import clear the error? will my looped zpool clear have simply reset the checksum counters to zero, or will it have cleared this too? 7. zpool scrub to see what else turns up. Chris -- This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss