Ok, the resilver has been restarted a number of times over the past few days 
due to two main issues - a drive disconnecting itself, and power failure. I 
think my troubles are 100% down to these environmental factors, but would like 
some confidence that after the resilver has completed, if it reports there 
aren't any persistent errors, that there actually aren't any.

Attempt #1: the resilver started after I initiated the replace on my SXCE105 
install. All was well until the box lost power. On starting back up, it hung 
while starting OpenSolaris - just after the line containing the system 
hostname. I've had this before when a scrub is in progress. My usual tactic is 
to boot with the 2009.06 live CD, import the pool, stop the scrub, export, 
reboot into SXCE105 again, and import. Of course, you can't stop a replace 
that's in progress, so the remaining attempts are in the 2009.06 live CD (build 
111b perhaps?)

Attempt #2: the resilver started on imported the pool in 2009.06. It was 
resilvering fine until one drive reported itself as offline. dmesg showed that 
the drive was 'gone'. I then noticed a lot of checksum errors at the pool 
level, and RAIDZ1 level, and a large number of 'permanent' errors. In a panic, 
thinking that the resilver was now doing more harm than good, I exported the 
pool and rebooted.

Attempt #3: I imported in 2009.06 again. This time, the drive that was 
disconnected last attempt was online again, and proceeded to resilver along 
with the original drive. There was only one permanent error - in a particular 
snapshot of a ZVOL I'm not too concerned about. This is the point that I wrote 
the original post, wondering if all of those 700+ errors reported the first 
time around weren't a problem any more. I have been running zpool clear in a 
loop because there were checksum errors on another of the drives (neither of 
the two part of the replacing vdev, and not the one that was removed 
previously). I didn't want it to be marked as faulty, so I kept the zpool clear 
running. Then .. power failure.

Attempt #4: I imported in 2009.06. This time, no errors detected at all. Is 
that a result of my zpool clear? Would that clear any 'permanent' errors? From 
the wording, I'd say it wouldn't, and therefore the action of starting the 
resilver again with all of the correct disks in place hasn't found any errors 
so far ... ? Then, disk removal again ... :-(

Attempt #5: I'm convinced that drive removal is down to faulty cabling. I move 
the machine, completely disconnect all drives, re-wire all connections with new 
cables, and start the scrub again in 2009.06. Now, there are checksum errors 
again, so I'm running zpool clear in order to keep drives from being marked as 
faulted .. but I also have this:

errors: Permanent errors have been detected in the following files:
        zp/iscsi/meerkat_t...@20090905_1631:<0x1>

I have a few of my usual VMs powered up (ESXi connecting using NFS), and they 
appear to be fine. I've ran a chkdsk in the windows VMs, and no errors are 
reported. Although I can't be 100% confident that any of those files were in 
the original list of 700+ errors. In the absence of iscsitgtd, I'm not powering 
up the ones that rely on iSCSI just yet.

My next steps will be:
1. allow the resilver to finish. Assuming I don't have yet another power cut, 
this will be in about 24 hours.
2. zpool export
3. reboot into SXCE
4. zpool import
5. start all my usual virtual machines on the ESXi host
6. note whether that permanent error is still there <-- this will be an 
interesting one for me - will the export & import clear the error? will my 
looped zpool clear have simply reset the checksum counters to zero, or will it 
have cleared this too?
7. zpool scrub to see what else turns up.

Chris
-- 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to