Re: [zfs-discuss] Single VDEV pool permanent and checksum errors after replace

Edward Ned Harvey Tue, 04 Jan 2011 05:14:59 -0800

> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Chris Murray
> 
> I have some strange goings-on with my VM of Solaris Express 11, and I
> hope someone can help.
> 
> It shares out other virtual machine files for use in ESXi 4.0 (it,
> too, runs in there)


The first thing I'm noticing is that you're running sol11express inside ESXi
4.0.  This is an unsupported configuration, and in my personal experience
whenever you run an unsupported OS for either the host or guest of any
virtualization (vmware or other) then the end result is random errors and
general instability.

Maybe that's not the problem for you, but I would certainly consider it
suspicious.


> So, it seems I
> want to go from one single disk (c8t1d0), to a mirror of c8t4d0 and
> c8t5d0. In my mind, that's a 'zpool replace' onto c8t4d0 and a 'zpool
> attach' of c8t5d0. I kick off the replace, and all goes fine. Part way
> through I try to do the attach as well, but am politely told I can't.

This also might not be the cause of your problem, but you should have
probably done the attach first, wait for it to complete, do a scrub for good
measure, and then do the replace.  In fact, I am surprised to find out it's
even POSSIBLE to do a replace on a pool that has only one disk.  I didn't
know you could even do that until now.


> The replace itself completed without complaint, however on completion,
> virtual machines whose disks are inside 'vmpool' start hanging,
> checksum errors rapidly start counting up, and since there's no
> redundancy, nothing can be done to repair them.

You did the replace, and it wrote to the new disk without reading anything
from it, so there was no way it could detect checksum errors during the
replace.  After the replace completed, it started reading the data that was
just previously written, and upon read, it discovers checksum mismatches.

You will need to go back to your original disk before you did the replace.


>     #1 - are there any suspicions as to what's happened here? How come
> the resilver completed fine but now there are checksum errors on the
> replacement disk? It does reside on the same physical disk, after all.
> Could this be something to do with me attempting the attach during the
> replace?

Well - Even though the new vmdk exists on the same physical disk, the facts
are, the new vmdk is reporting checksum errors.  You better consider the
possibilities (a) that disk is actually experiencing hardware failure and
you better backup as soon as possible, or (b) you're running into an
unsupported virtual hardware glitch, as I formerly hinted.


>     #2 - in my mind, c8t1d0 contains the state of the pool just prior
> to the cutover to c8t4d0. Is there any way I can get this back, and
> scrap the contents of c8t4d0? A 'zpool import -D' is fruitless, but I
> imagine there's some way of tricking Solaris into seeing c8t1d0 this
> as a single disk pool again?

Good question.  I am not sure precisely what "zpool replace" does.  Maybe
somebody else can answer this.  After doing a "zpool replace," is it
possible to move the old disk to a new system and import the pool?  Or is
the pool permanently removed from the old disk?

I would boot into command-line mode from the solaris CD, with only that one
disk attached, and then try the "zpool import" ... If that works, you know
you have something.  And if it doesn't work, I don't know what to tell you.
You might be hosed.


> Now that I've snapshotted the VM and have a sort of safety net, I run
> a scrub, which unsurprisingly unearths checksum errors and lists all
> of the files which have problems:

Because we're talking about ZFS, you should probably specify, that you
created a vmware snapshot of the machine.  You're not talking about a ZFS
snapshot, are you?


> I 'zpool clear vmpool', power on one of the VMs, and the checksum
> count quickly reaches 970.
> 
>     #3 - why would this be the case? I thought the purpose of a scrub
> was to traverse all blocks, read them, and unearth problems? I'm
> wondering why these 970 errors haven't been found in the scrub?

Checksum errors are not correctable when you have no redundancy.  That
means, although you ran the scrub and the clear, the problems have not been
fixed.  They will come back as soon as those blocks are read again.


> I power off the VM, perform another scrub. This time, 94 errors:

It is strange that the number of errors is lower the 2nd time around.  The
one thing that's sure:  You either have hardware failure, or something that
looks like hardware failure (such as a bug caused by unsupported
virtualization configuration.)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Single VDEV pool permanent and checksum errors after replace

Reply via email to