On Thu 14 Aug 2008 at 03:37PM, Evan Layton wrote:
> This error is coming from ZFS. Did you change out one of your disks in
> the mirror recently? If so you may want to run format on that disk and
> see if it has an EFI label on it. If it does you'll have to break the
> mirror and remove that disk from the mirror, re-label it and add it
> back into the mirror.
Evan, I would not recommend this procedure. Doing so will likely
(although probably not surely) result in an unbootable system.
Yesterday I saw that I had an EFI labelled disk in my root pool,
by accident. And so set out to fix the issue. I did what you'd expect:
detach the device, re-fdisk it, then repartition it with format -e, and
with an SMI label.
The end result of my fiddling was a machine which would not boot
build 95. As I tried various remedies (like installgrub, boot
to the cd and massage the pool, etc), the problem got worse until the
system could not boot any of my BEs anymore.
Today I was lucky enough to have Lin, George and Erik from the ZFS team
all in my office helping me to debug this. They were awesome and we
quickly got to a root cause.
The heart of the problem is that /etc/zfs/zpool.cache in the boot
archive and the pool configuration stored in the disks themselves can
get out of sync with each other. That's bad, because when ZFS tries to
reconcile them at boot time, it will get upset and panic, thinking that
the pool is damaged. This can happen when you do a mirror attach or
detach because apparently disk GUIDs in the pool can change as the
pool topology changes and mirror vdevs come and go. We stepped
through the problem with KMDB and watched ZFS load up a healthy pool,
then shoot it down as broken due to this reconciliation problem.
If you want to remove an EFI labelled disk from your root pool, my advice
to you would be to do the following. Note that I have not tested this
particular sequence, but I think it will work. Hah.
0) Backup your data and settings.
1) 'zpool detach' the EFI labelled disk from your pool. After you do this
YOU MUST NOT REBOOT. Your system is now in a fragile state.
2) Run 'zpool status' to ensure that your pool now has one disk.
3) Edit /etc/boot/solaris/filelist.ramdisk. Remove the only line in the
file:
etc/zfs/zpool.cache
4) Delete /platform/i86pc/boot_archive and /platform/i86pc/amd64/boot_archive
5) Run 'bootadm update-archive' -- This rebuilds the boot archive,
omitting the zpool.cache file.
It may also be necessary to do installgrub at this point. Probably, and
it wouldn't hurt.
6) Reboot your system, to ensure that you have a working configuration.
In Nevada, this is not an issue (George told me) because the boot archive
omits the zpool.cache file, so there's never any state to get out of sync.
I was left wondering why we populate /etc/boot/solaris/filelist.ramdisk
with "etc/zfs/zpool.cache". At a minimum, if we haven't already, we
should stop doing that as soon as possible.
I will be filing bugs to cover these issues tomorrow.
-dp
--
Daniel Price - Solaris Kernel Engineering - [EMAIL PROTECTED] - blogs.sun.com/dp
_______________________________________________
indiana-discuss mailing list
[email protected]
http://mail.opensolaris.org/mailman/listinfo/indiana-discuss