Re: [zfs-discuss] Kernel panic on zfs import (hardware failure)

Donald Murray, P.Eng. Mon, 02 Nov 2009 05:21:53 -0800

Hey,


On Sat, Oct 31, 2009 at 5:03 PM, Victor Latushkin
<victor.latush...@sun.com> wrote:
> Donald Murray, P.Eng. wrote:
>>
>> Hi,
>>
>> I've got an OpenSolaris 2009.06 box that will reliably panic whenever
>> I try to import one of my pools. What's the best practice for
>> recovering (before I resort to nuking the pool and restoring from
>> backup)?
>
> Could you please post panic stack backtrace?
>
>> There are two pools on the system: rpool and tank. The rpool seems to
>> be fine, since I can boot from a 2009.06 CD and 'zpool import -f
>> rpool'; I can also 'zfs scrub rpool', and it doesn't find any errors.
>> Hooray! Except I don't care about rpool. :-(
>>
>> If I boot from hard disk, the system begins importing zfs pools; once
>> it's imported everything I usually have enough time to log in before
>> it panics. If I boot from CD and 'zfs import -f tank', it panics.
>>
>> I've just started a 'zdb -e tank' which I found on the intertubes
>> here: http://opensolaris.org/jive/thread.jspa?threadID=49020. Zdb
>> seems to be ... doing something. Not sure _what_ it's doing, but it
>> can't be making things worse for me right?
>
> Yes, zdb only reads, so it cannot make thing worse.
>
>> I'm going to try adding the following to /etc/system, as mentioned
>> here: http://opensolaris.org/jive/thread.jspa?threadID=114906
>> set zfs:zfs_recover=1
>> set aok=1
>
> Please do not rush with these settings. Let's look at the stack backtrace
> first.
>
> Regards,
> Victor
>


I think I've found the cause of my problem. I disconnected one side of
each mirror, rebooted, and imported. The system didn't panic! So one
of the disconnected drives (or cables, or controllers...) was the culprit.

I've since narrowed it down to a single 500GB drive. When that drive is
connected, a zpool import panics the system. When that drive is disconnected,
the pool imports fine.

r...@weyl:~# zpool status tank
  pool: tank
 state: DEGRADED
status: One or more devices could not be used because the label is missing or
        invalid.  Sufficient replicas exist for the pool to continue
        functioning in a degraded state.
action: Replace the device using 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-4J
 scrub: resilver completed after 0h8m with 0 errors on Sun Nov  1 22:11:15 2009
config:

        NAME                     STATE     READ WRITE CKSUM
        tank                     DEGRADED     0     0     0
          mirror                 DEGRADED     0     0     0
            7508645614192559694  FAULTED      0     0     0  was
/dev/dsk/c7t0d0s0
            c6t1d0               ONLINE       0     0     0
          mirror                 ONLINE       0     0     0
            c5t1d0               ONLINE       0     0     6  21.2G resilvered
            c7t0d0               ONLINE       0     0     0

errors: No known data errors
r...@weyl:~#

The first thing that's jumping out at me: why does the first mirror
think the missing
disk was c7t0d0? I have an old zpool status from before the problem began, and
that disk used to be c6t0d0.

r...@weyl:~# zpool status tank
  pool: tank
 state: ONLINE
 scrub: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank        ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c6t0d0  ONLINE       0     0     0
            c6t1d0  ONLINE       0     0     0
          mirror    ONLINE       0     0     0
            c5t1d0  ONLINE       0     0     0
            c7t0d0  ONLINE       0     0     0

errors: No known data errors
r...@weyl:~#


Victor has been very helpful, living up to his reputation. Thanks Victor!

If we determine a root cause, I'll update the list.

Things I've learned along the way:
- pools import automatically based on cached information in
/etc/zfs/zpool.cache; if you move zpool.cache elsewhere, none of the
pools will import upon rebooting;
- import problematic pools via 'zpool import -f -R /a <poolname>';
this doesn't update the cachefile, and mounts the pool on /a;
- adding the following to /etc/system didn't prevent a hardware-induced panic:
set zfs:zfs_recover=1
set aok=1
- crash dumps are typically saved in /var/crash/$( uname -n )
- beadm is your friend;
- redundancy is your friend (okay, I already knew that);
- if you have a zfs problem, you want Victor Latushkin to be your friend;

Cheers!
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Kernel panic on zfs import (hardware failure)

Reply via email to