There are a lot of hits for this error in google, but I've had trouble
identifying any that resemble my situation.  I apologize if you've
answered it before.  If it's better for me to open a case with Sun
Support, I can do that, but I'm hoping to cheat my way around the system
so that I don't have to send somebody Explorer output before they
escalate it.  Seems more efficient in the long run. :)

Most of my tale of woe is background:

I have a pool running under Solaris 10 5/08.  It's an 8-member raidz2
whose volumes are on a 2540 array with two controllers.  Volumes are
mapped 1:1 with physical disks.  I didn't really want a 2540, but I
couldn't get anyone to swear to me that any other fiber-channel product
would work with Solaris.  I'm using fiber multipathing.

I've had two disk failures in the past two weeks.  Last week I replaced
the first.  No problems with ZFS initially; a 'zfs replace' did the
right thing.  Yesterday I replaced the second.  But while investigating
the problem I noticed that two of my paths had gone down, so that 6
disks had both paths attached, and 2 disks had only one path.

At this time, 'zpool status' showed:
  pool: z
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun. more devices contains corrupted data./msg/ZFS-8000-D3
 scrub: resilver completed with 0 errors on Fri Oct 24 20:04:51 2008
config:

        NAME                                         STATE     READ WRITE CKSUM
        z                                            DEGRADED     0     0     0
          raidz2                                     DEGRADED     0     0     0
            c6t600A0B800049F9E10000030548B3DF1Ed0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000030848B3DF52d0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000030B48B3DF7Ed0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000030E48B3DFA6d0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000031148B3DFD2d0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000031448B3DFFAd0s0  ONLINE       0     0     0
            c6t600A0B800049F9E10000031748B3E020d0s0  UNAVAIL      0     0     0 
 cannot open
            c6t600A0B800049F9E10000031A48B3E04Cd0s0  ONLINE       0     0     0


(At the time I hadn't figured it out, but I believe now that the one
disk was UNAVAIL because the disk had not been properly partitioned yet,
so s0 was undefined.)

Solaris 10's mpath support seems so far to be fairly intolerant of
reconfiguration without a reboot, and I wasn't ready to reboot yet, but
I thought I'd try resetting the controller that wasn't attached to all
of the disks.  But it appears that for some reason the CAM software
reset both controllers simultaneously.  The whole pool went into an
error state, and all disks became unavailable.  Very annoying, but not a
problem for zfs-discuss.


At this time, 'zpool status' showed:
  pool: z
 state: FAULTED
status: One or more devices could not be opened.  There are insufficient
        replicas for the pool to continue functioning.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-D3
 scrub: none requested
config:

        NAME                                         STATE     READ WRITE CKSUM
        z                                            FAULTED      0     0     0 
 corrupted data
          raidz2                                     DEGRADED     0     0     0
            c6t600A0B800049F9E10000030548B3DF1Ed0s0  UNAVAIL      0     0     0 
 corrupted data
            c6t600A0B800049F9E10000030848B3DF52d0s0  UNAVAIL      0     0     0 
 corrupted data
            c6t600A0B800049F9E10000030B48B3CF7Ed0s0  UNAVAIL      0     0     0 
 corrupted data
            c6t600A0B800049F9E10000030E48B3DFA6d0s0  UNAVAIL      0     0     0 
 cannot open
            c6t600A0B800049F9E10000031148B3DFD2d0s0  UNAVAIL      0     0     0 
 corrupted data
            c6t600A0B800049F9E10000031448B3DFFAd0s0  UNAVAIL      0     0     0 
 corrupted data
            c6t600A0B800049F9E10000031748B3E020d0s0  UNAVAIL      0     0     0 
 cannot open
            c6t600A0B800049F9E10000031A48B3E04Cd0s0  UNAVAIL      0     0     0 
 corrupted data


I don't know whether there's any chance of recovering this, but I
wanted to try.  I reset the 2540 again, but still no communication with
Solaris.  I rebooted the server, and communications resumed.  I had to
do some further repair/reconfig on the 2540 for the two disks marked
'cannot open', but it was a minor issue and worked fine.  Solaris was
then able to see all my disks.


Now we come to the main point.

I still hadn't figured out the partitioning problem on ....E020d0s0
yet.  It didn't occur to me because I believed that to be a spare disk
which I had already partitioned, and let's face it, "I/O error" can
be anything.  I was wrong, though: I had already used the spare the
previous week, and this one was unformatted.  But lacking any other
ideas, I tried to export and then import the pool.  The export went fine
without complaint.  I corrected the partitioning on the replacement disk
at this point.  But now when I try to import:

# zpool import
  pool: z
    id: 1372922273220982501
 state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
        The pool may be active on on another system, but can be imported using
        the '-f' flag.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

        z                                            FAULTED   corrupted data
          raidz2                                     ONLINE
            c6t600A0B800049F9E10000030548B3DF1Ed0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000030848B3DF52d0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000030B48B3DF7Ed0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000030E48B3DFA6d0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000031148B3DFD2d0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000031448B3DFFAd0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000031748B3E020d0s0  UNAVAIL   corrupted data
            c6t600A0B800049F9E10000031A48B3E04Cd0s0  UNAVAIL   corrupted data


# zpool import z
cannot import 'z': pool may be in use from other system
use '-f' to import anyway


# zpool import -f z
cannot import 'z': one or more devices is currently unavailable


'zdb -l' shows four valid labels for each of these disks except for the
new one.  Is this what "unavailable" means, in this case?

Growing adventurous, or perhaps just desperate, I used 'dd' to copy the
first label from one of the good disks to the new one which lacked any
labels.  Then I binary-edited the label to patch in the correct guid for
that disk.  (I got the correct guid from the zdb -l output.)  I still
get the same results from zpool import, though.  Is this because I need
to patch in three more copies of the label?  I'm not sure how (or more
correctly, where) to do that.

Is this a lost cause?  Anyone have any suggestions?  Is there a nice
tool for writing zdb -l output as a new label to a new disk?  Why did
zpool export a pool that it can't import?  This is an experimental
development system, but I'd still like to recover the data if possible.


It may or may not be relevant that I have another exported pool, a relic
of old experiments, which believes it uses the same device (...E020d0s0)
has had no zfs label.  But I had this pool before all of yesterday's
trouble, too.  If there's a way to destroy an exported pool I'm fine
with doing so, but it doesn't seem like this is part of today's problem.

Thanks for any ideas.

-- 
 -D.    [EMAIL PROTECTED]    NSIT    University of Chicago
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to