[zfs-discuss] RAID-Z1 pool became faulted when a disk was removed.

Rince Wed, 01 Nov 2006 02:04:50 -0800

So I have attached to my system two 7-disk SCSI arrays, each of 18.2 GB disks.

Each of them is a RAID-Z1 zpool.

I had a disk I thought was a dud, so I pulled the fifth disk in my array and put the dud in. Sure enough, Solaris started spitting errors like there was no tomorrow in dmesg, and wouldn't use the disk. Ah well. Remove it, put the original back in - hey, Solaris still thinks the disk is offline, and cfgadm -c unconfigure [disk];cfgadm -c configure [disk] didn't help - okay, sane poweroff. Hey, this is going to take awhile to rescrub, why not switch to the wide SCSI module for this disk array rather than the narrow one? Okay, fine, put the module in (this module is known working and was, in fact, pulled from the other array).

I notice it takes nigh-forever to come back up, and I'm wondering why - it literally took over 5 minutes to give me console login. Commands took at least 5s between being typed and appearing in console - it was obvious something insane was going on. Load average was 6.33, and fmd was taking most of the CPU. zpool status took about 10 minutes to tell me that it thought c2t2d0 was missing and that c2t4d0 was corrupt, thereby screwing me.

Wait, what. I didn't touch that disk, what's going on here.

I try to convince ZFS that the disk is there and usable via zpool online moonside c2t2d0, but it just claims the pool is inaccessable (great, thanks ZFS). I figure it has to be the module swap that's confusing it so, so I poweroff and switch back. Power back on...nope, still screwed the same way.

I try destroying the pool and importing it, but it "just" tells me the pool is corrupted because c2t4d0 has corrupt metadata.

pool: moonside
    id: 8290331144559232496
state: FAULTED
status: One or more devices contains corrupted data.
action: The pool cannot be imported due to damaged devices or data.
   see: http://www.sun.com/msg/ZFS-8000-5E
config:

        moonside    FAULTED   corrupted data
          raidz1    FAULTED   corrupted data
            c2t0d0 ONLINE
            c2t1d0 ONLINE
            c2t2d0 ONLINE
            c2t3d0 ONLINE
            c2t4d0 FAULTED   corrupted data
            c2t5d0 ONLINE
            c2t6d0 ONLINE

Thanks, ZFS. One disk (at most, one disk and attempting to use a different SCSI connector) blew up my RAID-Z1. That's...wonderful.

I try rebooting to see if it becomes less confused...

pool: moonside
    id: 8290331144559232496
state: FAULTED
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-3C
config:

        moonside    FAULTED   corrupted data
          raidz1    DEGRADED
            c2t0d0 ONLINE
            c2t1d0 ONLINE
            c2t2d0 ONLINE
            c2t3d0 ONLINE
            c2t4d0 UNAVAIL   cannot open
            c2t5d0 ONLINE
            c2t6d0 ONLINE

Uh, what. So the pool is "degraded", but the state is "faulted" because it has corrupted data somewhere that it can't tell me about? Screw this, force import.

# zpool import -f moonside
cannot import 'moonside': I/O error

...what!? I don't even know what that error means in this context, maybe my buddy dmesg does.

# dmesg | tail
Nov 1 03:28:02 maou scsi: [ID 193665 kern.info] sd2 at adp0: target 2 lun 0
Nov 1 03:28:02 maou genunix: [ID 936769 kern.info] sd2 is /[EMAIL PROTECTED],0/pci9004,[EMAIL PROTECTED]/[EMAIL PROTECTED],0
Nov 1 03:28:06 maou genunix: [ID 773945 kern.info]     UltraDMA mode 2 selected
Nov 1 03:28:31 maou last message repeated 7 times
Nov 1 03:28:52 maou genunix: [ID 408114 kern.info] /[EMAIL PROTECTED],0/pci9004,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd4) offline
Nov 1 03:28:55 maou genunix: [ID 773945 kern.info]     UltraDMA mode 2 selected
Nov 1 03:28:55 maou last message repeated 3 times
Nov 1 03:29:06 maou scsi: [ID 193665 kern.info] sd4 at adp0: target 4 lun 0
Nov 1 03:29:06 maou genunix: [ID 936769 kern.info] sd4 is /[EMAIL PROTECTED],0/pci9004,[EMAIL PROTECTED]/[EMAIL PROTECTED],0
Nov 1 03:29:06 maou genunix: [ID 408114 kern.info] /[EMAIL PROTECTED],0/pci9004,[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd4) online

Nope, dmesg doesn't know either. Uh, what?

Reboots fix everything. Reboot...

Now it's just really confused.

# zpool import -f moonside
cannot import 'moonside': one or more devices is currently unavailable

Can it not make up its mind? Does it want the missing seventh device to save it from the mean old corruption on that seventh device? What's with the claimed I/O errors that don't show up in dmesg?

pool: moonside
    id: 8290331144559232496
state: FAULTED
status: One or more devices are missing from the system.
action: The pool cannot be imported. Attach the missing
        devices and try again.
   see: http://www.sun.com/msg/ZFS-8000-3C
config:

        moonside    UNAVAIL   insufficient replicas
          raidz1    UNAVAIL   insufficient replicas
            c2t0d0 ONLINE
            c2t1d0 ONLINE
            c2t2d0 FAULTED   corrupted data
            c2t3d0 ONLINE
            c2t4d0 UNAVAIL   cannot open
            c2t5d0 ONLINE
            c2t6d0 ONLINE

Oh wow, that's really special. I'm not sure what's going on at this point. I swear there's no way I could have touched c2t2d0 by accident - this array is really sturdy and requires moderate physical effort to remove a disk from.

Is this behavior "expected", or is this a bug? Furthermore, should I ever expect to be able to see my precious data again?

snv b44, Pentium III 550.

- Rich

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

[zfs-discuss] RAID-Z1 pool became faulted when a disk was removed.

Reply via email to