Re: [zfs-discuss] zpool getting in a stuck state?

Cindy Swearingen Tue, 27 Oct 2009 13:32:44 -0700

Jeremy,

I can't comment on your hardware because I'm not familiar with it.


If you have a storage pool with ZFS redundancy and one device fails
or begins failing, then the pool keeps going, in a degraded mode but
is generally available.

You can try setting the failmode property to continue, which would
allow reads to continue in case of a device failure, might prevent
the pool from hanging.

If offlining the disk or replacing the disk doesn't help, let us know.

Cindy

On 10/27/09 13:13, Jeremy Kitchen wrote:

Jeremy Kitchen wrote:

Cindy Swearingen wrote:

Jeremy,

I generally suspect device failures in this case and if possible,
review the contents of /var/adm/messages and fmdump -eV to see
if the pool hang could be attributed to failed or failing devices.

perusing /var/adm/messages, I see:

Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:11 homiebackup10   Log info 0x31080000 received for target 5.
Oct 22 05:06:11 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x31080000 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x1
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x31080000 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0

lots of messages like that just prior to rsync warnings:

Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

I think the rsync warnings are indicative of the pool being hung.  So it
would seem that the bus is freaking out and then the pool dies and
that's that?  The strange thing is that this machine is way underloaded
compared to another one we have (which has 5 shelves, so ~150TB of
storage attached) which hasn't really had any problems like this.  We
had issues with that one when rebuilding drives, but it's been pretty
stable since.

looking at fmdump -eV, I see lots and lots of these:

Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x882108543f200401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0
        (end detector)

        driver-assessment = retry
        op-code = 0x28
        cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0
        pkt-reason = 0x4
        pkt-state = 0x0
        pkt-stats = 0x10
        __ttl = 0x1
        __tod = 0x4ae2ecee 0x5e3ce39


so doing some more reading here on the list and mucking about a bit
more, I've come across this in the fmdump log:

Oct 22 2009 05:03:56.687818542 ereport.fs.zfs.io
nvlist version: 0
        class = ereport.fs.zfs.io
        ena = 0x99eb889c6fe00001
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = zfs
                pool = 0x90ed10dfd0191c3b
                vdev = 0xf41193d6d1deedc2
        (end detector)

        pool = raid3155
        pool_guid = 0x90ed10dfd0191c3b
        pool_context = 0
        pool_failmode = wait
        vdev_guid = 0xf41193d6d1deedc2
        vdev_type = disk
        vdev_path = /dev/dsk/c6t5d0s0
        vdev_devid = id1,s...@n5000c50010a7666b/a
        parent_guid = 0xcbaa8ea60a3c133
        parent_type = raidz
        zio_err = 5
        zio_offset = 0xab2901da00
        zio_size = 0x200
        zio_objset = 0x4b
        zio_object = 0xa26ef4
        zio_level = 0
        zio_blkid = 0xf
        __ttl = 0x1
        __tod = 0x4ae04a2c 0x28ff472e


c6t5d0 is in the problem pool (raid3155) so I've gone ahead and offlined
the drive and will be replacing it shortly.  Hopefully that will take
care of the problem!

If this doesn't solve the problem, do you have any suggestions on what
more I can look at to try to figure out what's wrong?  Is there some
sort of setting I can set which will prevent the zpool from hanging up
the entire system in the event of a single drive failure like this?
It's really annoying to not be able to log into the machine (and having
to forcefully reboot the machine) when this happens.

Thanks again for your help!

-Jeremy

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool getting in a stuck state?

Reply via email to