Re: [zfs-discuss] zpool getting in a stuck state?

Cindy Swearingen Tue, 27 Oct 2009 12:23:34 -0700

Hi Jeremy,

The ereport.io.scsi.cmd.disk.tran is describing connections
problems to the /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0
device. I think the .tran suffix is for transient.


ZFS might be reporting problems with device as well, but if the
zpool/zfs commands are hanging, then it might be difficult to
get this confirmation. The zpool status command will report
device problems.

When a device in a pool fails, then I/O to the pool is blocked,
reads might be successful. See the failmode property description
in zpool.1m.

Is this pool redundant? If so, you can attempt to offline this
device until it is replaced. If you have another device available,
you might replace the suspect drive and see if that solves the
pool hang problem.

Cindy



On 10/27/09 12:04, Jeremy Kitchen wrote:

Cindy Swearingen wrote:

Jeremy,

I generally suspect device failures in this case and if possible,
review the contents of /var/adm/messages and fmdump -eV to see
if the pool hang could be attributed to failed or failing devices.


perusing /var/adm/messages, I see:

Oct 22 05:06:11 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:11 homiebackup10   Log info 0x31080000 received for target 5.
Oct 22 05:06:11 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x31080000 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x1
Oct 22 05:06:19 homiebackup10 scsi: [ID 365881 kern.info]
/p...@0,0/pci8086,4...@1/pci1000,3...@0 (mpt1):
Oct 22 05:06:19 homiebackup10   Log info 0x31080000 received for target 5.
Oct 22 05:06:19 homiebackup10   scsi_status=0x0, ioc_status=0x804b,
scsi_state=0x0

lots of messages like that just prior to rsync warnings:

Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 05:55:29 homiebackup10 rsyncd[29746]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]
Oct 22 06:10:29 homiebackup10 rsyncd[178]: [ID 702911 daemon.warning]
rsync error: error in rsync protocol data stream (code 12) at io.c(453)
[receiver=2.6.9]
Oct 22 06:25:27 homiebackup10 rsyncd[776]: [ID 702911 daemon.warning]
rsync: connection unexpectedly closed (0 bytes received so far) [receiver]

I think the rsync warnings are indicative of the pool being hung.  So it
would seem that the bus is freaking out and then the pool dies and
that's that?  The strange thing is that this machine is way underloaded
compared to another one we have (which has 5 shelves, so ~150TB of
storage attached) which hasn't really had any problems like this.  We
had issues with that one when rebuilding drives, but it's been pretty
stable since.

looking at fmdump -eV, I see lots and lots of these:

Oct 24 2009 05:02:54.098815545 ereport.io.scsi.cmd.disk.tran
nvlist version: 0
        class = ereport.io.scsi.cmd.disk.tran
        ena = 0x882108543f200401
        detector = (embedded nvlist)
        nvlist version: 0
                version = 0x0
                scheme = dev
                device-path = /p...@0,0/pci8086,4...@5/pci1000,3...@0/s...@30,0
        (end detector)

        driver-assessment = retry
        op-code = 0x28
        cdb = 0x28 0x0 0x51 0x9c 0xa5 0x80 0x0 0x0 0x80 0x0
        pkt-reason = 0x4
        pkt-state = 0x0
        pkt-stats = 0x10
        __ttl = 0x1
        __tod = 0x4ae2ecee 0x5e3ce39



always with the same device name.  So, it would appear that the drive at
 that location is probably broken, and zfs just isn't detecting it properly?

Also, I'm wondering if this is related to the thread just recently
titled [zfs-discuss] SNV_125 MPT warning in logfile, as we're using the
same controller that person mentions.

We're going to order some beefier controllers with the next shipment,
any suggestions on what to get?  If we find that the new controllers
work much better, we may even go as far as replacing the ones in the
existing machines (or at least any machines experiencing these issues).

We're not married to LSI, but we use LSI controllers in our webservers
for the most part and they're pretty solid there (though admittedly
those are hardware raid, rather than JBOD)

Thanks so much for your help!

-Jeremy

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] zpool getting in a stuck state?

Reply via email to