We just had our first x4500 disk failure (which of course had to happen
late Friday night <sigh>), I've opened a ticket on it but don't expect a
response until Monday so was hoping to verify the hot spare took over
correctly and we still have redundancy pending device replacement.

This is an S10U6 box:

SunOS cartman 5.10 Generic_141445-09 i86pc i386 i86pc

Looks like the first errors started yesterday morning:

Jan  8 07:46:02 cartman marvell88sx: [ID 268337 kern.warning] WARNING:
marvell88
sx1:device on port 2 failed to reset
Jan  8 07:46:15 cartman marvell88sx: [ID 268337 kern.warning] WARNING:
marvell88
sx1:device on port 2 failed to reset
Jan  8 07:46:32 cartman sata: [ID 801593 kern.warning] WARNING:
/p...@0,0/pci1022
,7...@2/pci11ab,1...@1:
Jan  8 07:46:32 cartman SATA device at port 2 - device failed
Jan  8 07:46:32 cartman scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci1022
,7...@2/pci11ab,1...@1/d...@2,0 (sd26):
Jan  8 07:46:32 cartman         Command failed to complete...Device is gone

ZFS failed the drive about 11:15PM:

Jan  8 23:15:01 cartman zpool_check[3702]: [ID 702911 daemon.error] zpool
export
 status: One or more devices has experienced an unrecoverable error.  An
Jan  8 23:15:01 cartman zpool_check[3702]: [ID 702911 daemon.error] zpool
export
 status: attempt was made to correct the error.  Applications are
unaffected.
Jan  8 23:15:01 cartman zpool_check[3702]: [ID 702911 daemon.error] unknown
head
er see
Jan  8 23:15:01 cartman zpool_check[3702]: [ID 702911 daemon.error]
warning: poo
l export health DEGRADED

However, the errors continue still:

Jan  9 03:54:48 cartman scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci1022
,7...@2/pci11ab,1...@1/d...@2,0 (sd26):
Jan  9 03:54:48 cartman         Command failed to complete...Device is gone
[...]
Jan  9 07:56:12 cartman scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci1022
,7...@2/pci11ab,1...@1/d...@2,0 (sd26):
Jan  9 07:56:12 cartman         Command failed to complete...Device is gone
Jan  9 07:56:12 cartman scsi: [ID 107833 kern.warning] WARNING:
/p...@0,0/pci1022
,7...@2/pci11ab,1...@1/d...@2,0 (sd26):
Jan  9 07:56:12 cartman         drive offline

If ZFS removed the drive from the pool, why does the system keep
complaining about it? Is fault management stuff still poking at it?

Here's the zpool status output:

  pool: export
 state: DEGRADED
[...]
 scrub: scrub completed after 0h6m with 0 errors on Fri Jan  8 23:21:31
2010


        NAME          STATE     READ WRITE CKSUM
        export        DEGRADED     0     0     0

          mirror      DEGRADED     0     0     0
            c0t2d0    ONLINE       0     0     0
            spare     DEGRADED 18.9K     0     0
              c1t2d0  REMOVED      0     0     0
              c5t0d0  ONLINE       0     0 18.9K

        spares
          c5t0d0      INUSE     currently in use

Is the pool/mirror/spare still supposed to show up as degraded after the
hot spare is deployed?

There are 18.9K checksum errors on the disk that failed, but there are also
18.9K read errors on the hot spare?

The scrub started at 11pm last night, the disk got booted at 11:15pm,
presumably the scrub came across the failures the os had been reporting.
The last scrub status shows that scrub completing successfully. What
happened to the resilver status? How can I tell if the resilver was
successful? Did the resilver start and complete while the scrub was still
running and its status output was lost? Is there any way to see the status
of past scrubs/resilvers, or is only the most recent one available?

Fault managment doesn't report any problems:

r...@cartman ~ # fmdump
TIME                 UUID                                 SUNW-MSG-ID
fmdump: /var/fm/fmd/fltlog is empty

Shouldn't this show a failed disk?

fmdump -e shows tuns of bad stuff:

Jan 08 07:46:32.9467 ereport.fs.zfs.probe_failure
Jan 08 07:46:36.2015 ereport.fs.zfs.io
[...]
Jan 08 07:51:05.1865 ereport.fs.zfs.io

None of that results in a fault diagnosys?

Mostly I'd like to verify my hot spare is working correctly. Given the
spare status is "degraded", the read errors on the spare device, and the
lack of successful resilver status output, it seems like the spare might
not have been added successfully.

Thanks for any input you might provide...


-- 
Paul B. Henson  |  (909) 979-6361  |  http://www.csupomona.edu/~henson/
Operating Systems and Network Analyst  |  hen...@csupomona.edu
California State Polytechnic University  |  Pomona CA 91768
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to