I have a large pool and I starting getting the following errors on one
of the LUNS:

Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.warning]
WARNING: /scsi_vhci/[EMAIL PROTECTED] (sd337):
Mar 13 17:52:36 gdo-node-2      Error for Command: write(10)
Error Level: Retryable
Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.notice]
Requested Block: 15782                     Error Block: 15782
Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.notice]        Vendor:
STK                                Serial Number:
Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.notice]        Sense
Key: Hardware Error
Mar 13 17:52:36 gdo-node-2 scsi: [ID 107833 kern.notice]        ASC:
0x84 (<vendor unique code 0x84>), ASCQ: 0x0, FRU: 0x0
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.warning]
WARNING: /scsi_vhci/[EMAIL PROTECTED] (sd337):
Mar 13 17:52:37 gdo-node-2      Error for Command: write(10)
Error Level: Retryable
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice]
Requested Block: 885894                    Error Block: 885894
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice]        Vendor:
STK                                Serial Number:
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice]        Sense
Key: Hardware Error
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice]        ASC:
0x84 (<vendor unique code 0x84>), ASCQ: 0x0, FRU: 0x0
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.warning]
WARNING: /scsi_vhci/[EMAIL PROTECTED] (sd337):
Mar 13 17:52:37 gdo-node-2      Error for Command: write(10)
Error Level: Retryable
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice]
Requested Block: 15779                     Error Block: 15779
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice]        Vendor:
STK                                Serial Number:
Mar 13 17:52:37 gdo-node-2 scsi: [ID 107833 kern.notice]        Sense
Key: Hardware Error

There were others which were at a "Fatal" error level.

>From the hardware side of things this lun has failed as well.  The lun
is actually only composed of a single disk of which the entire disk has
been made into the lun. 1 lun / volume / disk.  I'm testing various
configurations from the hardware, from R5 volumes, to these single disk
volumes. Back to the issue...


So I was hoping that the hotspare would kick in, but since that didn't
seem to be the case I thought I would try and replace the disk manually.

I did the following on this disk, but the errors just keep coming.

  zpool replace -f gdo-pool-01 c8t600A0B8000115EA20000FEDD45E81306d0 \
c8t600A0B800011399600007D6F45E8149Bd0


Originally the replacement disk was part of the spares, for this pool,
hence I think I had to use the -f.

I had removed the disk from the spares just prior to the above zpool
replace.

  zpool remove gdo-pool-01 c8t600A0B800011399600007D6F45E8149Bd0


After the replacement the raidz2 group looked like:

bash-3.00# zpool status gdo-pool-01
  pool: gdo-pool-01
state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the
errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver completed with 0 errors on Tue Mar 13 17:38:21 2007
config:

...
<several raidz2 group listings deleted to make this email shorter >
...

          raidz2                                     ONLINE       0
0     0
            c8t600A0B800011399600007CDD45E80D31d0    ONLINE       0
0     0
            spare                                    ONLINE       0
0     0
              c8t600A0B8000115EA20000FEDD45E81306d0  ONLINE      14
121.9     0
              c8t600A0B800011399600007D6F45E8149Bd0  ONLINE       0
0     0
            c8t600A0B800011399600007D0745E80F03d0    ONLINE       0
0     0
            c8t600A0B8000115EA20000FEF945E814DEd0    ONLINE       0
0     0
            c8t600A0B800011399600007D3145E810E9d0    ONLINE       0
0     0
            c8t600A0B800011399600007D4F45E81263d0    ONLINE       0
0     0
            c8t600A0B8000115EA20000FF1F45E8183Ed0    ONLINE       0
0     0
            c8t600A0B800011399600007D6B45E81471d0    ONLINE       0
0     0
            c8t600A0B8000115EA20000FE8B45E80D46d0    ONLINE       0
0     0
            c8t600A0B800011399600007C6F45E80927d0    ONLINE       0
0     0
            c8t600A0B8000115EA20000FEA745E80ED4d0    ONLINE       0
0     0
            c8t600A0B800011399600007C9945E80ABDd0    ONLINE       0
0     0
            c8t600A0B800011399600007CB545E80B81d0    ONLINE       0
0     0
            c8t600A0B8000115EA20000FEC345E8114Ed0    ONLINE       0
0     0
            c8t600A0B800011399600007CDF45E80D3Fd0    ONLINE       0
0     0
            c8t600A0B8000115EA20000FEDF45E81316d0    ONLINE       0
0     0

So even after the replace the read and write errors continue to
accumulate in the zpool status output and I continue to see errors
in /var/adm/messages.

This system is an x4600 running Solaris 10 update 3, with fairly recent
patches applied.


Any advise on what I should have done, or what I can do to make the
system stop using the bad lun would appreciated.

Thank you,

David


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to