On Jul 13, 2009, at 11:33 AM, Ross <no-re...@opensolaris.org> wrote:

Gaaah, looks like I spoke too soon:

$ zpool status
 pool: rc-pool
state: DEGRADED
status: One or more devices has experienced an unrecoverable error. An attempt was made to correct the error. Applications are unaffected. action: Determine if the device needs to be replaced, and clear the errors
       using 'zpool clear' or replace the device with 'zpool replace'.
  see: http://www.sun.com/msg/ZFS-8000-9P
scrub: resilver in progress for 2h59m, 77.89% done, 0h50m to go
config:

       NAME              STATE     READ WRITE CKSUM
       rc-pool           DEGRADED     0     0     0
         mirror          DEGRADED     0     0     0
           c4t1d0        ONLINE       0     0     0  218M resilvered
replacing UNAVAIL 0 963K 0 insufficient replicas
             c4t2d0s0/o  FAULTED  1.71M 23.4M     0  too many errors
             c4t2d0      REMOVED      0  964K     0  67.0G resilvered
           c5t1d0        ONLINE       0     0     0  218M resilvered
         mirror          ONLINE       0     0     0
           c4t3d0        ONLINE       0     0     0
           c5t2d0        ONLINE       0     0     0
           c5t0d0        ONLINE       0     0     0
         mirror          ONLINE       0     0     0
           c5t3d0        ONLINE       0     0     0
           c4t5d0        ONLINE       0     0     0
           c4t4d0        ONLINE       0     0     0
         mirror          ONLINE       0     0     0
           c5t4d0        ONLINE       0     0     0
           c5t5d0        ONLINE       0     0     0
           c4t6d0        ONLINE       0     0     0
         mirror          ONLINE       0     0     0
           c4t7d0        ONLINE       0 13.0K     0
           c5t6d0        ONLINE       0     0     0
           c5t7d0        ONLINE       0     0     0
       logs              DEGRADED     0     0     0
         c6d1p0          ONLINE       0     0     0

errors: No known data errors


There are a whole bunch of errors in /var/adm/messages:

Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.warning] WARNING: / p...@1,0/pci1022,7...@1/pci11ab,1...@2/d...@2,0 (sd3): Jul 13 15:56:53 rob-036 Error for Command: write (10) Error Level: Retryable Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Requested Block: 83778048 Error Block: 83778048 Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Vendor: ATA Serial Number: Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] Sense Key: Aborted_Command Jul 13 15:56:53 rob-036 scsi: [ID 107833 kern.notice] ASC: 0x0 (no additional sense info), ASCQ: 0x0, FRU: 0x0


Jul 13 15:57:31 rob-036 scsi: [ID 107833 kern.warning] WARNING: / p...@1,0/pci1022,7...@1/pci11ab,1...@2/d...@2,0 (sd3): Jul 13 15:57:31 rob-036 Command failed to complete...Device is gone


Not what I would expect from a brand new drive!!

Does anybody have any tips on how i can work out where the fault lies here? I wouldn't expect controller with so many other drives working, and what on earth is the proper technique for replacing a drive that failed part way through a resilver?

I really believe there is a problem with either the cabling or the enclosure's backplane here.

Two disks is statistical coincidence, three disks means, it ain't the disks that are bad (if you checked and there was no recall and the firmware is correct and up to date).

Fix the real problem and the disks already in place should resilver without further interruption.

-Ross

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to