I have this external Firewire box with 4 IDE drives in it, attached to
a Sunblade 2500.  I've built the following pool on them:

banff[1]% zpool status
  pool: pond
 state: ONLINE
 scrub: none requested
config:

        NAME         STATE     READ WRITE CKSUM
        pond         ONLINE       0     0     0
          raidz1     ONLINE       0     0     0
            c10t0d0  ONLINE       0     0     0
            c10t0d1  ONLINE       0     0     0
            c11t0d0  ONLINE       0     0     0
            c11t0d1  ONLINE       0     0     0

errors: No known data errors

I've partly filled it with data to stress it, and now when I run a
'zpool scrub', it gets to this point and stops:

banff[13]# zpool status
  pool: pond
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://www.sun.com/msg/ZFS-8000-9P
 scrub: scrub in progress, 35.30% done, 4h3m to go

This is usually followed with repeated complaints on the console:

Mar 19 18:46:30 banff scsi: WARNING: /[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd10): Mar 19 18:46:30 banff Error for Command: read(10) Error Level: Retryable Mar 19 18:46:30 banff scsi: Requested Block: 218914280 Error Block: 218914280 Mar 19 18:46:30 banff scsi: Vendor: WDC WD25 Serial Number:
Mar 19 18:46:30 banff scsi:     Sense Key: Media_Error
Mar 19 18:46:30 banff scsi: ASC: 0x4b (data phase error), ASCQ: 0x0, FRU: 0x0
(repeats four times)

Mar 19 18:47:33 banff scsi: WARNING: /[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],1 (sd11): Mar 19 18:47:33 banff SCSI transport failed: reason 'reset': retrying command Mar 19 19:01:36 banff scsi: WARNING: /[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd10):
Mar 19 19:01:36 banff   SCSI transport failed: reason 'reset': giving up
Mar 19 19:09:34 banff scsi: WARNING: /[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],1 (sd11):
Mar 19 19:09:34 banff   SCSI transport failed: reason 'reset': giving up
Mar 19 19:12:37 banff scsi: WARNING: /[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd10):
Mar 19 19:12:37 banff   SCSI transport failed: reason 'reset': giving up

After getting one of these, the machine and the storage never manage
to work it out and resume; all access grinds to a halt until I reboot
the server.  I also see the same blocks failing on the same disks, and
I'm surprised bad block mapping doesn't manage to come into play.  I
tried writing a zero to one of the drives to try to force this, but
it didn't help:

banff[20]# dd if=/dev/zero of=/dev/rdsk/c10t0d0 oseek=218914280 count=1

Help?  I don't understand why zfs isn't handling this.  I do not have
confidence that the external case is my friend here (more about that
in another post), but I'm surprised at this failure mode.

Thanks,
Rob T
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to