I have this external Firewire box with 4 IDE drives in it, attached to
a Sunblade 2500. I've built the following pool on them:
banff[1]% zpool status
pool: pond
state: ONLINE
scrub: none requested
config:
NAME STATE READ WRITE CKSUM
pond ONLINE 0 0 0
raidz1 ONLINE 0 0 0
c10t0d0 ONLINE 0 0 0
c10t0d1 ONLINE 0 0 0
c11t0d0 ONLINE 0 0 0
c11t0d1 ONLINE 0 0 0
errors: No known data errors
I've partly filled it with data to stress it, and now when I run a
'zpool scrub', it gets to this point and stops:
banff[13]# zpool status
pool: pond
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are
unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-9P
scrub: scrub in progress, 35.30% done, 4h3m to go
This is usually followed with repeated complaints on the console:
Mar 19 18:46:30 banff scsi: WARNING:
/[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd10):
Mar 19 18:46:30 banff Error for Command: read(10) Error
Level: Retryable
Mar 19 18:46:30 banff scsi: Requested Block: 218914280
Error Block: 218914280
Mar 19 18:46:30 banff scsi: Vendor: WDC WD25
Serial Number:
Mar 19 18:46:30 banff scsi: Sense Key: Media_Error
Mar 19 18:46:30 banff scsi: ASC: 0x4b (data phase error), ASCQ: 0x0,
FRU: 0x0
(repeats four times)
Mar 19 18:47:33 banff scsi: WARNING:
/[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],1 (sd11):
Mar 19 18:47:33 banff SCSI transport failed: reason 'reset': retrying
command
Mar 19 19:01:36 banff scsi: WARNING:
/[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd10):
Mar 19 19:01:36 banff SCSI transport failed: reason 'reset': giving up
Mar 19 19:09:34 banff scsi: WARNING:
/[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],1 (sd11):
Mar 19 19:09:34 banff SCSI transport failed: reason 'reset': giving up
Mar 19 19:12:37 banff scsi: WARNING:
/[EMAIL PROTECTED],600000/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED]/[EMAIL PROTECTED],0 (sd10):
Mar 19 19:12:37 banff SCSI transport failed: reason 'reset': giving up
After getting one of these, the machine and the storage never manage
to work it out and resume; all access grinds to a halt until I reboot
the server. I also see the same blocks failing on the same disks, and
I'm surprised bad block mapping doesn't manage to come into play. I
tried writing a zero to one of the drives to try to force this, but
it didn't help:
banff[20]# dd if=/dev/zero of=/dev/rdsk/c10t0d0 oseek=218914280 count=1
Help? I don't understand why zfs isn't handling this. I do not have
confidence that the external case is my friend here (more about that
in another post), but I'm surprised at this failure mode.
Thanks,
Rob T
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss