On Dec 10, 2009, at 8:36 AM, Mark Grant wrote:
From what I remember the problem with the hardware RAID controller
is that the long delay before the drive responds causes the drive
to be dropped from the RAID and then if you get another error on a
different drive while trying to repair the RAID then that disk is
also marked failed and your whole filesystem is gone even though
most of the data is still readable on the disks; odds are you could
have recovered 100% of the data using what is still readable on the
complete set of drives, since the bad sectors on the two failed
drives probably wouldn't be in the same place. The end result is
worse than not using RAID because you lose everything rather than
just the files with bad sectors (though if you're using mirroring
rather than parity then you could presumably recover most of the
data eventually).
Certainly if the disk was taking that long to respond I'd be
replacing it ASAP, but ASAP may not be fast enough if a second drive
has bad sectors too. And I have seen a consumer SATA drive
repeatedly lock up a system for a minute doing retries when there
was no indication at all beforehand that the drive had problems.
For the Solaris sd(7d) driver, the default timeout is 60 seconds with
3 or 5
retries, depending on the hardware. Whether you notice this at the
application
level depends on other factors: reads vs writes, etc. You can tune
this, of
course, and you have access to the source.
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss