Hello,

There have been comparisons posted here (and in general out there on the net) 
for various RAID levels and the chances of e.g. double failures. One problem 
that is rarely addressed though, is the various edge cases that significantly 
impact the probability of loss of data.

In particular, I am concerned about the relative likelyhood of bad sectors on 
a drive, vs. entire-drive failure. On a raidz where uptime is not important, 
I would not want a dead drive + a single bad sector on another drive to cause 
loss of data, yet dead drive + bad sector is going to be a lot more likely 
than two dead drives within the same time window.

In many situations it may not feel worth it to move to a raidz2 just to avoid 
this particular case.

I would like a pool policy that allowed one to specify that at the moment a 
disk fails (where "fails" = considered faulty), all mutable I/O would be 
immediately stopped (returning I/O errors to userspace I presume), and any 
transaction in the process of being committed is rolled back. The result is 
that the drive that just failed completely will not go out of date 
immediately.

If one then triggers a bad block on another drive while resilvering with a 
replacement drive, you know that you have the failed drive as a last resort 
(given that a full-drive failure is unlikely to mean the drive was physically 
obliterated; perhaps the controller circuitery can be replaced or certain 
physical components can be replaced). In the case of raidz2, you effectively 
have another "half" level of redundancy.

Also, with either raidz/raidz2 one can imagine cases where a machine is booted 
with one or two drives missing (due to cabling issues for example); 
guaranteeing that no pool is ever online for writable operations (thus making 
abscent drives out of date) until the administrative explicitly asks for it, 
would greatly reduce the probability of data loss due to a bad block in this 
case aswell.

In short, if true irrevocable dataloss is limited (assuming no software 
issues) to the complete obliteration of all data on n drives (for n levels of 
redundancy), or alternatively to the unlikely event of bad blocks co-inciding 
on multiple drives, wouldn't reliability be significantly increased in cases 
where this is an acceptable practice?

Opinions?

-- 
/ Peter Schuller, InfiDyne Technologies HB

PGP userID: 0xE9758B7D or 'Peter Schuller <[EMAIL PROTECTED]>'
Key retrieval: Send an E-Mail to [EMAIL PROTECTED]
E-Mail: [EMAIL PROTECTED] Web: http://www.scode.org

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to