Chad Leigh -- Shire.Net LLC wrote: > > On Dec 1, 2006, at 10:17 PM, Ian Collins wrote: > >> Chad Leigh -- Shire.Net LLC wrote: >> >>> >>> On Dec 1, 2006, at 4:34 PM, Dana H. Myers wrote: >>> >>>> Chad Leigh -- Shire.Net LLC wrote: >>>> >>>>> >>>>> And this is different from any other storage system, how? (ie, JBOD >>>>> controllers and disks can also have subtle bugs that corrupt data) >>>> >>>> >>>> Of course, but there isn't the expectation of data reliability with a >>>> JBOD that there is with some RAID configurations. >>>> >>> >>> There is not? People buy disk drives and expect them to corrupt >>> their data? I expect the drives I buy to work fine (knowing that >>> there could be bugs etc in them, the same as with my RAID systems). >>> >> So you trust your important data to a single drive? I doubt it. But I >> bet you do trust your data to a hardware RAID array. > > Yes, but not because I expect a single drive to be more error prone > (versus total failure).
Ah. I was guessing you were thinking this. You believe that a single disk is no more prone to a soft data error than a reliable RAID configuration. This is arguably incorrect. Soft errors and hard failures are pretty similar in that they both result in data corruption - where they differ is in the recovery. While I can not quote a specific figure, disk drives are not immune from soft errors. Normally, the drive is able to detect the error and correct it transparently. As a result, the apparent soft error rate is quite low for a typical drive. However, there are limits to the soft errors that a drive can detect and correct; it is possible for an error to slip through the drive's controller without correction if it is more than some number of bits. I don't actually know current estimates for undetected soft errors, but it's small, like 1 in 10^30 bits or more. My guess could be wildly wrong - it doesn't change the outcome. Perhaps one of the actual disk experts can help here :-). Thus, undetected soft errors don't happen very often but they do happen. Further, the drive is attached to the computer via a cable and electronics which themselves are not immune to errors. Again, the probability of an error is very small, so small that we take for granted the reliability of the data. A single disk can have an undetected error - if the OS doesn't check, it'll never notice either. Since this happens rarely, when it does happen, it may not even be noticed, or have a lasting impact. If a record is read from a database with an error but not modified, the error won't be written-back to the disk and may not occur again. An application may crash, and simply be restarted. Since programming errors occur far more frequently than soft read errors (they do), it's probable that buggy software will be blamed rather than a soft error from a disk. > Total drive failure on a single disk loses all > your data. But we are not talking total failure, we are talking errors > that corrupt data. I buy individual drives with the expectation that > they are designed to be error free That's not a reasonable expectation, but, for the above reasons, you may retire and never have seen something that you believed was a soft error. The very nature of soft errors tends to hide them, while the nature of a hard failure exacerbates them. > and are error free for the most part For the most part? They're not always error-free? Remember, for every so many detected errors, there's probably an undetected error. > and I do not expect a RAID array to be more robust in this regard (after > all, the RAID is made up of a bunch of single drives). Some RAID configurations are not more robust than a single drive - a simple stripe/concat for example. It doesn't matter what kind of error is encountered, a soft error or a hard failure, there's no redundancy of the data. A hard error can't be ignored, but a soft error probably is. Reliable RAID configurations can tolerate at least one error - be it soft or hard. The real difference is the recovery protocol. > Some people on this list think that the RAID arrays are more likely to > corrupt your data than JBOD (both with ZFS on top, for example, a ZFS > mirror of 2 raid arrays or a JBOD mirror or raidz). There is no proof > of this or even reasonable hypothetical explanation for this that I have > seen presented. We've had an example presented just last week in the field where a RAID array running in a reliable configuration returned corrupt data as a result of a faulty interconnection. The error was detected by ZFS, but recovery of the data was not possible because the RAID array was trusted to maintain data integrity and not ZFS. The RAID array could only insure the integrity of the data inside the RAID array but can not detect or correct errors occurring in the interconnect. If ZFS had been managing the disks in that array as JBOD, and the disks were in a reliable configuration, the interconnect errors would have most likely have been detected and corrected transparently, since ZFS is managing the data at the very top level in the computer, on top of all interconnection and disks. So it doesn't matter how often soft errors occur in a single disk; a reliable configuration will reduce the rate, and a reliable ZFS configuration will protect against errors in the entire path to the disks. Dana _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss