>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>> If you really mean there are devices out there which never >> return error codes, and always silently return bad data, please >> tell us which one and the story of when you encountered it, re> I blogged about one such case. re> http://blogs.sun.com/relling/entry/holy_smokes_a_holey_file re> However, I'm not inclined to publically chastise the vendor or re> device model. It is a major vendor and a popular re> device. 'nuff said. It's not really enough for me, but what's more the case doesn't match what we were looking for: a device which ``never returns error codes, always returns silently bad data.'' I asked for this because you said ``However, not all devices return error codes which indicate unrecoverable reads,'' which I think is wrong. Rather, most devices sometimes don't, not some devices always don't. Your experience doesn't say anything about this drive's inability to return UNC errors. It says you suspect it of silently returning bad data, once, but your experience doesn't even clearly implicate the device once: It could have been cabling/driver/power-supply/zfs-bugs when the block was written. I was hoping for a device in your ``bad stack'' which does it over and over. Remember, I'm not arguing ZFS checksums are worthless---I think they're great. I'm arguing with your original statement that ZFS is the only software RAID which deals with the dominant error you find in your testing, unrecoverable reads. This is untrue! re> This number should scare the *%^ out of you. It basically re> means that no data redundancy is a recipe for disaster. yeah, but that 9.5% number alone isn't an argument for ZFS over other software LVM's. re> 0.466%/yr is a per-disk rate. If you have 10 disks, your re> exposure is 4.6% per year. For 100 disks, 46% per year, etc. no, you're doing the statistics wrong, and in a really elementary way. You're counting multiple times the possible years in which more than one disk out of the hundred failed. If what you care about for 100 disks is that no disk experiences an error within one year, then you need to calculate (1 - 0.00466) ^ 100 = 62.7% so that's 37% probability of silent corruption. For 10 disks, the mistake doesn't make much difference and 4.6% is about right. I don't dispute ZFS checksums have value, but the point stands that the reported-error failure mode is 20x more common in netapp's study than this one, and other software LVM's do take care of the more common failure mode. re> UNCs don't cause ZFS to freeze as long as failmode != wait or re> ZFS manages the data redundancy. The time between issuing the read and getting the UNC back can be up to 30 seconds, and there are often several unrecoverable sectors in a row as well as lower-level retries multiplying this 30-second value. so, it ends up being a freeze. To fix it, ZFS needs to dispatch read requests for redundant data if the driver doesn't reply quickly. ``Quickly'' can be ambiguous, but the whole point of FMD was supposed to be that complicated statistics could be collected at various levels to identify even more subtle things than READ and CKSUM errors, like drives that are working at 1/10th the speed they should be, yet right now we can't even flag a drive taking 30 seconds to read a sector. ZFS is still ``patiently waiting'', and now that FMD is supposedly integrated instead of a discussion of what knobs and responses there are, you're passing the buck to the drivers and their haphazard nonuniform exception state machines. The best answer isn't changing drivers to make the drive timeout in 15 seconds instead---it's to send the read to other disks quickly using a very simple state machine, and start actually using FMD and a complicated state machine to generate suspicion-events for slow disks that aren't returning errors. Also the driver and mid-layer need to work with the hypothetical ZFS-layer timeouts to be as good as possible about not stalling the SATA chip, the channel if there's a port multiplier, or freezing the whole SATA stack including other chips, just because one disk has an outstanding READ command waiting to get an UNC back. In some sense the disk drivers and ZFS have different goals. The goal of drivers should be to keep marginal disk/cabling/... subsystems online as aggressively as possible, while the goal of ZFS should be to notice and work around slightly-failing devices as soon as possible. I thought the point of putting off reasonable exception handling for two years while waiting for FMD, was to be able to pursue both goals simultaneously without pressure to compromise one in favor of the other. In addition, I'm repeating myself like crazy at this point, but ZFS tools used for all pools like 'zpool status' need to not freeze when a single pool, or single device within a pool, is unavailable or slow, and this expectation is having nothing to do with failmode on the failing pool. And NFS running above ZFS should continue serving filesystems from available pools even if some pools are faulted, again nothing to do with failmode. Neither is the case now, and it's not a driver fix, but even beyond fixing these basic problems there's vast room for improvement, to deliver something better than LVM2 and closer to NetApp, rather than just catching up.
pgpzRuZrBmOeE.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss