>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
re> I really don't know how to please you. dd from the raw device instead of through ZFS would be better. If you could show that you can write data to a sector, and read back different data, without getting an error, over and over, I'd be totally stunned. The netapp paper was different from your test in many ways that make their claim that ``all drives silently corrupt data sometimes'' more convincing than your claim that you have ``one drive which silently corrupts data always and never returns UNC'': * not a desktop. The circumstances were more tightly-controlled, and their drive population installed in a repeated way * their checksum measurement was better than ZFS's by breaking the type of error up into three buckets instead of one, and their filesystem more mature, and their filesystem is not already known to count CKSUM errors for circumstances other than silent corruption, which argues the checksums are less likely to come from software bugs * they make statistical arguments that at least some of the errors are really coming from the drives by showing they have spatial locality w.r.t. the LBA on the drive, and are correlated with drive age and impending drive failure. The paper was less convincing in one way: * their drives are using nonstandard firmware re> Anyone who has been around for a while will have similar re> anecdotes. yeah, you'd think, but my similar anecdote is that (a) I can get UNC's repeatably on a specific bad sector that persist either forever or until I write new data to that sector with dd, and do get them on at least 10% of my drives per year, and (b) I get CKSUM errors from ZFS all the time with my iSCSI ghetto-SAN and with an IDE/Firewire mirror, often from things I can specifically trace back to not-a-drive-failure, but so far never from something I can for certain trace back to silent corruption by the disk drive. I don't doubt that it happens, but CKSUM isn't a way to spot it. ZFS may give me a way to stop it, but it doesn't give me an accurate way to measure/notice it. re> Indeed. Intuitively, the AFR and population is more easily re> grokked by the masses. It's nothing to do with masses. There's an error in your math. It's not right under any circumstance. Your point that a 100 drive population has bad/high odds of having silent corruption within a year isn't diminished by the correction, but it would be nice if you would own up to the statistics mistake since we're taking you at your word on a lot of other statistics. >> so, it ends up being a freeze. re> Untrue. There are disks which will retry forever. I don't understand. ZFS freezes until the disk stops retrying and returns an error. Because some disks never stop retrying and never return an error, just lock up until they're power-cycled, it's untrue that ZFS freezes? I think either you or I have lost the thread of the argument in our reply chain bantering. re> please file bugs. k., I filed the NFS bug, but unfortunately I don't have output to cut and paste into it. glad to see the 'zpool status' bug is there already and includes the point that lots of other things are probably hanging which shouldn't.
pgpLqP4hhsaWo.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss