darn, Darren, learning fast! best, z
----- Original Message ----- From: "A Darren Dunham" <ddun...@taos.com> To: <zfs-discuss@opensolaris.org> Sent: Wednesday, January 14, 2009 6:15 PM Subject: Re: [zfs-discuss] What are the usual suspects in data errors? > On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote: >> I realize that any error can occur in a storage subsystem, but most >> of these have an extremely low probability. I'm interested in this >> discussion in only those that do occur occasionally, and that are >> not catastrophic. > > What level is "extremely low" here? > >> Many of those components have their own error checking. Some have >> error correction. For example, parity checking is done on a SCSI bus, >> unless it's specifically disabled. Do SATA and PATA connections also >> do error checking? Disk sector I/O uses CRC error checking and >> correction. Memory buffers would often be protected by parity memory. >> Is there any more that I've missed? > > Reports suggest that bugs in drive firmware could account for errors at > a level that is not insignificant. > >> What can go wrong with the disk controller? A simple seek to the >> wrong track is not a problem because the track number is encoded on >> the platter. The controller will simply recalibrate the mechanism and >> retry the seek. If it computes the wrong sector, that would be a >> problem. Does this happen with any frequency? > > Netapp documents certain rewrite bugs that they've specifically seen. I > would imagine they have good data on the frequency that they see it in > the field. > >> In this case, ZFS >> would detect a checksum error and obtain the data from its redundant >> copy. > > Correct. > >> A logic error in ZFS might result in incorrect metadata being written >> with valid checksum. In this case, ZFS might panic on import or might >> correct the error. How is this sort of error prevented? > > It's very difficult to protect yourself from software bugs with the same > piece of software. You can create assertions that are hopefully simpler > and less prone to errors, but they will not catch all bugs. > >> Some errors might result from a loss of power if some ZFS data was >> written to a disk cache but never was written to the disk platter. >> Again, ZFS might panic on import or might correct the error. How is >> this sort of error prevented? > > ZFS uses a multi-stage commit. It relies on the "disk" responding to a > request to flush caches to the disk. If that assumption is correct, > then there is no problem in general with power issues. The disk is > consistent both before and after the cache is flushed. > > -- > Darren > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss