Re: [zfs-discuss] What are the usual suspects in data errors?

JZ Wed, 14 Jan 2009 16:41:58 -0800

darn, Darren, learning fast!

best,
z



----- Original Message ----- 
From: "A Darren Dunham" <ddun...@taos.com>
To: <zfs-discuss@opensolaris.org>
Sent: Wednesday, January 14, 2009 6:15 PM
Subject: Re: [zfs-discuss] What are the usual suspects in data errors?


> On Wed, Jan 14, 2009 at 04:39:03PM -0600, Gary Mills wrote:
>> I realize that any error can occur in a storage subsystem, but most
>> of these have an extremely low probability.  I'm interested in this
>> discussion in only those that do occur occasionally, and that are
>> not catastrophic.
> 
> What level is "extremely low" here?
> 
>> Many of those components have their own error checking.  Some have
>> error correction.  For example, parity checking is done on a SCSI bus,
>> unless it's specifically disabled.  Do SATA and PATA connections also
>> do error checking?  Disk sector I/O uses CRC error checking and
>> correction.  Memory buffers would often be protected by parity memory.
>> Is there any more that I've missed?
> 
> Reports suggest that bugs in drive firmware could account for errors at
> a level that is not insignificant.
> 
>> What can go wrong with the disk controller?  A simple seek to the
>> wrong track is not a problem because the track number is encoded on
>> the platter.  The controller will simply recalibrate the mechanism and
>> retry the seek.  If it computes the wrong sector, that would be a
>> problem.  Does this happen with any frequency? 
> 
> Netapp documents certain rewrite bugs that they've specifically seen.  I
> would imagine they have good data on the frequency that they see it in
> the field.
> 
>> In this case, ZFS
>> would detect a checksum error and obtain the data from its redundant
>> copy.
> 
> Correct.
> 
>> A logic error in ZFS might result in incorrect metadata being written
>> with valid checksum.  In this case, ZFS might panic on import or might
>> correct the error.  How is this sort of error prevented?
> 
> It's very difficult to protect yourself from software bugs with the same
> piece of software.  You can create assertions that are hopefully simpler
> and less prone to errors, but they will not catch all bugs.
> 
>> Some errors might result from a loss of power if some ZFS data was
>> written to a disk cache but never was written to the disk platter.
>> Again, ZFS might panic on import or might correct the error.  How is
>> this sort of error prevented?
> 
> ZFS uses a multi-stage commit.  It relies on the "disk" responding to a
> request to flush caches to the disk.  If that assumption is correct,
> then there is no problem in general with power issues.  The disk is
> consistent both before and after the cache is flushed.
> 
> -- 
> Darren
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@opensolaris.org
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] What are the usual suspects in data errors?

Reply via email to