From a reporting perspective, yes, zpool status should not hang, and 
should report an error if a drive goes away, or is in any way behaving 
badly.  No arguments there.  From the data integrity perspective, the 
only event zfs needs to know about is when a bad drive is replaced, such 
that a resilver is triggered.  If a drive is suddenly gone, but it is 
only one component of a redundant set, your data should still be fine.  
Now, if enough drives go away to break the redundancy, that's a 
different story altogether.

Jon

Ross Smith wrote:
> I agree that device drivers should perform the bulk of the fault 
> monitoring, however I disagree that this absolves ZFS of any 
> responsibility for checking for errors.  The primary goal of ZFS is to 
> be a filesystem and maintain data integrity, and that entails both 
> reading and writing data to the devices.  It is no good having 
> checksumming when reading data if you are loosing huge amounts of data 
> when a disk fails.
>  
> I'm not saying that ZFS should be monitoring disks and drivers to 
> ensure they are working, just that if ZFS attempts to write data and 
> doesn't get the response it's expecting, an error should be logged 
> against the device regardless of what the driver says.  If ZFS is 
> really about end-to-end data integrity, then you do need to consider 
> the possibility of a faulty driver.  Now I don't know what the root 
> cause of this error is, but I suspect it will be either a bad response 
> from the SATA driver, or something within ZFS that is not working 
> correctly.  Either way however I believe ZFS should have caught this.
>  
> It's similar to the iSCSI problem I posted a few months back where the 
> ZFS pool hangs for 3 minutes when a device is disconnected.  There's 
> absolutely no need for the entire pool to hang when the other half of 
> the mirror is working fine.  ZFS is often compared to hardware raid 
> controllers, but so far it's ability to handle problems is falling short.
>  
> Ross
>  
>
> > Date: Wed, 30 Jul 2008 09:48:34 -0500
> > From: [EMAIL PROTECTED]
> > To: [EMAIL PROTECTED]
> > CC: zfs-discuss@opensolaris.org
> > Subject: Re: [zfs-discuss] Supermicro AOC-SAT2-MV8 hang when drive 
> removed
> >
> > On Wed, 30 Jul 2008, Ross wrote:
> > >
> > > Imagine you had a raid-z array and pulled a drive as I'm doing here.
> > > Because ZFS isn't aware of the removal it keeps writing to that
> > > drive as if it's valid. That means ZFS still believes the array is
> > > online when in fact it should be degrated. If any other drive now
> > > fails, ZFS will consider the status degrated instead of faulted, and
> > > will continue writing data. The problem is, ZFS is writing some of
> > > that data to a drive which doesn't exist, meaning all that data will
> > > be lost on reboot.
> >
> > While I do believe that device drivers. or the fault system, should
> > notify ZFS when a device fails (and ZFS should appropriately react), I
> > don't think that ZFS should be responsible for fault monitoring. ZFS
> > is in a rather poor position for device fault monitoring, and if it
> > attempts to do so then it will be slow and may misbehave in other
> > ways. The software which communicates with the device (i.e. the
> > device driver) is in the best position to monitor the device.
> >
> > The primary goal of ZFS is to be able to correctly read data which was
> > successfully committed to disk. There are programming interfaces
> > (e.g. fsync(), msync()) which may be used to ensure that data is
> > committed to disk, and which should return an error if there is a
> > problem. If you were performing your tests over an NFS mount then the
> > results should be considerably different since NFS requests that its
> > data be committed to disk.
> >
> > Bob
>

-- 


-     _____/     _____/      /           - Jonathan Loran -           -
-    /          /           /                IT Manager               -
-  _____  /   _____  /     /     Space Sciences Laboratory, UC Berkeley
-        /          /     /      (510) 643-5146 [EMAIL PROTECTED]
- ______/    ______/    ______/           AST:7731^29u18e3
                                 


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to