On Oct 22, 2010, at 10:40 AM, Miles Nordin wrote:

>>>>>> "re" == Richard Elling <richard.ell...@gmail.com> writes:
> 
>    re> The risk here is not really different that that faced by
>    re> normal disk drives which have nonvolatile buffers (eg
>    re> virtually all HDDs and some SSDs).  This is why applications
>    re> can send cache flush commands when they need to ensure the
>    re> data is on the media.
> 
> It's probably different because of the iSCSI target reboot problem
> I've written about before:
> 
> iSCSI initiator         iSCSI target       nonvolatile medium
> 
> write A   ------------>
>                   <-----  ack A    
> write B   ------------>
>                   <-----  ack B
>                                  ---------->    [A]
>                         [REBOOT]
> write C   ------------>
> [timeout!]
> reconnect ------------>
>                   <-----  ack Connected
> write C   ------------>
>                   <-----  ack C
> flush     ------------>
>                                  --------->     [C]
>                   <-----  ack Flush
> 
> in the above time chart, the initiator thinks A, B, and C are written,
> but in fact only A and C are written.  I regard this as a failing of
> imagination in the SCSI protocol, but probably with better
> understanding of the details than I have the initiator could be made
> to provably work around the problem.  My guess has always been that no
> current initiators actually do, though.
> 
> I think it could happen also with a directly-attached SATA disk if you
> remove power from the disk without rebooting the host, so as Richard
> said it is not really different, except that in the real world it's
> much more common for an iSCSI target to lose power without the
> initiator's also losing power than it is for a disk to lose power
> without its host adapter losing power.  

I agree. I'd like to have some good field information on this, but I think it
is safe to assume that for the average small server, when the local disks
lose power the server also loses power and the exposure to this issue
is lost in the general recovery of the server. For the geezers, who remember
such pain as Netware or Aegis, NFS was a breath of fresh air and led to
much more robust designs.  In that respect, iSCSI or even FC is a step
backwards, down the protocol stack.

> The ancient practice of unix
> filesystem design always considers cord-yanking as something happening
> to the entire machine, and failing disks are not the filesystem's
> responsibility to work arund because how could it?  This assumption
> should have been changed and wasn't, when we entered the era of RAID
> and removable disks, where the connections to disks and disks
> themselves are both allowed to fail.  However, when NFS was designed,
> the assumption *WAS* changed, and indeed NFSv2 and earlier operated
> always with the write cache OFF to be safe from this, just as COMSTAR
> does in its (default?) abyssmal-performance mode (so campuses bought
> prestoserve cards (equivalent to a DDRDrive except much less silly
> because they have onboard batteries), or auspex servers with included
> NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc
> FC/iSCSI targets which always have big NVRAM's so they can leave the
> write cache off), and NFSv3 has a commit protocol that is smart enough
> to replay the 'write B' which makes the nonvolatile caches less
> necessary (so long as you're not closing files frequently, I guess?).

With COMSTAR, you can implement the commit to media policy in at least
three ways:
        1. server side: disable writeback cache, per LUN
        2. server side: change the sync policy to "always" for the zvol
        3. client side: disable write cache enable per LUN

For choices 1 and 2, the ZIL and separate log come into play.

> I think it would be smart to design more storage systems so NFS can
> replace the role of iSCSI, for disk access.  

I agree.

> In Isilon or Lustre
> clusters this trick is common when a node can settle with unshared
> access to a subtree: create an image file on the NFS/Lustre back-end
> and fill it with an ext3 or XFS, and writes to that inner filesystem
> become much faster because this rube goldberg arrangement discards the
> clsoe-to-open consistency guarantee.  We might use it in the ZFS world
> for actual physical disk acess instead of iSCSI, ex., it should be
> possible to NFS-export a zvol and see a share with a single file in it
> named 'theTarget' or something, but this file would be without
> read-ahead.  Better yet, to accomodate VMWare limitations, would be to
> export a single fake /zvol share containing all NFS-shared zvol's, and
> as you export zvol's their files appear within this share.  Also it
> should be possible to mount vdev elements over NFS without
> deadlocks---I know that is difficult, but VMWare does it.  Perahps it
> cannot be done through the existing NFS client, but obviously it can
> be done somehow, and it would both solve the iSCSI target reboot
> problem and also allow using more kinds of proprietary storage
> backend---the same reasons VMWare wants to give admins a choice
> applies to ZFS.  When NFS is used in this way the disk image file is
> never closed, so the NFS server will not need a slog to give good
> performance: the same job is accomplished by double-caching the
> uncommitted data on the client so it can be replayed if the time
> diagram above happens.

In the case of VMs, I particularly dislike the failure policies.  With NFS, by
default, it was simple -- if the client couldn't hear the server, the processes
blocked on I/O remain blocked.  Later, the "soft" option was added so that
they would eventually return failures, but that just let system administrators
introduce the same sort of failure mode as iSCSI. In the VM world, it seems
the hypervisors try to do crazy things like make the disks readonly, which 
is perhaps the worst thing you can do to a guest OS because now it needs
to be rebooted -- kinda like the old Netware world that we so gladly left 
behind.
 -- richard


_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to