>>>>> "re" == Richard Elling <richard.ell...@gmail.com> writes:

    re> The risk here is not really different that that faced by
    re> normal disk drives which have nonvolatile buffers (eg
    re> virtually all HDDs and some SSDs).  This is why applications
    re> can send cache flush commands when they need to ensure the
    re> data is on the media.

It's probably different because of the iSCSI target reboot problem
I've written about before:

iSCSI initiator         iSCSI target       nonvolatile medium

write A   ------------>
                   <-----  ack A    
write B   ------------>
                   <-----  ack B
                                  ---------->    [A]
                         [REBOOT]
write C   ------------>
[timeout!]
reconnect ------------>
                   <-----  ack Connected
write C   ------------>
                   <-----  ack C
flush     ------------>
                                  --------->     [C]
                   <-----  ack Flush

in the above time chart, the initiator thinks A, B, and C are written,
but in fact only A and C are written.  I regard this as a failing of
imagination in the SCSI protocol, but probably with better
understanding of the details than I have the initiator could be made
to provably work around the problem.  My guess has always been that no
current initiators actually do, though.

I think it could happen also with a directly-attached SATA disk if you
remove power from the disk without rebooting the host, so as Richard
said it is not really different, except that in the real world it's
much more common for an iSCSI target to lose power without the
initiator's also losing power than it is for a disk to lose power
without its host adapter losing power.  The ancient practice of unix
filesystem design always considers cord-yanking as something happening
to the entire machine, and failing disks are not the filesystem's
responsibility to work arund because how could it?  This assumption
should have been changed and wasn't, when we entered the era of RAID
and removable disks, where the connections to disks and disks
themselves are both allowed to fail.  However, when NFS was designed,
the assumption *WAS* changed, and indeed NFSv2 and earlier operated
always with the write cache OFF to be safe from this, just as COMSTAR
does in its (default?) abyssmal-performance mode (so campuses bought
prestoserve cards (equivalent to a DDRDrive except much less silly
because they have onboard batteries), or auspex servers with included
NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc
FC/iSCSI targets which always have big NVRAM's so they can leave the
write cache off), and NFSv3 has a commit protocol that is smart enough
to replay the 'write B' which makes the nonvolatile caches less
necessary (so long as you're not closing files frequently, I guess?).

I think it would be smart to design more storage systems so NFS can
replace the role of iSCSI, for disk access.  In Isilon or Lustre
clusters this trick is common when a node can settle with unshared
access to a subtree: create an image file on the NFS/Lustre back-end
and fill it with an ext3 or XFS, and writes to that inner filesystem
become much faster because this rube goldberg arrangement discards the
clsoe-to-open consistency guarantee.  We might use it in the ZFS world
for actual physical disk acess instead of iSCSI, ex., it should be
possible to NFS-export a zvol and see a share with a single file in it
named 'theTarget' or something, but this file would be without
read-ahead.  Better yet, to accomodate VMWare limitations, would be to
export a single fake /zvol share containing all NFS-shared zvol's, and
as you export zvol's their files appear within this share.  Also it
should be possible to mount vdev elements over NFS without
deadlocks---I know that is difficult, but VMWare does it.  Perahps it
cannot be done through the existing NFS client, but obviously it can
be done somehow, and it would both solve the iSCSI target reboot
problem and also allow using more kinds of proprietary storage
backend---the same reasons VMWare wants to give admins a choice
applies to ZFS.  When NFS is used in this way the disk image file is
never closed, so the NFS server will not need a slog to give good
performance: the same job is accomplished by double-caching the
uncommitted data on the client so it can be replayed if the time
diagram above happens.

Attachment: pgp5D3EwpiIVp.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to