>>>>> "re" == Richard Elling <richard.ell...@gmail.com> writes:
re> The risk here is not really different that that faced by re> normal disk drives which have nonvolatile buffers (eg re> virtually all HDDs and some SSDs). This is why applications re> can send cache flush commands when they need to ensure the re> data is on the media. It's probably different because of the iSCSI target reboot problem I've written about before: iSCSI initiator iSCSI target nonvolatile medium write A ------------> <----- ack A write B ------------> <----- ack B ----------> [A] [REBOOT] write C ------------> [timeout!] reconnect ------------> <----- ack Connected write C ------------> <----- ack C flush ------------> ---------> [C] <----- ack Flush in the above time chart, the initiator thinks A, B, and C are written, but in fact only A and C are written. I regard this as a failing of imagination in the SCSI protocol, but probably with better understanding of the details than I have the initiator could be made to provably work around the problem. My guess has always been that no current initiators actually do, though. I think it could happen also with a directly-attached SATA disk if you remove power from the disk without rebooting the host, so as Richard said it is not really different, except that in the real world it's much more common for an iSCSI target to lose power without the initiator's also losing power than it is for a disk to lose power without its host adapter losing power. The ancient practice of unix filesystem design always considers cord-yanking as something happening to the entire machine, and failing disks are not the filesystem's responsibility to work arund because how could it? This assumption should have been changed and wasn't, when we entered the era of RAID and removable disks, where the connections to disks and disks themselves are both allowed to fail. However, when NFS was designed, the assumption *WAS* changed, and indeed NFSv2 and earlier operated always with the write cache OFF to be safe from this, just as COMSTAR does in its (default?) abyssmal-performance mode (so campuses bought prestoserve cards (equivalent to a DDRDrive except much less silly because they have onboard batteries), or auspex servers with included NVRAM, which are analagous outside the NFS world to netapp/hitachi/emc FC/iSCSI targets which always have big NVRAM's so they can leave the write cache off), and NFSv3 has a commit protocol that is smart enough to replay the 'write B' which makes the nonvolatile caches less necessary (so long as you're not closing files frequently, I guess?). I think it would be smart to design more storage systems so NFS can replace the role of iSCSI, for disk access. In Isilon or Lustre clusters this trick is common when a node can settle with unshared access to a subtree: create an image file on the NFS/Lustre back-end and fill it with an ext3 or XFS, and writes to that inner filesystem become much faster because this rube goldberg arrangement discards the clsoe-to-open consistency guarantee. We might use it in the ZFS world for actual physical disk acess instead of iSCSI, ex., it should be possible to NFS-export a zvol and see a share with a single file in it named 'theTarget' or something, but this file would be without read-ahead. Better yet, to accomodate VMWare limitations, would be to export a single fake /zvol share containing all NFS-shared zvol's, and as you export zvol's their files appear within this share. Also it should be possible to mount vdev elements over NFS without deadlocks---I know that is difficult, but VMWare does it. Perahps it cannot be done through the existing NFS client, but obviously it can be done somehow, and it would both solve the iSCSI target reboot problem and also allow using more kinds of proprietary storage backend---the same reasons VMWare wants to give admins a choice applies to ZFS. When NFS is used in this way the disk image file is never closed, so the NFS server will not need a slog to give good performance: the same job is accomplished by double-caching the uncommitted data on the client so it can be replayed if the time diagram above happens.
pgp5D3EwpiIVp.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss