Re: [zfs-discuss] write cache and cache flush

Miles Nordin Mon, 02 Feb 2009 11:23:07 -0800

>>>>> "gm" == Greg Mason <gma...@msu.edu> writes:
>>>>> "g" == Gary Mills <mi...@cc.umanitoba.ca> writes:


    gm> I know disabling the ZIL is an Extremely Bad Idea,

but maybe you don't care about trashed thunderbird databases.  You
just don't want to lose the whole pool to ``status: The pool metadata
is corrupted and cannot be opened. / action: Destroy the pool and
restore from backup.''  I've no answer for that---maybe someone else?

The known problem with ZIL disabling, AIUI, is that it breaks the
statelessness of NFS.  If the server reboots and the NFS clients do
not, then assumptions on which the NFS protocol is built could be
broken, and files could get corrupted.

Behind this dire warning is an expectation I'm not sure everyone
shares: if the NFS server reboots, and the clients do not, then
(modulo bugs) no data is lost---once the clients unfreeze, it's like
nothing ever happened.  I don't think other file sharing protocols
like SMB or AFP attempt to keep that promise, so maybe people are
being warned about something most assumed would happen anyway.

will disabling the ZIL make NFS corrupt files worse than SMB or AFP
would when the server reboots?  not sure---at least SMB or AFP
_should_ give an error to the userland when the server reboots, sort
of like NFS 'hard,intr' when you press ^C, so applications using
sqllite or berkeleydb or whatever can catch that error and perform
their own user-level recovery, and if they call fsync() and get
success they can trust it absolutely no matter server or client
reboots.  while the ZIL-less NFS problems would probably be more
silent, more analagous to the ZFS-iSCSI problems except one layer
higher in the stack so programs think they've written to these .db
files but they haven't, and blindly scribble on, not knowing that a
batch of writes in the past was silently discarded.  In practice
everyone always says to run filemaker or Mail.app or Thunderbird or
anything with database files on ``a local disk'' only, so I think the
SMB and AFP error paths are not working right either and the actual
expectation is very low.

     g> Consider a file server running ZFS that exports a volume with
     g> Iscsi.  Consider also an application server that imports the
     g> LUN with Iscsi and runs a ZFS filesystem on that LUN.

I was pretty sure there was a bug for the iscsitadm target ignoring
SYNCHRONIZE_CACHE, but I cannot find the bug number now and may be
wrong.

Also there is a separate problem with remote storage and filesystems
highly dependent on SYNCHRONIZE_CACHE.  Even if not for the bug I
can't find, remote storage adds a failure case.  Normally you have
three main cases to handle:

  SYNCHRONIZE CACHE returns success after some delay

  SYNCHRONIZE CACHE never returns because someone yanked the
  cord---the whole system goes down.  You deal with it at boot, when
  mounting the filesystem.

  SYNCHRONIZE CACHE never returns because a drive went bad.

iSCSI adds a fourth:

  SYNCHRONIZE CACHE returns success
  SYNCHRONIZE CACHE returns success
  SYNCHRONIZE CACHE returns failure
  SYNCHRONIZE CACHE returns success

I think ZFS probably does not understand this case.  The others are
easier, because either you have enough raidz/mirror redundancy, or
else you are allowed handle the ``returns failure'' by implicitly
unmounting the filesystem and killing everything that held an open
file.

NFS works around this with the COMMIT op and client-driven replay in
v3, or by making everything synchronous in v2.  iSCSI is _not_ v2-like
because, even if there is no write caching in the initiator/target
(there probably ought to be), if the underlying physical disk in the
target has a write cache, still the entire target chassis can reboot
and lose the contents of that cache.  And I suspect iSCSI is not using
NFS-v3-like workarounds right now.  I think this hole is probably
still open.

pgpPNNAXMgAyH.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] write cache and cache flush

Reply via email to