>>>>> "ja" == James Andrewartha <jam...@daa.com.au> writes:

    ja> other people are arguing that POSIX says rename(2) is atomic,

Their statement is true but it's NOT an argument against T'so who is
100% right: the applications using that calling sequence for crash
consistency are not portable under POSIX.

atomic has nothing to do with crash consistency.  

It's about the view of the filesystem by other processes on the same
system, ex., the security vulnerabilities one can have with setuid
binaries that work in /tmp if said binaries don't take advantage of
certain guarantees of atomicity to avoid race conditions.  Obviously
/tmp has zero to do with what the filesystem looks like after a crash:
it always looks _empty_.

For ext4 the argument is settled, fix the app.  But a more productive
way to approach the problem would be to look at tradeoffs between
performance and crash consistency.  Maybe we need fbarrier() (which
could return faster---it sounds like on ZFS it could be a noop)
instead of fsync(), or maybe something more, something genuinely
post-Unix like limited filesystem-transactions that can open, commit,
rollback.  It's hard for a generation that grew up under POSIX to
think outside it.

A hypothetical new API ought to help balance performance/consistency
for networked filesystems, too, like NFS or Lustre/OCFS/...  For
example, networked filesystems often promise close-to-open
consistency, and the promise doesn't necessarily have to do with
crashing.  It means,

  client A             client B
   write
   close
   sendmsg  -------->   poll
                        open
                        read    (will see all A's writes)


  client A             client B
   write
   wait a while
   sendmsg  ------->    poll
                        read    (all bets are off)

This could stand obvious improvements in two ways.  First, if I'm
trying to send data to B using the filesystem 

                 (monkey chorus: don't do that!  it won't work!  you
                  have to send data between nodes with
                  libgnetdatasender and it's associated avahi-using
                  setuid-nobody daemon!  just check it out of svn.  no
                  it doesn't support IPv6 but the NEXT VERSION, what,
                  1000 nodes? well then you definitely don't want
                  to---

                  DOWN, monkeychorus!  If I feel like writing in
                  Python or Javurscript or even PHP, let me.  If I
                  feel like sending data through a filesystem, find a
                  way to let me!  why the hell not do it?  I said
                  post-POSIX.)

send USING THE FILESYSTEM, then maybe I don't want to close the file
all the time because that's slow or just annoying.  Is there some
dance I can do using locks on B or A to say, ``I need B to see the
data, but I do not necessarily need, nor want to wait, for it to be
committed to disk---I just want it consistent on all clients''?  like,
suppose I keep the file open on A and B at the same time over NFS.
Will taking a write lock on A and a read lock on B actually flush the
client's cache and get the information moved from A to B faster?

Second, we've discussed before NFSv3 write-write-write-commit batching
doesn't work across close/open, so people need slogs to make their
servers fast for the task of writing thousands of tiny files while for
mounting VM disk images over NFS the slog might not be so badly
needed.  Even with the slog, the tiny-files scenario would be slowed
down by network roundtrips.  If we had a transaction API, we could
open a transaction, write 1000 files, then close it.  On a high-rtt
network this could be many orders of magnitude faster than what we
have now.  but it's hard to imagine a transactional API that doesn't
break the good things about POSIX-style like ``relatively simple'',
``apparently-stateless NFS client-server sessions'', ``advisory
locking only'', ...

Attachment: pgpYqREyYRLrY.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to