>>>>> "ja" == James Andrewartha <jam...@daa.com.au> writes:
ja> other people are arguing that POSIX says rename(2) is atomic, Their statement is true but it's NOT an argument against T'so who is 100% right: the applications using that calling sequence for crash consistency are not portable under POSIX. atomic has nothing to do with crash consistency. It's about the view of the filesystem by other processes on the same system, ex., the security vulnerabilities one can have with setuid binaries that work in /tmp if said binaries don't take advantage of certain guarantees of atomicity to avoid race conditions. Obviously /tmp has zero to do with what the filesystem looks like after a crash: it always looks _empty_. For ext4 the argument is settled, fix the app. But a more productive way to approach the problem would be to look at tradeoffs between performance and crash consistency. Maybe we need fbarrier() (which could return faster---it sounds like on ZFS it could be a noop) instead of fsync(), or maybe something more, something genuinely post-Unix like limited filesystem-transactions that can open, commit, rollback. It's hard for a generation that grew up under POSIX to think outside it. A hypothetical new API ought to help balance performance/consistency for networked filesystems, too, like NFS or Lustre/OCFS/... For example, networked filesystems often promise close-to-open consistency, and the promise doesn't necessarily have to do with crashing. It means, client A client B write close sendmsg --------> poll open read (will see all A's writes) client A client B write wait a while sendmsg -------> poll read (all bets are off) This could stand obvious improvements in two ways. First, if I'm trying to send data to B using the filesystem (monkey chorus: don't do that! it won't work! you have to send data between nodes with libgnetdatasender and it's associated avahi-using setuid-nobody daemon! just check it out of svn. no it doesn't support IPv6 but the NEXT VERSION, what, 1000 nodes? well then you definitely don't want to--- DOWN, monkeychorus! If I feel like writing in Python or Javurscript or even PHP, let me. If I feel like sending data through a filesystem, find a way to let me! why the hell not do it? I said post-POSIX.) send USING THE FILESYSTEM, then maybe I don't want to close the file all the time because that's slow or just annoying. Is there some dance I can do using locks on B or A to say, ``I need B to see the data, but I do not necessarily need, nor want to wait, for it to be committed to disk---I just want it consistent on all clients''? like, suppose I keep the file open on A and B at the same time over NFS. Will taking a write lock on A and a read lock on B actually flush the client's cache and get the information moved from A to B faster? Second, we've discussed before NFSv3 write-write-write-commit batching doesn't work across close/open, so people need slogs to make their servers fast for the task of writing thousands of tiny files while for mounting VM disk images over NFS the slog might not be so badly needed. Even with the slog, the tiny-files scenario would be slowed down by network roundtrips. If we had a transaction API, we could open a transaction, write 1000 files, then close it. On a high-rtt network this could be many orders of magnitude faster than what we have now. but it's hard to imagine a transactional API that doesn't break the good things about POSIX-style like ``relatively simple'', ``apparently-stateless NFS client-server sessions'', ``advisory locking only'', ...
pgpYqREyYRLrY.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss