file system journals may support a variety of availability models, ranging from simple support for fast recovery (return to consistency) with possible data loss, to those that attempt to support synchronous write semantics with no data loss on failure, along with fast recovery
the simpler models use a persistent caching scheme for file system meta-data that can be used to limit the possible sources of file system corruption, avoiding a complete fsck run after a failure ... the journal specifies the only possible sources of corruption, allowing a quick check-and-recover mechanism ... here the journal is always written with meta-data changes (at least), before the actual updated meta-data in question is over-written to its old location on disk ... after a failure, the journal indicates what meta-data must be checked for consistency more elaborate models may cache both data and meta-data, to support limited data loss, synchronous writes and fast recovery ... newer file systems often let you choose among these features since ZFS never updates any data or meta-data in place (anything written into a pool is always written to a new (unused) location, it does not have the same consistency issues that traditional file systems have to deal with ... a ZFS pool is always in a consistent state, moving an old state to a new state only after the new state has been completely committed to persistent store ... the final update to a new state depends on a single atomic write that either succeeds (moving the system to a consistent new state) or fails, leaving the system in its current consistent state ... there can be no interim inconsistent state a ZFS pool builds its new state information in host memory for some period of time (about 5 seconds), as host IOs are generated by various applications ... at the end of this period these buffers are written to fresh locations on persistent store as described above, meaning that application writes are treated asynchronously by default, and in the face a failure, some amount of information that has been accumulating in host memory can be lost if an application requires synchronous writes and a guarantee of no data loss, then ZFS must somehow get the written information to persistent store before it returns the application write call ... this is where the intent log comes in ... the system call information (including the data) involved in a synchronous write operation are written to the intent log on persistent store before the application write call returns ... but the information is also written into the host memory buffer scheduled for its 5 sec updates (just as if it was an asynchronous write) ... at then end of the 5 sec update time the new host buffers are written to disk, and, once committed, the intent log information written to the ZIL is not longer needed and can be jettisoned (so the ZIL never needs to be very large) if the system fails, the accumulated but not flushed host buffer information will be lost, but the ZIL records will already be on disk for any synchronous writes and can be replayed when the host comes back up, or the pool is imported by some other living host ... the pool, of course, always comes up in a consistent state, but any ZIL records can be incorporated into a new consistent state before the pool is fully imported for use the ZIL is always there in host memory, even when no synchronous writes are being done, since the POSIX fsync() call could be made on an open write channel at any time, requiring all to-date writes on that channel to be committed to persistent store before it returns to the application ... it's cheaper to write the ZIL at this point than to force the entire 5 sec buffer out prematurely synchronous writes can clearly have a significant negative performance impact in ZFS (or any other system) by forcing writes to disk before having a chance to do more efficient, aggregated writes (the 5 second type), but the ZIL solution in ZFS provides a good trade-off with a lot of room to choose among various levels of performance and potential data loss ... this is especially true with the recent addition of separate ZIL device specification ... a small, fast (nvram type) device can be designated for ZIL use, leaving slower spindle disks for the rest of the pool hope this helps ... Bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss