file system journals may support a variety of availability models, ranging from
simple support for fast recovery (return to consistency) with possible data 
loss, to those that attempt to support synchronous write semantics with no data 
loss on failure, along with fast recovery

the simpler models use a persistent caching scheme for file system meta-data
that can be used to limit the possible sources of file system corruption,
avoiding a complete fsck run after a failure ... the journal specifies the only
possible sources of corruption, allowing a quick check-and-recover mechanism
... here the journal is always written with meta-data changes (at least), 
before the actual updated meta-data in question is over-written to its old
location on disk ... after a failure, the journal indicates what meta-data 
must be checked for consistency

more elaborate models may cache both data and meta-data, to support 
limited data loss, synchronous writes and fast recovery ... newer file systems
often let you choose among these features

since ZFS never updates any data or meta-data in place (anything written into a 
pool is always written to a new (unused) location, it does not have the same
consistency issues that traditional file systems have to deal with ... a ZFS
pool is always in a consistent state, moving an old state to a new state only
after the new state has been completely committed to persistent store ...
the final update to a new state depends on a single atomic write that either
succeeds (moving the system to a consistent new state) or fails, leaving the
system in its current consistent state ... there can be no interim inconsistent
state

a ZFS pool builds its new state information in host memory for some period of
time (about 5 seconds), as host IOs are generated by various applications ...
at the end of this period these buffers are written to fresh locations on 
persistent store as described above, meaning that application writes are
treated asynchronously by default, and in the face a failure, some amount of
information that has been accumulating in host memory can be lost

if an application requires synchronous writes and a guarantee of no data loss,
then ZFS must somehow get the written information to persistent store
before it returns the application write call ... this is where the intent log 
comes
in ... the system call information (including the data) involved in a 
synchronous write operation are written to the intent log on persistent store
before the application write call returns ... but the information is also
written into the host memory buffer scheduled for its 5 sec updates (just
as if it was an asynchronous write) ... at then end of the 5 sec update time 
the new host buffers are written to disk, and, once committed, the intent
log information written to the ZIL is not longer needed and can be jettisoned
(so the ZIL never needs to be very large)

if the system fails, the accumulated but not flushed host buffer information
will be lost, but the ZIL records will already be on disk for any synchronous
writes and can be replayed when the host comes back up, or the pool is
imported by some other living host ... the pool, of course, always comes up
in a consistent state, but any ZIL records can be incorporated into a new 
consistent state before the pool is fully imported for use

the ZIL is always there in host memory, even when no synchronous writes
are being done, since the POSIX fsync() call could be made on an open 
write channel at any time, requiring all to-date writes on that channel
to be committed to persistent store before it returns to the application
... it's cheaper to write the ZIL at this point than to force the entire 5 sec
buffer out prematurely

synchronous writes can clearly have a significant negative performance 
impact in ZFS (or any other system) by forcing writes to disk before having a
chance to do more efficient, aggregated writes (the 5 second type), but
the ZIL solution in ZFS provides a good trade-off with a lot of room to
choose among various levels of performance and potential data loss ...
this is especially true with the recent addition of separate ZIL device
specification ... a small, fast (nvram type) device can be designated for
ZIL use, leaving slower spindle disks for the rest of the pool 

hope this helps ... Bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to