Re: [zfs-discuss] Oracle and ZFS

Miles Nordin Mon, 23 Jun 2008 15:01:01 -0700

>>>>> "re" == Richard Elling <[EMAIL PROTECTED]> writes:
>>>>> "kb" == Keith Bierman <[EMAIL PROTECTED]> writes:


    re> the disk lies about the persistence of the data.  ZFS knows
    re> disks lie, so it sends sync commands when necessary

(1) i don't think ``lie'' is a correct characerization given that the
    sync commands exist, but point taken about the other area of risk.

    I suspect there may be similar problems in ZFS's write path when
    one is using iSCSI targets.  Or it's just common for iSCSI target
    implementations to suck (lie).  or maybe it's something else I'm
    seeing.

(2) i thought the recommendation that one give ZFS whole disks and let
    it put EFI labels on them came from the Solaris behavior that,
    only in a whole-disk-for-zfs configuration, will the Solaris
    drivers refrain from explicitly disabling the write cache in these
    inexpensive disks.  so the cache shouldn't be a problem for UFS,
    but it might be for non-Solaris operating systems (even for ZFS on
    platforms where ZFS is ported but the SYNCHRONIZE CACHE commands
    don't make it through some mid-layer or CAM or driver).

    kb> Aye, but isn't that the real rub ... when the power fails
    kb> after the write but *before* the fsync has occurred...

no, there is no rub here, I was only speaking precisely.  A proper
DBMS (anything except MySQL) is also designed to understand that power
failures happen.  It does its writes in a deliberate order such that
it won't return success to the application calling it until it gets
the return from fsync(), and also so that the system will never
recover such that a transaction is half-completed.

    re> the ZFS on-disk format is such that you can recover to a point
    re> in time where the file system is consistent.

do you mean taht, ``after a power outage ZFS will always recover the
filesystem to a state that it passed through in the moments leading up
to the outage,'' while UFS, which logs only metadata, typically
recovers to some state the filesystem never passed through---but it
never loses fsync()ed data nor data that wasn't written ``recently''
before the crash?

For casual filesystem use, or for applications that weren't designed
with cord-pulling in mind, ZFS's guarantee is larger and more
comforting.  But for databases, I don't think the distinction matters
because they call fsync() at deliberate moments and do their own
copy-on-write logging above the filesystem, so they provide the same
consistency guarantees whether operating on UFS or ZFS.  It would be
fine to feed a database the type of hacked non-CoW zvol that's used
for swap, if fsync could be made to work there, which maybe it can't.

pgpUl3DbdgW5f.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Oracle and ZFS

Reply via email to