On Fri, Aug 11, 2006 at 11:04:06AM -0500, Anton Rang wrote:
> >Once  the data  blocks are  on disk we  have the information
> >necessary to update the  indirect  blocks iteratively up  to
> >the  ueberblock. Those  are the  smaller I/Os;  I guess that
> >because    of ditto blocks  they  go  to physically seperate
> >locations, by design.
> 
> We shouldn't have to wait for the data blocks to reach disk,
> though.  We know where they're going in advance.  One of the
> key advantages of the ?berblock scheme is that we can, in a
> sense, speculatively write to disk.  We don't need the tight
> ordering that UFS requires to avoid security exposures and
> allow the file system to be repaired.  We can lay out all of
> the data and metadata, write them all to disk, choose new
> locations if the writes fail, etc. and not worry about any
> ordering or state issues, because the on-disk image doesn't
> change until we commit it.

> You're right, the ditto block mechanism will mean that some
> writes will be spread around (at least when using a
> non-redundant pool like mine), but then we should have at
> most three writes followed by the ?berblock update, assuming
> three degrees of replication.

The problem is that you don't know the actual *contents* of the parent block
until *all* of its children have been written to their final locations.
(This is because the block pointer's value depends on the final location)
The ditto blocks don't really effect this, since they can all be written
out in parallel.

So you end up with the current N phases; data, it's parents,
it's parents, ..., uberblock.

> >But  I follow  you in that,  It  may be remotely possible to
> >reduce the number of Iterations  in the process by  assuming
> >that the I/O will  all succeed, then  if some fails, fix  up
> >the consequence and when all  done, update the ueberblock. I
> >would not hold my breath quite yet for that.
> 
> Hmmm.  I guess my point is that we shouldn't need to iterate
> at all.  There are no dependencies between these writes; only
> between the complete set of writes and the ?berblock update.

Again, there is;  if a block write fails, you have to re-write it and
all of it's parents.  So the best you could do would be:

        1. assign locations for all blocks, and update the space bitmaps
           as necessary.
        2. update all of the non-Uberdata blocks with their actual
           contents (which requires calculating checksums on all of the
           child blocks)
        3. write everything out in parallel.
        3a. if any write fails, re-do 1+2 for that block, and 2 for all of its
            parents, then start over at 3 with all of the changed blocks.

        4. once everything is on stable storage, update the uberblock.

That's a lot more complicated than the current model, but certainly seems
possible.

Cheers,
- jonathan

(this is only my understanding of how ZFS works;  I could be mistaken)


-- 
Jonathan Adams, Solaris Kernel Development
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to