On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote:

The problem is that you don't know the actual *contents* of the parent block until *all* of its children have been written to their final locations. (This is because the block pointer's value depends on the final location)
But I know where the children are going before I actually write  
them.  There
is a dependency of the parent's contents on the *address* of its  
children, but
not on the actual write.  We can compute everything that we are going  
to write
before we start to write.

(Yes, in the event of a write failure we have to recover; but that's
very rare, and can easily be handled -- we just start over, since no
visible state has been changed.)

The ditto blocks don't really effect this, since they can all be written
out in parallel.
The reason they affect my desire of turning the update into a two-phase
commit (make all the changes, then update the überblock) is because the
ditto blocks are deliberately spread across the disk, so we can't collect them into a single write (for a non-redundant pool, or at least a one- disk
pool -- presumably they wind up on different disks for a two-disk pool,
in which case we can still do a single write per disk).

Again, there is;  if a block write fails, you have to re-write it and
all of it's parents.  So the best you could do would be:

        1. assign locations for all blocks, and update the space bitmaps
           as necessary.
        2. update all of the non-Uberdata blocks with their actual
           contents (which requires calculating checksums on all of the
           child blocks)
        3. write everything out in parallel.
3a. if any write fails, re-do 1+2 for that block, and 2 for all of its
            parents, then start over at 3 with all of the changed blocks.

        4. once everything is on stable storage, update the uberblock.

That's a lot more complicated than the current model, but certainly seems
possible.
(3a could actually be simplified to simply "mark the bad blocks as
unallocatable, and go to 1", but it's more efficient as you describe.)

The eventual advantage, though, is that we get the performance of a single
write (plus, always, the überblock update).  In a heavily loaded system,
the current approach (lots of small writes) won't scale so well. (Actually
we'd probably want to limit the size of each write to some small value,
like 16 MB, simply to allow the first write to start earlier under fairly
heavy loads.)

As I pointed out earlier, this would require getting scatter/gather support through the storage subsystem, but the potential win should be quite large.
Something to think about for the future.  :-)

Incidentally, this is part of how QFS gets its performance for streaming I/O. We use an "allocate forward" policy, allow very large allocation blocks, and separate the metadata from data. This allows us to write (or read) data in
fairly large I/O requests, without unnecessary disk head motion.

Anton

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to