On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote:
The problem is that you don't know the actual *contents* of the
parent block
until *all* of its children have been written to their final
locations.
(This is because the block pointer's value depends on the final
location)
But I know where the children are going before I actually write
them. There
is a dependency of the parent's contents on the *address* of its
children, but
not on the actual write. We can compute everything that we are going
to write
before we start to write.
(Yes, in the event of a write failure we have to recover; but that's
very rare, and can easily be handled -- we just start over, since no
visible state has been changed.)
The ditto blocks don't really effect this, since they can all be
written
out in parallel.
The reason they affect my desire of turning the update into a two-phase
commit (make all the changes, then update the überblock) is because the
ditto blocks are deliberately spread across the disk, so we can't
collect
them into a single write (for a non-redundant pool, or at least a one-
disk
pool -- presumably they wind up on different disks for a two-disk pool,
in which case we can still do a single write per disk).
Again, there is; if a block write fails, you have to re-write it and
all of it's parents. So the best you could do would be:
1. assign locations for all blocks, and update the space bitmaps
as necessary.
2. update all of the non-Uberdata blocks with their actual
contents (which requires calculating checksums on all of the
child blocks)
3. write everything out in parallel.
3a. if any write fails, re-do 1+2 for that block, and 2 for all of
its
parents, then start over at 3 with all of the changed blocks.
4. once everything is on stable storage, update the uberblock.
That's a lot more complicated than the current model, but certainly
seems
possible.
(3a could actually be simplified to simply "mark the bad blocks as
unallocatable, and go to 1", but it's more efficient as you describe.)
The eventual advantage, though, is that we get the performance of a
single
write (plus, always, the überblock update). In a heavily loaded system,
the current approach (lots of small writes) won't scale so well.
(Actually
we'd probably want to limit the size of each write to some small value,
like 16 MB, simply to allow the first write to start earlier under
fairly
heavy loads.)
As I pointed out earlier, this would require getting scatter/gather
support
through the storage subsystem, but the potential win should be quite
large.
Something to think about for the future. :-)
Incidentally, this is part of how QFS gets its performance for
streaming I/O.
We use an "allocate forward" policy, allow very large allocation
blocks, and
separate the metadata from data. This allows us to write (or read)
data in
fairly large I/O requests, without unnecessary disk head motion.
Anton
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss