Re: [zfs-discuss] Re: Lots of seeks?

Anton Rang Fri, 11 Aug 2006 10:57:07 -0700

On Aug 11, 2006, at 12:38 PM, Jonathan Adams wrote:

The problem is that you don't know the actual *contents* of theparent blockuntil *all* of its children have been written to their finallocations.(This is because the block pointer's value depends on the finallocation)

But I know where the children are going before I actually writethem. Thereis a dependency of the parent's contents on the *address* of itschildren, butnot on the actual write. We can compute everything that we are goingto write

before we start to write.

(Yes, in the event of a write failure we have to recover; but that's
very rare, and can easily be handled -- we just start over, since no
visible state has been changed.)

The ditto blocks don't really effect this, since they can all bewritten
out in parallel.


The reason they affect my desire of turning the update into a two-phase
commit (make all the changes, then update the überblock) is because the

ditto blocks are deliberately spread across the disk, so we can'tcollectthem into a single write (for a non-redundant pool, or at least a one-disk

pool -- presumably they wind up on different disks for a two-disk pool,
in which case we can still do a single write per disk).

Again, there is;  if a block write fails, you have to re-write it and
all of it's parents.  So the best you could do would be:

        1. assign locations for all blocks, and update the space bitmaps
           as necessary.
        2. update all of the non-Uberdata blocks with their actual
           contents (which requires calculating checksums on all of the
           child blocks)
        3. write everything out in parallel.

3a. if any write fails, re-do 1+2 for that block, and 2 for all ofits

            parents, then start over at 3 with all of the changed blocks.

        4. once everything is on stable storage, update the uberblock.

That's a lot more complicated than the current model, but certainlyseems

possible.


(3a could actually be simplified to simply "mark the bad blocks as
unallocatable, and go to 1", but it's more efficient as you describe.)

The eventual advantage, though, is that we get the performance of asingle

write (plus, always, the überblock update).  In a heavily loaded system,

the current approach (lots of small writes) won't scale so well.(Actually

we'd probably want to limit the size of each write to some small value,

like 16 MB, simply to allow the first write to start earlier underfairly

heavy loads.)

As I pointed out earlier, this would require getting scatter/gathersupportthrough the storage subsystem, but the potential win should be quitelarge.

Something to think about for the future.  :-)

Incidentally, this is part of how QFS gets its performance forstreaming I/O.We use an "allocate forward" policy, allow very large allocationblocks, andseparate the metadata from data. This allows us to write (or read)data in

fairly large I/O requests, without unnecessary disk head motion.

Anton

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Re: Lots of seeks?

Reply via email to