On Fri, Aug 11, 2006 at 11:04:06AM -0500, Anton Rang wrote: > >Once the data blocks are on disk we have the information > >necessary to update the indirect blocks iteratively up to > >the ueberblock. Those are the smaller I/Os; I guess that > >because of ditto blocks they go to physically seperate > >locations, by design. > > We shouldn't have to wait for the data blocks to reach disk, > though. We know where they're going in advance. One of the > key advantages of the ?berblock scheme is that we can, in a > sense, speculatively write to disk. We don't need the tight > ordering that UFS requires to avoid security exposures and > allow the file system to be repaired. We can lay out all of > the data and metadata, write them all to disk, choose new > locations if the writes fail, etc. and not worry about any > ordering or state issues, because the on-disk image doesn't > change until we commit it.
> You're right, the ditto block mechanism will mean that some > writes will be spread around (at least when using a > non-redundant pool like mine), but then we should have at > most three writes followed by the ?berblock update, assuming > three degrees of replication. The problem is that you don't know the actual *contents* of the parent block until *all* of its children have been written to their final locations. (This is because the block pointer's value depends on the final location) The ditto blocks don't really effect this, since they can all be written out in parallel. So you end up with the current N phases; data, it's parents, it's parents, ..., uberblock. > >But I follow you in that, It may be remotely possible to > >reduce the number of Iterations in the process by assuming > >that the I/O will all succeed, then if some fails, fix up > >the consequence and when all done, update the ueberblock. I > >would not hold my breath quite yet for that. > > Hmmm. I guess my point is that we shouldn't need to iterate > at all. There are no dependencies between these writes; only > between the complete set of writes and the ?berblock update. Again, there is; if a block write fails, you have to re-write it and all of it's parents. So the best you could do would be: 1. assign locations for all blocks, and update the space bitmaps as necessary. 2. update all of the non-Uberdata blocks with their actual contents (which requires calculating checksums on all of the child blocks) 3. write everything out in parallel. 3a. if any write fails, re-do 1+2 for that block, and 2 for all of its parents, then start over at 3 with all of the changed blocks. 4. once everything is on stable storage, update the uberblock. That's a lot more complicated than the current model, but certainly seems possible. Cheers, - jonathan (this is only my understanding of how ZFS works; I could be mistaken) -- Jonathan Adams, Solaris Kernel Development _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss