On Aug 9, 2006, at 8:18 AM, Roch wrote:
So while I'm feeling optimistic :-) we really ought to be able to do this in two I/O operations. If we have, say, 500K of data to write (including all of the metadata), we should be able to allocate a contiguous 500K block on disk and write that with a single operation. Then we update the Uberblock. Hi Anton, Optimistic a little yes. The data block should have aggregated quite well into near recordsize I/Os, are you sure they did not ? No O_DSYNC in here right ?
When I repeated this with just 512K written in 1K chunks via dd, I saw six 16K writes. Those were the largest. The others were around 1K-4K. No O_DSYNC. dd if=/dev/zero of=xyz bs=1k count=512 So some writes are being aggregated, but we're missing a lot.
Once the data blocks are on disk we have the information necessary to update the indirect blocks iteratively up to the ueberblock. Those are the smaller I/Os; I guess that because of ditto blocks they go to physically seperate locations, by design.
We shouldn't have to wait for the data blocks to reach disk, though. We know where they're going in advance. One of the key advantages of the überblock scheme is that we can, in a sense, speculatively write to disk. We don't need the tight ordering that UFS requires to avoid security exposures and allow the file system to be repaired. We can lay out all of the data and metadata, write them all to disk, choose new locations if the writes fail, etc. and not worry about any ordering or state issues, because the on-disk image doesn't change until we commit it. You're right, the ditto block mechanism will mean that some writes will be spread around (at least when using a non-redundant pool like mine), but then we should have at most three writes followed by the überblock update, assuming three degrees of replication.
All of these though are normally done asynchronously to applications, unless the disks are flooded.
Which is a good thing (I think they're asynchronous anyway, unless the cache is full).
But I follow you in that, It may be remotely possible to reduce the number of Iterations in the process by assuming that the I/O will all succeed, then if some fails, fix up the consequence and when all done, update the ueberblock. I would not hold my breath quite yet for that.
Hmmm. I guess my point is that we shouldn't need to iterate at all. There are no dependencies between these writes; only between the complete set of writes and the überblock update. -- Anton _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss