-r: ZFS's output aggregation mechanisms seem entirely adequate in terms of throughput, given that the ZIL should mask what would otherwise be poor disk utilization in the event of many small, synchronous writes. The problems are purely on the input side (just as they are with RAID-Z).
The read-side fragmentation problem occurs when an application writes at fine grain and subsequently reads at coarse grain, as I mentioned in the example of a tablespace which is updated at fine grain and then streamed back in bulk for sequential scans. Ironically, you already have part of a solution in the ZIL, at least if the fine-grained updates are small enough to place there: once in the ZIL, you no longer need worry about over-writing the original data (ignoring for the moment the impact on your snapshot facility - a drawback of block-oriented snapshots, but one you'll need to resolve if you ever want to defragment anything), since you can simply reapply the ZIL images until they stick and update checksums (if they're maintained - see earlier comments) accordingly (this would require using the ZIL as a conventional transaction log to protect this action, but that's not all that much more a stretch than its current small-update staging process). ZFS does not appear to deal with such situations very well right now: either it uses coarse-grained checksumming, in which case each of those small (e.g., 4 KB) tablespace updates turns into a read/modity/write operation on a 128 KB entity, or it uses fine-grained (4 KB in this case) checksums in which case these small blocks get spread all over the storage as they're individually updated and the subsequent sequential tablespace scans run at well under 1 MB/sec/disk (even worse if RAID-Z is used). richard: Characterizing the disk-utilization problem as a classic big-block-vs.-small-block argument may be more a Unix mind-set issue than anything else: other file systems (including a few on Unix, for that matter) solve this by using extent-based allocation to aggregate many smaller (though still possibly variable-size) blocks into groups which can be streamed efficiently. - bill This message posted from opensolaris.org _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss