-r:

ZFS's output aggregation mechanisms seem entirely adequate in terms of 
throughput, given that the ZIL should mask what would otherwise be poor disk 
utilization in the event of many small, synchronous writes.  The problems are 
purely on the input side (just as they are with RAID-Z).

The read-side fragmentation problem occurs when an application writes at fine 
grain and subsequently reads at coarse grain, as I mentioned in the example of 
a tablespace which is updated at fine grain and then streamed back in bulk for 
sequential scans.  Ironically, you already have part of a solution in the ZIL, 
at least if the fine-grained updates are small enough to place there:  once in 
the ZIL, you no longer need worry about over-writing the original data 
(ignoring for the moment the impact on your snapshot facility - a drawback of 
block-oriented snapshots, but one you'll need to resolve if you ever want to 
defragment anything), since you can simply reapply the ZIL images until they 
stick and update checksums (if they're maintained - see earlier comments) 
accordingly (this would require using the ZIL as a conventional transaction log 
to protect this action, but that's not all that much more a stretch than its 
current small-update staging process).

ZFS does not appear to deal with such situations very well right now:  either 
it uses coarse-grained checksumming, in which case each of those small (e.g., 4 
KB) tablespace updates turns into a read/modity/write operation on a 128 KB 
entity, or it uses fine-grained (4 KB in this case) checksums in which case 
these small blocks get spread all over the storage as they're individually 
updated and the subsequent sequential tablespace scans run at well under 1 
MB/sec/disk (even worse if RAID-Z is used).

richard:

Characterizing the disk-utilization problem as a classic 
big-block-vs.-small-block argument may be more a Unix mind-set issue than 
anything else:  other file systems (including a few on Unix, for that matter) 
solve this by using extent-based allocation to aggregate many smaller (though 
still possibly variable-size) blocks into groups which can be streamed 
efficiently.

- bill
 
 
This message posted from opensolaris.org
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to