On Tue, May 30, 2006 at 11:43:41AM -0500, Anton Rang wrote: > There's actually three separate issues here. > > The first is the fixed root block. This one may be a problem, but it > may be easy enough to mark certain logical units in a pool as "no root > block on this device."
I don't think that's very creative. Another way is to have lots of pre-allocated next ubberblock locations, so that seek-to-one-ubberblock times are always small. Each ubberblock can point to its predecessor and its copies and list the pre-allocated possible locations of its successors. You'd still need some well-known, non-COWed ubber- ubberblocks, but these would need to be updated infrequently -- less frequently than once per-transaction, the trade-off being the time to find the latest set of ubberblocks on mount. Data/meta-data on-disk separation doesn't seem to be the answer for write performance. It may make a big difference to separate memory allocations for caching data vs. meta-data though, and there must be a reason why this is being pursued by the IETF NFSv4 WG (see pNFS). But for local write performance it makes no sense to me. It could be that transactions are a problem though, for all I know, since it transacting may mean punctuating physical writes. But this seems like a matter of trade-offs, and clearly it's better to have transactions than not. > The second is the allocation policy. If ZFS used an allocate-forward > policy, as QFS does, it should be able to avoid seeks. Note that this > is optimal for data capture but not for most other workloads, as it > tends to spread data across the whole disk over time, rather than > keeping it concentrated in a smaller region (with concomitant faster > seek times). The on-disk layout of ZFS does not dictate block allocation policies. > The third is the write scheduling policy. QFS, when used in data > capture > applications, uses direct I/O and hence issues writes in sequential > block > order. ZFS should do the same to get peak performance from its devices > for streaming (though intelligent devices can absorb some > misordering, it > is usually at some performance penalty). Again. So far we're talking about potential improvements to the implementation, not the on-disk layout, with the possible exception of fixed well-known ubberblock locations. > >>(For what it's worth, the current 128K-per-I/O policy of ZFS really > >>hurts its performance for large writes. I imagine this would not be > >>too difficult to fix if we allowed multiple 128K blocks to be > >>allocated as a group.) > > > >I've been following the thread on this and that's not clear yet. > > > >Sure, the block size may be 128KB, but ZFS can bundle more than one > >per-file/transaction > > But it doesn't right now, as far as I can tell. I never see ZFS issuing > a 16 MB write, for instance. You simply can't get the same performance > from a disk array issuing 128 KB writes that you can with 16 MB writes. > It's physically impossible because of protocol overhead, even if the > controller itself were infinitely fast. (There's also the issue that at > 128 KB, most disk arrays will choose to cache rather than stream the > data, since it's less than a single RAID stripe, which slows you down.) I'll leave this to Ron, you et. al. to hash out, but nothing in the on-disk layout prevents ZFS from bundle _most_ (i.e., excluding updates of ubberblocks) of each transaction as one large write, AFAICT. Nico -- _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss