On Tue, May 30, 2006 at 11:43:41AM -0500, Anton Rang wrote:
> There's actually three separate issues here.
> 
> The first is the fixed root block.  This one may be a problem, but it
> may be easy enough to mark certain logical units in a pool as "no root
> block on this device."

I don't think that's very creative.  Another way is to have lots of
pre-allocated next ubberblock locations, so that seek-to-one-ubberblock
times are always small.  Each ubberblock can point to its predecessor
and its copies and list the pre-allocated possible locations of its
successors.  You'd still need some well-known, non-COWed ubber-
ubberblocks, but these would need to be updated infrequently -- less
frequently than once per-transaction, the trade-off being the time to
find the latest set of ubberblocks on mount.

Data/meta-data on-disk separation doesn't seem to be the answer for
write performance.  It may make a big difference to separate memory
allocations for caching data vs. meta-data though, and there must be a
reason why this is being pursued by the IETF NFSv4 WG (see pNFS).  But
for local write performance it makes no sense to me.

It could be that transactions are a problem though, for all I know,
since it transacting may mean punctuating physical writes.  But this
seems like a matter of trade-offs, and clearly it's better to have
transactions than not.

> The second is the allocation policy.  If ZFS used an allocate-forward
> policy, as QFS does, it should be able to avoid seeks.  Note that this
> is optimal for data capture but not for most other workloads, as it
> tends to spread data across the whole disk over time, rather than
> keeping it concentrated in a smaller region (with concomitant faster
> seek times).

The on-disk layout of ZFS does not dictate block allocation policies.

> The third is the write scheduling policy.  QFS, when used in data  
> capture
> applications, uses direct I/O and hence issues writes in sequential  
> block
> order.  ZFS should do the same to get peak performance from its devices
> for streaming (though intelligent devices can absorb some  
> misordering, it
> is usually at some performance penalty).

Again.

So far we're talking about potential improvements to the implementation,
not the on-disk layout, with the possible exception of fixed well-known
ubberblock locations.

> >>(For what it's worth, the current 128K-per-I/O policy of ZFS really
> >>hurts its performance for large writes. I imagine this would not be
> >>too difficult to fix if we allowed multiple 128K blocks to be
> >>allocated as a group.)
> >
> >I've been following the thread on this and that's not clear yet.
> >
> >Sure, the block size may be 128KB, but ZFS can bundle more than one
> >per-file/transaction
> 
> But it doesn't right now, as far as I can tell.  I never see ZFS issuing
> a 16 MB write, for instance.  You simply can't get the same performance
> from a disk array issuing 128 KB writes that you can with 16 MB writes.
> It's physically impossible because of protocol overhead, even if the
> controller itself were infinitely fast.  (There's also the issue that at
> 128 KB, most disk arrays will choose to cache rather than stream the
> data, since it's less than a single RAID stripe, which slows you down.)

I'll leave this to Ron, you et. al. to hash out, but nothing in the
on-disk layout prevents ZFS from bundle _most_ (i.e., excluding updates
of ubberblocks) of each transaction as one large write, AFAICT.

Nico
-- 
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to