On May 30, 2006, at 11:25 AM, Nicolas Williams wrote:
On Tue, May 30, 2006 at 08:13:56AM -0700, Anton B. Rang wrote:
Well, I don't know about his particular case, but many QFS clients
have found the separation of data and metadata to be invaluable. The
primary reason is that it avoids disk seeks. We have QFS customers
who
^^^^^^^^^^^^^^^^^^^^
Are you talking about reads or writes?
Writes -- that's what's important for data capture, which is where I
entered this thread. ;-) Sorry for the confusion.
So we're talking about writes then, in which case ZFS should not seek
because there are no fixed inode locations (there are fixed root block
locations though).
There's actually three separate issues here.
The first is the fixed root block. This one may be a problem, but it
may be easy enough to mark certain logical units in a pool as "no root
block on this device."
The second is the allocation policy. If ZFS used an allocate-forward
policy, as QFS does, it should be able to avoid seeks. Note that this
is optimal for data capture but not for most other workloads, as it
tends to spread data across the whole disk over time, rather than
keeping it concentrated in a smaller region (with concomitant faster
seek times).
The third is the write scheduling policy. QFS, when used in data
capture
applications, uses direct I/O and hence issues writes in sequential
block
order. ZFS should do the same to get peak performance from its devices
for streaming (though intelligent devices can absorb some
misordering, it
is usually at some performance penalty).
(For what it's worth, the current 128K-per-I/O policy of ZFS really
hurts its performance for large writes. I imagine this would not be
too difficult to fix if we allowed multiple 128K blocks to be
allocated as a group.)
I've been following the thread on this and that's not clear yet.
Sure, the block size may be 128KB, but ZFS can bundle more than one
per-file/transaction
But it doesn't right now, as far as I can tell. I never see ZFS issuing
a 16 MB write, for instance. You simply can't get the same performance
from a disk array issuing 128 KB writes that you can with 16 MB writes.
It's physically impossible because of protocol overhead, even if the
controller itself were infinitely fast. (There's also the issue that at
128 KB, most disk arrays will choose to cache rather than stream the
data, since it's less than a single RAID stripe, which slows you down.)
-- Anton
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss