On Jan 3, 2010, at 11:27 PM, matthew patton wrote:
I find it baffling that RaidZ(2,3) was designed to split a record-
size block into N (N=# of member devices) pieces and send the
uselessly tiny requests to spinning rust when we know the massive
delays entailed in head seeks and rotational delay. The ZFS-mirror
and load-balanced configuration do the obviously correct thing and
don't split records and gain more by utilizing parallel access. I
can't imagine the code-path for RAIDZ would be so hard to fix.
Knock yourself out :-)
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
I've read posts back to 06 and all I see are lamenting about the
horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to
claw back performance by combining multiple such vDEVs. I understand
RAIDZ will never equal Mirroring, but it could get damn close if it
didn't break requests down and better yet utilized copies=N and
properly placed the copies on disparate spindles. This is somewhat
analogous to what the likes of 3PAR do and it's not rocket science.
That is not the issue for small, random reads. For all reads, the
checksum is
verified. When you spread the record across multiple disks, then you
need
to read the record back from those disks. In general, this means that as
long as the recordsize is larger than the requested small read, then
your
performance will approach the N/(N-P) * IOPS limit. At the
pathological edge,
you can set recordsize to 512 bytes and you end up with mirroring (!)
The small, random read performance model I developed only calculates
the above IOPS limit, and does not consider recordsize.
The physical I/O is much more difficult to correlate to the logical I/
O because
of all of the coalescing and caching that occurs at all of the lower
levels in
the stack.
An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount
of storage but the latter is a hell of a lot more resilient and max
IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still
be 1/2 the IOPS of the 8 disk mirror but I'd at least save a bundle
of coin in either reduced spindle count or using slower drives.
With all the great things ZFS is capable of, why hasn't this been
redesigned long ago? what glaringly obvious truth am I missing?
Performance, dependability, space: pick two.
-- richard
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss