On Jan 3, 2010, at 11:27 PM, matthew patton wrote:

I find it baffling that RaidZ(2,3) was designed to split a record- size block into N (N=# of member devices) pieces and send the uselessly tiny requests to spinning rust when we know the massive delays entailed in head seeks and rotational delay. The ZFS-mirror and load-balanced configuration do the obviously correct thing and don't split records and gain more by utilizing parallel access. I can't imagine the code-path for RAIDZ would be so hard to fix.

Knock yourself out :-)
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c

I've read posts back to 06 and all I see are lamenting about the horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to claw back performance by combining multiple such vDEVs. I understand RAIDZ will never equal Mirroring, but it could get damn close if it didn't break requests down and better yet utilized copies=N and properly placed the copies on disparate spindles. This is somewhat analogous to what the likes of 3PAR do and it's not rocket science.

That is not the issue for small, random reads. For all reads, the checksum is verified. When you spread the record across multiple disks, then you need
to read the record back from those disks. In general, this means that as
long as the recordsize is larger than the requested small read, then your performance will approach the N/(N-P) * IOPS limit. At the pathological edge,
you can set recordsize to 512 bytes and you end up with mirroring (!)
The small, random read performance model I developed only calculates
the above IOPS limit, and does not consider recordsize.

The physical I/O is much more difficult to correlate to the logical I/ O because of all of the coalescing and caching that occurs at all of the lower levels in
the stack.

An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount of storage but the latter is a hell of a lot more resilient and max IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still be 1/2 the IOPS of the 8 disk mirror but I'd at least save a bundle of coin in either reduced spindle count or using slower drives.

With all the great things ZFS is capable of, why hasn't this been redesigned long ago? what glaringly obvious truth am I missing?

Performance, dependability, space: pick two.
 -- richard

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to