Richard Elling writes: > On Jan 3, 2010, at 11:27 PM, matthew patton wrote: > > > I find it baffling that RaidZ(2,3) was designed to split a record- > > size block into N (N=# of member devices) pieces and send the > > uselessly tiny requests to spinning rust when we know the massive > > delays entailed in head seeks and rotational delay. The ZFS-mirror > > and load-balanced configuration do the obviously correct thing and > > don't split records and gain more by utilizing parallel access. I > > can't imagine the code-path for RAIDZ would be so hard to fix. > > Knock yourself out :-) > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c > > > I've read posts back to 06 and all I see are lamenting about the > > horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to > > claw back performance by combining multiple such vDEVs. I understand > > RAIDZ will never equal Mirroring, but it could get damn close if it > > didn't break requests down and better yet utilized copies=N and > > properly placed the copies on disparate spindles. This is somewhat > > analogous to what the likes of 3PAR do and it's not rocket science. > > That is not the issue for small, random reads. For all reads, the > checksum is > verified. When you spread the record across multiple disks, then you > need > to read the record back from those disks. In general, this means that as > long as the recordsize is larger than the requested small read, then > your > performance will approach the N/(N-P) * IOPS limit. At the > pathological edge, > you can set recordsize to 512 bytes and you end up with mirroring (!) > The small, random read performance model I developed only calculates > the above IOPS limit, and does not consider recordsize. > > The physical I/O is much more difficult to correlate to the logical I/ > O because > of all of the coalescing and caching that occurs at all of the lower > levels in > the stack. > > > An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount > > of storage but the latter is a hell of a lot more resilient and max > > IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still > > be 1/2 the IOPS of the 8 disk mirror but I'd at least save a bundle > > of coin in either reduced spindle count or using slower drives. > > > > With all the great things ZFS is capable of, why hasn't this been > > redesigned long ago? what glaringly obvious truth am I missing? > > Performance, dependability, space: pick two. > -- richard > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
If store record X in one column like raid-5 or 6 does; then you need to generate parity for that record X by grouping with other unrelated records Y, Z, T etc. When X if freed in the filesystem, it still holds parity information protecting Y, Z, T so you can't get rid of what was stored @ X. If you try to store new data in X and in associated parity by fail in mid-stream you hit the raid-5 write hole. Moreover now that X is not referenced in the filesystem, no more checksum is associated with it and if bit rot occurs in X and disk holding Y dies, resilvering would generate garbage for Y. This seems to force use to chunk up disks with every unit checksummed even if freed. Secure deletion becomes a problem as well. And you need can end up madly searching for free stripes, repositioning old blocks in partial striped even if the pool is just 10% filled up. Can one do this with raid-dp ? http://blogs.sun.com/roch/entry/need_inodes That said, I truly am for a evolution for random read workloads. Raid-Z on 4K sectors is quite appealing. It means that small objects become nearly mirrored with good random read performance while large objects are stored efficiently. -r _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss