Re: [zfs-discuss] rethinking RaidZ and Record size

Roch Tue, 05 Jan 2010 08:01:58 -0800

Richard Elling writes:
 > On Jan 3, 2010, at 11:27 PM, matthew patton wrote:
 > 
 > > I find it baffling that RaidZ(2,3) was designed to split a record- 
 > > size block into N (N=# of member devices) pieces and send the  
 > > uselessly tiny requests to spinning rust when we know the massive  
 > > delays entailed in head seeks and rotational delay. The ZFS-mirror  
 > > and load-balanced configuration do the obviously correct thing and  
 > > don't split records and gain more by utilizing parallel access. I  
 > > can't imagine the code-path for RAIDZ would be so hard to fix.
 > 
 > Knock yourself out :-)
 > http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/vdev_raidz.c
 > 
 > > I've read posts back to 06 and all I see are lamenting about the  
 > > horrendous drop in IOPs, about sizing RAIDZ to ~4+P and trying to  
 > > claw back performance by combining multiple such vDEVs. I understand  
 > > RAIDZ will never equal Mirroring, but it could get damn close if it  
 > > didn't break requests down and better yet utilized copies=N and  
 > > properly placed the copies on disparate spindles. This is somewhat  
 > > analogous to what the likes of 3PAR do and it's not rocket science.
 > 
 > That is not the issue for small, random reads.  For all reads, the  
 > checksum is
 > verified. When you spread the record across multiple disks, then you  
 > need
 > to read the record back from those disks. In general, this means that as
 > long as the recordsize is larger than the requested small read, then  
 > your
 > performance will approach the N/(N-P) * IOPS limit. At the  
 > pathological edge,
 > you can set recordsize to 512 bytes and you end up with mirroring (!)
 > The small, random read performance model I developed only calculates
 > the above IOPS limit, and does not consider recordsize.
 > 
 > The physical I/O is much more difficult to correlate to the logical I/ 
 > O because
 > of all of the coalescing and caching that occurs at all of the lower  
 > levels in
 > the stack.
 > 
 > > An 8 disk mirror and a RAIDZ8+2P w/ copies=2 give me the same amount  
 > > of storage but the latter is a hell of a lot more resilient and max  
 > > IOPS should be higher to boot. An non-broken-up RAIDZ4+P would still  
 > > be 1/2 the IOPS of the 8 disk mirror but I'd at least save a bundle  
 > > of coin in either reduced spindle count or using slower drives.
 > >
 > > With all the great things ZFS is capable of, why hasn't this been  
 > > redesigned long ago? what glaringly obvious truth am I missing?
 > 
 > Performance, dependability, space: pick two.
 >   -- richard
 > 
 > _______________________________________________
 > zfs-discuss mailing list
 > zfs-discuss@opensolaris.org
 > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss



If store record X in one column like raid-5 or 6 does; then
you need to generate parity for that record X by grouping
with other unrelated records Y, Z, T etc. When X if freed in the
filesystem, it still holds parity information protecting Y,
Z, T so you can't get rid of what was stored @ X. If you try
to store new data in X and in associated parity by fail in
mid-stream you hit the raid-5 write hole. Moreover now
that X is not referenced in the filesystem, no more checksum
is associated with it and if bit rot occurs in X and disk
holding Y dies, resilvering would generate garbage for Y.

This seems to force use to chunk up disks with every unit
checksummed even if freed. Secure deletion becomes a problem
as well. And you need can end up madly searching for free
stripes, repositioning old blocks in partial striped even if
the pool is just 10% filled up.

Can one do this with raid-dp ?
        http://blogs.sun.com/roch/entry/need_inodes


That said, I truly am for a evolution for random read
workloads. Raid-Z on 4K sectors is quite appealing. It means
that small objects become nearly mirrored with good random read
performance while large objects are stored efficiently. 

-r

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

Reply via email to