Re: [zfs-discuss] rethinking RaidZ and Record size

matthew patton Mon, 04 Jan 2010 10:35:22 -0800

 Chris Siebenmann <c...@cs.toronto.edu> wrote:

>  People have already mentioned the RAID-[56] write hole,
> but it's more
> than that; in a never-overwrite system with multiple blocks
> in one RAID
> stripe, how do you handle updates to some of the blocks?
> 
>  See:
>     http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSSensibleRAID


Oh that's easy. Netapp's been doing this since forever. A little extra 
meta-data is nothing to worry about. Snapshots create by comparison massive 
metadata footprints.

since it's all COW why not do full stripe writes all the time? Assume a 4 disk 
raidZ(3+p)

A1 A2 A3 AP
B1 B2 B3 BP
C1 C2 C3 CP
...

Then the transaction group timer fires and dumps a bunch of records needing 
syncing: A2', B2', C3', D2', A1', B3'. I write these out in a totally new/empty 
stripes as

A1' A2' C3' XP
B2' B3' D2' XP

I don't have to read any of the original blocks and parity is calculated from 
in-memory.

Then I just modify the metadata to mark the original blocks as 
invalid/superceded. But for XOR and stripe recovery purposes we can leave the 
original stripe perfectly alone. When a full stripe is no longer valid (all 
blocks superceded) and isn't part of a snapshot it gets put on the "clean/ready 
for reuse" list.

After a while one could potentially end up with all of the "A" blocks sitting 
on just one spindle. They are still fully protected but a sequential read of 
A1-A3 will obviously be much slower than if they were properly spread across 3 
spindles. This is where array scrubbing would step in and rebalance the 
A-series. On the other hand the elevator algorithm applied to the transaction 
group could order things such that all 'x1' blocks go on spindle 1, 'x2' go on 
spindle 2 etc. If there aren't enough of a particular spindle then just use an 
empty block to fill in the hole or if that is too wasteful, resort to the less 
optimal ordering for the left-overs and scrubbing will eventually take care of 
it.

Note that my representation mimics RAID4 in layout. You can of course move the 
parity block around, indeed parity block spindle is a simple function of stripe 
index and array width. Eg. for stripe N on width W -> parity is on spindle W - 
(N mod W). Is distributed parity worth doing? No, I don't think so.



      
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] rethinking RaidZ and Record size

Reply via email to