Chris Siebenmann <c...@cs.toronto.edu> wrote: > People have already mentioned the RAID-[56] write hole, > but it's more > than that; in a never-overwrite system with multiple blocks > in one RAID > stripe, how do you handle updates to some of the blocks? > > See: > http://utcc.utoronto.ca/~cks/space/blog/solaris/ZFSSensibleRAID
Oh that's easy. Netapp's been doing this since forever. A little extra meta-data is nothing to worry about. Snapshots create by comparison massive metadata footprints. since it's all COW why not do full stripe writes all the time? Assume a 4 disk raidZ(3+p) A1 A2 A3 AP B1 B2 B3 BP C1 C2 C3 CP ... Then the transaction group timer fires and dumps a bunch of records needing syncing: A2', B2', C3', D2', A1', B3'. I write these out in a totally new/empty stripes as A1' A2' C3' XP B2' B3' D2' XP I don't have to read any of the original blocks and parity is calculated from in-memory. Then I just modify the metadata to mark the original blocks as invalid/superceded. But for XOR and stripe recovery purposes we can leave the original stripe perfectly alone. When a full stripe is no longer valid (all blocks superceded) and isn't part of a snapshot it gets put on the "clean/ready for reuse" list. After a while one could potentially end up with all of the "A" blocks sitting on just one spindle. They are still fully protected but a sequential read of A1-A3 will obviously be much slower than if they were properly spread across 3 spindles. This is where array scrubbing would step in and rebalance the A-series. On the other hand the elevator algorithm applied to the transaction group could order things such that all 'x1' blocks go on spindle 1, 'x2' go on spindle 2 etc. If there aren't enough of a particular spindle then just use an empty block to fill in the hole or if that is too wasteful, resort to the less optimal ordering for the left-overs and scrubbing will eventually take care of it. Note that my representation mimics RAID4 in layout. You can of course move the parity block around, indeed parity block spindle is a simple function of stripe index and array width. Eg. for stripe N on width W -> parity is on spindle W - (N mod W). Is distributed parity worth doing? No, I don't think so. _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss