>I understand the copy-on-write thing. That was very well illustrated in 
>"ZFS The Last Word in File Systems" by Jeff Bonwick.
>
>But if every block is it's own RAID-Z stripe, if the block is lost, how 
>does ZFS recover the block???

You should perhaps not take "block" literally; the block is written as
part of a single transaction on all disks of the RAID-Z group.

Only when the block is stored on disk, the bits referencing them will
be written.  For the whole block to be lost, all disks need to be lost
or the transaction must not occur.

>Is the stripe parity (as opposed to block checksum which I understand) 
>stored somewhere else or within the same black????

Parts of the block are written to each disk; the parity is written to
the parity disk.

>But how exactly does "every RAID-Z write is a full stripe write" works? 
>More specifically, if in a 3 disk RAID-Z configuration, if one disk 
>fails completely and is replaced, exactly how does the "metadata driven 
>reconstruction" recover the newly replaced disk?

The metadata driven reconstruction will take the ueberblock and from there
it will re-read the other disks and reconstruct the parity while also
verifying checksums.

Not all data needs to be read and not all parity needs to be computed;
only the bits of disks which are actually in use are verified and have
their parity recomputed.


>"....Well, the tricky bit here is RAID-Z reconstruction. Because the 
>stripes are all different sizes, there's no simple formula like "all the 
>disks XOR to zero." You have to traverse the filesystem metadata to 
>determine the RAID-Z geometry. Note that this would be impossible if the 
>filesystem and the RAID array were separate products, which is why 
>there's nothing like RAID-Z in the storage market today. You really need 
>an integrated view of the logical and physical structure of the data to 
>pull it off."
>
>Every stripe is different size? Is this because ZFS adapts to the nature 
>of the I/O coming to it?

It's because the blocks written are all of different sizes.



So if you write a 128K block on a 3 way RAID-Z, this can be written as
2x64K of data + 1x64K of parity.

(Though I must admit that in such a scheme the disks still XOR to zero, at 
least the bits of disk used)

Casper

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to