Steven Sim wrote:
Darren and Henk;Firstly, thank you very much for both of your replies. I am very grateful indeed for you all taking time off to answer my questions.I understand RAID-5 quite well and from both of your RAID-Z description, I see that the RAID-Z parity is also a separate block on a separate disk. Very well. This is just like RAID-5.My confusion is simple. Would this not then give rise also to the write-hole vulnerability of RAID-5?Jeff Bonwick states "/that there's no way to update two or more disks atomically, so RAID stripes can become damaged during a crash or power outage./"If I understand correctly, then the parity block for RAID-Z are also written in two different atomic operations. As per RAID-5. (the only difference being each can be of a different stripe size).How then does it fit into Jeff's statement that "/_Every block is its own RAID-Z stripe?*"*_* ( */Perhaps I misunderstood but I now think this statement is rather for the fact that RAID-Z has a variable stripe size rather than meaning each block holding it's own RAID-Z parity within itself. )If the write-hole power outage situation as described by Jeff Bonwick occur, how does RAID-Z "beat" the RAID-5 write-hole vulnerability?
Recall that no written blocks are actually part of the file system until all the metadata blocks above them are also written. This includes the uberblock, whose write is atomic. So although the physical writing of the blocks on different physical disks is not atomic, if a crash occurs between such writes it is as if none of the writes ever occurred.
Through each block's independant checksum held one level above in the metadata block? Is this correct? Or am I completely off course?
Yes, that is correct. HTH, Fred
Henk Langeveld wonderful character based diagrams describes what is basically a standard RAID-5 layout on 4 disks. How is RAID-Z any different from RAID-5? (except for the ability to stripe different sizes which gives allows RAID-Z to never have to do a read-modify-write. This increases performance very significantly but I am unable to relate this to the write-hole vulnerability issue).Warmest Regards Steven Sim Darren Dunham wrote:<i>RAID-Z is a data/parity scheme like RAID-5, <u><b>but it uses dynamic stripe width. Every block is its own RAID-Z stripe, regardless of blocksize. This means that every RAID-Z write is a full-stripe write.</b></u> This, when combined with the copy-on-write transactional semantics of ZFS, completely eliminates the RAID write hole. RAID-Z is also faster than traditional RAID because it never has to do read-modify-write.</i>"<br> <br> I am unable to relate the above statement to the diagram shown in the PDF file '<b>zfs_last.pdf</b>' entitled "<b>ZFS THE LAST WORD IN FILE SYSTEMS</b>" (also by Jeff Bonwick), on page 11.<br>On the copy I'm looking at page 11 is "Copy-On-Write Transactions". Note that this page is showing only the "copy on write" stuff (which isused on all pools) and doesn't show anything about raidz.I was wondering whether Jeff or some one knowledgeable would elaborate further on the above and also answer the following questions;<br> <br> <ul> <li>The green and blue "blocks" shown in the diagram on page 11, do the represent actual physical blocks on individual disks or a single RAID-Z stripe write across multiple disks???</li>They represent filesystem "data" (the leaves) and filesystem "metadata" (the blocks above the leaves). They're written to a pool that may have some form of redundancy (mirroring, raidz), but that level is not presented in this slide.<li>The parity for RAID-Z, where is it??Mentioned on page 17, but no diagram on this link. Bill Moore presented this talk to BayLisa several months ago, and used a very similar presentation, but it had a diagram on the "RAID-Z" slide (the one on page 17) showing the data and parity blocks used by a raidz pool. Looking through google, I see many links to similar ZFSpresentations, but none with the diagram on that page.Ah, found one... http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf Page 13 in that stack.Surely not the checksum stored together in the upper level direct and indirect block pointer? And if not and it is written as a separate block on another disks, then .......I am afraid I do not understand....<br>The parity is written as a separate block on a separate disk. It's verysimilar to how RAID4/RAID5 would write parity on another disk.It's just that for R4/R5, any given data block on disk can be immediately calculated to be part of a particular stripe on the storage (which has a particular parity block). In the case of raidz, the stripe has a maximum length set by the raidz columns, but it may be smaller than that.<li>Could someone please elaborate more on the statement "Every block is it's own RAID-Z stripe"??? The block being referred to is a single block across multiple disks or a single disk?<br>Every filesystem block (not disk block). So a single filesystem block that spans multiple disks.Fujitsu Asia Pte. Ltd. _____________________________________________________This e-mail is confidential and may also be privileged. If you are not the intended recipient, please notify us immediately. You should not copy or use it for any purpose, nor disclose its contents to any other person.Opinions, conclusions and other information in this message that do not relate to the official business of my firm shall be understood as neither given nor endorsed by it.------------------------------------------------------------------------ _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
-- Fred Zlotnick Director, Solaris Data Technology Sun Microsystems, Inc. [EMAIL PROTECTED] x85006/+1 650 786 5006 _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss