Re: [zfs-discuss] Trying to understand zfs RAID-Z

Fred Zlotnick Thu, 17 May 2007 06:49:26 -0700


Steven Sim wrote:

Darren and Henk;
Firstly, thank you very much for both of your replies. I am verygrateful indeed for you all taking time off to answer my questions.
I understand RAID-5 quite well and from both of your RAID-Z description,I see that the RAID-Z parity is also a separate block on a separatedisk. Very well. This is just like RAID-5.
My confusion is simple. Would this not then give rise also to thewrite-hole vulnerability of RAID-5?
Jeff Bonwick states "/that there's no way to update two or more disksatomically, so RAID stripes can become damaged during a crash or poweroutage./"
If I understand correctly, then the parity block for RAID-Z are alsowritten in two different atomic operations. As per RAID-5. (the onlydifference being each can be of a different stripe size).
How then does it fit into Jeff's statement that "/_Every block is itsown RAID-Z stripe?*"*_* ( */Perhaps I misunderstood but I now thinkthis statement is rather for the fact that RAID-Z has a variable stripesize rather than meaning each block holding it's own RAID-Z paritywithin itself. )
If the write-hole power outage situation as described by Jeff Bonwickoccur, how does RAID-Z "beat" the RAID-5 write-hole vulnerability?


Recall that no written blocks are actually part of the file system until
all the metadata blocks above them are also written. This includes the
uberblock, whose write is atomic.  So although the physical writing of
the blocks on different physical disks is not atomic, if a crash occurs
between such writes it is as if none of the writes ever occurred.

Through each block's independant checksum held one level above in themetadata block? Is this correct? Or am I completely off course?


Yes, that is correct.

HTH,
Fred

Henk Langeveld wonderful character based diagrams describes what isbasically a standard RAID-5 layout on 4 disks. How is RAID-Z anydifferent from RAID-5? (except for the ability to stripe different sizeswhich gives allows RAID-Z to never have to do a read-modify-write. Thisincreases performance very significantly but I am unable to relate thisto the write-hole vulnerability issue).


Warmest Regards
Steven Sim

Darren Dunham wrote:

<i>RAID-Z is a data/parity scheme like RAID-5, <u><b>but it uses
dynamic stripe width.
Every block is its own RAID-Z stripe, regardless of blocksize. This
means
that every RAID-Z write is a full-stripe write.</b></u> This, when
combined with the
copy-on-write transactional semantics of ZFS, completely eliminates the
RAID write hole. RAID-Z is also faster than traditional RAID because it
never has to do
read-modify-write.</i>"<br>
<br>
I am unable to relate the above statement to the diagram shown in the
PDF file '<b>zfs_last.pdf</b>' entitled "<b>ZFS THE LAST WORD IN FILE
SYSTEMS</b>" (also by Jeff Bonwick), on page 11.<br>


On the copy I'm looking at page 11 is "Copy-On-Write Transactions".
Note that this page is showing only the "copy on write" stuff (which is

used on all pools) and doesn't show anything about raidz.

I was wondering whether Jeff or some one knowledgeable would elaborate
further on the above and also answer the following questions;<br>
<br>
<ul>
  <li>The green and blue "blocks" shown in the diagram on page 11, do
the represent actual physical blocks on individual disks or a single
RAID-Z stripe write across multiple disks???</li>


They represent filesystem "data" (the leaves) and filesystem "metadata"
(the blocks above the leaves).  They're written to a pool that may have
some form of redundancy (mirroring, raidz), but that level is not
presented in this slide.

  <li>The parity for RAID-Z, where is it??


Mentioned on page 17, but no diagram on this link.

Bill Moore presented this talk to BayLisa several months ago, and used a
very similar presentation, but it had a diagram on the "RAID-Z" slide
(the one on page 17) showing the data and parity blocks used by a raidz
pool.  Looking through google, I see many links to similar ZFS

presentations, but none with the diagram on that page.

Ah, found one...
http://www.snia.org/events/past/sdc2006/zfs_File_Systems-bonwick-moore.pdf
Page 13 in that stack.

Surely not the checksum
stored together in the upper level direct and indirect block pointer?
And if not and it is written as a separate block on another disks, then
.......I am afraid I do not understand....<br>


The parity is written as a separate block on a separate disk.  It's very

similar to how RAID4/RAID5 would write parity on another disk.

It's just that for R4/R5, any given data block on disk can be
immediately calculated to be part of a particular stripe on the storage
(which has a particular parity block).  In the case of raidz, the stripe
has a maximum length set by the raidz columns, but it may be smaller
than that.

  <li>Could someone please elaborate more on the statement "Every block
is it's own RAID-Z stripe"??? The block being referred to is a single
block across multiple disks or a single disk?<br>


Every filesystem block (not disk block).  So a single filesystem block
that spans multiple disks.





Fujitsu Asia Pte. Ltd.
_____________________________________________________

This e-mail is confidential and may also be privileged. If you are notthe intended recipient, please notify us immediately. You should notcopy or use it for any purpose, nor disclose its contents to any otherperson.

Opinions, conclusions and other information in this message that do notrelate to the official business of my firm shall be understood asneither given nor endorsed by it.



------------------------------------------------------------------------

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss


--
Fred Zlotnick
Director, Solaris Data Technology
Sun Microsystems, Inc.
[EMAIL PROTECTED]
x85006/+1 650 786 5006
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Trying to understand zfs RAID-Z

Reply via email to