Re: [zfs-discuss] Optimal raidz3 configuration

Richard Elling Sun, 17 Oct 2010 21:03:56 -0700

On Oct 16, 2010, at 4:57 AM, Edward Ned Harvey wrote:

>> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
>> 
>>> raidzN takes a really long time to resilver (code written
>> inefficiently,
>>> it's a known problem.)  If you had a huge raidz3, it would literally
>> never
>>> finish, because it couldn't resilver as fast as new data appears.  A
>> week
>> 
>> In what way is the code written inefficiently?
> 
> Here is a link to one message in the middle of a really long thread, which
> touched on a lot of things, so it's difficult to read the thread now and get
> what it all boils down to and which parts are relevant to the present
> discussion.  Relevant comments below...
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41998.html
> 
> In conclusion of the referenced thread:
> 
> The raidzN resilver code is inefficient, especially when there are a lot of
> disks in the vdev, because...
> 
> 1. It processes one slab at a time.  That's very important.  Each disk
> spends a lot of idle time waiting for the next disk to fetch something, so
> there is an opportunity to start prefetching data on the idle disks, and
> that is not happening.


Slabs don't matter. So the rest of this argument is moot.

> 2. Each slab is spread across many disks, so the average seek time to fetch
> the slab approaches the maximum seek time of a single disk.  That means an
> average 2x longer than average seek time.

nope.

> 2a. The more disks in the vdev, the smaller the piece of data that gets
> written to each individual disk.  So you are waiting for the maximum seek
> time, in order to fetch a slab fragment which is tiny ...

This is an oversimplification.  In all of the resilvering tests I've done, the
resilver time is entirely based on the random write performance of the
resilvering disk. 

> 3. The order of slab fetching is determined by creation time, not by disk
> layout.  This is a huge setback.  It means each seek is essentially random,
> which yields maximum seek time, instead of being sequential which approaches
> zero seek time.  If you could cut the seek time down to zero, you would have
> infinitely faster IOPS.  Something divided by zero is infinity.  Suddenly
> you wouldn't care about seek time and you'd start paying attention to some
> other limiting factor.
> http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg42017.html

Seeks are usually quite small compared to the rotational delay, due to
the way data is written.

> 4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and
> they're trying to resilver at the same time.  Does the system ignore
> subsequently failed disks and concentrate on restoring a single disk
> quickly?  

No, of course.

> Or does the system try to resilver them all simultaneously and
> therefore double or triple the time before any one disk is fully resilvered?

Yes, of course.

> 5. If all your files reside in one big raidz3, that means a little piece of
> *every* slab in the pool must be on each disk.  We've concluded above that
> you are approaching maximum seek time,

No, you are jumping to the conclusion that data is allocated at the beginning
and the end of the device, which is not the case.

> and now we're also concluding you
> must do the maximum number of possible seeks.  If instead, you break your
> big raidz3 vdev into 3 raidz1 vdev's, that means each raidz1 vdev will have
> approx 33% as many slab pieces on it.  

Again, misuse of the term "slab."  A record will exist in only one set.  So it 
is
simply a matter of finding the records that need to be resilvered.

> If you need to resilver a disk, even
> though you're resilvering approximately the same number of bytes per disk as
> you would have in raidz3, in the raidz1 you've cut the number of seeks down
> to 33%, and you've reduced the time necessary for each of those seeks.

No, not really. The metadata contains the information you need to locate
the records to be resilvered. By design, the metadata is redundant and spread
across top-level vdevs or, in the case of a single top-level vdev, made 
redundant and diverse. So there are two activities in play:
        1. metadata is read in time order and prefetched
        2. records are reconstructed from the surviving vdevs

> Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20
> mirrors.  Resilver one disk.  You only require 5% as many seeks, and each
> seek will go twice as fast.  

Again, this is an oversimplification that assumes seeks are not done in
parallel. In reality, the I/Os are scheduled to each device in the set 
concurrently,
so the total number of seeks per set is moot.

> So the mirror will resilver 40x faster.  

I've never seen data to support this.  And yes, I've done many experiments
and observed real-life reconstruction.

> Also,
> if anybody is actually using the pool during that time, only 5% of the user
> operations will result in a seek on the resilvering mirror disk, while 100%
> of the user operations will hurt the raidz3 resilver.

Good argument for SSDs, yes? :-)

> 6. Please see the following calculation of probability of failure of 20
> mirrors vs 23 disk raidz3.  According to my calculations, the probability of
> 4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in
> the same mirror failing is approx 5E-5. So the chances of either pool to
> fail is very small, but the raidz3 is approx 10x more likely to suffer pool
> failure than the mirror setup.  Granted there is some linear estimation
> which is not entirely accurate, but I think the calculation comes within an
> order of magnitude of being correct.  The mirror setup is 65% more hardware,
> 10x more reliable, and much faster than the raidz3 setup, same usable
> capacity.
> http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdf 

Ok, you've share the math and it isn't quite right. To build a better model,
you will need to work on the probability of each sector being far and
corrupt.  What we tend to see in the field is that the probability of failure
follows the models from the vendors and the locality follows more 
traditional location models.  Location models for HDDs are not easy,
because there is so many layers of reordering, caching, and optimization.  
IMHO it is better to rely on empirical studies, which I have done. My data 
does not match your model very well.  Do you have some measurements to
back up your hypothesis?

> Compare the 21disk raidz3 versus 3 vdev's of 7-disk raidz1.  You get more
> than 3x faster resilver time with the smaller vdev's, and you only get 3x
> the redundancy in the raidz3.  That means the probability of 4
> simultaneously failed disks in the raidz3 is higher than the probability of
> 2 failed disks in a single raidz1 vdev.

Disagree.  We do have models for this and can do the math.  Starting with the
model I described in ZFS data protection comparison and extending to 21
disks, we see:
        Config                  MTTDL[1] (years)
        3x7 disk raidz1      2,581
        21 disk raidz3  37,499,659

As I've said many times, and shown data to prove (next chance is at the 
OpenStorage Summit in a few weeks :-) that the resilver becomes constrained
by the performance of the resilvering disk, not the surviving disks.
 -- richard



-- 
OpenStorage Summit, October 25-27, Palo Alto, CA
http://nexenta-summit2010.eventbrite.com
USENIX LISA '10 Conference, November 7-12, San Jose, CA
ZFS and performance consulting
http://www.RichardElling.com

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Optimal raidz3 configuration

Reply via email to