> From: Bob Friesenhahn [mailto:bfrie...@simple.dallas.tx.us]
> 
> > raidzN takes a really long time to resilver (code written
> inefficiently,
> > it's a known problem.)  If you had a huge raidz3, it would literally
> never
> > finish, because it couldn't resilver as fast as new data appears.  A
> week
> 
> In what way is the code written inefficiently?

Here is a link to one message in the middle of a really long thread, which
touched on a lot of things, so it's difficult to read the thread now and get
what it all boils down to and which parts are relevant to the present
discussion.  Relevant comments below...
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg41998.html

In conclusion of the referenced thread:

The raidzN resilver code is inefficient, especially when there are a lot of
disks in the vdev, because...

1. It processes one slab at a time.  That's very important.  Each disk
spends a lot of idle time waiting for the next disk to fetch something, so
there is an opportunity to start prefetching data on the idle disks, and
that is not happening.

2. Each slab is spread across many disks, so the average seek time to fetch
the slab approaches the maximum seek time of a single disk.  That means an
average 2x longer than average seek time.

2a. The more disks in the vdev, the smaller the piece of data that gets
written to each individual disk.  So you are waiting for the maximum seek
time, in order to fetch a slab fragment which is tiny ...

3. The order of slab fetching is determined by creation time, not by disk
layout.  This is a huge setback.  It means each seek is essentially random,
which yields maximum seek time, instead of being sequential which approaches
zero seek time.  If you could cut the seek time down to zero, you would have
infinitely faster IOPS.  Something divided by zero is infinity.  Suddenly
you wouldn't care about seek time and you'd start paying attention to some
other limiting factor.
http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg42017.html

4. Guess what happens if you have 2 or 3 failed disks in your raidz3, and
they're trying to resilver at the same time.  Does the system ignore
subsequently failed disks and concentrate on restoring a single disk
quickly?  Or does the system try to resilver them all simultaneously and
therefore double or triple the time before any one disk is fully resilvered?

5. If all your files reside in one big raidz3, that means a little piece of
*every* slab in the pool must be on each disk.  We've concluded above that
you are approaching maximum seek time, and now we're also concluding you
must do the maximum number of possible seeks.  If instead, you break your
big raidz3 vdev into 3 raidz1 vdev's, that means each raidz1 vdev will have
approx 33% as many slab pieces on it.  If you need to resilver a disk, even
though you're resilvering approximately the same number of bytes per disk as
you would have in raidz3, in the raidz1 you've cut the number of seeks down
to 33%, and you've reduced the time necessary for each of those seeks.
Still better ... Compare a 23-disk raidz3 (capacity of 20 disks) against 20
mirrors.  Resilver one disk.  You only require 5% as many seeks, and each
seek will go twice as fast.  So the mirror will resilver 40x faster.  Also,
if anybody is actually using the pool during that time, only 5% of the user
operations will result in a seek on the resilvering mirror disk, while 100%
of the user operations will hurt the raidz3 resilver.

6. Please see the following calculation of probability of failure of 20
mirrors vs 23 disk raidz3.  According to my calculations, the probability of
4 disk failure in raidz3 is approx 4.4E-4 and the probability of 2 disks in
the same mirror failing is approx 5E-5.  So the chances of either pool to
fail is very small, but the raidz3 is approx 10x more likely to suffer pool
failure than the mirror setup.  Granted there is some linear estimation
which is not entirely accurate, but I think the calculation comes within an
order of magnitude of being correct.  The mirror setup is 65% more hardware,
10x more reliable, and much faster than the raidz3 setup, same usable
capacity.
http://dl.dropbox.com/u/543241/raidz3%20vs%20mirrors.pdf 

...

Compare the 21disk raidz3 versus 3 vdev's of 7-disk raidz1.  You get more
than 3x faster resilver time with the smaller vdev's, and you only get 3x
the redundancy in the raidz3.  That means the probability of 4
simultaneously failed disks in the raidz3 is higher than the probability of
2 failed disks in a single raidz1 vdev.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to