> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> boun...@opensolaris.org] On Behalf Of Ray Van Dolson
> 
> System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs.  16 vdevs of
> 15
> disks each -- RAIDZ3.  NexentaStor 3.1.2.

I think you'll get better, both performance & reliability, if you break each
of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:

Obviously, with raidz3, if any 3 of 15 disks fail, you're still in
operation, and on the 4th failure, you're toast.
Obviously, with raidz1, if any 1 of 5 disks fail, you're still in operation,
and on the 2nd failure, you're toast.

So it's all about computing the probability of 4 overlapping failures in the
15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1.  In
order to calculate that, you need to estimate the time to resilver any one
failed disk...

In ZFS, suppose you have a record of 128k, and suppose you have a 2-way
mirror vdev.  Then each disk writes 128k.  If you have a 3-disk raidz1, then
each disk writes 64k.   If you have a 5-disk raidz1, then each disk writes
32k.  If you have a 15-disk raidz3, then each disk writes 10.6k.  

Assuming you have a machine in production, and you are doing autosnapshots.
And your data is volatile.  Over time, it serves to fragment your data, and
after a year or two of being in production, your resilver will be composed
almost entirely of random IO.  Each of the non-failed disks must read their
segment of the stripe, in order to reconstruct the data that will be written
to the new good disk.  If you're in the 15-disk raidz3 configuration...
Your segment size is approx 3x smaller, which means approx 3x more IO
operations.

Another way of saying that...  Assuming the amount of data you will write to
your pool is the same regardless of which architecture you chose...  For
discussion purposes, let's say you write 3T to your pool.  And let's
momentarily assume you whole pool will be composed of 15 disks, in either a
single raidz3, or in 3x 5-disk raidz1.  If you use one big raidz3, then the
3T will require at least 24million 128k records to hold it all, and each
128k record will be divided up onto all the disks.  If you use the smaller
raidz1, then only 1T will get written to each vdev, and you will only need
8million records on each disk.  Thus, to resilver the large vdev, you will
require 3x more IO operations.

Worse still, on each IO request, you have to wait for the slowest of all
disks to return.  If you were in a 2-way mirror situation, your seek time
would be the average seek time of a single disk.  But if you were in an
infinite-disk situation, your seek time would be the worst case seek time on
every single IO operation, which is about 2x longer than the average seek
time.  So not only do you have 3x more seeks to perform, you have up to 2x
longer to wait upon each seek...

Now, to put some numbers on this...
A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
sequential.  This means resilvering the entire disk sequentially, including
unused space, (which is not what ZFS does) would require 2.2 hours.  In
practice, on my 1T disks, which are in a mirrored configuration, I find
resilvering takes 12 hours.  I would expect this to be ~4 days if I were
using 5-disk raidz1, and I would expect it to be ~12 days if I were using
15-disk raidz3.

Your disks are all 2T, so you should double all the times I just wrote.
Your raidz3 should be able to resilver a single disk in approx 24 days.
Your raidz5 should be able to do one in ~ 8 days.  If you were using
mirrors, ~ 1 day.

Suddenly the prospect of multiple failures overlapping don't seem so
unlikely.

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to