Re: [zfs-discuss] IOzone benchmarking

Ray Van Dolson Fri, 04 May 2012 12:33:22 -0700

On Thu, May 03, 2012 at 07:35:45AM -0700, Edward Ned Harvey wrote:
> > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
> > boun...@opensolaris.org] On Behalf Of Ray Van Dolson
> > 
> > System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs.  16 vdevs of
> > 15
> > disks each -- RAIDZ3.  NexentaStor 3.1.2.
> 
> I think you'll get better, both performance & reliability, if you break each
> of those 15-disk raidz3's into three 5-disk raidz1's.  Here's why:
> 
> Obviously, with raidz3, if any 3 of 15 disks fail, you're still in
> operation, and on the 4th failure, you're toast.
> Obviously, with raidz1, if any 1 of 5 disks fail, you're still in operation,
> and on the 2nd failure, you're toast.
> 
> So it's all about computing the probability of 4 overlapping failures in the
> 15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1.  In
> order to calculate that, you need to estimate the time to resilver any one
> failed disk...
> 
> In ZFS, suppose you have a record of 128k, and suppose you have a 2-way
> mirror vdev.  Then each disk writes 128k.  If you have a 3-disk raidz1, then
> each disk writes 64k.   If you have a 5-disk raidz1, then each disk writes
> 32k.  If you have a 15-disk raidz3, then each disk writes 10.6k.  
> 
> Assuming you have a machine in production, and you are doing autosnapshots.
> And your data is volatile.  Over time, it serves to fragment your data, and
> after a year or two of being in production, your resilver will be composed
> almost entirely of random IO.  Each of the non-failed disks must read their
> segment of the stripe, in order to reconstruct the data that will be written
> to the new good disk.  If you're in the 15-disk raidz3 configuration...
> Your segment size is approx 3x smaller, which means approx 3x more IO
> operations.
> 
> Another way of saying that...  Assuming the amount of data you will write to
> your pool is the same regardless of which architecture you chose...  For
> discussion purposes, let's say you write 3T to your pool.  And let's
> momentarily assume you whole pool will be composed of 15 disks, in either a
> single raidz3, or in 3x 5-disk raidz1.  If you use one big raidz3, then the
> 3T will require at least 24million 128k records to hold it all, and each
> 128k record will be divided up onto all the disks.  If you use the smaller
> raidz1, then only 1T will get written to each vdev, and you will only need
> 8million records on each disk.  Thus, to resilver the large vdev, you will
> require 3x more IO operations.
> 
> Worse still, on each IO request, you have to wait for the slowest of all
> disks to return.  If you were in a 2-way mirror situation, your seek time
> would be the average seek time of a single disk.  But if you were in an
> infinite-disk situation, your seek time would be the worst case seek time on
> every single IO operation, which is about 2x longer than the average seek
> time.  So not only do you have 3x more seeks to perform, you have up to 2x
> longer to wait upon each seek...
> 
> Now, to put some numbers on this...
> A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write
> sequential.  This means resilvering the entire disk sequentially, including
> unused space, (which is not what ZFS does) would require 2.2 hours.  In
> practice, on my 1T disks, which are in a mirrored configuration, I find
> resilvering takes 12 hours.  I would expect this to be ~4 days if I were
> using 5-disk raidz1, and I would expect it to be ~12 days if I were using
> 15-disk raidz3.
> 
> Your disks are all 2T, so you should double all the times I just wrote.
> Your raidz3 should be able to resilver a single disk in approx 24 days.
> Your raidz5 should be able to do one in ~ 8 days.  If you were using
> mirrors, ~ 1 day.
> 
> Suddenly the prospect of multiple failures overlapping don't seem so
> unlikely.


Ed, thanks for taking the time to write this all out.  Definitely food
for thought.

Ray
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] IOzone benchmarking

Reply via email to