On Thu, May 03, 2012 at 07:35:45AM -0700, Edward Ned Harvey wrote: > > From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- > > boun...@opensolaris.org] On Behalf Of Ray Van Dolson > > > > System is a 240x2TB (7200RPM) system in 20 Dell MD1200 JBODs. 16 vdevs of > > 15 > > disks each -- RAIDZ3. NexentaStor 3.1.2. > > I think you'll get better, both performance & reliability, if you break each > of those 15-disk raidz3's into three 5-disk raidz1's. Here's why: > > Obviously, with raidz3, if any 3 of 15 disks fail, you're still in > operation, and on the 4th failure, you're toast. > Obviously, with raidz1, if any 1 of 5 disks fail, you're still in operation, > and on the 2nd failure, you're toast. > > So it's all about computing the probability of 4 overlapping failures in the > 15-disk raidz3, or 2 overlapping failures in a smaller 5-disk raidz1. In > order to calculate that, you need to estimate the time to resilver any one > failed disk... > > In ZFS, suppose you have a record of 128k, and suppose you have a 2-way > mirror vdev. Then each disk writes 128k. If you have a 3-disk raidz1, then > each disk writes 64k. If you have a 5-disk raidz1, then each disk writes > 32k. If you have a 15-disk raidz3, then each disk writes 10.6k. > > Assuming you have a machine in production, and you are doing autosnapshots. > And your data is volatile. Over time, it serves to fragment your data, and > after a year or two of being in production, your resilver will be composed > almost entirely of random IO. Each of the non-failed disks must read their > segment of the stripe, in order to reconstruct the data that will be written > to the new good disk. If you're in the 15-disk raidz3 configuration... > Your segment size is approx 3x smaller, which means approx 3x more IO > operations. > > Another way of saying that... Assuming the amount of data you will write to > your pool is the same regardless of which architecture you chose... For > discussion purposes, let's say you write 3T to your pool. And let's > momentarily assume you whole pool will be composed of 15 disks, in either a > single raidz3, or in 3x 5-disk raidz1. If you use one big raidz3, then the > 3T will require at least 24million 128k records to hold it all, and each > 128k record will be divided up onto all the disks. If you use the smaller > raidz1, then only 1T will get written to each vdev, and you will only need > 8million records on each disk. Thus, to resilver the large vdev, you will > require 3x more IO operations. > > Worse still, on each IO request, you have to wait for the slowest of all > disks to return. If you were in a 2-way mirror situation, your seek time > would be the average seek time of a single disk. But if you were in an > infinite-disk situation, your seek time would be the worst case seek time on > every single IO operation, which is about 2x longer than the average seek > time. So not only do you have 3x more seeks to perform, you have up to 2x > longer to wait upon each seek... > > Now, to put some numbers on this... > A single 1T disk can sustain (let's assume) 1.0 Gbit/sec read/write > sequential. This means resilvering the entire disk sequentially, including > unused space, (which is not what ZFS does) would require 2.2 hours. In > practice, on my 1T disks, which are in a mirrored configuration, I find > resilvering takes 12 hours. I would expect this to be ~4 days if I were > using 5-disk raidz1, and I would expect it to be ~12 days if I were using > 15-disk raidz3. > > Your disks are all 2T, so you should double all the times I just wrote. > Your raidz3 should be able to resilver a single disk in approx 24 days. > Your raidz5 should be able to do one in ~ 8 days. If you were using > mirrors, ~ 1 day. > > Suddenly the prospect of multiple failures overlapping don't seem so > unlikely.
Ed, thanks for taking the time to write this all out. Definitely food for thought. Ray _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss