On 9/8/2010 10:08 PM, Freddie Cash wrote:
On Wed, Sep 8, 2010 at 6:27 AM, Edward Ned Harvey<sh...@nedharvey.com> wrote:
Both of the above situations resilver in equal time, unless there is a bus
bottleneck. 21 disks in a single raidz3 will resilver just as fast as 7
disks in a raidz1, as long as you are avoiding the bus bottleneck. But 21
disks in a single raidz3 provides better redundancy than 3 vdev's each
containing a 7 disk raidz1.
No, it (21-disk raidz3 vdev) most certainly will not resilver in the
same amount of time. In fact, I highly doubt it would resilver at
all.
My first foray into ZFS resulted in a 24-disk raidz2 vdev using 500 GB
Seagate ES.2 and WD RE3 drives connected to 3Ware 9550SXU and 9650SE
multilane controllers. Nice 10 TB storage pool. Worked beatifully as
we filled it with data. Had less than 50% usage when a disk died.
No problem, it's ZFS, it's meant to be easy to replace a drive, just
offline, swap, replace, wait for it to resilver.
Well, 3 days later, it was still under 10%, and every disk light was
still solid grrn. SNMP showed over 100 MB/s of disk I/O continuously,
and the box was basically unusable (5 minutes to get the password line
to appear on the console).
Tried rebooting a few times, stopped all disk I/O to the machine (it
was our backups box, running rysnc every night for - at the time - 50+
remote servers), let it do its thing.
After 3 weeks of trying to get the resilver to complete (or even reach
50%), we pulled the plug and destroyed the pool, rebuilding it using
3x 8-drive raidz2 vdevs. Things have been a lot smoother ever since.
Have replaced 8 of the drives (1 vdev) with 1.5 TB drives. Have
replaced multiple dead drives. Resilvers, while running outgoing
rsync all day and incoming rsync all night, take 3 days for a 1.5 TB
drive (with SNMP showing 300 MB/s disk I/O).
You most definitely do not want to use a single super-wide raidz vdev.
It just won't work.
Instead of the Best Practices Guide saying "Don't put more than ___ disks
into a single vdev," the BPG should say "Avoid the bus bandwidth bottleneck
by constructing your vdev's using physical disks which are distributed
across multiple buses, as necessary per the speed of your disks and buses."
Yeah, I still don't buy it. Even spreading disks out such that you
have 4 SATA drives per PCI-X/PCIe bus, I don't think you'd be able to
get a 500 GB SATA disk to resilver in a 24-disk raidz vdev (even a
raidz1) in a 50% full pool. Especially if you are using the pool for
anything at the same time.
the thing that folks tend to forget is that RaidZ is IOPS limited. For
the most part, if I want to reconstruct a single slab (stripe) of data,
I have to issue a read to EACH disk in the vdev, and wait for that disk
to return the value, before I can write the computed parity value out to
the disk under reconstruction.
This is *regardless* of the amount of data being reconstructed.
So, the bottleneck tends to be the IOPS value of the single disk being
reconstructed. Thus, having fewer disks in a vdev leads to less data
being required to be resilvered, which leads to fewer IOPS being
required to finish the resilver.
Example (for ease of calculation, let's do the disk-drive mfg's cheat of
1k = 1000 bytes):
Scenario 1: I have 5 1TB disks in a raidz1, and I assume I have 128k
slab sizes. Thus, I have 32k of data for each slab written to each
disk. (4x32k data + 32k parity for a 128k slab size). So, each IOPS
gets to reconstruct 32k of data on the failed drive. It thus takes
about 1TB/32k = 31e6 IOPS to reconstruct the full 1TB drive.
Scenario 2: I have 10 1TB drives in a raidz1, with the same 128k slab
sizes. In this case, there's only about 14k of data on each drive for a
slab. This means, each IOPS to the failed drive only write 14k. So, it
takes 1TB/14k = 71e6 IOPS to complete.
From this, it can be pretty easy to see that the number of required
IOPS to the resilvered disk goes up linearly with the number of data
drives in a vdev. Since you're always going to be IOPS bound by the
single disk resilvering, you have a fixed limit.
In addition, remember that having more disks means you have to wait
longer for each IOPS to complete. That is, it takes longer
(fractionally, but in the aggregate, a measuable amount) for 9 drives to
each return 14k of info than it does for 4 drives to return 32k of
data. This is due to rotational and seek access delays. So, not only
are you having to do more total IOPS in Scenario 2, but each IOPS takes
longer to complete (the read cycle taking longer, the write/reconstruct
cycle taking the same amount of time).
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss