On 3/20/2011 2:23 PM, Richard Elling wrote:
On Mar 20, 2011, at 12:48 PM, David Magda wrote:
On Mar 20, 2011, at 14:24, Roy Sigurd Karlsbakk wrote:
It all depends on the number of drives in the VDEV(s), traffic
patterns during resilver, speed VDEV fill, of drives etc. Still,
close to 6 days is a lot. Can you detail your configuration?
How many times do we have to rehash this? The speed of resilver is
dependent on the amount of data, the distribution of data on the
resilvering device, speed of the resilvering device, and the throttle. It is NOT
dependent on the number of drives in the vdev.
Thanks for clearing this up - I've been told large VDEVs lead to long resilver
times, but then, I guess that was wrong.
There was a thread ("Suggested RaidZ configuration...") a little while back
where the topic of IOps and resilver time came up:
http://mail.opensolaris.org/pipermail/zfs-discuss/2010-September/thread.html#44633
I think this message by Erik Trimble is a good summary:
hmmm... I must've missed that one, otherwise I would have said...
Scenario 1: I have 5 1TB disks in a raidz1, and I assume I have 128k slab
sizes. Thus, I have 32k of data for each slab written to each disk. (4x32k
data + 32k parity for a 128k slab size). So, each IOPS gets to reconstruct 32k
of data on the failed drive. It thus takes about 1TB/32k = 31e6 IOPS to
reconstruct the full 1TB drive.
Here, the IOPS doesn't matter because the limit will be the media write
speed of the resilvering disk -- bandwidth.
Scenario 2: I have 10 1TB drives in a raidz1, with the same 128k slab sizes.
In this case, there's only about 14k of data on each drive for a slab. This
means, each IOPS to the failed drive only write 14k. So, it takes 1TB/14k =
71e6 IOPS to complete.
Here, IOPS might matter, but I doubt it. Where we see IOPS matter is when the
block
sizes are small (eg. metadata). In some cases you can see widely varying
resilver times when
the data is large versus small. These changes follow the temporal distribution
of the original
data. For example, if a pool's life begins with someone loading their MP3
collection (large
blocks, mostly sequential) and then working on source code (small blocks, more
random, lots
of creates/unlinks) then the resilver will be bandwidth bound as it resilvers
the MP3s and then
IOPS bound as it resilvers the source. Hence, the prediction of when resilver
will finish is not
very accurate.
From this, it can be pretty easy to see that the number of required IOPS to
the resilvered disk goes up linearly with the number of data drives in a vdev.
Since you're always going to be IOPS bound by the single disk resilvering, you
have a fixed limit.
You will not always be IOPS bound by the resilvering disk. You will be speed
bound
by the resilvering disk, where speed is either write bandwidth or random write
IOPS.
-- richard
Really? Can you really be bandwidth limited on a (typical) RAIDZ resilver?
I can see where you might be on a mirror, with large slabs and
essentially sequential read/write - that is, since the drivers can queue
up several read/write requests at a time, you have the potential to be
reading/writing several (let's say 4) 128k slabs per single IOPS. That
means you read/write at 512k per IOPS for a mirror (best case
scenario). For a 7200RPM drive, that's 100 IOPS x .5MB/IOPS = 50MB/s,
which is lower than the maximum throughput of a modern SATA drive. For
one of the 15k SAS drives able to do 300IOPS, you get 150MB/s, which
indeed exceeds a SAS drive's write bandwidth.
For RAIDZn configs, however, you're going to be limited on the size of
an individual read/write. As Roy pointed out before, that max size of
an individual portion of a slab is 128k/X, where X=number of data drives
in RAIDZn. So, for a typical 4-data-drive RAIDZn, even in the best
case scenario where I can queue multiple slab requests (say 4) into a
single IOPS, that means I'm likely to top out at about 128k of data to
write to the resilvered drive per IOPS. Which, leads to 12MB/s for the
7200RPM drive, and 36MB/s for the 15k drive, both well under their
respective bandwidth capability.
Even with large slab sizes, I really can't see any place where a RAIDZ
resilver isn't going to be IOPS bound when using HDs as backing store.
Mirrors are more likely, but still, even in that case, I think you're
going to hit the IOPS barrier far more often than the bandwidth barrier.
Now, with SSDs as backing store, yes, you become bandwidth limited,
because the IOPS values of SSDs are at least an order of magnitude
greater than HDs, though both have the same max bandwidth characteristics.
Now, the *total* time it takes to resilver either a mirror or RAIDZ is
indeed primarily dependent on the number of allocated slabs in the vdev,
and the level of fragmentation of slabs. That essentially defines the
total amount of work that needs to be done. The above discussion
compares resilver times based on IDENTICAL data - that is, I'm
comparing how a RAIDZ and mirror resilver a given data pattern. So, if
you want to come up with how much time it will take a resilver to
complete, you have to worry about four things:
(1) How many total slabs do I have to resilver? (total data size is
irrelevant, it's the number of slabs required to store that amount of data)
(2) How fragmented are my files? (sequentially written, never
re-written, few deletes will be much faster then heavy modified and
deleted pools - essentially, how much seeking is my drive going to have
to do?)
(3) Do I have a Mirror or RAIDZ config (and, how many data drives in the
RAIDZ)
(4) What are the IOPS/bandwidth characteristics of the backing store I
use in #3
--
Erik Trimble
Java System Support
Mailstop: usca22-123
Phone: x17195
Santa Clara, CA
_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss