Re: [zfs-discuss] A resilver record?

Erik Trimble Sun, 20 Mar 2011 20:53:20 -0700

On 3/20/2011 2:23 PM, Richard Elling wrote:

On Mar 20, 2011, at 12:48 PM, David Magda wrote:

On Mar 20, 2011, at 14:24, Roy Sigurd Karlsbakk wrote:

It all depends on the number of drives in the VDEV(s), traffic
patterns during resilver, speed VDEV fill, of drives etc. Still,
close to 6 days is a lot. Can you detail your configuration?

How many times do we have to rehash this? The speed of resilver is
dependent on the amount of data, the distribution of data on the
resilvering device, speed of the resilvering device, and the throttle. It is NOT
dependent on the number of drives in the vdev.

Thanks for clearing this up - I've been told large VDEVs lead to long resilver 
times, but then, I guess that was wrong.

There was a thread ("Suggested RaidZ configuration...") a little while back 
where the topic of IOps and resilver time came up:

http://mail.opensolaris.org/pipermail/zfs-discuss/2010-September/thread.html#44633

I think this message by Erik Trimble is a good summary:

hmmm... I must've missed that one, otherwise I would have said...

Scenario 1:    I have 5 1TB disks in a raidz1, and I assume I have 128k slab 
sizes.  Thus, I have 32k of data for each slab written to each disk. (4x32k 
data + 32k parity for a 128k slab size).  So, each IOPS gets to reconstruct 32k 
of data on the failed drive.   It thus takes about 1TB/32k = 31e6 IOPS to 
reconstruct the full 1TB drive.

Here, the IOPS doesn't matter because the limit will be the media write
speed of the resilvering disk -- bandwidth.

Scenario 2:    I have 10 1TB drives in a raidz1, with the same 128k slab sizes. 
 In this case, there's only about 14k of data on each drive for a slab. This 
means, each IOPS to the failed drive only write 14k.  So, it takes 1TB/14k = 
71e6 IOPS to complete.

Here, IOPS might matter, but I doubt it.  Where we see IOPS matter is when the 
block
sizes are small (eg. metadata). In some cases you can see widely varying 
resilver times when
the data is large versus small. These changes follow the temporal distribution 
of the original
data. For example, if a pool's life begins with someone loading their MP3 
collection (large
blocks, mostly sequential) and then working on source code (small blocks, more 
random, lots
of creates/unlinks) then the resilver will be bandwidth bound as it resilvers 
the MP3s and then
IOPS bound as it resilvers the source. Hence, the prediction of when resilver 
will finish is not
very accurate.

 From this, it can be pretty easy to see that the number of required IOPS to 
the resilvered disk goes up linearly with the number of data drives in a vdev.  
Since you're always going to be IOPS bound by the single disk resilvering, you 
have a fixed limit.

You will not always be IOPS bound by the resilvering disk. You will be speed 
bound
by the resilvering disk, where speed is either write bandwidth or random write 
IOPS.
  -- richard


Really? Can you really be bandwidth limited on a (typical) RAIDZ resilver?

I can see where you might be on a mirror, with large slabs andessentially sequential read/write - that is, since the drivers can queueup several read/write requests at a time, you have the potential to bereading/writing several (let's say 4) 128k slabs per single IOPS. Thatmeans you read/write at 512k per IOPS for a mirror (best casescenario). For a 7200RPM drive, that's 100 IOPS x .5MB/IOPS = 50MB/s,which is lower than the maximum throughput of a modern SATA drive. Forone of the 15k SAS drives able to do 300IOPS, you get 150MB/s, whichindeed exceeds a SAS drive's write bandwidth.

For RAIDZn configs, however, you're going to be limited on the size ofan individual read/write. As Roy pointed out before, that max size ofan individual portion of a slab is 128k/X, where X=number of data drivesin RAIDZn. So, for a typical 4-data-drive RAIDZn, even in the bestcase scenario where I can queue multiple slab requests (say 4) into asingle IOPS, that means I'm likely to top out at about 128k of data towrite to the resilvered drive per IOPS. Which, leads to 12MB/s for the7200RPM drive, and 36MB/s for the 15k drive, both well under theirrespective bandwidth capability.

Even with large slab sizes, I really can't see any place where a RAIDZresilver isn't going to be IOPS bound when using HDs as backing store.Mirrors are more likely, but still, even in that case, I think you'regoing to hit the IOPS barrier far more often than the bandwidth barrier.

Now, with SSDs as backing store, yes, you become bandwidth limited,because the IOPS values of SSDs are at least an order of magnitudegreater than HDs, though both have the same max bandwidth characteristics.

Now, the *total* time it takes to resilver either a mirror or RAIDZ isindeed primarily dependent on the number of allocated slabs in the vdev,and the level of fragmentation of slabs. That essentially defines thetotal amount of work that needs to be done. The above discussioncompares resilver times based on IDENTICAL data - that is, I'mcomparing how a RAIDZ and mirror resilver a given data pattern. So, ifyou want to come up with how much time it will take a resilver tocomplete, you have to worry about four things:

(1) How many total slabs do I have to resilver? (total data size isirrelevant, it's the number of slabs required to store that amount of data)(2) How fragmented are my files? (sequentially written, neverre-written, few deletes will be much faster then heavy modified anddeleted pools - essentially, how much seeking is my drive going to haveto do?)(3) Do I have a Mirror or RAIDZ config (and, how many data drives in theRAIDZ)(4) What are the IOPS/bandwidth characteristics of the backing store Iuse in #3







--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] A resilver record?

Reply via email to