On 9/9/2010 6:19 AM, Edward Ned Harvey wrote:
From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss-
boun...@opensolaris.org] On Behalf Of Erik Trimble

the thing that folks tend to forget is that RaidZ is IOPS limited.  For
the most part, if I want to reconstruct a single slab (stripe) of data,
I have to issue a read to EACH disk in the vdev, and wait for that disk
to return the value, before I can write the computed parity value out
to
the disk under reconstruction.
If I'm trying to interpret your whole message, Erik, and condense it, I
think I get the following.  Please tell me if and where I'm wrong.

In any given zpool, some number of slabs are used in the whole pool.  In
raidzN, a portion of each slab is written on each disk.  Therefore, during
resilver, if there are a total of 1million slabs used in the zpool, it means
each good disk will need to read 1million partial slabs, and the replaced
disk will need to write 1 million partial slabs.  Each good disk receives a
read request in parallel, and all of them must complete before a write is
given to the new disk.  Each read/write cycle is completed before the next
cycle begins.  (It seems this could be accelerated by allowing all the good
disks to continue reading in parallel instead of waiting, right?)

The conclusion I would reach is:

Given no bus bottleneck:

It is true that resilvering a raidz will be slower with many disks in the
vdev, because the average latency for the worst of N disks will increase as
N increases.  But that effect is only marginal, and bounded between the
average latency of a single disk, and the worst case latency of a single
disk.

The characteristic that *really* makes a big difference is the number of
slabs in the pool.  i.e. if your filesystem is composed of mostly small
files or fragments, versus mostly large unfragmented files.



Oh, and a mea culpa on converting hours to weeks instead of days. I did the math, then forgot which unit I was dealing in. Ooops.


Your reading of my posts is correct. Indeed, the number of slaps is critical, as this directly impacts IOPS needed. One of the very nice speedups for resilvering would be the ability to do a larger "read" of several continguous slabs (as physically laid out on the disks) in a single IOPS - the difference between reading a 128k slab portion and 5 consecutive 64k slab portions is trivial, so the ability to do more than one slab at a time would be critical for improving resilver times. I have *no* idea how hard this is - given that resilvering currently walks the space allocation tree (which is in creation time order), it generally doesn't get good consecutive slab requests this way, so things would have to change from being tree-driven to being layout-on-disk-driven.

--
Erik Trimble
Java System Support
Mailstop:  usca22-123
Phone:  x17195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to