Since my name was mention, a couple of things:

(a) I'm not infallible. :-)

(b) In my posts, I swapped "slab" for "record". I really should have
said "record".  It's more correct as to what's going on.

(c) It is possible for constituent drives in a RaidZ to be issued
concurrent requests for portions of a record, which *may* increase
efficiency. So, the "assembly" of a complete record isn't completely a
serial operation (that is, ZFS doesn't wait for all the parts of a
record to be assembled before issuing further requests for the next
record) So, drives may have requests for multiple portions of records
sitting in their "todo" queues. Thus, all "good" (i.e. being rebuilt
*from*) drives should be constantly busy, and not waiting around for
others to finish reading data.  That all said, I don't see (in the code)
where the place is that indicates how many records can be done in
parallel. 2? 4? 20?  It matters quite a bit.

(d) writing completed record parts (i.e. the segment that needs to be
resilvered) is also queued up, so, for the most part, the replaced drive
is doing relatively sequential IO.  That is, *usually* the head doesn't
have to seek and *may* not even have to wait much for rotational delay -
it just stays where it left off and writes the next reconstructed data.
Now, for drives which are not replaced, but rather just "stale", this
isn't often true, and those drives may be stuck seeking quite a bit.
But, since they're usually only slightly stale, it isn't noticed that
much.


(e) Given C above, the average performance of a drive being read does
tend to be "average" for random IO - that is, half the max seek time,
plus half the average rotational latency. NCQ/etc will help this by
clustering reads, so actual performance should be better than a pure
average, but I'd not bet on a significant improvement.  And, for a
typical pools, I'm going to make a bald-faced statement that the HD read
cache is going to be much less helpful than usual (as for a typical
filesystem with lots of small files, most will fit in a single record,
and the next location on the HD is likely NOT to be something you want)
- that is, HD read-ahead cache misses are going to be frequent.  All
this assumes you are reconstructing a drive which has not been
sequentially written to - those types of zpools will resilver much
faster than zpools exposed to "typical" read/write patterns.

(f)  IOPS is going to be the limiting factor, particularly for the
resilvering drive, as there is less opportunity to group writes than
there is to group reads (even allowing for D above).  My reading of the
code says that ZFS issues writes to the resilver drive as the
opportunity comes - that is, ZFS itself doesn't try to batch up multiple
records into a single write request.    I'd like verification of this,
though.



-Erik


-- 
Erik Trimble
Java System Support
Mailstop:  usca22-317
Phone:  x67195
Santa Clara, CA
Timezone: US/Pacific (GMT-0800)

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to