Since my name was mention, a couple of things: (a) I'm not infallible. :-)
(b) In my posts, I swapped "slab" for "record". I really should have said "record". It's more correct as to what's going on. (c) It is possible for constituent drives in a RaidZ to be issued concurrent requests for portions of a record, which *may* increase efficiency. So, the "assembly" of a complete record isn't completely a serial operation (that is, ZFS doesn't wait for all the parts of a record to be assembled before issuing further requests for the next record) So, drives may have requests for multiple portions of records sitting in their "todo" queues. Thus, all "good" (i.e. being rebuilt *from*) drives should be constantly busy, and not waiting around for others to finish reading data. That all said, I don't see (in the code) where the place is that indicates how many records can be done in parallel. 2? 4? 20? It matters quite a bit. (d) writing completed record parts (i.e. the segment that needs to be resilvered) is also queued up, so, for the most part, the replaced drive is doing relatively sequential IO. That is, *usually* the head doesn't have to seek and *may* not even have to wait much for rotational delay - it just stays where it left off and writes the next reconstructed data. Now, for drives which are not replaced, but rather just "stale", this isn't often true, and those drives may be stuck seeking quite a bit. But, since they're usually only slightly stale, it isn't noticed that much. (e) Given C above, the average performance of a drive being read does tend to be "average" for random IO - that is, half the max seek time, plus half the average rotational latency. NCQ/etc will help this by clustering reads, so actual performance should be better than a pure average, but I'd not bet on a significant improvement. And, for a typical pools, I'm going to make a bald-faced statement that the HD read cache is going to be much less helpful than usual (as for a typical filesystem with lots of small files, most will fit in a single record, and the next location on the HD is likely NOT to be something you want) - that is, HD read-ahead cache misses are going to be frequent. All this assumes you are reconstructing a drive which has not been sequentially written to - those types of zpools will resilver much faster than zpools exposed to "typical" read/write patterns. (f) IOPS is going to be the limiting factor, particularly for the resilvering drive, as there is less opportunity to group writes than there is to group reads (even allowing for D above). My reading of the code says that ZFS issues writes to the resilver drive as the opportunity comes - that is, ZFS itself doesn't try to batch up multiple records into a single write request. I'd like verification of this, though. -Erik -- Erik Trimble Java System Support Mailstop: usca22-317 Phone: x67195 Santa Clara, CA Timezone: US/Pacific (GMT-0800) _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss