On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble <erik.trim...@oracle.com> wrote: > > The problem boils down to this: > > When ZFS does a resilver, it walks the METADATA tree to determine what > order to rebuild things from. That means, it resilvers the very first > slab ever written, then the next oldest, etc. The problem here is that > slab "age" has nothing to do with where that data physically resides on > the actual disks. If you've used the zpool as a WORM device, then, sure, > there should be a strict correlation between increasing slab age and > locality on the disk. However, in any reasonable case, files get > deleted regularly. This means that the probability that for a slab B, > written immediately after slab A, it WON'T be physically near slab A. > > In the end, the problem is that using metadata order, while reducing the > total amount of work to do in the resilver (as you only resilver live > data, not every bit on the drive), increases the physical inefficiency > for each slab. That is, seek time between cyclinders begins to dominate > your slab reconstruction time. In RAIDZ, this problem is magnified by > both the much larger average vdev size vs mirrors, and the necessity > that all drives containing a slab information return that data before > the corrected data can be written to the resilvering drive. > > Thus, current ZFS resilvering tends to be seek-time limited, NOT > throughput limited. This is really the "fault" of the underlying media, > not ZFS. For instance, if you have a raidZ of SSDs (where seek time is > negligible, but throughput isn't), they resilver really, really fast. > In fact, they resilver at the maximum write throughput rate. However, > HDs are severely seek-limited, so that dominates HD resilver time.
You guys may be interested in a solution I used in a totally different situation. There an identical tree data structure had to be maintained on every node of a distributed system. When a new node was added, it needed to be initialized with an identical copy before it could be put in operation. But this had to be done while the rest of the system was operational and there may even be updates from a central node during the `mirroring' operation. Some of these updates could completely change the tree! Starting at the root was not going to work since a subtree that was being copied may stop existing in the middle and its space reused! In a way this is a similar problem (but worse!). I needed something foolproof and simple. My algorithm started copying sequentially from the start. If N blocks were already copied when an update comes along, updates of any block with block# > N are ignored (since the sequential copy would get to them eventually). Updates of any block# <= N were queued up (further update of the same block would overwrite the old update, to reduce work). Periodically they would be flushed out to the new node. This was paced so at to not affect the normal operation much. I should think a variation would work for active filesystems. You sequentially read some amount of data from all the disks from which data for the new disk to be prepared and write it out sequentially. Each time read enough data so that reading time dominates any seek time. Handle concurrent updates as above. If you dedicate N% of time to resilvering, the total time to complete resilver will be 100/N times sequential read time of the whole disk. (For example, 1TB disk, 100MBps io speed, 25% for resilver => under 12 hours). How much worse this gets depends on the amount of updates during resilvering. At the time of resilvering your FS is more likely to be near full than near empty so I wouldn't worry about optimizing the mostly empty FS case. Bakul _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss