Erik, just a hypothetical what-if ...
In the case of resilvering on a mirrored disk, why not take a snapshot, and then resilver by doing a pure block copy from the snapshot? It would be sequential, so long as the original data was unmodified; and random access in dealing with the modified blocks only, right. After the original snapshot had been replicated, a second pass would be done, in order to update the clone to 100% live data. Not knowing enough about the inner workings of ZFS snapshots, I don't know why this would not be doable. (I'm biased towards mirrors for busy filesystems.) I'm supposing that a block-level snapshot is not doable -- or is it? Mark On Dec 20, 2010, at 1:27 PM, Erik Trimble wrote: > On 12/20/2010 9:20 AM, Saxon, Will wrote: >>> -----Original Message----- >>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >>> boun...@opensolaris.org] On Behalf Of Edward Ned Harvey >>> Sent: Monday, December 20, 2010 11:46 AM >>> To: 'Lanky Doodle'; zfs-discuss@opensolaris.org >>> Subject: Re: [zfs-discuss] A few questions >>> >>>> From: zfs-discuss-boun...@opensolaris.org [mailto:zfs-discuss- >>>> boun...@opensolaris.org] On Behalf Of Lanky Doodle >>>> >>>>> I believe Oracle is aware of the problem, but most of >>>>> the core ZFS team has left. And of course, a fix for >>>>> Oracle Solaris no longer means a fix for the rest of >>>>> us. >>>> OK, that is a bit concerning then. As good as ZFS may be, i'm not sure I >>> want >>>> to committ to a file system that is 'broken' and may not be fully fixed, >>> if at all. >>> >>> ZFS is not "broken." It is, however, a weak spot, that resilver is very >>> inefficient. For example: >>> >>> On my server, which is made up of 10krpm SATA drives, 1TB each... My >>> drives >>> can each sustain 1Gbit/sec sequential read/write. This means, if I needed >>> to resilver the entire drive (in a mirror) sequentially, it would take ... >>> 8,000 sec = 133 minutes. About 2 hours. In reality, I have ZFS mirrors, >>> and disks are around 70% full, and resilver takes 12-14 hours. >>> >>> So although resilver is "broken" by some standards, it is bounded, and you >>> can limit it to something which is survivable, by using mirrors instead of >>> raidz. For most people, even using 5-disk, or 7-disk raidzN will still be >>> fine. But you start getting unsustainable if you get up to 21-disk radiz3 >>> for example. >> This argument keeps coming up on the list, but I don't see where anyone has >> made a good suggestion about whether this can even be 'fixed' or how it >> would be done. >> >> As I understand it, you have two basic types of array reconstruction: in a >> mirror you can make a block-by-block copy and that's easy, but in a parity >> array you have to perform a calculation on the existing data and/or existing >> parity to reconstruct the missing piece. This is pretty easy when you can >> guarantee that all your stripes are the same width, start/end on the same >> sectors/boundaries/whatever and thus know a piece of them lives on all >> drives in the set. I don't think this is possible with ZFS since we have >> variable stripe width. A failed disk d may or may not contain data from >> stripe s (or transaction t). This information has to be discovered by >> looking at the transaction records. Right? >> >> Can someone speculate as to how you could rebuild a variable stripe width >> array without replaying all the available transactions? I am no filesystem >> engineer but I can't wrap my head around how this could be handled any >> better than it already is. I've read that resilvering is throttled - >> presumably to keep performance degradation to a minimum during the process - >> maybe this could be a tunable (e.g. priority: low, normal, high)? >> >> Do we know if resilvers on a mirror are actually handled differently from >> those on a raidz? >> >> Sorry if this has already been explained. I think this is an issue that >> everyone who uses ZFS should understand completely before jumping in, >> because the behavior (while not 'wrong') is clearly NOT the same as with >> more conventional arrays. >> >> -Will > the "problem" is NOT the checksum/error correction overhead. that's > relatively trivial. The problem isn't really even variable width (i.e. > variable number of disks one crosses) slabs. > > The problem boils down to this: > > When ZFS does a resilver, it walks the METADATA tree to determine what order > to rebuild things from. That means, it resilvers the very first slab ever > written, then the next oldest, etc. The problem here is that slab "age" has > nothing to do with where that data physically resides on the actual disks. If > you've used the zpool as a WORM device, then, sure, there should be a strict > correlation between increasing slab age and locality on the disk. However, > in any reasonable case, files get deleted regularly. This means that the > probability that for a slab B, written immediately after slab A, it WON'T be > physically near slab A. > > In the end, the problem is that using metadata order, while reducing the > total amount of work to do in the resilver (as you only resilver live data, > not every bit on the drive), increases the physical inefficiency for each > slab. That is, seek time between cyclinders begins to dominate your slab > reconstruction time. In RAIDZ, this problem is magnified by both the much > larger average vdev size vs mirrors, and the necessity that all drives > containing a slab information return that data before the corrected data can > be written to the resilvering drive. > > Thus, current ZFS resilvering tends to be seek-time limited, NOT throughput > limited. This is really the "fault" of the underlying media, not ZFS. For > instance, if you have a raidZ of SSDs (where seek time is negligible, but > throughput isn't), they resilver really, really fast. In fact, they resilver > at the maximum write throughput rate. However, HDs are severely > seek-limited, so that dominates HD resilver time. > > > The "answer" isn't simple, as the problem is media-specific. > > -- > Erik Trimble > Java System Support > Mailstop: usca22-123 > Phone: x17195 > Santa Clara, CA > Timezone: US/Pacific (GMT-0800) > > _______________________________________________ > zfs-discuss mailing list > zfs-discuss@opensolaris.org > http://mail.opensolaris.org/mailman/listinfo/zfs-discuss _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss