On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote: > While waiting for that resilver to complete last week, > I caught myself wondering how the resilvers (are supposed > to) work in ZFS?
The devil finds work for idle hands... :-) > Based on what I see in practice and read in this list > and some blogs, I've built a picture and would be grateful > if some experts actually familiar with code and architecture > would say how far off I guessed from the truth ;) Well, I'm not that - certainly not on the code. It would probably be best (for both of us) to spend idle time looking at the code, before spending too much on speculation. Nonetheless, let's have at it! :) > Ultimately I wonder if there are possible optimizations > to make the scrub process more resembling a sequential > drive-cloning (bandwidth/throughput-bound), than an > IOPS-bound random seek thrashing for hours that we > often see now, at least on (over?)saturated pools. The tradeoff will be code complexity and resulting fragility. Choose wisely what you wish for. > This may possibly improve zfs send speeds as well. Less likely, that's pretty much always going to have to go in txg order. > First of all, I state (and ask to confirm): I think > resilvers are a subset of scrubs, in that: > 1) resilvers are limited to a particular top-level VDEV > (and its number is a component of each block's DVA address) > and > 2) when scrub finds a block mismatching its known checksum, > scrub reallocates the whole block anew using the recovered > known-valid data - in essence it is a newly written block > with a new path in BP tree and so on; a resilver expects > to have a disk full of known-missing pieces of blocks, > and reconstructed pieces are written on the resilvering > disk "in-place" at an address dictated by the known DVA - > this allows to not rewrite the other disks and BP tree > as COW would otherwise require. No. Scrub (and any other repair, such as for errors found in the course of normal reads) rewrite the reconstructed blocks in-place: to the original DVA as referenced by its parents in the BP tree, even if the device underneath that DVA is actually a new disk. There is no COW. This is not a rewrite, and there is no original data to preserve, this is a repair: making the disk sector contain what the rest of the filesystem tree 'expects' it to contain. More specifically, making it contain data that checksums to the value that block pointers elsewhere say it should, via reconstruction using redundant information (same DVA on a mirror/RAIDZ recon, or ditto blocks at different DVAs found in the parent BP for copies>1, including metadata) BTW, if a new BP tree was required to repair blocks, we'd have bp-rewrite already (or we wouldn't have repair yet). > Other than these points, resilvers and scrubs should > work the same, perhaps with nuances like separate tunables > for throttling and such - but generic algorithms should > be nearly identical. > > Q1: Is this assessment true? In a sense, yes, despite the correction above. There is less difference between these cases than you expected, so they are nearly identical :-) > So I'll call them both a "scrub" below - it's shorter :) Call them all repair. The difference is not in how repair happens, but in how the need for a given sector to be repaired is discovered. Let's go over those, and clarify terminology, before going through the rest of your post: * Normal reads: a device error or checksum failure triggers a repair. * Scrub: Devices may be fine, but we want to verify that and fix any errors. In particular, we want to check all redundant copies. * Resilver: A device has been offline for a while, and needs to be 'caught up', from its last known-good TXG to current. * Replace: A device has gone, and needs to be completely reconstructed. Scrub is very similar to normal reads, apart from checking all copies rather than serving the data from whichever copy successfully returns first. Errors are not expected, are counted and repaired as/if found. Resilver and Replace are very similar, and the terms are often used interchangably. Replace is essentially resilver with a starting TXG of 0 (plus some labelling). In both cases, an error is expected or assumed from the device in question, and repair initiated unconditionally (and without incrementing error counters). You're suggesting an assymetry between Resilver and Replace to exploit the possibile speedup of sequential access; ok, seems attractive at first blush, let's explore the idea. > Now, as everybody knows, at least by word-of-mouth on > this list, the scrub tends to be slow on pools with a rich > life (many updates and deletions, causing fragmentation, > with "old" and "young" blocks intermixed on disk), more > so if the pools are quite full (over about 80% for some > reporters). This slowness (on non-SSD disks with non-zero > seek latency) is attributed to several reasons I've seen > stated and/or thought up while pondering. The reasons may > include statements like: > > 1) "Scrub goes on in TXG order". Yes, it does, approximately. More below. > If it is indeed so - the system must find older blocks, > then newer ones, and so on. IF the block-pointer tree > starting from uberblock is the only reference to the > entirety of the on-disk blocks (unlike say DDT) (aside: it is. The DDT is not special in this sense, because to find the DDT you have to follow the bp tree too.) > then > this tree would have to be read into memory and sorted > by TXG age and then processed. > > From my system's failures I know that this tree would > take about 30Gb on my home-NAS box with 8Gb RAM, and > the kernel crashes the machine by depleting RAM and > not going into swap after certain operations (i.e. > large deletes on datasets with enabled deduplication). > That was discussed last year by me, and recently by > other posters. > > Since the scrub does not do that and does not even > press on RAM in a fatal manner, I think this "reason" > is wrong. Well, your observations and analysis of what scrub is not doing are correct and sound.. :-) > I also fail to see why one would do that > processing ordering in the first place - on a fairly > fragmented system even the blocks from "newer" TXGs > do not necessarily follow those from the "previous" > ones. You're thinking too much about the on-disk ordering of sector numbers. Understandable, since you're trying to find a way to do sequential repair. For now, let's just say that going in TXG order is the easiest way to iterate over the disk and be sure to get all live data, without doing other complicated and memory/IO-intensive sorts. Again, we'll come back to this. > What this rumour could reflect, however, is that a scrub > (or more importantly, a resilver) are indeed limited by > the "interesting" range of TXGs, such as picking only > those blocks which were written between the last TXG that > a lost-and-reconnected disk knew of (known to the system > via that disk's stale uberblock), and the current TXG > at the moment of its reconnection. Newer writes would > probably land onto all disks anyway, so a resilver has > only to find and fix those missing TXG numbers. Yes, for resilver this is spot on, as above. > In my problematic system however I only saw full resilvers > even after they restarted numerously... This may actually > support the idea that scrubs are NOT txg-ordered, otherwise > a regularly updated tabkeeping attribute on the disk (in > uberblock?) would note that some TXGs are known to fully > exist on the resilvering drive - and this is not happening. Now you have two problems: * confusing scrub (as a way of checking and possibly triggering repair) with resilver (known need to repair). * older code: in newer code there is better bookkeeping, at least for scrub, that allows a resume (after, say a reboot) from where it left off. I'm not sure about resilver here, though (and note the complexity with the optimisation of 'new writes' past the offline window, above). > 2) "Scrub walks the block-pointer tree". yes, it does. It's essentially the same as the previous point, though: scrub walks the bp tree in txg order. > That seems like a viable reason for lots of random reads > (hitting the IOPS barrier). Yep. We're getting closer to the real reason here, but let's play it out in full as we go. > It does not directly explain > the reports I think I've seen about L2ARC improving scrub > speeds and system responsiveness - although extra caching > takes the repetitive load off the HDDs and leaves them > some more timeslices to participate in scrubbing (and > *that* should incur reads from disks, not caches). If L2ARC indeed helps, it will surely be mostly to do with improving responsiveness on other reads and freeing up the disks to do scrubs. > On an active system, block pointer entries are relatively > short-lived, with whole branches of a tree being updated > and written in a new location upon every file update. > This image is bound to look like good cheese after a while > even if the writes were initially coalesced into few IOs. You might be surprised, you probably have more long-lived data than you thought, especially with snapshots in place. The full metadata bp tree path to that old data is also retained. Note also the corollary: whenever data is COW'd, the full metadata path is also COW'd (possibly rolled up together with other updates in the same TXG). What that means is that, to read data for a new TXG as you progress in a resilver, replace or scrub, you have to read all new metadata. > 3) "If there are N top-level VDEVs in a pool, then only > the one with the resilvering disk would be hit for > performance" - not quite true, because pieces of the > BPtree are spread across all VDEVs. The one resilvering > would get the most bulk traffic, when DVAs residing on > it are found and userdata blocks get transferred, but > random read seeks caused by the resilvering process > should happen all over the pool. Not sure what this one means and I think it's mostly false for the reason you state. Either resilvering or replacing, the disk is mostly getting writes - and cachable writes at that - from this activity. For resilver especially, it might see reads for other concurrent activity. The IOPS limitation is for seeks necessary to satisfy reads, mostly from other disks, to provide data for reconstruction. As noted above, if a disk is being resilvered for TXG n, it won't have any of the metadata for that TXG either, so won't really be servicing any reads. > Q2: Am I correct with the interpretation of statements 1-3? Not quite, as discussed above. Let's go over the scrub case in detail (resilver being a txg window-limited variant, and both resilver and replace enabling different error reporting logic). * Every meta/data block in the disk was written in a given TXG. * Every meta/data block is reachable by a path through the bp tree, from the root at the close of that TXG, down through however many indirect levels are needed. * For every later TXG while the data remains current, the new root and top few nodes in the tree will change, (due to other writes) but those upper nodes will refer to the same subtree below the point of divergence caused by those later writes. In other words, each TXG assembles a bp tree from a new root, and reuses subtrees from the previous TXG where no changes have been made. * Snapshots are simply additional references to old (filesystem) root bp's, as a way to keep that subtree live. And here's the kicker for any attempt at LBA-sequential repair: * The checksum for a given block, that allows it to be verified, is stored in the bp that refers to it. So, if reading blocks sequentially, you can't verify them. You don't know what their checksums are supposed to be, or even what they contain or where to look for the checksums, even if you were prepared to seek to find them. This is why scrub walks the bp tree. When doing a scrub, you start at the root bp and walk the tree, doing reads for everything, verifying checksums, and letting repair happen for any errors. That traversal is either a breadth-first or depth-first traversal of the tree (I'm not sure which) done in TXG order. When you're done with that bp tree, the pool has almost certainly moved on with new TXG's. Get the new root bp, and do the traversal again. This time, any bp with a birth time equal or older to the TXG you previously finished has already been verified, including the entire subtree below, and so can be skipped. This is why scrub walks in TXG order. It's also why the disk access is in 'approximate TXG order', as you'll sometimes see the more pedantic commenters state. Note that there can be a lot of fanout in the tree; don't make the mistake of thinking that the directories and filesystems you see are the tree in question; the ZPL is a layer on top of the ZAP object store. > IDEA1 > > One optimization that could take place here would be to > store some of the BPs' ditto copies in compact locations > on disk (not all over it evenly), albeit maybe hurting > the write performance. This way a resilver run, or even > a scrub or zfs send, might be like a vdev-prefetch - a > scooping read of several megabytes worth of blockpointers > (this would especially help if the whole tree would fit > in RAM/L2ARC/swap), then sorting out the tree or its major > branches. The benefit would be little mechanical seeking > for lots of BP data. This might possibly require us to > invalidate the freed BP slots somehow as well :\ > > In case of scrubs, where we would have to read in all of > the allocated blocks from the media to test it, this would > let us schedule a sequential read of the drives userdata > while making sense of the sectors we find (as particular > zfs blocks). > > In case of resilvering - this would let us find DVAs of > blocks in the interesting TLVDEV and in the TXG range and > also schedule huge sequential reads instead of random > seeking. > > In case of zfs send, this would help us pick out the > TXG-limited ranges of the blocks for a dataset, and > again schedule the sequential reads for userdata (if any). > > Q3: Does the IDEA above make sense - storing BP entries > (one of the ditto blocks) in some common location on disk, > so as to minimize mechanical seeks while reading much of > the BP tree? It's not going to help a scrub, since that reads all of the ditto block copies, so bunching just one copy isn't useful. It might potentially help metadata-heavy activities that don't touch the data, like find(1), at the expense of several other issues, at least some of which you note. That said, there are always opportunities for tweaks and improvements to the allocation policy, or even for multiple allocation policies each more suited/tuned to specific workloads if known in advance. > IDEA2 > > It seems possible to enable defragmentation of the BP tree > (those ditto copies that are stored together) by just > relocating the valid ones in correct order onto a free > metaslab. "Just" is such a four-letter word. If you move a bp, you change its DVA. Which means that the parent bp pointing to it needs to be updated and rewritten, and its parents as well. This is new, COW data, with a new TXG attached -- but referring to data that is old and has not been changed. You just broke snapshots and scrub, at least. This is bp rewrite, or rather, why bp-rewrite is hard. > It seems that ZFS keeps some free space for > passive defrag purposes anyway - why not use it actively? > Live migration of blocks like this seems to be available > with scrub's repair of the mismatching blocks. This gets back to the misunderstanding (way) above. Repair is not COW; repair is repairing the disk block to the original, correct contents. > However, > here some care should be taken to take into account that > the parent blockpointers would also need to be reallocated > since the childrens' checksums would change - so the whole > tree/branch of reallocations would have to be planned and > written out in sequential order onto the spare free space. And more complexities, since you want this done on a live pool. > Overall, if my understanding somewhat resembles how things > really are, these ideas may help create and maintain such > layout of metadata that it can be bulk-read, which is IMHO > critical for many operations as well as to shorted recovery > windows when resilvering disks. > > Q4: I wonder if similar (equivalent) solutions are already > in place and did not help much? ;) At least scrub does more book-keeping in more recent code and will avoid restarts and rework. I would like to see a replace variant that signals that at least some of the data on the disk may already be valid, so it could potentially be used in reconstruction when multiple disks have errors. -- Dan.
pgp4NWmw3Brpx.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss