Hello all, While waiting for that resilver to complete last week, I caught myself wondering how the resilvers (are supposed to) work in ZFS?
Based on what I see in practice and read in this list and some blogs, I've built a picture and would be grateful if some experts actually familiar with code and architecture would say how far off I guessed from the truth ;) Ultimately I wonder if there are possible optimizations to make the scrub process more resembling a sequential drive-cloning (bandwidth/throughput-bound), than an IOPS-bound random seek thrashing for hours that we often see now, at least on (over?)saturated pools. This may possibly improve zfs send speeds as well. First of all, I state (and ask to confirm): I think resilvers are a subset of scrubs, in that: 1) resilvers are limited to a particular top-level VDEV (and its number is a component of each block's DVA address) and 2) when scrub finds a block mismatching its known checksum, scrub reallocates the whole block anew using the recovered known-valid data - in essence it is a newly written block with a new path in BP tree and so on; a resilver expects to have a disk full of known-missing pieces of blocks, and reconstructed pieces are written on the resilvering disk "in-place" at an address dictated by the known DVA - this allows to not rewrite the other disks and BP tree as COW would otherwise require. Other than these points, resilvers and scrubs should work the same, perhaps with nuances like separate tunables for throttling and such - but generic algorithms should be nearly identical. Q1: Is this assessment true? So I'll call them both a "scrub" below - it's shorter :) Now, as everybody knows, at least by word-of-mouth on this list, the scrub tends to be slow on pools with a rich life (many updates and deletions, causing fragmentation, with "old" and "young" blocks intermixed on disk), more so if the pools are quite full (over about 80% for some reporters). This slowness (on non-SSD disks with non-zero seek latency) is attributed to several reasons I've seen stated and/or thought up while pondering. The reasons may include statements like: 1) "Scrub goes on in TXG order". If it is indeed so - the system must find older blocks, then newer ones, and so on. IF the block-pointer tree starting from uberblock is the only reference to the entirety of the on-disk blocks (unlike say DDT) then this tree would have to be read into memory and sorted by TXG age and then processed. From my system's failures I know that this tree would take about 30Gb on my home-NAS box with 8Gb RAM, and the kernel crashes the machine by depleting RAM and not going into swap after certain operations (i.e. large deletes on datasets with enabled deduplication). That was discussed last year by me, and recently by other posters. Since the scrub does not do that and does not even press on RAM in a fatal manner, I think this "reason" is wrong. I also fail to see why one would do that processing ordering in the first place - on a fairly fragmented system even the blocks from "newer" TXGs do not necessarily follow those from the "previous" ones. What this rumour could reflect, however, is that a scrub (or more importantly, a resilver) are indeed limited by the "interesting" range of TXGs, such as picking only those blocks which were written between the last TXG that a lost-and-reconnected disk knew of (known to the system via that disk's stale uberblock), and the current TXG at the moment of its reconnection. Newer writes would probably land onto all disks anyway, so a resilver has only to find and fix those missing TXG numbers. In my problematic system however I only saw full resilvers even after they restarted numerously... This may actually support the idea that scrubs are NOT txg-ordered, otherwise a regularly updated tabkeeping attribute on the disk (in uberblock?) would note that some TXGs are known to fully exist on the resilvering drive - and this is not happening. 2) "Scrub walks the block-pointer tree". That seems like a viable reason for lots of random reads (hitting the IOPS barrier). It does not directly explain the reports I think I've seen about L2ARC improving scrub speeds and system responsiveness - although extra caching takes the repetitive load off the HDDs and leaves them some more timeslices to participate in scrubbing (and *that* should incur reads from disks, not caches). On an active system, block pointer entries are relatively short-lived, with whole branches of a tree being updated and written in a new location upon every file update. This image is bound to look like good cheese after a while even if the writes were initially coalesced into few IOs. 3) "If there are N top-level VDEVs in a pool, then only the one with the resilvering disk would be hit for performance" - not quite true, because pieces of the BPtree are spread across all VDEVs. The one resilvering would get the most bulk traffic, when DVAs residing on it are found and userdata blocks get transferred, but random read seeks caused by the resilvering process should happen all over the pool. Q2: Am I correct with the interpretation of statements 1-3? IDEA1 One optimization that could take place here would be to store some of the BPs' ditto copies in compact locations on disk (not all over it evenly), albeit maybe hurting the write performance. This way a resilver run, or even a scrub or zfs send, might be like a vdev-prefetch - a scooping read of several megabytes worth of blockpointers (this would especially help if the whole tree would fit in RAM/L2ARC/swap), then sorting out the tree or its major branches. The benefit would be little mechanical seeking for lots of BP data. This might possibly require us to invalidate the freed BP slots somehow as well :\ In case of scrubs, where we would have to read in all of the allocated blocks from the media to test it, this would let us schedule a sequential read of the drives userdata while making sense of the sectors we find (as particular zfs blocks). In case of resilvering - this would let us find DVAs of blocks in the interesting TLVDEV and in the TXG range and also schedule huge sequential reads instead of random seeking. In case of zfs send, this would help us pick out the TXG-limited ranges of the blocks for a dataset, and again schedule the sequential reads for userdata (if any). Q3: Does the IDEA above make sense - storing BP entries (one of the ditto blocks) in some common location on disk, so as to minimize mechanical seeks while reading much of the BP tree? IDEA2 It seems possible to enable defragmentation of the BP tree (those ditto copies that are stored together) by just relocating the valid ones in correct order onto a free metaslab. It seems that ZFS keeps some free space for passive defrag purposes anyway - why not use it actively? Live migration of blocks like this seems to be available with scrub's repair of the mismatching blocks. However, here some care should be taken to take into account that the parent blockpointers would also need to be reallocated since the childrens' checksums would change - so the whole tree/branch of reallocations would have to be planned and written out in sequential order onto the spare free space. Overall, if my understanding somewhat resembles how things really are, these ideas may help create and maintain such layout of metadata that it can be bulk-read, which is IMHO critical for many operations as well as to shorted recovery windows when resilvering disks. Q4: I wonder if similar (equivalent) solutions are already in place and did not help much? ;) Thanks, //Jim _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss