On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote:
>   While waiting for that resilver to complete last week,
> I caught myself wondering how the resilvers (are supposed
> to) work in ZFS?

The devil finds work for idle hands... :-)

>   Based on what I see in practice and read in this list
> and some blogs, I've built a picture and would be grateful
> if some experts actually familiar with code and architecture
> would say how far off I guessed from the truth ;)

Well, I'm not that - certainly not on the code.  It would probably be
best (for both of us) to spend idle time looking at the code, before
spending too much on speculation. Nonetheless, let's have at it! :)

>   Ultimately I wonder if there are possible optimizations
> to make the scrub process more resembling a sequential
> drive-cloning (bandwidth/throughput-bound), than an
> IOPS-bound random seek thrashing for hours that we
> often see now, at least on (over?)saturated pools.

The tradeoff will be code complexity and resulting fragility. Choose
wisely what you wish for.

> This may possibly improve zfs send speeds as well.

Less likely, that's pretty much always going to have to go in txg
order.

>   First of all, I state (and ask to confirm): I think
> resilvers are a subset of scrubs, in that:
> 1) resilvers are limited to a particular top-level VDEV
> (and its number is a component of each block's DVA address)
> and
> 2) when scrub finds a block mismatching its known checksum,
> scrub reallocates the whole block anew using the recovered
> known-valid data - in essence it is a newly written block
> with a new path in BP tree and so on; a resilver expects
> to have a disk full of known-missing pieces of blocks,
> and reconstructed pieces are written on the resilvering
> disk "in-place" at an address dictated by the known DVA -
> this allows to not rewrite the other disks and BP tree
> as COW would otherwise require.

No. Scrub (and any other repair, such as for errors found in the
course of normal reads) rewrite the reconstructed blocks in-place: to
the original DVA as referenced by its parents in the BP tree, even if
the device underneath that DVA is actually a new disk.

There is no COW. This is not a rewrite, and there is no original data
to preserve, this is a repair: making the disk sector contain what the
rest of the filesystem tree 'expects' it to contain. More specifically,
making it contain data that checksums to the value that block pointers
elsewhere say it should, via reconstruction using redundant
information (same DVA on a mirror/RAIDZ recon, or ditto blocks at
different DVAs found in the parent BP for copies>1, including metadata)

BTW, if a new BP tree was required to repair blocks, we'd have
bp-rewrite already (or we wouldn't have repair yet).

>   Other than these points, resilvers and scrubs should
> work the same, perhaps with nuances like separate tunables
> for throttling and such - but generic algorithms should
> be nearly identical.
>
> Q1: Is this assessment true?

In a sense, yes, despite the correction above.  There is less
difference between these cases than you expected, so they are nearly
identical :-)

>   So I'll call them both a "scrub" below - it's shorter :)

Call them all repair.

The difference is not in how repair happens, but in how the need for a
given sector to be repaired is discovered.

Let's go over those, and clarify terminology, before going through the
rest of your post:

 * Normal reads: a device error or checksum failure triggers a
   repair. 

 * Scrub: Devices may be fine, but we want to verify that and fix any
   errors. In particular, we want to check all redundant copies.

 * Resilver: A device has been offline for a while, and needs to be
   'caught up', from its last known-good TXG to current.

 * Replace: A device has gone, and needs to be completely
   reconstructed.

Scrub is very similar to normal reads, apart from checking all copies
rather than serving the data from whichever copy successfully returns
first. Errors are not expected, are counted and repaired as/if found.

Resilver and Replace are very similar, and the terms are often used
interchangably. Replace is essentially resilver with a starting TXG of
0 (plus some labelling). In both cases, an error is expected or
assumed from the device in question, and repair initiated
unconditionally (and without incrementing error counters). 

You're suggesting an assymetry between Resilver and Replace to exploit
the possibile speedup of sequential access; ok, seems attractive at
first blush, let's explore the idea.

>   Now, as everybody knows, at least by word-of-mouth on
> this list, the scrub tends to be slow on pools with a rich
> life (many updates and deletions, causing fragmentation,
> with "old" and "young" blocks intermixed on disk), more
> so if the pools are quite full (over about 80% for some
> reporters). This slowness (on non-SSD disks with non-zero
> seek latency) is attributed to several reasons I've seen
> stated and/or thought up while pondering. The reasons may
> include statements like:
>
> 1) "Scrub goes on in TXG order".

Yes, it does, approximately. More below.

> If it is indeed so - the system must find older blocks,
> then newer ones, and so on. IF the block-pointer tree
> starting from uberblock is the only reference to the
> entirety of the on-disk blocks (unlike say DDT)

(aside: it is. The DDT is not special in this sense, because to find
the DDT you have to follow the bp tree too.)

> then
> this tree would have to be read into memory and sorted
> by TXG age and then processed.
>
> From my system's failures I know that this tree would
> take about 30Gb on my home-NAS box with 8Gb RAM, and
> the kernel crashes the machine by depleting RAM and
> not going into swap after certain operations (i.e.
> large deletes on datasets with enabled deduplication).
> That was discussed last year by me, and recently by
> other posters.
>
> Since the scrub does not do that and does not even
> press on RAM in a fatal manner, I think this "reason"
> is wrong.

Well, your observations and analysis of what scrub is not doing are
correct and sound.. :-)

> I also fail to see why one would do that
> processing ordering in the first place - on a fairly
> fragmented system even the blocks from "newer" TXGs
> do not necessarily follow those from the "previous"
> ones.

You're thinking too much about the on-disk ordering of sector
numbers.  Understandable, since you're trying to find a way to do
sequential repair.  

For now, let's just say that going in TXG order is the easiest way to
iterate over the disk and be sure to get all live data, without doing
other complicated and memory/IO-intensive sorts. Again, we'll come back
to this.  

> What this rumour could reflect, however, is that a scrub
> (or more importantly, a resilver) are indeed limited by
> the "interesting" range of TXGs, such as picking only
> those blocks which were written between the last TXG that
> a lost-and-reconnected disk knew of (known to the system
> via that disk's stale uberblock), and the current TXG
> at the moment of its reconnection. Newer writes would
> probably land onto all disks anyway, so a resilver has
> only to find and fix those missing TXG numbers.

Yes, for resilver this is spot on, as above.

> In my problematic system however I only saw full resilvers
> even after they restarted numerously... This may actually
> support the idea that scrubs are NOT txg-ordered, otherwise
> a regularly updated tabkeeping attribute on the disk (in
> uberblock?) would note that some TXGs are known to fully
> exist on the resilvering drive - and this is not happening.

Now you have two problems:

 * confusing scrub (as a way of checking and possibly triggering
   repair) with resilver (known need to repair).
 * older code: in newer code there is better bookkeeping, at least for
   scrub, that allows a resume (after, say a reboot) from where it
   left off.  I'm not sure about resilver here, though (and note the
   complexity with the optimisation of 'new writes' past the offline
   window, above).

> 2) "Scrub walks the block-pointer tree".

yes, it does. It's essentially the same as the previous point, though:
scrub walks the bp tree in txg order.

> That seems like a viable reason for lots of random reads
> (hitting the IOPS barrier). 

Yep.  We're getting closer to the real reason here, but let's play it
out in full as we go.

> It does not directly explain
> the reports I think I've seen about L2ARC improving scrub
> speeds and system responsiveness - although extra caching
> takes the repetitive load off the HDDs and leaves them
> some more timeslices to participate in scrubbing (and
> *that* should incur reads from disks, not caches).

If L2ARC indeed helps, it will surely be mostly to do with improving
responsiveness on other reads and freeing up the disks to do scrubs. 

> On an active system, block pointer entries are relatively
> short-lived, with whole branches of a tree being updated
> and written in a new location upon every file update.
> This image is bound to look like good cheese after a while
> even if the writes were initially coalesced into few IOs.

You might be surprised, you probably have more long-lived data than
you thought, especially with snapshots in place.  The full metadata
bp tree path to that old data is also retained.

Note also the corollary: whenever data is COW'd, the full metadata
path is also COW'd (possibly rolled up together with other updates in
the same TXG).  What that means is that, to read data for a new TXG as
you progress in a resilver, replace or scrub, you have to read all new
metadata. 

> 3) "If there are N top-level VDEVs in a pool, then only
> the one with the resilvering disk would be hit for
> performance" - not quite true, because pieces of the
> BPtree are spread across all VDEVs. The one resilvering
> would get the most bulk traffic, when DVAs residing on
> it are found and userdata blocks get transferred, but
> random read seeks caused by the resilvering process
> should happen all over the pool.

Not sure what this one means and I think it's mostly false for the
reason you state.  

Either resilvering or replacing, the disk is mostly getting writes -
and cachable writes at that - from this activity. For resilver
especially, it might see reads for other concurrent activity.

The IOPS limitation is for seeks necessary to satisfy reads, mostly
from other disks, to provide data for reconstruction.  As noted above,
if a disk is being resilvered for TXG n, it won't have any of the
metadata for that TXG either, so won't really be servicing any reads.

> Q2: Am I correct with the interpretation of statements 1-3?

Not quite, as discussed above.

Let's go over the scrub case in detail (resilver being a txg
window-limited variant, and both resilver and replace enabling
different error reporting logic).

 * Every meta/data block in the disk was written in a given TXG.
 * Every meta/data block is reachable by a path through the bp tree,
   from the root at the close of that TXG, down through however many
   indirect levels are needed. 
 * For every later TXG while the data remains current, the new root
   and top few nodes in the tree will change, (due to other writes) but
   those upper nodes will refer to the same subtree below the point of 
   divergence caused by those later writes. In other words, each TXG
   assembles a bp tree from a new root, and reuses subtrees from the
   previous TXG where no changes have been made.
 * Snapshots are simply additional references to old (filesystem) root
   bp's, as a way to keep that subtree live.

And here's the kicker for any attempt at LBA-sequential repair:

 * The checksum for a given block, that allows it to be verified, is
   stored in the bp that refers to it.

So, if reading blocks sequentially, you can't verify them. You don't
know what their checksums are supposed to be, or even what they
contain or where to look for the checksums, even if you were prepared
to seek to find them.  This is why scrub walks the bp tree.

When doing a scrub, you start at the root bp and walk the tree, doing
reads for everything, verifying checksums, and letting repair happen
for any errors. That traversal is either a breadth-first or
depth-first traversal of the tree (I'm not sure which) done in TXG
order.  

When you're done with that bp tree, the pool has almost certainly
moved on with new TXG's. Get the new root bp, and do the traversal
again. This time, any bp with a birth time equal or older to the TXG
you previously finished has already been verified, including the
entire subtree below, and so can be skipped.  This is why scrub walks
in TXG order.  It's also why the disk access is in 'approximate TXG
order', as you'll sometimes see the more pedantic commenters state.

Note that there can be a lot of fanout in the tree; don't make the
mistake of thinking that the directories and filesystems you see are
the tree in question; the ZPL is a layer on top of the ZAP object
store.

> IDEA1
>
> One optimization that could take place here would be to
> store some of the BPs' ditto copies in compact locations
> on disk (not all over it evenly), albeit maybe hurting
> the write performance. This way a resilver run, or even
> a scrub or zfs send, might be like a vdev-prefetch - a
> scooping read of several megabytes worth of blockpointers
> (this would especially help if the whole tree would fit
> in RAM/L2ARC/swap), then sorting out the tree or its major
> branches. The benefit would be little mechanical seeking
> for lots of BP data. This might possibly require us to
> invalidate the freed BP slots somehow as well :\
>
> In case of scrubs, where we would have to read in all of
> the allocated blocks from the media to test it, this would
> let us schedule a sequential read of the drives userdata
> while making sense of the sectors we find (as particular
> zfs blocks).
>
> In case of resilvering - this would let us find DVAs of
> blocks in the interesting TLVDEV and in the TXG range and
> also schedule huge sequential reads instead of random
> seeking.
>
> In case of zfs send, this would help us pick out the
> TXG-limited ranges of the blocks for a dataset, and
> again schedule the sequential reads for userdata (if any).
>
> Q3: Does the IDEA above make sense - storing BP entries
> (one of the ditto blocks) in some common location on disk,
> so as to minimize mechanical seeks while reading much of
> the BP tree?

It's not going to help a scrub, since that reads all of the ditto
block copies, so bunching just one copy isn't useful. It might
potentially help metadata-heavy activities that don't touch the data,
like find(1), at the expense of several other issues, at least some of
which you note.

That said, there are always opportunities for tweaks and improvements
to the allocation policy, or even for multiple allocation policies
each more suited/tuned to specific workloads if known in advance.

> IDEA2
>
> It seems possible to enable defragmentation of the BP tree
> (those ditto copies that are stored together) by just
> relocating the valid ones in correct order onto a free
> metaslab.

"Just" is such a four-letter word.

If you move a bp, you change its DVA. Which means that the parent bp
pointing to it needs to be updated and rewritten, and its parents as
well. This is new, COW data, with a new TXG attached -- but referring
to data that is old and has not been changed. You just broke snapshots
and scrub, at least.

This is bp rewrite, or rather, why bp-rewrite is hard.

> It seems that ZFS keeps some free space for
> passive defrag purposes anyway - why not use it actively?
> Live migration of blocks like this seems to be available
> with scrub's repair of the mismatching blocks. 

This gets back to the misunderstanding (way) above.  Repair is not
COW; repair is repairing the disk block to the original, correct
contents.

> However,
> here some care should be taken to take into account that
> the parent blockpointers would also need to be reallocated
> since the childrens' checksums would change - so the whole
> tree/branch of reallocations would have to be planned and
> written out in sequential order onto the spare free space.

And more complexities, since you want this done on a live pool. 

> Overall, if my understanding somewhat resembles how things
> really are, these ideas may help create and maintain such
> layout of metadata that it can be bulk-read, which is IMHO
> critical for many operations as well as to shorted recovery
> windows when resilvering disks.
>
> Q4: I wonder if similar (equivalent) solutions are already
> in place and did not help much? ;)

At least scrub does more book-keeping in more recent code and will
avoid restarts and rework.

I would like to see a replace variant that signals that at least some
of the data on the disk may already be valid, so it could potentially
be used in reconstruction when multiple disks have errors.

--
Dan.

Attachment: pgp4NWmw3Brpx.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Reply via email to