First of all, thank you Daniel for taking the time to post a lengthy reply! I do not get that kind of high-quality feedback very often :)
I hope the community and googlers would benefit from that conversation sometime. I did straighten out some thoughts and (mis-)understandings, at least, more on that below :) 2012-05-18 15:30, Daniel Carosone wrote:
On Fri, May 18, 2012 at 03:05:09AM +0400, Jim Klimov wrote:
>> While waiting for that resilver to complete last week, >> I caught myself wondering how the resilvers (are supposed >> to) work in ZFS? > The devil finds work for idle hands... :-) Or rather, brains ;) > Well, I'm not that - certainly not on the code. It would probably be > best (for both of us) to spend idle time looking at the code, before > spending too much on speculation. Nonetheless, let's have at it! :)
...Yes, I should look at the code instead of posting speculation.
Good idea any day, but rather lengthy in time. I have looked at the code, at blogs, at mailing list archives, at the aged ZFS spec, for about a year on-and-off now, and as you could see - understanding remains imperfect ;) Besides, turning the specific C code, even with those good comments that are in place, into a narrative description like we did in this thread, is bulky, time-consuming and likely useless (not conveyed) to other people wanting to understand the same and perhaps hoping to contribute - even if only algorithmic ideas ;) Finally, breaking the head over existing code only, instead of sitting back and doing some educated thinking (speculation), *may* be useless in the sense that if the current algorithms (or their implementation) work unsatisfactorily for at least the use-cases I see them used in. Thus I as a n00b researcher might care a bit less about what exactly is wrong in the system that does not work (the way I want it to, at least), and I'd care a bit more about designing and planning = speculating = how (I think) it should work to suit my needs and usage patterns. In this regard the existing implementation may be seen as a POC which demostrates what can be done, even if sub-optimally. It works somewhat, and since we see downsides - it might work better. At the very least I can try to understand how it works now and why some particular choices and tradeoffs were mare (perhaps we do use the lesser of evils indeed) - explained in higher-level concepts and natural-language words that correspondents like you or other ZFS experts (and authors) on this list can quickly confirm or deny without wasting their precious time (no sarcasm) on lengthy posts like these, describing it all in detail. This is a useful experience and learning source, and different from what reading the code alone gives me. Anyway, this "speculation" would be done by this n00b reader of the code implicitly and with less (without any?) constructive discussion (thanks again for that!) if I were to look into code trying to fix something without planning ahead, and I know that often does not end very well. Ultimately, I guess I got more understanding by spending a few hours to formulate correct questions (and thankfully getting some answers) than from compiling all the disparate (and often outdated) docs and blogs, and code, into some form of a structure in my head. I also got to confirm that much of this compilation was correct and which parts I missed ;) Perhaps, now I (or someone else) won't waste months on inventing or implementing something senseless from the start, or would find ways to make a pluggable writing policy for tests of different allocators for different purposes, or something of that kind... - as you propose here: > That said, there are always opportunities for tweaks and improvements > to the allocation policy, or even for multiple allocation policies > each more suited/tuned to specific workloads if known in advance. Now, on to my ZFS questions and your selected responses: >> This may possibly improve zfs send speeds as well. > > Less likely, that's pretty much always going to have to go in txg > order. Would that be really TXG order - i.e. send blocks from TXG(N), then send blocks from TXG(N+1), and so on; OR a BPtree walk of the selected branch (starting from the root of snapshot dataset), perhaps limiting the range of chosen TXG numbers by the snapshot's creation and completion "TXG timestamps"? Essentially, I don't want to quote all those pieces of text, but I still doubt that tree walks are done in TXG order - at least the way I understand it (which may be different from your or others' understanding): I interpreted "TXG order" as I said above - a monotonous incremental walk from older TXG numbers to newer ones. In order to do that you must have the whole tree in RAM and sort it by TXGs (perhaps making an array of all defined TXGs and pointers to individual block pointers that have this TXG), which is lengthy, bulky on RAM and I don't think I see it happening in real life. If the statement means that "when walking the tree, first walk the child branch with lower TXG" then the statement makes sense somewhat - but it is not strictly "TXG-ordered", I think. At the very least, the walk starts with the most recent TXG being the uberblock (or poolwide root block) ;) Such a walk would indeed reach out to the oldest TXGs in a particular branch first, but starting from (and backtracking to) newer ones. So in order to benefit from sequential reads during the tree walk, the written blocks with the block-pointer tree (at least one copy of them) should be stored on disk in essentially this same order that a tree walk reader expects to find them. Then a read request (with associated vdev prefetch) would find large portions of the BP tree needed "now or in a few steps" in one mechanical IO... > So, if reading blocks sequentially, you can't verify them. You don't > know what their checksums are supposed to be, or even what they > contain or where to look for the checksums, even if you were prepared > to seek to find them. This is why scrub walks the bp tree. ...And perhaps to take more advantage of this, the system should not descend into a single child BP and its branch right away, but rather try to see in the rolling prefetch cache (after a read was satisfied by a mechanical IO) if more of the soon-to-be-needed blkptrs are in RAM currently and should be relocated to the ARC/L2ARC before they roll out of the prefetch cache, even if actual requests for them would come after the subtree walk, perhaps in a few seconds or minutes. If the subtree is so big that these ARCed entries would be pushed out by then, well, we did all we could to speed up the system for smaller branches and lost little time in the process. And cache misses would be logged so users can know to upgrade their ARCs. > No. Scrub (and any other repair, such as for errors found in the > course of normal reads) rewrite the reconstructed blocks in-place: to > the original DVA as referenced by its parents in the BP tree, even if > the device underneath that DVA is actually a new disk. > There is no COW. This is not a rewrite, and there is no original data > to preserve... Okay, thanks, I guess this simplifies things - although somewhat defies the BPtree defrag approach I proposed. > BTW, if a new BP tree was required to repair blocks, we'd have > bp-rewrite already (or we wouldn't have repair yet). I'm not so sure. I've seen discussed (and proposed) many small tasks that could be done by a BP rewrite in general, but can be done "elsehow". Taking as an example my (mis)understanding of scrub repairs, the recovered block data could just be written into the pool just like any other new data block, and cause the rewriting of the BP tree branch leading to it. If that is not done (or required) here - well, that's for the better I guess. > ...This is bp rewrite, or rather, why bp-rewrite is hard. The generic BP rewrite also should handle things like reduction of VDEV sizes, removal of TLVDEVs, changes to TLVDEV layouts (i.e. migration of raidz levels) and so on. That is likely hard (especially to do online) indeed. Individual operations, like defragmentation, recompression or dedup of existing data, all of which can be done today by zfs-sending data away from the pool, cleaning it up, and zfs-receiving the data back - without all the lowlevel layout changes that BP rewrite can do - well, they can be done today. Why not in-place? Unlike manual send-away-and-receive cycles incurring downtime, the equivalent in-place manipulations can be done transparently to ZPL/ZVOL users by just invalidating parts of the ARC (by DVA of reallocated blocks), I think, and do not seem as inherently difficult as complete BP rewrites. Again, this interim solution may be just a POC for later works on BP rewrite to include and improve :) > "Just" is such a four-letter word. > > If you move a bp, you change its DVA. Which means that the parent bp > pointing to it needs to be updated and rewritten, and its parents as > well. This is new, COW data, with a new TXG attached -- but referring > to data that is old and has not been changed. > This gets back to the misunderstanding (way) above. Repair is not > COW; repair is repairing the disk block to the original, correct > contents. Changes of DVAs causing reallocation of the whole branch of BPs during the defrag - yes, as I also wrote. However I am not sure that it would induce such changes to TXG numbers that must be fatal to snapshots and scrubs: as I've seen in the code (unlike the ZFS on-disk format docs), the current blkptr_t includes two fields for a TXG number - the birth TXG and (IIRC) the write TXG. I guess one refers to the timestamp of when the data block was initially allocated in the queue, and another one (if non-zero) refers to the timestamp of when the block was optionally reallocated and written into the pool - perhaps upon recovery from ZIL, or (as I thought above) upon generic repair, or my proposed idea of defrag. So perhaps the system is already ready to correctly process such reallocations, or can be cheated into that by "clever" use and/or ignoration of one of these fields... > You just broke snapshots and scrub, at least. As for snapshots: you can send a series of incremental snapshots from one system to another, and of course the TXG numbers on a particular pool for blocks of the snapshot dataset would differ. But this does not matter, as long as they are committed on disk in a particular order, with BPtree branches properly pointing to timestamp-ordered snapshots of the parent dataset. Your concern seems valid indeed, but I think it can be countered by scheduling a BPtree defrag to involve relocating and updating block pointers for all snapshots of a dataset (and maybe its clones), or at least ensuring that the parent blocks of newer snapshots have higher TXG numbers - if that is required. This may place non-trivial demands on cache or buffer memory size and usage in order to prepare the big transaction in case of large datasets, so perhaps if the system detects it can't properly defrag the BPtree branch in one operation, it should abort without crashing the OS into scanrate-hell ;) > It's not going to help a scrub, since that reads all of the ditto > block copies, so bunching just one copy isn't useful. I can agree - but only partially. If the point of storing the blockpointers together and minimizing mechanical reads to get many of them at once is reachable, then it becomes possible to "preread" the "colocated" version of BP tree or its large portions quickly (if there are no checksum or device errors during such reads - otherwise we fall back to scattered ditto copies of those corrupted BP tree blocks). Then we can schedule more optimal reads for the scattered data, including the ditto blocks of the BP tree that we've already read in (the other copies of these blocks). It would be the same walk covering the same data objects on disk, but possibly in a different (and hopefully faster) manner than today.
-- Dan.
Thanks a lot for the discussion, I really appreciate it :) //Jim Klimov _______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss