On Tue, Aug 4, 2020 at 9:44 AM Robert Haas <robertmh...@gmail.com> wrote: > I think it might be possible to distinguish between different types of > corruption and to separate, at least to some degree, the checking > associated with each type. I think one can imagine something that > checks the structure of a btree without regard to the contents. That > is, it cares that left and right links are consistent with each other > and with downlinks from the parent level. So it checks things like the > left link of the page to which my right link points is pointing back > to me, and that's also the page to which my parent's next downlink > points.
I think that this kind of phased approach to B-Tree verification is possible, more or less, but hard to justify. And it seems impossible to do with only an AccessShareLock. It's not clear that what you describe is much better than just checking a bunch of indexes and seeing what patterns emerge. For example, the involvement of collated text might be a common factor across indexes. That kind of pattern is the first thing that I look for, and often the only thing. It also serves to give me an idea of how messed up things are. There are not that many meaningful degrees of messed-up with indexes in my experience. The first error really does tell you most of what you need to know about any given corrupt index. Kind of like how you can bucket the number of cockroaches in your home into perhaps three meaningful buckets: 0 cockroaches, at least 1 cockroach, and lots of cockroaches. (Even there, if you really care about the distinction between the second and third bucket, something has gone terribly wrong -- so even three buckets seems like a lot to me.) FWIW, current DEBUG1 + DEBUG2 output for amcheck shows you quite a lot of details about the tree structure. It's a handy way of getting a sense of what's going on at a high level. For example, if index corruption is found very early on, that strongly suggests that it's pretty pervasive. > Now a second type of checking, which can also be done without regard > to keys, is checking that the TIDs in the index point to TIDs that are > on heap pages that actually exist, and that the corresponding items > are not unused, nor are they tuples which are not the root of a HOT > chain. Passing a check of this type doesn't prove that the index and > heap are consistent, but failing it proves that they are inconsistent. > This kind of check can be done on every leaf index page you can find > by any means even if it fails the structural checks described above. > Failure of these checks on one page does not preclude checking the > same invariants for other pages. Let's call this kind of thing "basic > index-heap sanity checking." One real weakness in the current code is our inability to detect index tuples that are in the correct order and so on, but point to the wrong thing -- we can detect that if it manifests itself as the absence of an index tuple that should be in the index (when you use heapallindexed verification), but we cannot *reliably* detect the presence of an index tuple that shouldn't be in the index at all (though in practice it probably mostly gets caught). The checks on the tree structure itself are excellent with bt_index_parent_check() following Alexander's commit d114cc53 (which I thought was really excellent work). But we still have that one remaining blind spot in verify_nbtree.c, even when you opt in to every possible type of verification (i.e. bt_index_parent_check() with all options). I'd much rather fix that, or help with the new heap checker stuff. > A fourth type of checking is to verify the index key against the keys > in the heap tuples to which they point, but only for index tuples that > passed the basic index-heap sanity checking and where the tuples have > not been pruned. This can be sensibly done even if the structural > checks or index-ordering checks have failed. That's going to require the equivalent of a merge join, which is terribly expensive relative to such a small benefit. > Aside from providing a way to usefully continue after errors, this > would also be useful in certain scenarios where you want to know what > kind of corruption you have. For example, suppose that I start getting > wrong answers from index lookups on a particular index. Upon > investigation, it turns out that my last glibc update changed my OS > collation definitions for the collation I'm using, and therefore it is > to be expected that some of my keys may appear to be out of order with > respect to the new definitions. Now what I really want to know before > running REINDEX is that this is the only problem I have. It would be > amazing if I could run the tool and have it give me a list of problems > so that I could confirm that I have only index-ordering problems, not > any other kind, and even more amazing if it could tell me the specific > keys that were affected so that I could understand exactly how the > sorting behavior changed. This detail seems really hard. There are probably lots of cases where the sorting behavior changed but it just didn't affect you, given the data you had -- it just so happened that you didn't have exactly the wrong kind of diacritic mark or whatever. After all, revisions to how strings in a given natural language are supposed to sort are likely to be relatively rare and relatively obscure (even among people that speak the language in question). Also, the task of figuring out if the tuple to the left or the right is in the wrong order seems kind of daunting. Meanwhile, a simple smoke test covering many indexes probably gives you a fairly meaningful idea of the extent of the damage, without requiring that we do any hard engineering work. > I'm speaking here with fairly limited knowledge of the details of how > all this actually works and, again, I'm not trying to suggest that you > or anyone is obligated to do any work on this, or that it would be > easy to accomplish or worth the time it took. I'm just trying to > sketch out what I see as maybe being theoretically possible, and why I > think it would be useful if it did. I don't think that your relatively limited knowledge of the B-Tree code is an issue here -- your intuitions seem pretty reasonable. I appreciate your perspective here. Corruption detection presents us with some odd qualitative questions of the kind that are just awkward to discuss. Discouraging perspectives that don't quite match my own would be quite counterproductive. That having been said, I suspect that this is a huge task for a small benefit. It's exceptionally hard to test because you have lots of non-trivial code that only gets used in circumstances that by definition should never happen. If users really needed to recover the data in the index then maybe it would happen -- but they don't. The biggest problem that amcheck currently has is that it isn't used enough, because it isn't positioned as a general purpose tool at all. I'm hoping that the work from Mark helps with that. -- Peter Geoghegan