On Tue, Aug 4, 2020 at 12:00 PM Peter Geoghegan <p...@bowt.ie> wrote: > With indexes you tend to have redundancy in how relationships among > pages are described. So you have siblings whose pointers must be in > agreement (left points to right, right points to left), and it's not > clear which one you should trust when they don't agree. It's not like > simple heuristics get you all that far. I really can't think of a good > one, and detecting corruption should mean detecting truly exceptional > cases. I guess you could build a model based on Bayesian methods, or > something like that. But that is very complicated, and only used when > you actually have corruption -- which is presumably extremely rare in > reality. That's very unappealing as a project.
I think it might be possible to distinguish between different types of corruption and to separate, at least to some degree, the checking associated with each type. I think one can imagine something that checks the structure of a btree without regard to the contents. That is, it cares that left and right links are consistent with each other and with downlinks from the parent level. So it checks things like the left link of the page to which my right link points is pointing back to me, and that's also the page to which my parent's next downlink points. It could also verify that there's a proper tree structure, where every page has a well-defined tree level. So you assign the root page level 1, and each time you traverse a downlink you assign that page a level one larger. If you ever try to assign to a page a level unequal to the level previously assigned to it, you report that as a problem. You can check, too, that if a page does not have a left or right link, it's actually the last page at that level according what you saw at the parent, grandparent, etc. levels. Finally, you can check that all of the max-level pages you can find are leaf pages, and the others are all internal pages. All of this structural stuff can be verified without caring a whit about what keys you've got or what they mean or whether there's even a heap associated with this index. Now a second type of checking, which can also be done without regard to keys, is checking that the TIDs in the index point to TIDs that are on heap pages that actually exist, and that the corresponding items are not unused, nor are they tuples which are not the root of a HOT chain. Passing a check of this type doesn't prove that the index and heap are consistent, but failing it proves that they are inconsistent. This kind of check can be done on every leaf index page you can find by any means even if it fails the structural checks described above. Failure of these checks on one page does not preclude checking the same invariants for other pages. Let's call this kind of thing "basic index-heap sanity checking." A third type of checking is to verify the relationship between the index keys within and across the index pages: are the keys actually in order within a page, and are they in order across pages? The first part of this can be checked individually for each page pretty much no matter what other problems we may have; we only have to abandon this checking for a particular page if it's total garbage and we cannot identify any index items on the page at all. The second part, though, has the problem you mention. I think the solution is to skip the second part of the check for any pages that failed related structural checks. For example, if my right sibling thinks that I am not its left sibling, or my right sibling and I agree that we are siblings but do not agree on who our parent is, or if that parent does not agree that we have the same sibling relationship that we think we have, then we should report that problem and forget about issuing any complaints about the relationship between my key space and that sibling's key space. The internal consistency of each page with respect to key ordering can still be verified, though, and it's possible that my key space can be validly compared to the key space of my other sibling, if the structural checks pass on that side. A fourth type of checking is to verify the index key against the keys in the heap tuples to which they point, but only for index tuples that passed the basic index-heap sanity checking and where the tuples have not been pruned. This can be sensibly done even if the structural checks or index-ordering checks have failed. I don't mean to suggest that one would implement all of these things as separate phases; that would be crazy expensive, and what if things changed by the time you visit the page? Rather, the checks likely ought to be interleaved, just keeping track internally of which things need to be skipped because prerequisite checks have already failed. Aside from providing a way to usefully continue after errors, this would also be useful in certain scenarios where you want to know what kind of corruption you have. For example, suppose that I start getting wrong answers from index lookups on a particular index. Upon investigation, it turns out that my last glibc update changed my OS collation definitions for the collation I'm using, and therefore it is to be expected that some of my keys may appear to be out of order with respect to the new definitions. Now what I really want to know before running REINDEX is that this is the only problem I have. It would be amazing if I could run the tool and have it give me a list of problems so that I could confirm that I have only index-ordering problems, not any other kind, and even more amazing if it could tell me the specific keys that were affected so that I could understand exactly how the sorting behavior changed. If I were to discover that my index also has structural problems or inconsistencies with the heap, then I'd know that it couldn't be right to blame it only the collation update; something else has gone wrong. I'm speaking here with fairly limited knowledge of the details of how all this actually works and, again, I'm not trying to suggest that you or anyone is obligated to do any work on this, or that it would be easy to accomplish or worth the time it took. I'm just trying to sketch out what I see as maybe being theoretically possible, and why I think it would be useful if it did. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company