On Wed, Nov 16, 2016 at 1:03 PM, Andres Freund <and...@anarazel.de> wrote: > I think the patch could use a pgindent run.
Okay. > I'd really want a function that runs all check on a table. Why not just use a custom SQL query? The docs have an example of one such query, that verifies all user indexes, but there could be another example. >> +#define CHECK_SUPERUSER() { \ >> + if (!superuser()) \ >> + ereport(ERROR, \ >> + >> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ >> + (errmsg("must be superuser to use >> verification functions")))); } > > Why is this a macro? This was just copied from pageinspect, TBH. What would you prefer? >> +#ifdef NOT_USED >> +#define CORRUPTION PANIC >> +#define CONCERN WARNING >> +#define POSITION NOTICE >> +#else >> +#define CORRUPTION ERROR >> +#define CONCERN DEBUG1 >> +#define POSITION DEBUG2 >> +#endif > > Hm, if we want that - and it doesn't seem like a bad idea - I think we > should be make it available without recompiling. I suppose, provided it doesn't let CORRUPTION elevel be < ERROR. That might be broken if it was allowed. > Hm, I think it'd be, for the long term, better if we'd move the btree > check code to amcheck_nbtree.c or such. Agreed. >> +Datum >> +bt_index_parent_check(PG_FUNCTION_ARGS) >> +{ >> + Oid indrelid = PG_GETARG_OID(0); > Theoretically we should recheck that the oids still match, theoretically > the locking here allows the index->table mapping to change. It's not a > huge window tho... >> + /* Check for active uses of the index in the current transaction */ >> + CheckTableNotInUse(indrel, "bt_index_parent_check"); > > Why do we actually care? This was left-over from duplicating REINDEX code. Indeed, we have no reason to care at all. >> + if (!rel->rd_index->indisready) >> + ereport(ERROR, >> + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), >> + errmsg("cannot check index \"%s\"", >> + RelationGetRelationName(rel)), >> + errdetail("Index is not yet ready for >> insertions"))); > > Why can't we check this? Seems valuable to allow checking after a > partial CONCURRENTLY index build? Okay. > Why don't you use IndexIsReady et al? Not sure if the patch compiles > against an earlier release (there weren't valid/ready/live dead fields, > but instead it was compressed into fewer)? This version is almost identical to the Github version of amcheck that we've been using with 9.4+. Agreed on IndexIsReady(). >> +/* >> + * bt_check_every_level() >> + * >> + * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in >> + * logical order, verifying invariants as it goes. >> + * >> + * It is the caller's responsibility to acquire appropriate heavyweight >> lock on >> + * the index relation, and advise us if extra checks are safe when an >> + * ExclusiveLock is held. An ExclusiveLock is generally assumed to prevent >> any >> + * kind of physical modification to the index structure, including >> + * modifications that VACUUM may make. >> + */ > > Hm, what kind of locking does the kill_prior_tuple stuff use? The killing itself, or the setting of LP_DEAD bit? Ordinary index scans could concurrently set LP_DEAD with only an AccessShareLock. But, that's just metadata that amcheck doesn't care about. Anything that needs to actually recyle space based on finding an LP_DEAD bit set is a write operation, that is precluded from running when bt_index_parent_check() is called by its ExclusiveLock. _bt_vacuum_one_page() is only called by one nbtinsert.c code path, when it looks like a page split might be needed. >> +static void >> +bt_check_every_level(Relation rel, bool exclusivelock) >> +{ > >> + /* >> + * Certain deletion patterns can result in "skinny" B-Tree indexes, >> where >> + * the fast root and true root differ. >> + * >> + * Start from the true root, not the fast root, unlike conventional >> index >> + * scans. This approach is more thorough, and removes the risk of >> + * following a stale fast root from the meta page. >> + */ >> + if (metad->btm_fastroot != metad->btm_root) >> + ereport(CONCERN, >> + (errcode(ERRCODE_DUPLICATE_OBJECT), >> + errmsg("fast root mismatch in index %s", >> + RelationGetRelationName(rel)), >> + errdetail_internal("Fast block %u (level %u) " >> + >> "differs from true root " >> + "block >> %u (level %u).", >> + >> metad->btm_fastroot, metad->btm_fastlevel, >> + >> metad->btm_root, metad->btm_level))); > > Hm, isn't that a potentially spurious report? Well, it's not an error, so no. I think that it would be better with a lower elevel, though, to indicate the low severity. >> + ereport(CONCERN, >> + >> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), >> + errmsg("block %u of index >> \"%s\" ignored", >> + current, >> RelationGetRelationName(state->rel)))); > > Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few > of these? It is, but you shouldn't get very many of them, since pages that are ignorable don't have sibling links at the second phase of deletion. I guess WARNING might be too strong here, too. > "next level down's leftmost block number" - it'd be good to simplify > that a bit. Okay. >> +/* >> + * bt_target_page_check() > > Why "target"? Conceptually, the verification scan vists from true root to leaf, left to right, and makes each page it finds along the way the target exactly once. It's represented as such in the state that we pass around -- there is some metadata stashed about the target there (e.g., LSN is kept there for easy retrieval by possible ereport(ERROR) calls). I think that that makes things quite a lot clearer, particularly when we need to discuss checks that happen against the sibling or child of target (conceptually, we're still checking the target). The description of what happens and why that is okay is a lot clearer because we constantly refer to the current target page, IMV -- that's always our frame of reference. >> + /* >> + * ******************** >> + * * Page order check * >> + * ******************** > > Isn't that item order, rather than page order? I don't know...it checks that the items on the page are in the right order. > (Various minor niggles about style) Okay. > This paragraph is too complicated to follow in the noisy environment > where we both currently are ;) As I pointed out yesterday, that paragraph is 95%+ of what makes the patch hard to understand. It's worth it, though, to detect problems with things like transposed pages while holding only an AccessShareLock. > downlinks? Downlinks are an accepted terminology for the index tuples in internal pages, that point to child pages (rather than having TID pointer to table). > "parent page's rightmost page"? I guess that should be "parent page's rightmost child". This is covered in the nbtree README: """ To preserve consistency on the parent level, we cannot merge the key space of a page into its right sibling unless the right sibling is a child of the same parent --- otherwise, the parent's key space assignment changes too, meaning we'd have to make bounding-key updates in its parent, and perhaps all the way up the tree. Since we can't possibly do that atomically, we forbid this case. That means that the rightmost child of a parent node can't be deleted unless it's the only remaining child, in which case we will delete the parent too (see below). ''"" > I like this. Some of the more complex pieces towards the end of the > field need some attention, there's a fair amount of word-smithing > needed, and I do think we want to make the structural changes outlined > above, but besides these, imo fairly simple adaptions, I do think this > is useful and not that far from being committable. Cool. I'll try to get a revision out soon. I'm happy to do that much. -- Peter Geoghegan -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers