Hi,
I think the patch could use a pgindent run. On 2016-09-07 11:44:01 -0700, Peter Geoghegan wrote: > diff --git a/contrib/amcheck/amcheck--1.0.sql > b/contrib/amcheck/amcheck--1.0.sql > new file mode 100644 > index 0000000..ebbd6ac > --- /dev/null > +++ b/contrib/amcheck/amcheck--1.0.sql > @@ -0,0 +1,20 @@ > +/* contrib/amcheck/amcheck--1.0.sql */ > + > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION > +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit > + > +-- > +-- bt_index_check() > +-- > +CREATE FUNCTION bt_index_check(index regclass) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'bt_index_check' > +LANGUAGE C STRICT; > + > +-- > +-- bt_index_parent_check() > +-- > +CREATE FUNCTION bt_index_parent_check(index regclass) > +RETURNS VOID > +AS 'MODULE_PATHNAME', 'bt_index_parent_check' > +LANGUAGE C STRICT; I'd really want a function that runs all check on a table. > +#define CHECK_SUPERUSER() { \ > + if (!superuser()) \ > + ereport(ERROR, \ > + > (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ > + (errmsg("must be superuser to use > verification functions")))); } Why is this a macro? > +/* > + * Callers to verification functions should never receive a false positive > + * indication of corruption. Therefore, when using amcheck functions for > + * stress testing, it may be useful to temporally change the CORRUPTION > elevel > + * to PANIC, to immediately halt the server in the event of detecting an > + * invariant condition violation. This may preserve more information about > the > + * nature of the underlying problem. Note that modifying the CORRUPTION > + * constant to be an elevel < ERROR is not well tested. > + */ > +#ifdef NOT_USED > +#define CORRUPTION PANIC > +#define CONCERN WARNING > +#define POSITION NOTICE > +#else > +#define CORRUPTION ERROR > +#define CONCERN DEBUG1 > +#define POSITION DEBUG2 > +#endif Hm, if we want that - and it doesn't seem like a bad idea - I think we should be make it available without recompiling. > +/* > + * A B-Tree cannot possibly have this many levels, since there must be one > + * block per level, which is bound by the range of BlockNumber: > + */ > +#define InvalidBtreeLevel ((uint32) InvalidBlockNumber) Hm, I think it'd be, for the long term, better if we'd move the btree check code to amcheck_nbtree.c or such. > +Datum > +bt_index_parent_check(PG_FUNCTION_ARGS) > +{ > + Oid indrelid = PG_GETARG_OID(0); > + Oid heapid; > + Relation indrel; > + Relation heaprel; > + > + CHECK_SUPERUSER(); > + > + /* > + * We must lock table before index to avoid deadlocks. However, if the > + * passed indrelid isn't an index then IndexGetRelation() will fail. > + * Rather than emitting a not-very-helpful error message, postpone > + * complaining, expecting that the is-it-an-index test below will fail. > + */ > + heapid = IndexGetRelation(indrelid, true); > + if (OidIsValid(heapid)) > + heaprel = heap_open(heapid, ShareLock); > + else > + heaprel = NULL; > + > + /* > + * Open the target index relations separately (like relation_openrv(), > but > + * with heap relation locked first to prevent deadlocking). In hot > standby > + * mode this will raise an error. > + */ > + indrel = index_open(indrelid, ExclusiveLock); Theoretically we should recheck that the oids still match, theoretically the locking here allows the index->table mapping to change. It's not a huge window tho... > + /* Check for active uses of the index in the current transaction */ > + CheckTableNotInUse(indrel, "bt_index_parent_check"); Why do we actually care? > +static void > +btree_index_checkable(Relation rel) > +{ > + if (rel->rd_rel->relkind != RELKIND_INDEX || > + rel->rd_rel->relam != BTREE_AM_OID) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("only nbtree access method indexes are > supported"), > + errdetail("Relation \"%s\" is not a B-Tree > index.", > + > RelationGetRelationName(rel)))); > + > + if (RELATION_IS_OTHER_TEMP(rel)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot access temporary tables of > other sessions"), > + errdetail("Index \"%s\" is associated with > temporary relation.", > + > RelationGetRelationName(rel)))); > + > + if (!rel->rd_index->indisready) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("cannot check index \"%s\"", > + RelationGetRelationName(rel)), > + errdetail("Index is not yet ready for > insertions"))); Why can't we check this? Seems valuable to allow checking after a partial CONCURRENTLY index build? Why don't you use IndexIsReady et al? Not sure if the patch compiles against an earlier release (there weren't valid/ready/live dead fields, but instead it was compressed into fewer)? > +/* > + * bt_check_every_level() > + * > + * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in > + * logical order, verifying invariants as it goes. > + * > + * It is the caller's responsibility to acquire appropriate heavyweight lock > on > + * the index relation, and advise us if extra checks are safe when an > + * ExclusiveLock is held. An ExclusiveLock is generally assumed to prevent > any > + * kind of physical modification to the index structure, including > + * modifications that VACUUM may make. > + */ Hm, what kind of locking does the kill_prior_tuple stuff use? > +static void > +bt_check_every_level(Relation rel, bool exclusivelock) > +{ > + /* > + * Certain deletion patterns can result in "skinny" B-Tree indexes, > where > + * the fast root and true root differ. > + * > + * Start from the true root, not the fast root, unlike conventional > index > + * scans. This approach is more thorough, and removes the risk of > + * following a stale fast root from the meta page. > + */ > + if (metad->btm_fastroot != metad->btm_root) > + ereport(CONCERN, > + (errcode(ERRCODE_DUPLICATE_OBJECT), > + errmsg("fast root mismatch in index %s", > + RelationGetRelationName(rel)), > + errdetail_internal("Fast block %u (level %u) " > + > "differs from true root " > + "block > %u (level %u).", > + > metad->btm_fastroot, metad->btm_fastlevel, > + > metad->btm_root, metad->btm_level))); Hm, isn't that a potentially spurious report? > +static BtreeLevel > +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level) > +{ > + do > + { > + /* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */ > + CHECK_FOR_INTERRUPTS(); > + > + /* Initialize state for this iteration */ > + state->targetblock = current; > + state->target = palloc_btree_page(state, state->targetblock); > + state->targetlsn = PageGetLSN(state->target); > + > + opaque = (BTPageOpaque) PageGetSpecialPointer(state->target); > + > + if (P_IGNORE(opaque)) > + { > + if (P_RIGHTMOST(opaque)) > + ereport(CORRUPTION, > + > (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg("block %u fell off the > end of index \"%s\"", > + current, > RelationGetRelationName(state->rel)))); > + else > + ereport(CONCERN, > + > (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), > + errmsg("block %u of index > \"%s\" ignored", > + current, > RelationGetRelationName(state->rel)))); Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few of these? > + /* > + * Before beginning any non-trivial examination of > level, establish > + * next level down's leftmost block number, which next > call here > + * will pass as its leftmost (iff this isn't leaf > level). "next level down's leftmost block number" - it'd be good to simplify that a bit. > +/* > + * bt_target_page_check() Why "target"? > +static void > +bt_target_page_check(BtreeCheckState *state) > +{ > + /* > + * ******************** > + * * Page order check * > + * ******************** Isn't that item order, rather than page order? > + } > + /* > + * ******************** > + * * Last item check * > + * ******************** Newline before block wouldn't hurt. > + * Check last item against next/right page's first data item's > when > + * last item on page is reached. I think this need some polishing. > + /* > + * ******************** > + * * Downlink check * > + * ******************** > + * > + * Additional check of child items against target page (their > parent), > + * iff this is internal page and caller holds ExclusiveLock on > index s/is/is an/ /holds/holds an/ > +static ScanKey > +bt_right_page_check_scankey(BtreeCheckState *state) > +{ > + * A deleted page won't actually be recycled by VACUUM early enough for > us > + * to fail to be able follow its right link (or left link, or downlink), "for us to fail to be able" sounds odd. > + /* > + * No ExclusiveLock held case -- why it's safe to proceed. > + * > + * Problem: > + * > + * We must avoid false positive reports of corruption when caller treats > + * item returned here as an upper bound on target's last item. In > general, > + * false positives are disallowed. Ensuring they don't happen in > + * !exclusivelock case is subtle. s/item/an item/? This paragraph is too complicated to follow in the noisy environment where we both currently are ;) > +/* > + * bt_downlink_check() > + * > + * Checks one of target's downlink against its child page. downlinks? > + * Conceptually, the target page continues to be what is checked here. The > + * target block is still blamed in the event of finding an invariant > violation. > + * The downlink insertion into the target is probably where any problem > raised > + * here arises, and there is no such thing as a parent link, so doing the > + * verification this way around is much more practical. > + */ > +static void > +bt_downlink_check(BtreeCheckState *state, BlockNumber childblock, > + ScanKey targetkey) > +{ > + * between the deleted page and its right sibling. (We cannot delete a > + * parent page's rightmost page unless it is the last child page, and we > + * intend to delete the parent itself.) "parent page's rightmost page"? I like this. Some of the more complex pieces towards the end of the field need some attention, there's a fair amount of word-smithing needed, and I do think we want to make the structural changes outlined above, but besides these, imo fairly simple adaptions, I do think this is useful and not that far from being committable. - Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers