Re: [HACKERS] amcheck (B-Tree integrity checking tool)

Andres Freund Wed, 16 Nov 2016 13:04:10 -0800

Hi,


I think the patch could use a pgindent run.

On 2016-09-07 11:44:01 -0700, Peter Geoghegan wrote:
> diff --git a/contrib/amcheck/amcheck--1.0.sql 
> b/contrib/amcheck/amcheck--1.0.sql
> new file mode 100644
> index 0000000..ebbd6ac
> --- /dev/null
> +++ b/contrib/amcheck/amcheck--1.0.sql
> @@ -0,0 +1,20 @@
> +/* contrib/amcheck/amcheck--1.0.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit
> +
> +--
> +-- bt_index_check()
> +--
> +CREATE FUNCTION bt_index_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_check'
> +LANGUAGE C STRICT;
> +
> +--
> +-- bt_index_parent_check()
> +--
> +CREATE FUNCTION bt_index_parent_check(index regclass)
> +RETURNS VOID
> +AS 'MODULE_PATHNAME', 'bt_index_parent_check'
> +LANGUAGE C STRICT;

I'd really want a function that runs all check on a table.


> +#define CHECK_SUPERUSER() { \
> +             if (!superuser()) \
> +                     ereport(ERROR, \
> +                                     
> (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \
> +                                      (errmsg("must be superuser to use 
> verification functions")))); }

Why is this a macro?


> +/*
> + * Callers to verification functions should never receive a false positive
> + * indication of corruption.  Therefore, when using amcheck functions for
> + * stress testing, it may be useful to temporally change the CORRUPTION 
> elevel
> + * to PANIC, to immediately halt the server in the event of detecting an
> + * invariant condition violation.  This may preserve more information about 
> the
> + * nature of the underlying problem.  Note that modifying the CORRUPTION
> + * constant to be an elevel < ERROR is not well tested.
> + */
> +#ifdef NOT_USED
> +#define CORRUPTION           PANIC
> +#define CONCERN                      WARNING
> +#define POSITION             NOTICE
> +#else
> +#define CORRUPTION           ERROR
> +#define CONCERN                      DEBUG1
> +#define POSITION             DEBUG2
> +#endif

Hm, if we want that - and it doesn't seem like a bad idea - I think we
should be make it available without recompiling.


> +/*
> + * A B-Tree cannot possibly have this many levels, since there must be one
> + * block per level, which is bound by the range of BlockNumber:
> + */
> +#define InvalidBtreeLevel    ((uint32) InvalidBlockNumber)

Hm, I think it'd be, for the long term, better if we'd move the btree
check code to amcheck_nbtree.c or such.


> +Datum
> +bt_index_parent_check(PG_FUNCTION_ARGS)
> +{
> +     Oid                     indrelid = PG_GETARG_OID(0);
> +     Oid                     heapid;
> +     Relation        indrel;
> +     Relation        heaprel;
> +
> +     CHECK_SUPERUSER();
> +
> +     /*
> +      * We must lock table before index to avoid deadlocks.  However, if the
> +      * passed indrelid isn't an index then IndexGetRelation() will fail.
> +      * Rather than emitting a not-very-helpful error message, postpone
> +      * complaining, expecting that the is-it-an-index test below will fail.
> +      */
> +     heapid = IndexGetRelation(indrelid, true);
> +     if (OidIsValid(heapid))
> +             heaprel = heap_open(heapid, ShareLock);
> +     else
> +             heaprel = NULL;
> +
> +     /*
> +      * Open the target index relations separately (like relation_openrv(), 
> but
> +      * with heap relation locked first to prevent deadlocking).  In hot 
> standby
> +      * mode this will raise an error.
> +      */
> +     indrel = index_open(indrelid, ExclusiveLock);

Theoretically we should recheck that the oids still match, theoretically
the locking here allows the index->table mapping to change. It's not a
huge window tho...


> +     /* Check for active uses of the index in the current transaction */
> +     CheckTableNotInUse(indrel, "bt_index_parent_check");

Why do we actually care?



> +static void
> +btree_index_checkable(Relation rel)
> +{
> +     if (rel->rd_rel->relkind != RELKIND_INDEX ||
> +             rel->rd_rel->relam != BTREE_AM_OID)
> +             ereport(ERROR,
> +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                              errmsg("only nbtree access method indexes are 
> supported"),
> +                              errdetail("Relation \"%s\" is not a B-Tree 
> index.",
> +                                                
> RelationGetRelationName(rel))));
> +
> +     if (RELATION_IS_OTHER_TEMP(rel))
> +             ereport(ERROR,
> +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                              errmsg("cannot access temporary tables of 
> other sessions"),
> +                              errdetail("Index \"%s\" is associated with 
> temporary relation.",
> +                                                
> RelationGetRelationName(rel))));
> +
> +     if (!rel->rd_index->indisready)
> +             ereport(ERROR,
> +                             (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +                              errmsg("cannot check index \"%s\"",
> +                                             RelationGetRelationName(rel)),
> +                              errdetail("Index is not yet ready for 
> insertions")));

Why can't we check this? Seems valuable to allow checking after a
partial CONCURRENTLY index build?

Why don't you use IndexIsReady et al? Not sure if the patch compiles
against an earlier release (there weren't valid/ready/live dead fields,
but instead it was compressed into fewer)?


> +/*
> + * bt_check_every_level()
> + *
> + * Main entry point for B-Tree SQL-callable functions.  Walks the B-Tree in
> + * logical order, verifying invariants as it goes.
> + *
> + * It is the caller's responsibility to acquire appropriate heavyweight lock 
> on
> + * the index relation, and advise us if extra checks are safe when an
> + * ExclusiveLock is held.  An ExclusiveLock is generally assumed to prevent 
> any
> + * kind of physical modification to the index structure, including
> + * modifications that VACUUM may make.
> + */

Hm, what kind of locking does the kill_prior_tuple stuff use?


> +static void
> +bt_check_every_level(Relation rel, bool exclusivelock)
> +{

> +     /*
> +      * Certain deletion patterns can result in "skinny" B-Tree indexes, 
> where
> +      * the fast root and true root differ.
> +      *
> +      * Start from the true root, not the fast root, unlike conventional 
> index
> +      * scans.  This approach is more thorough, and removes the risk of
> +      * following a stale fast root from the meta page.
> +      */
> +     if (metad->btm_fastroot != metad->btm_root)
> +             ereport(CONCERN,
> +                             (errcode(ERRCODE_DUPLICATE_OBJECT),
> +                              errmsg("fast root mismatch in index %s",
> +                                             RelationGetRelationName(rel)),
> +                              errdetail_internal("Fast block %u (level %u) "
> +                                                                     
> "differs from true root "
> +                                                                     "block 
> %u (level %u).",
> +                                                                     
> metad->btm_fastroot, metad->btm_fastlevel,
> +                                                                     
> metad->btm_root, metad->btm_level)));

Hm, isn't that a potentially spurious report?


> +static BtreeLevel
> +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level)
> +{

> +     do
> +     {
> +             /* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */
> +             CHECK_FOR_INTERRUPTS();
> +
> +             /* Initialize state for this iteration */
> +             state->targetblock = current;
> +             state->target = palloc_btree_page(state, state->targetblock);
> +             state->targetlsn = PageGetLSN(state->target);
> +
> +             opaque = (BTPageOpaque) PageGetSpecialPointer(state->target);
> +
> +             if (P_IGNORE(opaque))
> +             {
> +                     if (P_RIGHTMOST(opaque))
> +                             ereport(CORRUPTION,
> +                                             
> (errcode(ERRCODE_INDEX_CORRUPTED),
> +                                              errmsg("block %u fell off the 
> end of index \"%s\"",
> +                                                             current, 
> RelationGetRelationName(state->rel))));
> +                     else
> +                             ereport(CONCERN,
> +                                             
> (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> +                                              errmsg("block %u of index 
> \"%s\" ignored",
> +                                                             current, 
> RelationGetRelationName(state->rel))));

Hm, why is this worth a WARNING? Isn't it perfectly normal to have a few
of these?


> +                     /*
> +                      * Before beginning any non-trivial examination of 
> level, establish
> +                      * next level down's leftmost block number, which next 
> call here
> +                      * will pass as its leftmost (iff this isn't leaf 
> level).

"next level down's leftmost block number" - it'd be good to simplify
that a bit.


> +/*
> + * bt_target_page_check()

Why "target"?

> +static void
> +bt_target_page_check(BtreeCheckState *state)
> +{


> +             /*
> +              *   ********************
> +              *   * Page order check *
> +              *   ********************

Isn't that item order, rather than page order?


> +             }
> +             /*
> +              *   ********************
> +              *   * Last item check  *
> +              *   ********************

Newline before block wouldn't hurt.


> +              * Check last item against next/right page's first data item's 
> when
> +              * last item on page is reached.

I think this need some polishing.



> +             /*
> +              *   ********************
> +              *   * Downlink check   *
> +              *   ********************
> +              *
> +              * Additional check of child items against target page (their 
> parent),
> +              * iff this is internal page and caller holds ExclusiveLock on 
> index

s/is/is an/ /holds/holds an/


> +static ScanKey
> +bt_right_page_check_scankey(BtreeCheckState *state)
> +{

> +      * A deleted page won't actually be recycled by VACUUM early enough for 
> us
> +      * to fail to be able follow its right link (or left link, or downlink),

"for us to fail to be able" sounds odd.

> +     /*
> +      * No ExclusiveLock held case -- why it's safe to proceed.
> +      *
> +      * Problem:
> +      *
> +      * We must avoid false positive reports of corruption when caller treats
> +      * item returned here as an upper bound on target's last item.  In 
> general,
> +      * false positives are disallowed.  Ensuring they don't happen in
> +      * !exclusivelock case is subtle.

s/item/an item/?


This paragraph is too complicated to follow in the noisy environment
where we both currently are ;)

> +/*
> + * bt_downlink_check()
> + *
> + * Checks one of target's downlink against its child page.

downlinks?

> + * Conceptually, the target page continues to be what is checked here.  The
> + * target block is still blamed in the event of finding an invariant 
> violation.
> + * The downlink insertion into the target is probably where any problem 
> raised
> + * here arises, and there is no such thing as a parent link, so doing the
> + * verification this way around is much more practical.
> + */
> +static void
> +bt_downlink_check(BtreeCheckState *state, BlockNumber childblock,
> +                               ScanKey targetkey)
> +{

> +      * between the deleted page and its right sibling.  (We cannot delete a
> +      * parent page's rightmost page unless it is the last child page, and we
> +      * intend to delete the parent itself.)

"parent page's rightmost page"?


I like this. Some of the more complex pieces towards the end of the
field need some attention, there's a fair amount of word-smithing
needed, and I do think we want to make the structural changes outlined
above, but besides these, imo fairly simple adaptions, I do think this
is useful and not that far from being committable.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] amcheck (B-Tree integrity checking tool)

Reply via email to