I was assigned an "action point" during the FOSDEM developer meeting: "Post new version of btree consistency checker patch". I attach a new WIP version of my consistency checker tool, amcheck. This patch is proposed for 9.6, as an extension in contrib -- maybe we can still get it in. This is the first time I've added any version of this to a commitfest, although I've posted a couple of rough versions of this in the past couple of years. The attached version has received a major overhaul, and is primarily aimed at finding corruption in production systems, although I think it will still have significant value for debugging too. Maybe it can help with some of the B-Tree patches in the final commitfest, for example. I also have some hope that it will become a learning tool for people interested in how B-Tree indexes work.
To recap, the extension adds some SQL-callable functions that verify certain invariant conditions hold within some particular B-Tree index. These are the conditions that index scans rely on always being true. The tool's scope may eventually cover other AMs, including heapam, but nbtree seems like the best place to start. Note that no function currently checks that the index is consistent with the heap, which would be very useful (that's probably how I'd eventually target the heapam, actually). Invariants ======== nbtree invariants that the tool verifies with just an AccessShareLock on the relation are: * That all items are in the correct, opclass order on each page. * That the page "high key", if any, actually bounds the items on the page. * That the last item on a page is less than or equal to the first item on the next page (the page to its right). The idea here is that the key space spans multiple pages, not just one page, so it make sense to check the last item where we can. With an ExclusiveLock + ShareLock, some addition invariants are verified: * That child pages actually have their parent's downlink as a lower bound. * Sane right links and left links at each level. Obviously, this tool is all about distrusting the structure of a B-Tree index. That being the case, it's not always totally clear where to draw the line. I think I have the balance about right, though. Interface ======= There are only 2 SQL callable functions in the extension, which are very similar: bt_index_check(index regclass) bt_index_parent_check(index regclass) The latter is more thorough than the former -- it performs all checks, including those checks that I mentioned required an ExclusiveLock. So, bt_index_check() only requires an AccessShareLock. bt_index_parent_check() requires an ExclusiveLock on the index relation, and a ShareLock on its heap relation, almost like REINDEX. bt_index_parent_check() performs verification that is a superset of the verification performed by bt_index_check() -- mostly, the extra verification/work is that it verifies downlinks against child pages. Both functions raise an error in the event of observing that an invariant in a B-Tree was violated, such as items being out of order on a page. I've written extensive documentation, which goes into practical aspects of using amcheck effectively. It doesn't go into significant detail about every invariant that is checked, but gives a good idea of roughly what checks are performed. I could almost justify only having one function with an argument about the downlink/child verification, but that would be a significant footgun given the varying locking requirements that such a function would have. Locking ====== We never rely on something like holding on to a buffer pin as an interlock for correctness (the vacuum interlock thing isn't generally necessary, because we don't look at the heap at all). We simply pin + BT_READ lock a buffer, copy it into local memory allocated by palloc(), and then immediately release the buffer lock and drop the pin. This is the same in all instances. There is never more than one buffer lock or pin held at a time. We do, on the other hand, have a detailed rationale for why it's okay that we don't have an ExclusiveLock on the index relation for checks that span the key space of more than one page by following right links to compare items across sibling pages. This isn't the same thing as having an explicit interlock like a pin -- our interlock is one against *recycling* by vacuum, which is based on recentGlobalXmin. This rationale requires expert review. Performance ========== Trying to keep the tool as simple as possible, while still making it do verification that is as useful as possible was my priority here, not performance. Still, verification completes fairly quickly. Certainly, it takes far less time than having to REINDEX the index, and doesn't need too much memory. I think that in practice most problems that can be detected by the B-Tree checker functions will be detected with the lighter variant. -- Peter Geoghegan
From 573810d8d3c994ce1a16ecffb2f5d208c0ff93e3 Mon Sep 17 00:00:00 2001 From: Peter Geoghegan <p...@bowt.ie> Date: Tue, 10 Jun 2014 22:20:40 -0700 Subject: [PATCH] Add amcheck extension to contrib The new extension makes available two SQL-callable functions, each of which accept a single argument (a regclass/nbtree index relation argument) and return void. The SQL-callable function bt_index_check() checks invariants of the target index by walking the tree, performing various checks of the sanity of the page space using insertion scankeys to compare items. The SQL-callable function bt_index_parent_check() performs a superset of the checks performed by bt_index_check(): the same checks, as well as another check verifying that each downlink is actually respected as a lower bound on its child page. bt_index_check() requires only an AccessShareLock on the target. bt_index_parent_check() requires an ExclusiveLock on the target, and ShareLock on its heap relation. --- contrib/amcheck/Makefile | 19 + contrib/amcheck/amcheck--1.0.sql | 20 + contrib/amcheck/amcheck.c | 1117 ++++++++++++++++++++++++++++++++++++++ contrib/amcheck/amcheck.control | 5 + doc/src/sgml/amcheck.sgml | 244 +++++++++ doc/src/sgml/contrib.sgml | 1 + doc/src/sgml/filelist.sgml | 1 + 7 files changed, 1407 insertions(+) create mode 100644 contrib/amcheck/Makefile create mode 100644 contrib/amcheck/amcheck--1.0.sql create mode 100644 contrib/amcheck/amcheck.c create mode 100644 contrib/amcheck/amcheck.control create mode 100644 doc/src/sgml/amcheck.sgml diff --git a/contrib/amcheck/Makefile b/contrib/amcheck/Makefile new file mode 100644 index 0000000..7bb29aa --- /dev/null +++ b/contrib/amcheck/Makefile @@ -0,0 +1,19 @@ +# contrib/amcheck/Makefile + +MODULE_big = amcheck +OBJS = amcheck.o $(WIN32RES) + +EXTENSION = amcheck +DATA = amcheck--1.0.sql +PGFILEDESC = "amcheck - functions to verify access method invariants" + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/amcheck +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif diff --git a/contrib/amcheck/amcheck--1.0.sql b/contrib/amcheck/amcheck--1.0.sql new file mode 100644 index 0000000..ebbd6ac --- /dev/null +++ b/contrib/amcheck/amcheck--1.0.sql @@ -0,0 +1,20 @@ +/* contrib/amcheck/amcheck--1.0.sql */ + +-- complain if script is sourced in psql, rather than via CREATE EXTENSION +\echo Use "CREATE EXTENSION amcheck" to load this file. \quit + +-- +-- bt_index_check() +-- +CREATE FUNCTION bt_index_check(index regclass) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_check' +LANGUAGE C STRICT; + +-- +-- bt_index_parent_check() +-- +CREATE FUNCTION bt_index_parent_check(index regclass) +RETURNS VOID +AS 'MODULE_PATHNAME', 'bt_index_parent_check' +LANGUAGE C STRICT; diff --git a/contrib/amcheck/amcheck.c b/contrib/amcheck/amcheck.c new file mode 100644 index 0000000..44bc7bb --- /dev/null +++ b/contrib/amcheck/amcheck.c @@ -0,0 +1,1117 @@ +/*------------------------------------------------------------------------- + * + * amcheck.c + * Verifies the integrity of access methods based on invariants. + * + * Provides SQL-callable functions for verifying that various logical + * invariants in the structure of index access methods are respected. This + * includes, for example, the invariant that each page in the target B-Tree + * index has "real" items in logical order as reported by an insertion scankey + * (the insertion scankey sort-wise NULL semantics are useful for + * verification). + * + * + * Copyright (c) 2016, PostgreSQL Global Development Group + * + * IDENTIFICATION + * contrib/amcheck/amcheck.c + * + *------------------------------------------------------------------------- + */ +#include "postgres.h" + +#include "access/nbtree.h" +#include "access/transam.h" +#include "catalog/index.h" +#include "catalog/pg_am.h" +#include "commands/tablecmds.h" +#include "miscadmin.h" +#include "storage/lmgr.h" +#include "utils/memutils.h" +#include "utils/snapmgr.h" + + +PG_MODULE_MAGIC; + +#define CHECK_SUPERUSER() { \ + if (!superuser()) \ + ereport(ERROR, \ + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), \ + (errmsg("must be superuser to use verification functions")))); } + +/* +* As noted in comments above _bt_compare(), there is special handling of the +* first data item (that is, the first item with a valid downlink -- not the +* high key item) on a non-leaf (internal) page. There is clearly no point in +* having amcheck functions make any comparison of or against these "minus +* infinity" items, because they contain no actual information other than the +* downlink. +*/ +#define OFFSET_IS_MINUS_INFINITY(opaque, offset) \ + (!P_ISLEAF(opaque) && offset == P_FIRSTDATAKEY(opaque)) + +/* + * Callers to verification functions should never receive a false positive + * indication of corruption. Therefore, when using amcheck functions for + * stress testing, it may be useful to temporally change the CORRUPTION elevel + * to PANIC, to immediately halt the server in the event of detecting an + * invariant condition violation. This may preserve more information about the + * nature of the underlying problem. Note that modifying the CORRUPTION + * constant to be an elevel < ERROR is not well tested. + */ +#ifdef NOT_USED +#define CORRUPTION PANIC +#define CONCERN WARNING +#else +#define CORRUPTION ERROR +#define CONCERN INFO +#endif + +/* + * A B-Tree cannot possibly have this many levels, since there must be one + * block per page, and that is bound by the range of BlockNumber: + */ +#define InvalidBtreeLevel ((uint32) InvalidBlockNumber) + +/* + * State associated with verifying a B-Tree index + */ +typedef struct BtreeCheckState +{ + /* + * Unchanging state, established at start of verification: + * + * rel: B-Tree Index Relation + * exclusivelock: ExclusiveLock held on rel; else AccessShareLock + * checkstrategy: Buffer access strategy + * targetcontext: Per-target-page memory context + */ + Relation rel; + bool exclusivelock; + BufferAccessStrategy checkstrategy; + MemoryContext targetcontext; + + /* + * Mutable state, for verification of particular page: + * + * target: Main target page + * targetblock: Main target block number of page + * + * target is the point of reference for a verification operation. + * + * Other B-Tree pages may be allocated, but those are always auxiliary + * (e.g. they are target's child pages). Conceptually, only the target + * page is checked. Each page found by verification's left/right, + * top/bottom scan becomes the target once. + * + * Memory is managed by resetting targecontext after verification of some + * target page finishes (possibly including target verification that + * depends on non-target page state). + */ + Page target; + BlockNumber targetblock; + XLogRecPtr targetlsn; + +} BtreeCheckState; + +/* + * Starting point for verifying an entire B-Tree index level + */ +typedef struct BtreeLevel +{ + /* Level number (0 is leaf page level). */ + uint32 level; + + /* Left most block on level. Scan of level begins here. */ + BlockNumber leftmost; + + /* Is this level reported as "true" root level by meta page? */ + bool istruerootlevel; +} BtreeLevel; + +PG_FUNCTION_INFO_V1(bt_index_check); +PG_FUNCTION_INFO_V1(bt_index_parent_check); + +static void btree_index_checkable(Relation rel); +static void bt_check_every_level(Relation rel, bool exclusivelock); +static BtreeLevel bt_check_level_from_leftmost(BtreeCheckState *state, + BtreeLevel level); +static void bt_target_page_check(BtreeCheckState *state); +static ScanKey bt_right_page_check_scankey(BtreeCheckState *state); +static void bt_downlink_check(BtreeCheckState *state, BlockNumber childblock, + ScanKey targetkey); +static bool invariant_key_less_than_equal_offset(BtreeCheckState *state, + ScanKey key, + OffsetNumber upperbound); +static bool invariant_key_greater_than_equal_offset(BtreeCheckState *state, + ScanKey key, + OffsetNumber lowerbound); +static bool invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state, + Page other, + ScanKey key, + OffsetNumber upperbound); +static Page palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum); + +/* + * bt_index_check(index regclass) + * + * Verify integrity of B-Tree index. + * + * Only acquires AccessShareLock on index relation. Does not consider + * invariants that exist between parent/child pages. + */ +Datum +bt_index_check(PG_FUNCTION_ARGS) +{ + Oid relid = PG_GETARG_OID(0); + Relation indrel; + + CHECK_SUPERUSER(); + + indrel = relation_open(relid, AccessShareLock); + + /* Relation suitable for checking as B-Tree? */ + btree_index_checkable(indrel); + + /* Check index */ + bt_check_every_level(indrel, false); + + relation_close(indrel, AccessShareLock); + + PG_RETURN_VOID(); +} + +/* + * bt_index_parent_check(index regclass) + * + * Verify integrity of B-Tree index. + * + * Acquires ExclusiveLock on index relation, and ShareLock on the associated + * heap relation, a bit like REINDEX. Verifies that downlinks in parent pages + * are valid lower bounds on child pages. + */ +Datum +bt_index_parent_check(PG_FUNCTION_ARGS) +{ + Oid indrelid = PG_GETARG_OID(0); + Oid heapid; + Relation indrel; + Relation heaprel; + + CHECK_SUPERUSER(); + + /* + * We must lock table before index to avoid deadlocks. However, if the + * passed indrelid isn't an index then IndexGetRelation() will fail. + * Rather than emitting a not-very-helpful error message, postpone + * complaining, expecting that the is-it-an-index test below will fail. + */ + heapid = IndexGetRelation(indrelid, true); + if (OidIsValid(heapid)) + heaprel = heap_open(heapid, ShareLock); + else + heaprel = NULL; + + /* + * Open the target index relations separately (like relation_openrv(), but + * with heap relation locked first to prevent deadlocking). In hot standby + * mode this will raise an error. + */ + indrel = index_open(indrelid, ExclusiveLock); + + /* Check for active uses of the index in the current transaction */ + CheckTableNotInUse(indrel, "bt_index_parent_check"); + + /* Relation suitable for checking as B-Tree? */ + btree_index_checkable(indrel); + + /* Check index */ + bt_check_every_level(indrel, true); + + index_close(indrel, ExclusiveLock); + if (heaprel) + heap_close(heaprel, ShareLock); + + PG_RETURN_VOID(); +} + +/* + * btree_index_checkable() + * + * Basic checks about the suitability of a relation for checking as a B-Tree + * index. + */ +static void +btree_index_checkable(Relation rel) +{ + if (rel->rd_rel->relkind != RELKIND_INDEX || + rel->rd_rel->relam != BTREE_AM_OID) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("only nbtree access method indexes are supported"), + errdetail("index \"%s\" does not using the nbtree access method.", + RelationGetRelationName(rel)))); + + if (RELATION_IS_OTHER_TEMP(rel)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot access temporary tables of other sessions"), + errdetail("index \"%s\" is associated with temporary relation.", + RelationGetRelationName(rel)))); + + if (!rel->rd_index->indisready) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot check index \"%s\"", + RelationGetRelationName(rel)), + errdetail("index is not yet ready for insertions"))); + + if (!rel->rd_index->indisvalid) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("cannot check index \"%s\"", + RelationGetRelationName(rel)), + errdetail("index is not valid"))); +} + +/* + * bt_check_every_level() + * + * Main entry point for B-Tree SQL-callable functions. Walks the B-Tree in + * logical order, verifying invariants as it goes. + * + * It is the caller's responsibility to acquire appropriate heavyweight lock on + * the index relation, and advise us if extra checks are safe when an + * ExclusiveLock is held. An ExclusiveLock is generally assumed to prevent any + * kind of physically modification to the index structure, including + * modifications that VACUUM may make. + */ +static void +bt_check_every_level(Relation rel, bool exclusivelock) +{ + BtreeCheckState *state; + Page metapage; + BTMetaPageData *metad; + uint32 previouslevel; + BtreeLevel current; + + /* + * RecentGlobalXmin assertion matches index_getnext_tid(). See note on + * RecentGlobalXmin/B-Tree page deletion. + */ + Assert(TransactionIdIsValid(RecentGlobalXmin)); + + /* + * Initialize state for entire verification operation + */ + state = palloc(sizeof(BtreeCheckState)); + state->rel = rel; + state->exclusivelock = exclusivelock; + state->checkstrategy = GetAccessStrategy(BAS_BULKREAD); + /* Create context for page */ + state->targetcontext = AllocSetContextCreate(CurrentMemoryContext, + "amcheck page data", + ALLOCSET_DEFAULT_MINSIZE, + ALLOCSET_DEFAULT_INITSIZE, + ALLOCSET_DEFAULT_MAXSIZE); + + /* Get true root block from meta-page */ + metapage = palloc_btree_page(state, BTREE_METAPAGE); + metad = BTPageGetMeta(metapage); + + /* + * Certain deletion patterns can result in "skinny" B-Tree indexes, where + * the fast root and true root differ. + * + * Start from the true root, not the fast root, unlike conventional index + * scans. This approach is more thorough, and removes the risk of + * following a stale fast root from the meta page. + */ + if (metad->btm_fastroot != metad->btm_root) + ereport(CONCERN, + (errcode(ERRCODE_DUPLICATE_OBJECT), + errmsg("fast root mismatch in index %s", + RelationGetRelationName(rel)), + errdetail_internal("Fast block %u (level %u) " + "differs from true root " + "block %u (level %u).", + metad->btm_fastroot, metad->btm_fastlevel, + metad->btm_root, metad->btm_level))); + + /* + * Starting at the root, verify every level. Move left to right, top to + * bottom. Note that there may be no pages other than the meta page (meta + * page can indicate that root is P_NONE when the index is totally empty). + */ + previouslevel = InvalidBtreeLevel; + current.level = metad->btm_level; + current.leftmost = metad->btm_root; + current.istruerootlevel = true; + while (current.leftmost != P_NONE) + { + /* + * Verify this level, and get left most page for next level down, if + * not at leaf level + */ + current = bt_check_level_from_leftmost(state, current); + + if (current.leftmost == InvalidBlockNumber) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" has no valid pages on level below %u or first level", + RelationGetRelationName(rel), previouslevel))); + + /* + * Even when concurrent page splits are possible, we should reliably + * end up on level immediately below that of last iteration. In + * general, it's impossible for the height of the tree to decrease. + */ + if (previouslevel != InvalidBtreeLevel && + previouslevel - 1 != current.level) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" is missing expected level %u", + RelationGetRelationName(rel), + previouslevel - 1))); + + previouslevel = current.level; + } + + if (metad->btm_root != P_NONE && current.level != 0) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" has internal pages but lacks leaf pages", + RelationGetRelationName(rel)))); + + /* Be tidy: */ + MemoryContextDelete(state->targetcontext); +} + +/* + * bt_check_level_from_leftmost() + * + * Given a left-most block at some level, move right, verifying each page + * individually (with more verification across pages for "exclusivelock" + * callers). Caller should pass the true root page as the leftmost initially, + * working their way down by passing what is returned for the last call here + * until level 0 (leaf page level) was reached. + * + * Returns state for next call, if any. This includes left-most block number + * one level lower that should be passed on next level/call, or P_NONE leaf + * level is checked. Level numbers follow the nbtree convention: higher levels + * have higher numbers, because new levels are added only due to a root page + * split. Note that prior to the first root page split, the root is also a + * leaf page. This means that there is always a level 0, and it's always the + * last level processed. + * + * Note on memory management: State's per-page context is reset here, between + * each call to bt_target_page_check(). + */ +static BtreeLevel +bt_check_level_from_leftmost(BtreeCheckState *state, BtreeLevel level) +{ + /* State to establish early, concerning entire level */ + BTPageOpaque opaque; + MemoryContext oldcontext; + BtreeLevel nextleveldown; + + /* Variables for iterating across level using right links */ + BlockNumber leftcurrent = P_NONE; + BlockNumber current = level.leftmost; + + /* Initialize return state */ + nextleveldown.leftmost = InvalidBlockNumber; + nextleveldown.level = InvalidBtreeLevel; + nextleveldown.istruerootlevel = false; + + /* Use page-level context for duration of this call */ + oldcontext = MemoryContextSwitchTo(state->targetcontext); + + do + { + /* Don't rely on CHECK_FOR_INTERRUPTS() calls at lower level */ + CHECK_FOR_INTERRUPTS(); + + /* Initialize state for this iteration */ + state->targetblock = current; + state->target = palloc_btree_page(state, state->targetblock); + state->targetlsn = PageGetLSN(state->target); + + opaque = (BTPageOpaque) PageGetSpecialPointer(state->target); + + if (P_IGNORE(opaque)) + { + if (P_RIGHTMOST(opaque)) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("block %u fell off the end of index \"%s\"", + current, RelationGetRelationName(state->rel)))); + else + ereport(CONCERN, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("block %u of index \"%s\" ignored", + current, RelationGetRelationName(state->rel)))); + goto nextpage; + } + else if (nextleveldown.leftmost == InvalidBlockNumber) + { + /* + * A concurrent page split could make the caller supplied leftmost + * block no longer contain the leftmost page, or no longer be the + * true root, but where that isn't possible due to heavyweight + * locking, check that the first valid page meets caller's + * expectations. + */ + if (state->exclusivelock) + { + if (!P_LEFTMOST(opaque)) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("block %u is not leftmost in index \"%s\"", + current, RelationGetRelationName(state->rel)))); + + if (level.istruerootlevel && !P_ISROOT(opaque)) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("block %u is not root in index \"%s\"", + current, RelationGetRelationName(state->rel)))); + } + + /* + * Before beginning any non-trivial examination of level, establish + * next level down's leftmost block number, which next call here + * will pass as its leftmost (iff this isn't leaf level). + */ + if (!P_ISLEAF(opaque)) + { + IndexTuple itup; + ItemId itemid; + + /* Internal page -- downlink gets leftmost on next level */ + itemid = PageGetItemId(state->target, P_FIRSTDATAKEY(opaque)); + itup = (IndexTuple) PageGetItem(state->target, itemid); + nextleveldown.leftmost = ItemPointerGetBlockNumber(&(itup->t_tid)); + } + else + { + /* + * Leaf page -- final level caller must process. + * + * Note that this could also be the root page, if there has + * been no root page split yet. + */ + nextleveldown.leftmost = P_NONE; + } + + /* + * Store level, now that non-deleted page was found (there must be + * at least one per level) + */ + nextleveldown.level = opaque->btpo.level; + + /* + * Finished setting up state for this call/level. Control will + * never end up back here in any future loop iteration for this + * level. + */ + } + + if (state->exclusivelock && leftcurrent != opaque->btpo_prev) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("left link/right link pair in index \"%s\" don't comport", + RelationGetRelationName(state->rel)), + errdetail_internal("Deleted block=%u left block=%u link reported block=%u.", + current, leftcurrent, opaque->btpo_prev))); + + /* Verify invariants for page -- all important checks occur here */ + bt_target_page_check(state); + +nextpage: + + /* Try to detect circular links */ + if (current == leftcurrent || current == opaque->btpo_prev) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("circular link chain found in block %u of index \"%s\"", + current, RelationGetRelationName(state->rel)))); + + leftcurrent = current; + current = opaque->btpo_next; + + /* Free page and associated memory for this iteration */ + MemoryContextReset(state->targetcontext); + } + while (current != P_NONE); + + /* Don't change context for caller */ + MemoryContextSwitchTo(oldcontext); + + return nextleveldown; +} + +/* + * bt_target_page_check() + * + * Function performs the following checks on target page, or pages ancillary to + * target page: + * + * - That every "real" data item is less than or equal to the high key, which + * is an upper bound on the items on the pages (where there is a high key at + * all -- pages that are rightmost lack one). + * + * - That within the page, every "real" item is less than or equal to the item + * immediately to its right, if any (i.e., that the items are in order within + * the page, so that the binary searches performed by index scans are sane). + * + * - That the last item stored on the page is less than or equal to the first + * "real" data item on the page to the right (if such a first item is + * available). + * + * Furthermore, when state passed shows ExclusiveLock held, function also + * checks: + * + * - That all child pages comport with downlinks (internal pages only). + * + * Note: This routine is not especially proactive in freeing memory. High + * watermark memory consumption is bound to some small fixed multiple of + * BLCKSZ, though. Caller should reset the current context between calls here. + */ +static void +bt_target_page_check(BtreeCheckState *state) +{ + OffsetNumber offset; + OffsetNumber max; + BTPageOpaque topaque; + + topaque = (BTPageOpaque) PageGetSpecialPointer(state->target); + max = PageGetMaxOffsetNumber(state->target); + + /* + * Loop over page items, but don't start from P_HIKEY (don't have iteration + * directly considering high key item, if any). That's something that is + * used as part of verifying all other items, but doesn't get its own + * iteration. + */ + for (offset = P_FIRSTDATAKEY(topaque); offset <= max; offset++) + { + ItemId itemid; + IndexTuple itup; + ScanKey skey; + + CHECK_FOR_INTERRUPTS(); + + /* Don't try to generate scankey using "minus infinity" garbage data */ + if (OFFSET_IS_MINUS_INFINITY(topaque, offset)) + continue; + + /* Build insertion scankey for current page offset */ + itemid = PageGetItemId(state->target, offset); + itup = (IndexTuple) PageGetItem(state->target, itemid); + skey = _bt_mkscankey(state->rel, itup); + + /* + * ******************** + * * High key check * + * ******************** + * + * If there is a high key, which there must be for a non-rightmost + * page, check that it actually is upper bound on page items. + */ + if (!P_RIGHTMOST(topaque) && + !invariant_key_less_than_equal_offset(state, skey, P_HIKEY)) + { + char *itid, *htid; + + itid = psprintf("(%u,%u)", state->targetblock, offset); + htid = psprintf("(%u,%u)", + ItemPointerGetBlockNumber(&(itup->t_tid)), + ItemPointerGetOffsetNumber(&(itup->t_tid))); + + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("high key invariant violated for index \"%s\"", + RelationGetRelationName(state->rel)), + errdetail_internal("Index tid=%s points to %s tid=%s " + "page lsn=%X/%X.", + itid, + P_ISLEAF(topaque) ? "heap":"index", + htid, + (uint32) (state->targetlsn >> 32), + (uint32) state->targetlsn))); + } + + /* + * ******************** + * * Page order check * + * ******************** + * + * Check that items are stored on page in logical order, by checking + * current item is less than or equal to next item (if any). + */ + if (OffsetNumberNext(offset) <= max && + !invariant_key_less_than_equal_offset(state, skey, + OffsetNumberNext(offset))) + { + char *itid, *htid, + *nitid, *nhtid; + + itid = psprintf("(%u,%u)", state->targetblock, offset); + htid = psprintf("(%u,%u)", + ItemPointerGetBlockNumber(&(itup->t_tid)), + ItemPointerGetOffsetNumber(&(itup->t_tid))); + nitid = psprintf("(%u,%u)", state->targetblock, + OffsetNumberNext(offset)); + + /* Reuse itup to get pointed-to heap location of second item */ + itemid = PageGetItemId(state->target, OffsetNumberNext(offset)); + itup = (IndexTuple) PageGetItem(state->target, itemid); + nhtid = psprintf("(%u,%u)", + ItemPointerGetBlockNumber(&(itup->t_tid)), + ItemPointerGetOffsetNumber(&(itup->t_tid))); + + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("page order invariant violated for index \"%s\"", + RelationGetRelationName(state->rel)), + errdetail_internal("Lower index tid=%s (points to %s tid=%s) " + "higher index tid=%s (points to %s tid=%s) " + "page lsn=%X/%X.", + itid, + P_ISLEAF(topaque) ? "heap":"index", + htid, + nitid, + P_ISLEAF(topaque) ? "heap":"index", + nhtid, + (uint32) (state->targetlsn >> 32), + (uint32) state->targetlsn))); + } + /* + * ******************** + * * Last item check * + * ******************** + * + * Check last item against next page when last item on page is reached. + * + * The general idea here is that checking the ordering of items on the + * page should still perform some check on the last item on the page, + * if at all possible. In other words, this is roughly the same + * process as the page order check that has already been performed for + * every other "real" item on target page by now; we just need to reach + * into the next page to get a scankey to compare against lower bound + * of max. + */ + else if (offset == max) + { + ScanKey rightkey; + + /* Get first item in next/right page */ + rightkey = bt_right_page_check_scankey(state); + + /* + * Why this is correct for !exclusivelock callers: + * + * Concurrent page splits are not a problem for ordinary index + * scans, since the key space always moves in a way that lets index + * scans not miss things: they might have to move right, but they + * never have to move left (leaving aside index scans backwards, a + * special case). A concurrent page split could occur here, but + * just as with index scans we're following the stale right link, + * which will reliably get us further along in the key space, which + * is all we really need to get an item further along in key space + * to check invariant in target page. + * + * (Note that routines like _bt_search() don't require *any* page + * split interlock when descending the tree, including something + * very light like a buffer pin. That's why it's okay that we + * don't either.) + * + * That just leaves concurrent page deletion. It's okay if we + * follow a rightlink and find a half-dead or dead page. Either + * way, there must be a sane further right link to follow for these + * ignorable pages, because page deletion refuses to merge the key + * space between adjacent pages that do not share a common parent + * (that is, merging of the key space has to be among true sibling + * pages, never cousin pages). We should succeed in finding a page + * to the right that isn't ignorable before too long. And so, here + * too we can always get further in the key space by moving right + * one or more times. + * + * A deleted page won't actually be recycled by VACUUM early enough + * for us to fail to be able follow its right link, because it + * doesn't do so until it knows that no possible index scan could + * land on the page with the expectation of at least being able to + * move right and eventually find a non-ignorable page. (see page + * recycling/RecentGlobalXmin notes in nbtree README.) + */ + if (rightkey && + !invariant_key_greater_than_equal_offset(state, rightkey, max)) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("cross page order invariant violated for index \"%s\"", + RelationGetRelationName(state->rel)), + errdetail_internal("Last item on page tid=(%u,%u) " + "right page block=%u " + "page lsn=%X/%X.", + state->targetblock, offset, + topaque->btpo_next, + (uint32) (state->targetlsn >> 32), + (uint32) state->targetlsn))); + } + + /* + * ******************** + * * Downlink check * + * ******************** + * + * Additional check of child items against target page (their parent), + * iff this is internal page and holds ExclusiveLock on index relation. + */ + if (!P_ISLEAF(topaque) && state->exclusivelock) + { + BlockNumber childblock = ItemPointerGetBlockNumber(&(itup->t_tid)); + + bt_downlink_check(state, childblock, skey); + } + } +} + +/* + * bt_right_page_check_scankey() + * + * Return a scankey for the first real data item on page to right of current + * target (or the first non-ignorable page), sufficient to check ordering + * invariant on last item in current target page. Returned scankey relies on + * local memory allocated for the child page, which caller cannot pfree(). + * Caller's memory context should be reset between calls here. + * + * Note that we always get the first real data item -- not the high key, and + * not the undefined first "real" item that internal pages must have (the + * "minus infinity" item). + * + * Iff there is no sane first real data item, return NULL. Typically, this + * happens because the target page is rightmost and there simply is no page to + * find the first item on, but it's also possible because leaf pages may have + * no "real" items after VACUUM is run, or because the only item available in a + * "minus infinity" item. + */ +static ScanKey +bt_right_page_check_scankey(BtreeCheckState *state) +{ + BTPageOpaque opaque; + ItemId firstsane; + OffsetNumber nline; + BlockNumber targetnext; + Page rightpage; + + /* Determine target's next block number */ + opaque = (BTPageOpaque) PageGetSpecialPointer(state->target); + + /* No right-link; target it already at end of key space */ + if (P_RIGHTMOST(opaque)) + return NULL; + + targetnext = opaque->btpo_next; + for (;;) + { + CHECK_FOR_INTERRUPTS(); + + rightpage = palloc_btree_page(state, targetnext); + opaque = (BTPageOpaque) PageGetSpecialPointer(rightpage); + + if (!P_IGNORE(opaque)) + break; + + /* + * Right-most page should not be ignorable: + */ + if (P_RIGHTMOST(opaque)) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("fell off the end of index \"%s\"", + RelationGetRelationName(state->rel)))); + + /* + * We landed on a deleted page, so step right to find a live page + * (there should be one; if not, the "fell off index" check would have + * just failed) + */ + targetnext = opaque->btpo_next; + ereport(CONCERN, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("level %u leftmost page of index \"%s\" was found deleted or half dead", + opaque->btpo.level, RelationGetRelationName(state->rel)), + errdetail_internal("Deleted page found when getting scankey of first item to right."))); + + pfree(rightpage); + } + + nline = PageGetMaxOffsetNumber(rightpage); + + if (P_ISLEAF(opaque) && nline >= P_FIRSTDATAKEY(opaque)) + { + /* For leaf page, simply return first data item (if any) */ + firstsane = PageGetItemId(rightpage, P_FIRSTDATAKEY(opaque)); + } + else if (!P_ISLEAF(opaque) && + nline >= OffsetNumberNext(P_FIRSTDATAKEY(opaque))) + { + /* + * Internal page must have at least one downlink (real item), but that + * could just be a "minus infinity" item, which would be useless to + * caller. Return first item after the internal page's undefined + * "minus infinity" item, if any. + */ + firstsane = PageGetItemId(rightpage, + OffsetNumberNext(P_FIRSTDATAKEY(opaque))); + } + else + { + /* + * No first item. Page must be empty leaf page, or internal page with + * only a "minus infinity" item. + * + * XXX: Maybe we should return a scankey with the high key, if any, + * here, which would suit the caller's purposes. It's simpler for this + * function to have one clear purpose, though, and it's far from clear + * that not doing so will ever actually matter. + */ + ereport(CONCERN, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("%s block %u of index \"%s\" has no first data item", + P_ISLEAF(opaque) ? "leaf":"internal", targetnext, + RelationGetRelationName(state->rel)))); + return NULL; + } + + /* + * Return scankey of first real item. Note that this relies on right page + * memory remaining allocated. + */ + return _bt_mkscankey(state->rel, + (IndexTuple) PageGetItem(rightpage, firstsane)); +} + +/* + * bt_downlink_check() + * + * Checks one of target's downlink against its child page. + * + * Conceptually, the target page continues to be what is checked here. The + * target block is still blamed in the event of finding an invariant violation. + * The downlink insertion into the target is probably where any problem raised + * here arises, and their is no such thing as a parent link, so doing the + * verification this way around is much more practical. + */ +static void +bt_downlink_check(BtreeCheckState *state, BlockNumber childblock, + ScanKey targetkey) +{ + OffsetNumber offset; + OffsetNumber maxoffset; + Page child; + BTPageOpaque copaque; + + Assert(state->exclusivelock); + + /* + * Verify child page has the down-link key from target page (its parent) as + * a lower bound. + * + * This check would probably be almost as effective if only the first item + * was checked, since the ordering of all items will be checked when we + * descend to the next level anyway. This behavior seems simpler, though, + * and it might well make an appreciable difference to examine everything + * when the operator class has subtle bugs. + */ + child = palloc_btree_page(state, childblock); + copaque = (BTPageOpaque) PageGetSpecialPointer(child); + maxoffset = PageGetMaxOffsetNumber(child); + + for (offset = P_FIRSTDATAKEY(copaque); offset <= maxoffset; offset++) + { + /* + * Skip comparison of target page key against "minus infinity" item, if + * any. Checking it would indicate that its not an upper bound, but + * that's only because of the hard-coding within _bt_compare(). + */ + if (OFFSET_IS_MINUS_INFINITY(copaque, offset)) + continue; + + if (!invariant_key_less_than_equal_nontarget_offset(state, child, + targetkey, offset)) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("down-link lower bound invariant violated for index \"%s\"", + RelationGetRelationName(state->rel)), + errdetail_internal("Parent block=%u child index tid=(%u,%u) " + "parent page lsn=%X/%X.", + state->targetblock, childblock, offset, + (uint32) (state->targetlsn >> 32), + (uint32) state->targetlsn))); + } + + pfree(child); +} + +/* + * invariant_key_less_than_equal_offset() + * + * Does the invariant hold that the key is less than or equal to a given upper + * bound offset item? + * + * If this function returns false, convention is that caller throws error due + * to corruption. + */ +static bool +invariant_key_less_than_equal_offset(BtreeCheckState *state, ScanKey key, + OffsetNumber upperbound) +{ + int16 natts = state->rel->rd_rel->relnatts; + int32 cmp; + + cmp = _bt_compare(state->rel, natts, key, state->target, upperbound); + + return cmp <= 0; +} + +/* + * invariant_key_less_than_equal_offset() + * + * Does the invariant hold that the key is greater than or equal to a given + * lower bound offset item? + * + * If this function returns false, convention is that caller throws error due + * to corruption. + */ +static bool +invariant_key_greater_than_equal_offset(BtreeCheckState *state, ScanKey key, + OffsetNumber lowerbound) +{ + int16 natts = state->rel->rd_rel->relnatts; + int32 cmp; + + cmp = _bt_compare(state->rel, natts, key, state->target, lowerbound); + + return cmp >= 0; +} + +/* + * invariant_key_less_than_equal_nontarget_offset() + * + * Does the invariant hold that the key is less than or equal to a given upper + * bound offset item, with the offset relating to a caller-supplied page that + * is not the current target page? Caller's non-target page is typically a + * child page of the target, checked as part of checking a property of the + * target page (i.e. the key comes from the target). + * + * If this function returns false, convention is that caller throws error due + * to corruption. + */ +static bool +invariant_key_less_than_equal_nontarget_offset(BtreeCheckState *state, + Page nontarget, ScanKey key, + OffsetNumber upperbound) +{ + int16 natts = state->rel->rd_rel->relnatts; + int32 cmp; + + cmp = _bt_compare(state->rel, natts, key, nontarget, upperbound); + + return cmp <= 0; +} + +/* + * palloc_btree_page() + * + * Given a block number of a B-Tree page, return page in palloc()'d memory. + * While at it, perform some basic checks of the page. + * + * There is never an attempt to get a consistent view of multiple pages using + * multiple concurrent buffer locks; in general, we prefer to have only one pin + * and buffer lock at a time, which is often all that the nbtree code requires. + * + * Operating on a copy of the page is useful because it prevents control + * getting stuck in an uninterruptible state when an underlying operator class + * misbehaves. + */ +static Page +palloc_btree_page(BtreeCheckState *state, BlockNumber blocknum) +{ + Buffer buffer; + Page page; + BTPageOpaque opaque; + + page = palloc(BLCKSZ); + + /* + * We copy the page into local storage to avoid holding pin on the buffer + * longer than we must. + */ + buffer = ReadBufferExtended(state->rel, MAIN_FORKNUM, blocknum, RBM_NORMAL, + state->checkstrategy); + LockBuffer(buffer, BT_READ); + + /* + * Perform the same basic sanity checking that nbtree itself performs for + * every page: + */ + _bt_checkpage(state-> rel, buffer); + + /* Only use copy of page in palloc()'d memory */ + memcpy(page, BufferGetPage(buffer), BLCKSZ); + UnlockReleaseBuffer(buffer); + + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + + if (opaque->btpo_flags & BTP_META && blocknum != BTREE_METAPAGE) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("invalid meta page found at block %u in index \"%s\"", + blocknum, RelationGetRelationName(state->rel)))); + + /* Check page from block that ought to be meta page */ + if (blocknum == BTREE_METAPAGE) + { + BTMetaPageData *metad = BTPageGetMeta(page); + + if (!(opaque->btpo_flags & BTP_META) || + metad->btm_magic != BTREE_MAGIC) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("index \"%s\" meta page is corrupt", + RelationGetRelationName(state->rel)))); + + if (metad->btm_version != BTREE_VERSION) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("version mismatch in index \"%s\": file version %d, code version %d", + RelationGetRelationName(state->rel), + metad->btm_version, BTREE_VERSION))); + } + + /* + * Deleted pages have no sane "level" field, so can only check non-deleted + * page level + */ + if (P_ISLEAF(opaque) && !P_ISDELETED(opaque) && opaque->btpo.level != 0) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("invalid leaf page level %u for block %u in index \"%s\"", + opaque->btpo.level, blocknum, RelationGetRelationName(state->rel)))); + + if (blocknum != BTREE_METAPAGE && !P_ISLEAF(opaque) && + !P_ISDELETED(opaque) && opaque->btpo.level == 0) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("invalid internal page level 0 for block %u in index \"%s\"", + opaque->btpo.level, RelationGetRelationName(state->rel)))); + + if (!P_ISLEAF(opaque) && P_HAS_GARBAGE(opaque)) + ereport(CORRUPTION, + (errcode(ERRCODE_INDEX_CORRUPTED), + errmsg("internal page block %u in index \"%s\" has garbage items", + blocknum, RelationGetRelationName(state->rel)))); + + return page; +} diff --git a/contrib/amcheck/amcheck.control b/contrib/amcheck/amcheck.control new file mode 100644 index 0000000..180f150 --- /dev/null +++ b/contrib/amcheck/amcheck.control @@ -0,0 +1,5 @@ +# amcheck extension +comment = 'verify access method invariants' +default_version = '1.0' +module_pathname = '$libdir/amcheck' +relocatable = true diff --git a/doc/src/sgml/amcheck.sgml b/doc/src/sgml/amcheck.sgml new file mode 100644 index 0000000..b042e58 --- /dev/null +++ b/doc/src/sgml/amcheck.sgml @@ -0,0 +1,244 @@ +<!-- doc/src/sgml/amcheck.sgml --> + +<sect1 id="amcheck" xreflabel="amcheck"> + <title>amcheck</title> + + <indexterm zone="amcheck"> + <primary>amcheck</primary> + </indexterm> + + <para> + The <filename>amcheck</> module provides functions that allow you to + verify the logical consistency of the structure of indexes. If the + structure appears to be valid, no error is raised. + </para> + + <para> + The functions verify various <emphasis>invariants</> in the + structure of the representation of particular indexes. The + correctness of the access method functions behind index scans and + other important operations is predicated on these invariants always + holding. For example, certain functions verify, among other things, + that all B-Tree pages have items in <quote>logical</> order; if that + particular invariant somehow fails to hold, we can expect binary + searches on any affected page to produce wrong answers. As a + consequence of that, index scans may subtly produce wrong answers. + </para> + + <para> + Verification is performed using the same procedures as those used by + index scans themselves, which may be user-defined operator class + code. For example, B-Tree index verification relies on comparisons + made with one or more B-Tree support function 1 routines, much like + B-Tree index scans rely on the routines to guide the scan to a point + in the underlying table; see <xref linkend="xindex-support"> for + details of operator class support functions. + </para> + + <para> + <filename>amcheck</> functions may be used only by superusers. + </para> + + <sect2> + <title>Functions</title> + + <variablelist> + <varlistentry> + <term> + <function>bt_index_check(index regclass) returns void</function> + <indexterm> + <primary>bt_index_check</primary> + </indexterm> + </term> + + <listitem> + <para> + <function>bt_index_check</function> tests that its target, a + B-Tree index, respects a variety of invariants. Example usage: +<screen> +test=# SELECT bt_index_check(c.oid), c.relname, c.relpages +FROM pg_index i +JOIN pg_opclass op ON i.indclass[0] = op.oid +JOIN pg_am am ON op.opcmethod = am.oid +JOIN pg_class c ON i.indexrelid = c.oid +JOIN pg_namespace n ON c.relnamespace = n.oid +WHERE am.amname = 'btree' AND n.nspname = 'pg_catalog' +-- Don't check pg_class (bt_index_parent_check() requires this): +AND c.relname NOT LIKE 'pg_class%' +-- Function may throw an error when this is omitted: +AND i.indisready AND i.indisvalid +ORDER BY c.relpages DESC LIMIT 10; + bt_index_check | relname | relpages +----------------+---------------------------------+---------- + | pg_depend_reference_index | 43 + | pg_depend_depender_index | 40 + | pg_proc_proname_args_nsp_index | 31 + | pg_description_o_c_o_index | 21 + | pg_attribute_relid_attnam_index | 14 + | pg_proc_oid_index | 10 + | pg_attribute_relid_attnum_index | 9 + | pg_amproc_fam_proc_index | 5 + | pg_amop_opr_fam_index | 5 + | pg_amop_fam_strat_index | 5 +(10 rows) +</screen> + This example shows a session that performs verification of every + catalog index in the database <quote>test</> (except those + associated with the <structname>pg_class</structname> catalog). + Details of just the 10 largest indexes verified are displayed. + Naturally, this query could easily be changed to call + <function>bt_index_check</function> for every index in the + database where verification is supported. + </para> + <para> + An <literal>AccessShareLock</> is acquired on the target index + by <function>bt_index_check</function>. This lock mode is the + same lock mode acquired on relations by simple + <literal>SELECT</> statements. + <function>bt_index_check</function> does not verify invariants + that span child/parent relationships, nor does it verify that + the target index is consistent with its heap relation. When a + routine, lightweight test for corruption is required in a live + production environment, using + <function>bt_index_check</function> often provides the best + trade-off between thoroughness of verification and limiting the + impact on application performance and availability. + </para> + </listitem> + </varlistentry> + + <varlistentry> + <term> + <function>bt_index_parent_check(index regclass) returns void</function> + <indexterm> + <primary>bt_index_parent_check</primary> + </indexterm> + </term> + + <listitem> + <para> + <function>bt_index_parent_check</function> tests that its + target, a B-Tree index, respects a variety of invariants. The + checks performed by <function>bt_index_parent_check</function> + are a superset of the checks performed by + <function>bt_index_check</function>. + <function>bt_index_parent_check</function> can be thought of as + a more thorough variant of <function>bt_index_check</function>: + unlike <function>bt_index_check</function>, + <function>bt_index_parent_check</function> also checks + invariants that span parent/child relationships. However, it + does not verify that the target index is consistent with its + heap relation. + </para> + <para> + An <literal>ExclusiveLock</> is required on the target index by + <function>bt_index_parent_check</function> (a + <literal>ShareLock</> is also acquired on the heap relation). + These locks prevent concurrent data modification from + <command>INSERT</>, <command>UPDATE</>, and <command>DELETE</> + commands. The locks also prevent the underlying relation from + being concurrently processed by <command>VACUUM</> and certain + other utility commands. Note that the function holds locks for + as short a duration as possible, so there is no advantage to + verifying each index individually in a series of transactions. + </para> + <para> + <function>bt_index_parent_check</function>'s additional + verification is more likely to detect various pathological + cases. These cases may involve an incorrectly implemented + B-Tree operator class used by the index that is checked, or, + hypothetically, undiscovered bugs in the underlying B-Tree index + access method code. Note that + <function>bt_index_parent_check</function> will raise an error + when called while Hot Standby mode is enabled, unlike + <function>bt_index_check</function>. + </para> + </listitem> + </varlistentry> + </variablelist> + </sect2> + + <sect2> + <title>Using <filename>amcheck</> effectively</title> + + <para> + <filename>amcheck</> can be effective at detecting various types of + failure modes that <link + linkend="app-initdb-data-checksums"><application>data page + checksums</></link> will always fail to catch. These include: + + <itemizedlist> + <listitem> + <para> + Filesystem or storage subsystem faults where checksums happen to + simply not be enabled. Note that <filename>amcheck</> examines a + page as represented in some shared memory buffer at the time of + verification if there is only a shared buffer hit when accessing + the block. Consequently, <filename>amcheck</> does not + necessarily examine data read from the filesystem at the time of + verification. Note that when checksums are enabled, + <filename>amcheck</> may raise an error due to a checksum failure + when a corrupt block is read into a buffer. + </para> + </listitem> + <listitem> + <para> + Corruption caused by faulty RAM, and the broader memory subsystem + and operating system. <productname>PostgreSQL</> does not + protect against correctable memory errors and it is assumed you + will operate using RAM that uses industry standard Error + Correcting Codes (ECC) or better protection. However, ECC memory + is typically only immune to single-bit errors, and should not be + assumed to provide <emphasis>absolute</emphasis> protection + against failures that result in memory corruption. + </para> + </listitem> + <listitem> + <para> + Structural inconsistencies caused by incorrect operator class + implementations, including issues due to the comparison rules of + operating system collations changing the ordering implied by + comparisons of an underlying collatable type, such as + <type>text</>. Though rare, updates to operating system + collation rules can cause these issues. More commonly, an + inconsistency in the collation order between a master server and + a standby server is implicated, possibly because the + <emphasis>major</> operating system version in use is + inconsistent. Such inconsistencies will generally only arise on + standby servers, and so can generally only be detected on standby + servers. If a problem like this arises, it may not affect each + individual index that is ordered using an affected collation, + simply because <emphasis>indexed</> values might happen to have + the same absolute ordering regardless of the behavioral + inconsistency. See <xref linkend="locale"> and <xref + linkend="collation"> for further details about how + <productname>PostgreSQL</> uses operating system locales and + collations. + </para> + </listitem> + <listitem> + <para> + Corruption caused by hypothetical undiscovered bugs in the + underlying <productname>PostgreSQL</> access method code. + Automatic verification of the structural integrity of indexes + plays a role in the general testing of new or proposed + <productname>PostgreSQL</> features that could plausibly allow a + logical inconsistency to be introduced. One obvious testing + strategy is to call <filename>amcheck</> functions continuously + when running the standard regression tests. See <xref + linkend="regress-run"> for details on running the tests. + </para> + </listitem> + </itemizedlist> + + No error concerning corruption raised by <filename>amcheck</> should + ever be a false positive. Users are strongly advised to perform a + <command>REINDEX </> for any index where corruption is indicated, + although it is also important to identify and correct the underlying + problem. In general, <filename>amcheck</> can only prove the + presence of corruption; it cannot prove its absence. + </para> + + </sect2> + +</sect1> diff --git a/doc/src/sgml/contrib.sgml b/doc/src/sgml/contrib.sgml index 1b3d2d9..164f594 100644 --- a/doc/src/sgml/contrib.sgml +++ b/doc/src/sgml/contrib.sgml @@ -103,6 +103,7 @@ CREATE EXTENSION <replaceable>module_name</> FROM unpackaged; </para> &adminpack; + &amcheck; &auth-delay; &auto-explain; &btree-gin; diff --git a/doc/src/sgml/filelist.sgml b/doc/src/sgml/filelist.sgml index a12fee7..48476bd 100644 --- a/doc/src/sgml/filelist.sgml +++ b/doc/src/sgml/filelist.sgml @@ -104,6 +104,7 @@ <!-- contrib information --> <!ENTITY contrib SYSTEM "contrib.sgml"> <!ENTITY adminpack SYSTEM "adminpack.sgml"> +<!ENTITY amcheck SYSTEM "amcheck.sgml"> <!ENTITY auth-delay SYSTEM "auth-delay.sgml"> <!ENTITY auto-explain SYSTEM "auto-explain.sgml"> <!ENTITY btree-gin SYSTEM "btree-gin.sgml"> -- 1.9.1
-- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers