Re: [HACKERS] 9.4 btree index corruption

Heikki Linnakangas Sun, 25 May 2014 07:18:06 -0700

On 05/21/2014 10:22 PM, Jeff Janes wrote:

Testing partial-write crash-recovery in 9.4 (e12d7320ca494fd05134847e30)
with foreign keys, I found some btree index corruption.


28807 VACUUM 2014-05-21 15:33:46.878 PDT:ERROR:  right sibling 4044 of
block 460 is not next child 23513 of block 1264 in index "foo_p_id_idx"
28807 VACUUM 2014-05-21 15:33:46.878 PDT:STATEMENT:  VACUUM;

It took ~8 hours on 8 cores to encounter this problem.  This is a single
occurrence, it has not yet been reproduced.

I don't know that the partial-writes, or the crash recovery, or the foreign
key, parts of this test are important--it could be a more generic problem
that only happened to be observed here.  Nor do I know yet if it occurs in
9_3_STABLE.

Below is the testing harness and the data directory (massively bloated at
3.7GB once uncompressed).  It is currently in wrap-around shutdown, but
that is the effect of persistent vacuum failures, not the cause of them.
  You can restart the data directory and it will repeat the above sibling
error once autovac kicks in.  I don't know if the bloat is due to the
vacuum failure or if it was already in process before the failures started.
  I've cranked up the logging on that front future efforts.

I'm using some fast-foward code on the xid consumption so that freezing
occurs more often, and some people have expressed reservations that the
code might be imperfect, and I can't rule that out as the cause (but I've
never traced any other problems back to that code).  But it did make it
through 4 complete wraps before this problem was encountered, so if that is
the problem it must be probabilistic rather than deterministic.

https://drive.google.com/folderview?id=0Bzqrh1SO9FcENWd6ZXlwVWpxU0E&usp=sharing

I downloaded the data directory and investigated. I got this messagewhen I started it up:

20392 2014-05-25 05:51:37.835 PDT:ERROR: right sibling 4044 of block460 is not next child 23513 of block 86458 in index "foo_p_id_idx"20392 2014-05-25 05:51:37.835 PDT:CONTEXT: automatic vacuum of table"jjanes.public.foo"

Interestingly, it's complaining about parent page 86458, while yoursclaimed it was 1264. I don't know why; perhaps a huge number ofinsertions happened after that error, causing the parent level pages tobe split, moving the downlinks it complains about to the right. Did youcontinue running the test after that error occurred?


This is what the tree looks like around those pages:

Level 1:
+-------------+     +-------------+     +-------------+
| Blk 1264    |     | Blk 160180  |     | Blk 86458   |
|             |     |             |     |             |
| Downlinks:  | <-> | Downlinks:  | <-> | Downlinks:  |
| ...         |     | ...         |     |     1269    |
|             |     |             |     |      460    |
|             |     |             |     |    23513    |
+-------------+     +-------------+     +-------------+


Leaf level:
+-------------+     +---------+     +-----------+     +-------------+
| Blk 1269    |     | Blk 460 |     | Blk 4044  |     | Blk 23513   |
|             | <-> |         | <-> | HALF_DEAD | <-> |             |
| (314 items) |     | (empty) |     |           |     | (212 items) |
+-------------+     +---------+     +-----------+     +-------------+

Leaf block 4044 is marked as half-dead, and there is no downlink to iton the parent block, 86458. That is normal, when a vacuum is interruptedduring page deletion (in 9.4). The next vacuum ought to continue it.However, because block 460 is empty, the next vacuum that comes alongwill try to start the page deletion process on that block. Page deletiondoes a cross-check at the parent level, checking that the next downlinkon the parent page points to the right sibling of the page we're aboutto delete. However, because the right sibling is already marked ashalf-dead, and the downlink pointing to it has already been removed, thecross-check fails.

So, the sanity check is a bit overzealous. It's normal that the rightsibling has no downlink, if it's marked as half-dead. The trivial fix isto just remove the cross-check, but I'd rather not do that becausegenerally-speaking more sanity checks are good.

I think we could fetch the right sibling page to see if it's marked ashalf-dead, and if it is, just give up on deleting the current page. Thisvacuum will eventually get to the right sibling page and finish thedeletion, and the next vacuum will then be able to delete the emptypage. Or we could finish the deletion of the right page immediately, butI'd like to keep the rarely-occurring corner-case codepaths to the minimum.

With the attached patch, I was able to get past that error, but whenVACUUM reaches the end, I got this:


jjanes=# vacuum foo;

ERROR: database is not accepting commands to avoid wraparound data lossin database "jjanes"

HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

That seems like an unrelated issue, I'll start a new thread about that.

Many thanks for the testing, Jeff!

- Heikki

diff --git a/src/backend/access/nbtree/nbtpage.c b/src/backend/access/nbtree/nbtpage.c
index d357b33..fd7e9b5 100644
--- a/src/backend/access/nbtree/nbtpage.c
+++ b/src/backend/access/nbtree/nbtpage.c
@@ -954,18 +954,36 @@ _bt_delitems_delete(Relation rel, Buffer buf,
 	END_CRIT_SECTION();
 }
 
+
+static bool
+_bt_is_page_halfdead(Relation rel, BlockNumber blk)
+{
+	Buffer		buf;
+	Page		page;
+	BTPageOpaque opaque;
+	bool		result;
+
+	buf = _bt_getbuf(rel, blk, BT_READ);
+	page = BufferGetPage(buf);
+	opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+	result = P_ISHALFDEAD(opaque);
+	_bt_relbuf(rel, buf);
+
+	return result;
+}
+
 /*
  * Subroutine to find the parent of the branch we're deleting.  This climbs
  * up the tree until it finds a page with more than one child, i.e. a page
  * that will not be totally emptied by the deletion.  The chain of pages below
- * it, with one downlink each, will be part of the branch that we need to
- * delete.
+ * it, with one downlink each, will form the branch that we need to delete.
  *
  * If we cannot remove the downlink from the parent, because it's the
  * rightmost entry, returns false.  On success, *topparent and *topoff are set
  * to the buffer holding the parent, and the offset of the downlink in it.
  * *topparent is write-locked, the caller is responsible for releasing it when
- * done.  *target is set to the topmost page in the branch to-be-deleted, ie.
+ * done.  *target is set to the topmost page in the branch to-be-deleted, i.e.
  * the page whose downlink *topparent / *topoff point to, and *rightsib to its
  * right sibling.
  *
@@ -994,7 +1012,7 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 	BTPageOpaque opaque;
 	BlockNumber leftsib;
 
-	/* Locate the parent's downlink (updating the stack entry if needed) */
+	/* Locate the downlink in the parent (updating the stack entry if needed) */
 	ItemPointerSet(&(stack->bts_btentry.t_tid), child, P_HIKEY);
 	pbuf = _bt_getstackbuf(rel, stack, BT_WRITE);
 	if (pbuf == InvalidBuffer)
@@ -1066,6 +1084,13 @@ _bt_lock_branch_parent(Relation rel, BlockNumber child, BTStack stack,
 				_bt_relbuf(rel, lbuf);
 			}
 
+			if (_bt_is_page_halfdead(rel, *rightsib))
+			{
+				elog(LOG, "could not delete page %u because its right sibling %u is half-dead",
+					 parent, *rightsib);
+				return false;
+			}
+
 			return _bt_lock_branch_parent(rel, parent, stack->bts_parent,
 										topparent, topoff, target, rightsib);
 		}
@@ -1234,6 +1259,13 @@ _bt_pagedel(Relation rel, Buffer buf)
 					lbuf = _bt_getbuf(rel, leftsib, BT_READ);
 					lpage = BufferGetPage(lbuf);
 					lopaque = (BTPageOpaque) PageGetSpecialPointer(lpage);
+					/*
+					 * If the left sibling is split again by another backend,
+					 * after we released the lock, we know that the first
+					 * split must be finished, because we don't allow an
+					 * incompletely-split page to be split again. So we don't
+					 * need to walk right here.
+					 */
 					if (lopaque->btpo_next == BufferGetBlockNumber(buf) &&
 						P_INCOMPLETE_SPLIT(lopaque))
 					{
@@ -1338,6 +1370,13 @@ _bt_mark_page_halfdead(Relation rel, Buffer leafbuf, BTStack stack)
 	leafblkno = BufferGetBlockNumber(leafbuf);
 	leafrightsib = opaque->btpo_next;
 
+	if (_bt_is_page_halfdead(rel, leafrightsib))
+	{
+		elog(LOG, "could not delete page %u because its right sibling %u is half-dead",
+			 leafblkno, leafrightsib);
+		return false;
+	}
+
 	/*
 	 * We cannot delete a page that is the rightmost child of its immediate
 	 * parent, unless it is the only child --- in which case the parent has to

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] 9.4 btree index corruption

Reply via email to