Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Andres Freund Fri, 01 Feb 2019 07:14:47 -0800

Hi,

On 2019-01-29 12:23:51 -0800, Andres Freund wrote:
> On 2019-01-29 11:25:41 -0800, Andres Freund wrote:
> > On 2019-01-28 22:37:53 -0500, Tom Lane wrote:
> > > Andres Freund <and...@anarazel.de> writes:
> > > > I did that now. I couldn't reproduce it locally, despite a lot of
> > > > runs. Looking at the buildfarm it looks like the failures were,
> > > > excluding handfish which failed without recognizable symptoms before and
> > > > after, on BSD derived platforms (netbsd, freebsd, OX), which certainly
> > > > is interesting.
> > > 
> > > Isn't it now.  Something about the BSD scheduler perhaps?  But we've
> > > got four or five different BSD-ish platforms that reported failures,
> > > and it's hard to believe they've all got identical schedulers.
> > > 
> > > That second handfish failure does match the symptoms elsewhere:
> > > 
> > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=handfish&dt=2019-01-29%2000%3A20%3A22
> > > 
> > > --- 
> > > /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/expected/thread-thread.stderr
> > >       2018-10-30 20:11:45.551967381 -0300
> > > +++ 
> > > /home/filiperosset/dev/client-code-REL_8/HEAD/pgsql.build/src/interfaces/ecpg/test/results/thread-thread.stderr
> > >        2019-01-28 22:38:20.614211568 -0200
> > > @@ -0,0 +1,20 @@
> > > +SQL error: page 0 of relation "test_thread" should be empty but is not 
> > > on line 125
> > > 
> > > so it's not quite 100% BSD, but certainly the failure rate on BSD is
> > > way higher than elsewhere.  Puzzling.
> > 
> > Interesting.
> > 
> > While chatting with Robert about this issue I came across the following
> > section of code:
> > 
> >             /*
> >              * If the FSM knows nothing of the rel, try the last page 
> > before we
> >              * give up and extend.  This avoids one-tuple-per-page syndrome 
> > during
> >              * bootstrapping or in a recently-started system.
> >              */
> >             if (targetBlock == InvalidBlockNumber)
> >             {
> >                     BlockNumber nblocks = 
> > RelationGetNumberOfBlocks(relation);
> > 
> >                     if (nblocks > 0)
> >                             targetBlock = nblocks - 1;
> >             }
> > 
> > 
> > I think that explains the issue (albeit not why it is much more frequent
> > on BSDs).  Because we're not going through the FSM, it's perfectly
> > possible to find a page that is uninitialized, *and* is not yet in the
> > FSM. The only reason this wasn't previously actively broken, I think, is
> > that while we previously *also* looked that page (before the extending
> > backend acquired a lock!), when looking at the page
> > PageGetHeapFreeSpace(), via PageGetFreeSpace(), decides there's no free
> > space because it just interprets the zeroes in pd_upper - pd_lower as no
> > free space.
> > 
> > Hm, thinking about what a good solution here could be.
> 
> I wonder if we should just expand the logic we have for
> RBM_ZERO_AND_LOCK so it can be and use it in hio.c (we probably could
> just use it without any changes, but the name seems a bit confusing) -
> because that'd prevent the current weirdness that it's possible that the
> buffer can be locked by somebody between the ReadBufferBI(P_NEW) and and
> the LockBuffer(BUFFER_LOCK_EXCLUSIVE).  I think that'd allow us to
> alltogether drop the cleanup lock logic we currently have, and also
> protect us against the FSM issue I'd outlined upthread?


Here's a version of the patch implementing this approach.  I assume this
solves the FreeBSD issue, but I'm running tests in a loop on Thomas'
machine.

I did not rename RBM_ZERO_AND_LOCK. New buffers are zeroed too, so that
still seems apt enough.

Greetings,

Andres Freund

>From 2f211ba82bc3a9c0935bfd0cd7fb58355134b929 Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Fri, 1 Feb 2019 07:08:10 -0800
Subject: [PATCH v3] Move page initialization from RelationAddExtraBlocks() to
 use, take 2.

Previously we initialized pages when bulk extending in
RelationAddExtraBlocks(). That has a major disadvantage: It ties
RelationAddExtraBlocks() to heap, as other types of storage are likely
to need different amounts of special space, have different amount of
free space (previously determined by PageGetHeapFreeSpace()).

That we're relying on initializing pages, but not WAL logging the
initialization, also means the risk for getting
"WARNING:  relation \"%s\" page %u is uninitialized --- fixing"
style warnings in vacuums after crashes/immediate shutdowns, is
considerably higher. The warning sounds much more serious than what
they are.

Fix those two issues together by not initializing pages in
RelationAddExtraPages() (but continue to do so in
RelationGetBufferForTuple(), which is linked much more closely to
heap), and accepting uninitialized pages as normal in
vacuumlazy.c. When vacuumlazy encounters an empty page it now adds it
to the FSM, but does nothing else.  We chose to not issue a debug
message, much less a warning in that case - it seems rarely useful,
and quite likely to scare people unnecessarily.

For now empty pages aren't added to the VM, because standbys would not
re-discover such pages after a promotion. In contrast to other sources
for empty pages, there's no corresponding WAL records triggering FSM
updates during replay.

Previously when extending the relation, there was a moment between
extending the relation, and acquiring an exclusive lock on the new
page, in which another backend could lock the page. To avoid new
content being put on that new page, vacuumlazy needed to acquire the
extension lock for a brief moment when encountering a new page. A
second corner case, only working somewhat by accident, was that
RelationGetBufferForTuple() sometimes checks the last page in a
relation for free space, without consulting the fsm; that only worked
because PageGetHeapFreeSpace() interprets the zero page header in a
new page as no free space.  The lack of handling this properly
required reverting the previous attempt in 684200543b.

This issue can be solved by using RBM_ZERO_AND_LOCK when extending the
relation, thereby avoiding this window. There's some added complexity
when RelationGetBufferForTuple() is called with another buffer (for
updates), to avoid deadlocks.

Author: Andres Freund
Reviewed-By: Tom Lane
Discussion: https://postgr.es/m/20181219083945.6khtgm36mivon...@alap3.anarazel.de
---
 src/backend/access/heap/hio.c        | 120 ++++++++++++++++++---------
 src/backend/access/heap/vacuumlazy.c |  84 ++++++++++---------
 2 files changed, 127 insertions(+), 77 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 3da0b49ccc4..5a108b7fe66 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -74,23 +74,31 @@ RelationPutHeapTuple(Relation relation,
 }
 
 /*
- * Read in a buffer, using bulk-insert strategy if bistate isn't NULL.
+ * Read in a buffer in mode, using bulk-insert strategy if bistate isn't NULL.
  */
 static Buffer
 ReadBufferBI(Relation relation, BlockNumber targetBlock,
-			 BulkInsertState bistate)
+			 ReadBufferMode mode, BulkInsertState bistate)
 {
 	Buffer		buffer;
 
 	/* If not bulk-insert, exactly like ReadBuffer */
 	if (!bistate)
-		return ReadBuffer(relation, targetBlock);
+		return ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
+								  mode, NULL);
 
 	/* If we have the desired block already pinned, re-pin and return it */
 	if (bistate->current_buf != InvalidBuffer)
 	{
 		if (BufferGetBlockNumber(bistate->current_buf) == targetBlock)
 		{
+			/*
+			 * Currently the LOCK variants are only used for extending
+			 * relation, which should never reach this branch.
+			 */
+			Assert(mode != RBM_ZERO_AND_LOCK &&
+				   mode != RBM_ZERO_AND_CLEANUP_LOCK);
+
 			IncrBufferRefCount(bistate->current_buf);
 			return bistate->current_buf;
 		}
@@ -101,7 +109,7 @@ ReadBufferBI(Relation relation, BlockNumber targetBlock,
 
 	/* Perform a read using the buffer strategy */
 	buffer = ReadBufferExtended(relation, MAIN_FORKNUM, targetBlock,
-								RBM_NORMAL, bistate->strategy);
+								mode, bistate->strategy);
 
 	/* Save the selected block as target for future inserts */
 	IncrBufferRefCount(buffer);
@@ -204,11 +212,10 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 		/*
 		 * Extend by one page.  This should generally match the main-line
 		 * extension code in RelationGetBufferForTuple, except that we hold
-		 * the relation extension lock throughout.
+		 * the relation extension lock throughout, and we don't immediately
+		 * initialize the page (see below).
 		 */
-		buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 		page = BufferGetPage(buffer);
 
 		if (!PageIsNew(page))
@@ -216,18 +223,18 @@ RelationAddExtraBlocks(Relation relation, BulkInsertState bistate)
 				 BufferGetBlockNumber(buffer),
 				 RelationGetRelationName(relation));
 
-		PageInit(page, BufferGetPageSize(buffer), 0);
-
 		/*
-		 * We mark all the new buffers dirty, but do nothing to write them
-		 * out; they'll probably get used soon, and even if they are not, a
-		 * crash will leave an okay all-zeroes page on disk.
+		 * Add the page to the FSM without initializing. If we were to
+		 * initialize here the page would potentially get flushed out to disk
+		 * before we add any useful content. There's no guarantee that that'd
+		 * happen before a potential crash, so we need to deal with
+		 * uninitialized pages anyway, thus avoid the potential for
+		 * unnecessary writes.
 		 */
-		MarkBufferDirty(buffer);
 
 		/* we'll need this info below */
 		blockNum = BufferGetBlockNumber(buffer);
-		freespace = PageGetHeapFreeSpace(page);
+		freespace = BufferGetPageSize(buffer) - SizeOfPageHeaderData;
 
 		UnlockReleaseBuffer(buffer);
 
@@ -412,7 +419,7 @@ loop:
 		if (otherBuffer == InvalidBuffer)
 		{
 			/* easy case */
-			buffer = ReadBufferBI(relation, targetBlock, bistate);
+			buffer = ReadBufferBI(relation, targetBlock, RBM_NORMAL, bistate);
 			if (PageIsAllVisible(BufferGetPage(buffer)))
 				visibilitymap_pin(relation, targetBlock, vmbuffer);
 			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
@@ -479,6 +486,18 @@ loop:
 		 * we're done.
 		 */
 		page = BufferGetPage(buffer);
+
+		/*
+		 * Initialize page, it'll be used soon.  We could avoid dirtying the
+		 * buffer here, and rely on the caller to do so whenever it puts a
+		 * tuple onto the page, but there seems not much benefit in doing so.
+		 */
+		if (PageIsNew(page))
+		{
+			PageInit(page, BufferGetPageSize(buffer), 0);
+			MarkBufferDirty(buffer);
+		}
+
 		pageFreeSpace = PageGetHeapFreeSpace(page);
 		if (len + saveFreeSpace <= pageFreeSpace)
 		{
@@ -571,28 +590,7 @@ loop:
 	 * it worth keeping an accurate file length in shared memory someplace,
 	 * rather than relying on the kernel to do it for us?
 	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
-
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
-
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+	buffer = ReadBufferBI(relation, P_NEW, RBM_ZERO_AND_LOCK, bistate);
 
 	/*
 	 * We need to initialize the empty new page.  Double-check that it really
@@ -607,6 +605,52 @@ loop:
 			 RelationGetRelationName(relation));
 
 	PageInit(page, BufferGetPageSize(buffer), 0);
+	MarkBufferDirty(buffer);
+
+	/*
+	 * Release the file-extension lock; it's now OK for someone else to extend
+	 * the relation some more.
+	 */
+	if (needLock)
+		UnlockRelationForExtension(relation, ExclusiveLock);
+
+	/*
+	 * Lock the other buffer. It's guaranteed to be of a lower page number
+	 * than the new page. To conform with the deadlock prevent rules, we ought
+	 * to lock otherBuffer first, but that would give other backends a chance
+	 * to put tuples on our page. To reduce the likelihood of that, attempt to
+	 * lock the other buffer conditionally, that's very likely to work.
+	 * Otherwise we need to lock buffers in the correct order, and retry if
+	 * the space has been used in the mean time.
+	 *
+	 * Alternatively, we could acquire the lock on otherBuffer before
+	 * extending the relation, but that'd require holding the lock while
+	 * performing IO, which seems worse than an unlikely retry.
+	 */
+	if (otherBuffer != InvalidBuffer)
+	{
+		Assert(otherBuffer != buffer);
+
+		if (unlikely(!ConditionalLockBuffer(otherBuffer)))
+		{
+			LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+			LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+			/*
+			 * Because the buffer was unlocked for a while, it's possible,
+			 * although unlikely, that the page was filled. If so, just retry
+			 * from start.
+			 */
+			if (len > PageGetHeapFreeSpace(page))
+			{
+				LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
+				UnlockReleaseBuffer(buffer);
+
+				goto loop;
+			}
+		}
+	}
 
 	if (len > PageGetHeapFreeSpace(page))
 	{
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 37aa484ec3a..26dfb0c7e0f 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -860,43 +860,46 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 
 		if (PageIsNew(page))
 		{
+			bool		still_new;
+
 			/*
-			 * An all-zeroes page could be left over if a backend extends the
-			 * relation but crashes before initializing the page. Reclaim such
-			 * pages for use.
+			 * All-zeroes pages can be left over if either a backend extends
+			 * the relation by a single page, but crashes before the newly
+			 * initialized page has been written out, or when bulk-extending
+			 * the relation (which creates a number of empty pages at the tail
+			 * end of the relation, but enters them into the FSM).
 			 *
-			 * We have to be careful here because we could be looking at a
-			 * page that someone has just added to the relation and not yet
-			 * been able to initialize (see RelationGetBufferForTuple). To
-			 * protect against that, release the buffer lock, grab the
-			 * relation extension lock momentarily, and re-lock the buffer. If
-			 * the page is still uninitialized by then, it must be left over
-			 * from a crashed backend, and we can initialize it.
+			 * Make sure these pages are in the FSM, to ensure they can be
+			 * reused. Do that by testing if there's any space recorded for
+			 * the page. If not, enter it.
 			 *
-			 * We don't really need the relation lock when this is a new or
-			 * temp relation, but it's probably not worth the code space to
-			 * check that, since this surely isn't a critical path.
-			 *
-			 * Note: the comparable code in vacuum.c need not worry because
-			 * it's got exclusive lock on the whole relation.
+			 * Note we do not enter the page into the visibilitymap. That has
+			 * the downside that we repeatedly visit this page in subsequent
+			 * vacuums, but otherwise we'll never not discover the space on a
+			 * promoted standby. The harm of repeated checking ought to
+			 * normally not be too bad - the space usually should be used at
+			 * some point, otherwise there wouldn't be any regular vacuums.
 			 */
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
-			LockBufferForCleanup(buf);
-			if (PageIsNew(page))
-			{
-				ereport(WARNING,
-						(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
-								relname, blkno)));
-				PageInit(page, BufferGetPageSize(buf), 0);
-				empty_pages++;
-			}
-			freespace = PageGetHeapFreeSpace(page);
-			MarkBufferDirty(buf);
+
+			/*
+			 * Perform checking of FSM after releasing lock, the fsm is
+			 * approximate, after all.
+			 */
+			still_new = PageIsNew(page);
 			UnlockReleaseBuffer(buf);
 
-			RecordPageWithFreeSpace(onerel, blkno, freespace);
+			if (still_new)
+			{
+				empty_pages++;
+
+				if (GetRecordedFreeSpace(onerel, blkno) == 0)
+				{
+					Size		freespace;
+
+					freespace = BufferGetPageSize(buf) - SizeOfPageHeaderData;
+					RecordPageWithFreeSpace(onerel, blkno, freespace);
+				}
+			}
 			continue;
 		}
 
@@ -905,7 +908,10 @@ lazy_scan_heap(Relation onerel, int options, LVRelStats *vacrelstats,
 			empty_pages++;
 			freespace = PageGetHeapFreeSpace(page);
 
-			/* empty pages are always all-visible and all-frozen */
+			/*
+			 * Empty pages are always all-visible and all-frozen (note that
+			 * the same is currently not true for new pages, see above).
+			 */
 			if (!PageIsAllVisible(page))
 			{
 				START_CRIT_SECTION();
@@ -1639,12 +1645,13 @@ lazy_check_needs_freeze(Buffer buf, bool *hastup)
 
 	*hastup = false;
 
-	/* If we hit an uninitialized page, we want to force vacuuming it. */
-	if (PageIsNew(page))
-		return true;
-
-	/* Quick out for ordinary empty page. */
-	if (PageIsEmpty(page))
+	/*
+	 * New and empty pages, obviously, don't contain tuples. We could make
+	 * sure that the page is registered in the FSM, but it doesn't seem worth
+	 * waiting for a cleanup lock just for that, especially because it's
+	 * likely that the pin holder will do so.
+	 */
+	if (PageIsNew(page) || PageIsEmpty(page))
 		return false;
 
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -2029,7 +2036,6 @@ count_nondeletable_pages(Relation onerel, LVRelStats *vacrelstats)
 
 		if (PageIsNew(page) || PageIsEmpty(page))
 		{
-			/* PageIsNew probably shouldn't happen... */
 			UnlockReleaseBuffer(buf);
 			continue;
 		}
-- 
2.18.0.rc2.dirty

Re: Why are we PageInit'ing buffers in RelationAddExtraBlocks()?

Reply via email to