Hi,

I, every now and then, spent a bit of time making this more efficient
over the last few weeks.

I had a bit of a problem to reproduce the problems I'd seen in
production on physical hardware (found EC2 to be to variable to
benchmark this), but luckily 2ndQuadrant today allowed me access to
their four socket machine[1] of the AXLE project.  Thanks Simon and
Tomas!

First, some mostly juicy numbers:

My benchmark was a parallel COPY into a single wal logged target
table:
CREATE TABLE data(data text);
The source data has been generated with
narrow:
COPY (select g.i::text FROM generate_series(1, 10000) g(i)) TO 
'/tmp/copybinary' WITH BINARY;
wide:
COPY (select repeat(random()::text, 10) FROM generate_series(1, 10000) g(i)) TO 
'/tmp/copybinarywide' WITH BINARY;

Between every test I ran a TRUNCATE data; CHECKPOINT;

For each number of clients I ran pgbench for 70 seconds. I'd previously
determined using -P 1 that the numbers are fairly stable. Longer runs
would have been nice, but then I'd not have finished in time.

shared_buffers = 48GB, narrow table contents:
client     tps after:      tps before:
1          180.255577      210.125143
2          338.231058      391.875088
4          638.814300      405.243901
8          1126.852233     370.922271
16         1242.363623     498.487008
32         1229.648854     484.477042
48         1223.288397     468.127943
64         1198.007422     438.238119
96         1201.501278     370.556354
128        1198.554929     288.213032
196        1189.603398     193.841993
256        1144.082291     191.293781
512        643.323675      200.782105

shared_buffers = 1GB, narrow table contents:
client     tps after:      tps before:
1          191.137410      210.787214
2          351.293017      384.086634
4          649.800991      420.703149
8          1103.770749     355.947915
16         1287.192256     489.050768
32         1226.329585     464.936427
48         1187.266489     443.386440
64         1182.698974     402.251258
96         1208.315983     331.290851
128        1183.469635     269.250601
196        1202.847382     202.788617
256        1177.924515     190.876852
512        572.457773      192.413191

1
shared_buffers = 48GB, wide table contents:
client     tps after:      tps before:
1          59.685215       68.445331
2          102.034688      103.210277
4          179.434065      78.982315
8          222.613727      76.195353
16         232.162484      77.520265
32         231.979136      71.654421
48         231.981216      64.730114
64         230.955979      57.444215
96         228.016910      56.324725
128        227.693947      45.701038
196        227.410386      37.138537
256        224.626948      35.265530
512        105.356439      34.397636

shared_buffers = 1GB, wide table contents:
(ran out of patience)

Note that the peak performance with the patch is significantly better,
but there's currently a noticeable regression in single threaded
performance. That undoubtedly needs to be addressed.


So, to get to the actual meat: My goal was to essentially get rid of an
exclusive lock over relation extension alltogether. I think I found a
way to do that that addresses the concerns made in this thread.

Thew new algorithm basically is:
1) Acquire victim buffer, clean it, and mark it as pinned
2) Get the current size of the relation, save buffer into blockno
3) Try to insert an entry into the buffer table for blockno
4) If the page is already in the buffer table, increment blockno by 1,
   goto 3)
5) Try to read the page. In most cases it'll not yet exist. But the page
   might concurrently have been written by another backend and removed
   from shared buffers already. If already existing, goto 1)
6) Zero out the page on disk.

I think this does handle the concurrency issues.

This patch very clearly is in the POC stage. But I do think the approach
is generally sound.  I'd like to see some comments before deciding
whether to carry on.


Greetings,

Andres Freund

PS: Yes, I know that precision in the benchmark isn't warranted, but I'm
too lazy to truncate them.

[1]
[10:28:11 PM] Tomas Vondra: 4x Intel Xeon E5­4620 Eight Core 2.2GHz
Processor’s generation Sandy Bridge EP
each core handles 2 threads, so 16 threads total
256GB (16x16GB) ECC REG System Validated Memory (1333 MHz)
2x 250GB SATA 2.5” Enterprise Level HDs (RAID 1, ~250GB)
17x 600GB SATA 2.5” Solid State HDs (RAID 0, ~10TB)
LSI MegaRAID 9271­8iCC controller and Cache Vault Kit (1GB cache)
2 x Nvidia Tesla K20 Active GPU Cards (GK110GL)

>From fc095897a6f4207d384559a095f80a36cf49648c Mon Sep 17 00:00:00 2001
From: Andres Freund <and...@anarazel.de>
Date: Sun, 29 Mar 2015 20:55:32 +0200
Subject: [PATCH] WIP: Saner heap extension.

---
 src/backend/access/heap/hio.c       |  86 ++++----
 src/backend/commands/vacuumlazy.c   |  39 ++--
 src/backend/storage/buffer/bufmgr.c | 377 ++++++++++++++++++++++++++----------
 src/backend/storage/smgr/md.c       |  62 ++++++
 src/backend/storage/smgr/smgr.c     |  20 +-
 src/include/storage/buf_internals.h |   1 +
 src/include/storage/bufmgr.h        |   1 +
 src/include/storage/smgr.h          |   7 +-
 8 files changed, 417 insertions(+), 176 deletions(-)

diff --git a/src/backend/access/heap/hio.c b/src/backend/access/heap/hio.c
index 6db73bf..b47f9fe 100644
--- a/src/backend/access/heap/hio.c
+++ b/src/backend/access/heap/hio.c
@@ -15,6 +15,8 @@
 
 #include "postgres.h"
 
+#include "miscadmin.h"
+
 #include "access/heapam.h"
 #include "access/hio.h"
 #include "access/htup_details.h"
@@ -237,7 +239,6 @@ RelationGetBufferForTuple(Relation relation, Size len,
 				saveFreeSpace;
 	BlockNumber targetBlock,
 				otherBlock;
-	bool		needLock;
 
 	len = MAXALIGN(len);		/* be conservative */
 
@@ -433,63 +434,50 @@ RelationGetBufferForTuple(Relation relation, Size len,
 	/*
 	 * Have to extend the relation.
 	 *
-	 * We have to use a lock to ensure no one else is extending the rel at the
-	 * same time, else we will both try to initialize the same new page.  We
-	 * can skip locking for new or temp relations, however, since no one else
-	 * could be accessing them.
+	 * To avoid, as it used to be the case, holding the extension lock during
+	 * victim buffer search for the new buffer, we extend the relation here
+	 * instead of relying on bufmgr.c. We still have to hold the extension
+	 * lock to prevent a race between two backends initializing the same page.
 	 */
-	needLock = !RELATION_IS_LOCAL(relation);
-
-	if (needLock)
-		LockRelationForExtension(relation, ExclusiveLock);
+	while(true)
+	{
+		buffer = ExtendRelation(relation, MAIN_FORKNUM, bistate->strategy);
 
-	/*
-	 * XXX This does an lseek - rather expensive - but at the moment it is the
-	 * only way to accurately determine how many blocks are in a relation.  Is
-	 * it worth keeping an accurate file length in shared memory someplace,
-	 * rather than relying on the kernel to do it for us?
-	 */
-	buffer = ReadBufferBI(relation, P_NEW, bistate);
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * We can be certain that locking the otherBuffer first is OK, since it
-	 * must have a lower page number.
-	 */
-	if (otherBuffer != InvalidBuffer)
-		LockBuffer(otherBuffer, BUFFER_LOCK_EXCLUSIVE);
+		/*
+		 * Now acquire lock on the new page.
+		 */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
 
-	/*
-	 * Now acquire lock on the new page.
-	 */
-	LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
 
-	/*
-	 * Release the file-extension lock; it's now OK for someone else to extend
-	 * the relation some more.  Note that we cannot release this lock before
-	 * we have buffer lock on the new page, or we risk a race condition
-	 * against vacuumlazy.c --- see comments therein.
-	 */
-	if (needLock)
-		UnlockRelationForExtension(relation, ExclusiveLock);
+		/*
+		 * While unlikely, it's possible that another backend managed to
+		 * initialize the page and use up the free space till we got the
+		 * exclusive lock. That'd require the page to be vacuumed (to be put
+		 * on the free space list) and then be used; possible but fairly
+		 * unlikely in practice. If it happens and there's not enough space,
+		 * just retry.
+		 */
+		if (PageIsNew(page))
+		{
+			PageInit(page, BLCKSZ, 0);
 
-	/*
-	 * We need to initialize the empty new page.  Double-check that it really
-	 * is empty (this should never happen, but if it does we don't want to
-	 * risk wiping out valid data).
-	 */
-	page = BufferGetPage(buffer);
+			Assert(len <= PageGetHeapFreeSpace(page));
+			break;
+		}
+		else if (len <= PageGetHeapFreeSpace(page))
+			break;
 
-	if (!PageIsNew(page))
-		elog(ERROR, "page %u of relation \"%s\" should be empty but is not",
-			 BufferGetBlockNumber(buffer),
-			 RelationGetRelationName(relation));
+		if (otherBuffer != InvalidBuffer)
+			LockBuffer(otherBuffer, BUFFER_LOCK_UNLOCK);
 
-	PageInit(page, BufferGetPageSize(buffer), 0);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
 
-	if (len > PageGetHeapFreeSpace(page))
-	{
-		/* We should not get here given the test at the top */
-		elog(PANIC, "tuple is too big: size %zu", len);
+		CHECK_FOR_INTERRUPTS();
 	}
 
 	/*
diff --git a/src/backend/commands/vacuumlazy.c b/src/backend/commands/vacuumlazy.c
index a01cfb4..896731c 100644
--- a/src/backend/commands/vacuumlazy.c
+++ b/src/backend/commands/vacuumlazy.c
@@ -674,35 +674,18 @@ lazy_scan_heap(Relation onerel, LVRelStats *vacrelstats,
 			/*
 			 * An all-zeroes page could be left over if a backend extends the
 			 * relation but crashes before initializing the page. Reclaim such
-			 * pages for use.
-			 *
-			 * We have to be careful here because we could be looking at a
-			 * page that someone has just added to the relation and not yet
-			 * been able to initialize (see RelationGetBufferForTuple). To
-			 * protect against that, release the buffer lock, grab the
-			 * relation extension lock momentarily, and re-lock the buffer. If
-			 * the page is still uninitialized by then, it must be left over
-			 * from a crashed backend, and we can initialize it.
-			 *
-			 * We don't really need the relation lock when this is a new or
-			 * temp relation, but it's probably not worth the code space to
-			 * check that, since this surely isn't a critical path.
-			 *
-			 * Note: the comparable code in vacuum.c need not worry because
-			 * it's got exclusive lock on the whole relation.
+			 * pages for use.  It is also possible that we're looking at a
+			 * page that has just added but not yet initialized (see
+			 * RelationGetBufferForTuple). In that case we just initialize the
+			 * page here. That means the page will end up in the free space
+			 * map a little earlier, but that seems fine.
 			 */
-			LockBuffer(buf, BUFFER_LOCK_UNLOCK);
-			LockRelationForExtension(onerel, ExclusiveLock);
-			UnlockRelationForExtension(onerel, ExclusiveLock);
-			LockBufferForCleanup(buf);
-			if (PageIsNew(page))
-			{
-				ereport(WARNING,
-				(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
-						relname, blkno)));
-				PageInit(page, BufferGetPageSize(buf), 0);
-				empty_pages++;
-			}
+			ereport(DEBUG2,
+					(errmsg("relation \"%s\" page %u is uninitialized --- fixing",
+							relname, blkno)));
+			PageInit(page, BufferGetPageSize(buf), 0);
+			empty_pages++;
+
 			freespace = PageGetHeapFreeSpace(page);
 			MarkBufferDirty(buf);
 			UnlockReleaseBuffer(buf);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index e4b25587..4613666 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -392,6 +392,7 @@ static Buffer ReadBuffer_common(SMgrRelation reln, char relpersistence,
 				  ForkNumber forkNum, BlockNumber blockNum,
 				  ReadBufferMode mode, BufferAccessStrategy strategy,
 				  bool *hit);
+static volatile BufferDesc *GetVictimBuffer(BufferAccessStrategy strategy, BufFlags *oldFlags);
 static bool PinBuffer(volatile BufferDesc *buf, BufferAccessStrategy strategy);
 static void PinBuffer_Locked(volatile BufferDesc *buf);
 static void UnpinBuffer(volatile BufferDesc *buf, bool fixOwner);
@@ -483,6 +484,176 @@ PrefetchBuffer(Relation reln, ForkNumber forkNum, BlockNumber blockNum)
 #endif   /* USE_PREFETCH */
 }
 
+Buffer
+ExtendRelation(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy)
+{
+	BlockNumber	blockno;
+	Buffer		buf_id;
+	volatile BufferDesc *buf;
+	BufFlags	oldFlags;
+	Block		bufBlock;
+	bool		isLocalBuf = RelationUsesLocalBuffers(reln);
+	int			readblocks;
+
+	BufferTag	oldTag;			/* previous identity of selected buffer */
+	uint32		oldHash;		/* hash value for oldTag */
+	LWLock	   *oldPartitionLock;		/* buffer partition lock for it */
+
+	BufferTag	newTag;
+	uint32		newHash;
+	LWLock	   *newPartitionLock;
+
+	/* FIXME: This obviously isn't acceptable for integration */
+	if (isLocalBuf)
+	{
+		return ReadBufferExtended(reln, forkNum, P_NEW, RBM_NORMAL, strategy);
+	}
+
+	/* Open it at the smgr level if not already done */
+	RelationOpenSmgr(reln);
+
+	/* Make sure we will have room to remember the buffer pin */
+	ResourceOwnerEnlargeBuffers(CurrentResourceOwner);
+
+retry_victim:
+	/* we'll need a clean unassociated victim buffer */
+	while (true)
+	{
+		bool		gotIt = false;
+
+		/*
+		 * Returns a buffer that was unpinned and not dirty at the time of the
+		 * check.
+		 */
+		buf = GetVictimBuffer(strategy, &oldFlags);
+
+		if (oldFlags & BM_TAG_VALID)
+		{
+			oldTag = buf->tag;
+			oldHash = BufTableHashCode(&oldTag);
+			oldPartitionLock = BufMappingPartitionLock(oldHash);
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
+		}
+
+		LockBufHdr(buf);
+
+		/* somebody else might have re-pinned the buffer by now */
+		if (buf->refcount != 1 || (buf->flags & BM_DIRTY))
+		{
+			UnlockBufHdr(buf);
+		}
+		else
+		{
+			buf->flags &= ~(BM_TAG_VALID | BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_NEW);
+
+			UnlockBufHdr(buf);
+
+			gotIt = true;
+
+			if (oldFlags & BM_TAG_VALID)
+				BufTableDelete(&oldTag, oldHash);
+		}
+
+		if (oldFlags & BM_TAG_VALID)
+			LWLockRelease(oldPartitionLock);
+
+		if (gotIt)
+			break;
+		else
+			UnpinBuffer(buf, true);
+	}
+
+	/*
+	 * At this state we have an empty victim buffer; pinned to prevent it from
+	 * being reused.
+	 */
+
+	/*
+	 * First try the current end of the relation. If a concurrent process has
+	 * acquired that, try the next one after that.
+	 */
+	blockno = smgrnblocks(reln->rd_smgr, forkNum);
+
+	while (true)
+	{
+		INIT_BUFFERTAG(newTag, reln->rd_smgr->smgr_rnode.node, forkNum, blockno);
+
+		newHash = BufTableHashCode(&newTag);
+		newPartitionLock = BufMappingPartitionLock(newHash);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+
+		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+		if (buf_id >= 0)
+		{
+			/* somebody else got this block, try the next one */
+			LWLockRelease(newPartitionLock);
+			blockno++;
+			continue;
+		}
+
+		LockBufHdr(buf);
+
+		buf->tag = newTag;
+		if (reln->rd_rel->relpersistence == RELPERSISTENCE_PERMANENT)
+			buf->flags |= BM_NEW | BM_TAG_VALID | BM_PERMANENT;
+		else
+			buf->flags |= BM_NEW | BM_TAG_VALID;
+		buf->usage_count = 1;
+
+		UnlockBufHdr(buf);
+		LWLockRelease(newPartitionLock);
+
+		break;
+	}
+
+	/*
+	 * By here we made a entry into the buffer table, but haven't yet
+	 * read/written the page.  We can't just initialize the page, potentially
+	 * while we were busy with the above, another backend could have extended
+	 * the relation, written something, and the buffer could already have been
+	 * reused for something else.
+	 */
+
+	if (!StartBufferIO(buf, true))
+	{
+		/*
+		 * Somebody else is already using this block. Just try another one.
+		 */
+		UnpinBuffer(buf, true);
+		goto retry_victim;
+	}
+
+	/*
+	 * FIXME: if we die here we might have a problem: Everyone trying to read
+	 * this block will get a failure. Need to add checks for BM_NEW against
+	 * that. That's not really new to this code tho.
+	 */
+
+	bufBlock = isLocalBuf ? LocalBufHdrGetBlock(buf) : BufHdrGetBlock(buf);
+
+	readblocks = smgrtryread(reln->rd_smgr, forkNum, blockno, bufBlock);
+
+	if (readblocks != BLCKSZ)
+	{
+		MemSet((char *) bufBlock, 0, BLCKSZ);
+
+		smgrextend(reln->rd_smgr, forkNum, blockno, (char *) bufBlock, false);
+
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
+		TerminateBufferIO(buf, false, BM_VALID);
+	}
+	else
+	{
+		/* Set BM_VALID, terminate IO, and wake up any waiters */
+		TerminateBufferIO(buf, false, BM_VALID);
+		UnpinBuffer(buf, true);
+
+		goto retry_victim;
+	}
+
+	return BufferDescriptorGetBuffer(buf);
+}
 
 /*
  * ReadBuffer -- a shorthand for ReadBufferExtended, for reading from main
@@ -847,6 +1018,112 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	return BufferDescriptorGetBuffer(bufHdr);
 }
 
+static volatile BufferDesc *
+GetVictimBuffer(BufferAccessStrategy strategy, BufFlags *oldFlags)
+{
+	volatile BufferDesc *buf;
+
+	/*
+	 * Ensure, while the spinlock's not yet held, that there's a free refcount
+	 * entry.
+	 */
+	ReservePrivateRefCountEntry();
+
+retry:
+	/*
+	 * Select a victim buffer.  The buffer is returned with its header
+	 * spinlock still held!
+	 */
+	buf = StrategyGetBuffer(strategy);
+
+	Assert(buf->refcount == 0);
+
+	/* Must copy buffer flags while we still hold the spinlock */
+	*oldFlags = buf->flags;
+
+	/* Pin the buffer and then release the buffer spinlock */
+	PinBuffer_Locked(buf);
+
+	/*
+	 * If the buffer was dirty, try to write it out.  There is a race
+	 * condition here, in that someone might dirty it after we released it
+	 * above, or even while we are writing it out (since our share-lock
+	 * won't prevent hint-bit updates).  We will recheck the dirty bit
+	 * after re-locking the buffer header.
+	 */
+	if (*oldFlags & BM_DIRTY)
+	{
+		/*
+		 * We need a share-lock on the buffer contents to write it out
+		 * (else we might write invalid data, eg because someone else is
+		 * compacting the page contents while we write).  We must use a
+		 * conditional lock acquisition here to avoid deadlock.  Even
+		 * though the buffer was not pinned (and therefore surely not
+		 * locked) when StrategyGetBuffer returned it, someone else could
+		 * have pinned and exclusive-locked it by the time we get here. If
+		 * we try to get the lock unconditionally, we'd block waiting for
+		 * them; if they later block waiting for us, deadlock ensues.
+		 * (This has been observed to happen when two backends are both
+		 * trying to split btree index pages, and the second one just
+		 * happens to be trying to split the page the first one got from
+		 * StrategyGetBuffer.)
+		 */
+		if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
+		{
+			/*
+			 * If using a nondefault strategy, and writing the buffer
+			 * would require a WAL flush, let the strategy decide whether
+			 * to go ahead and write/reuse the buffer or to choose another
+			 * victim.  We need lock to inspect the page LSN, so this
+			 * can't be done inside StrategyGetBuffer.
+			 */
+			if (strategy != NULL)
+			{
+				XLogRecPtr	lsn;
+
+				/* Read the LSN while holding buffer header lock */
+				LockBufHdr(buf);
+				lsn = BufferGetLSN(buf);
+				UnlockBufHdr(buf);
+
+				if (XLogNeedsFlush(lsn) &&
+					StrategyRejectBuffer(strategy, buf))
+				{
+					/* Drop lock/pin and loop around for another buffer */
+					LWLockRelease(buf->content_lock);
+					UnpinBuffer(buf, true);
+					goto retry;
+				}
+			}
+
+			/* OK, do the I/O */
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
+										   smgr->smgr_rnode.node.spcNode,
+											smgr->smgr_rnode.node.dbNode,
+										  smgr->smgr_rnode.node.relNode);
+
+			FlushBuffer(buf, NULL);
+			LWLockRelease(buf->content_lock);
+
+			TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
+										   smgr->smgr_rnode.node.spcNode,
+											smgr->smgr_rnode.node.dbNode,
+										  smgr->smgr_rnode.node.relNode);
+		}
+		else
+		{
+			/*
+			 * Someone else has locked the buffer, so give it up and loop
+			 * back to get another one.
+			 */
+			UnpinBuffer(buf, true);
+			goto retry;
+		}
+	}
+
+	return buf;
+}
+
 /*
  * BufferAlloc -- subroutine for ReadBuffer.  Handles lookup of a shared
  *		buffer.  If no buffer exists already, selects a replacement
@@ -940,102 +1217,8 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
-		/*
-		 * Ensure, while the spinlock's not yet held, that there's a free
-		 * refcount entry.
-		 */
-		ReservePrivateRefCountEntry();
-
-		/*
-		 * Select a victim buffer.  The buffer is returned with its header
-		 * spinlock still held!
-		 */
-		buf = StrategyGetBuffer(strategy);
-
-		Assert(buf->refcount == 0);
-
-		/* Must copy buffer flags while we still hold the spinlock */
-		oldFlags = buf->flags;
-
-		/* Pin the buffer and then release the buffer spinlock */
-		PinBuffer_Locked(buf);
-
-		/*
-		 * If the buffer was dirty, try to write it out.  There is a race
-		 * condition here, in that someone might dirty it after we released it
-		 * above, or even while we are writing it out (since our share-lock
-		 * won't prevent hint-bit updates).  We will recheck the dirty bit
-		 * after re-locking the buffer header.
-		 */
-		if (oldFlags & BM_DIRTY)
-		{
-			/*
-			 * We need a share-lock on the buffer contents to write it out
-			 * (else we might write invalid data, eg because someone else is
-			 * compacting the page contents while we write).  We must use a
-			 * conditional lock acquisition here to avoid deadlock.  Even
-			 * though the buffer was not pinned (and therefore surely not
-			 * locked) when StrategyGetBuffer returned it, someone else could
-			 * have pinned and exclusive-locked it by the time we get here. If
-			 * we try to get the lock unconditionally, we'd block waiting for
-			 * them; if they later block waiting for us, deadlock ensues.
-			 * (This has been observed to happen when two backends are both
-			 * trying to split btree index pages, and the second one just
-			 * happens to be trying to split the page the first one got from
-			 * StrategyGetBuffer.)
-			 */
-			if (LWLockConditionalAcquire(buf->content_lock, LW_SHARED))
-			{
-				/*
-				 * If using a nondefault strategy, and writing the buffer
-				 * would require a WAL flush, let the strategy decide whether
-				 * to go ahead and write/reuse the buffer or to choose another
-				 * victim.  We need lock to inspect the page LSN, so this
-				 * can't be done inside StrategyGetBuffer.
-				 */
-				if (strategy != NULL)
-				{
-					XLogRecPtr	lsn;
-
-					/* Read the LSN while holding buffer header lock */
-					LockBufHdr(buf);
-					lsn = BufferGetLSN(buf);
-					UnlockBufHdr(buf);
-
-					if (XLogNeedsFlush(lsn) &&
-						StrategyRejectBuffer(strategy, buf))
-					{
-						/* Drop lock/pin and loop around for another buffer */
-						LWLockRelease(buf->content_lock);
-						UnpinBuffer(buf, true);
-						continue;
-					}
-				}
-
-				/* OK, do the I/O */
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_START(forkNum, blockNum,
-											   smgr->smgr_rnode.node.spcNode,
-												smgr->smgr_rnode.node.dbNode,
-											  smgr->smgr_rnode.node.relNode);
-
-				FlushBuffer(buf, NULL);
-				LWLockRelease(buf->content_lock);
-
-				TRACE_POSTGRESQL_BUFFER_WRITE_DIRTY_DONE(forkNum, blockNum,
-											   smgr->smgr_rnode.node.spcNode,
-												smgr->smgr_rnode.node.dbNode,
-											  smgr->smgr_rnode.node.relNode);
-			}
-			else
-			{
-				/*
-				 * Someone else has locked the buffer, so give it up and loop
-				 * back to get another one.
-				 */
-				UnpinBuffer(buf, true);
-				continue;
-			}
-		}
+		/* returns a nondirty buffer, with potentially valid contents */
+		buf = GetVictimBuffer(strategy, &oldFlags);
 
 		/*
 		 * To change the association of a valid buffer, we'll need to have
@@ -1171,7 +1354,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 * 1 so that the buffer can survive one clock-sweep pass.)
 	 */
 	buf->tag = newTag;
-	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT);
+	buf->flags &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED | BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT | BM_NEW);
 	if (relpersistence == RELPERSISTENCE_PERMANENT)
 		buf->flags |= BM_TAG_VALID | BM_PERMANENT;
 	else
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 42a43bb..0038c91 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -729,6 +729,68 @@ mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 }
 
+
+/*
+ *	mdtryread() -- Read the specified block from a relation.
+ */
+int
+mdtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer)
+{
+	off_t		seekpos;
+	int			nbytes;
+	MdfdVec    *v;
+
+	TRACE_POSTGRESQL_SMGR_MD_READ_START(forknum, blocknum,
+										reln->smgr_rnode.node.spcNode,
+										reln->smgr_rnode.node.dbNode,
+										reln->smgr_rnode.node.relNode,
+										reln->smgr_rnode.backend);
+
+	v = _mdfd_getseg(reln, forknum, blocknum, false, EXTENSION_RETURN_NULL);
+
+	/* would need another segment */
+	if (v == NULL)
+		return 0;
+
+	seekpos = (off_t) BLCKSZ *(blocknum % ((BlockNumber) RELSEG_SIZE));
+
+	Assert(seekpos < (off_t) BLCKSZ * RELSEG_SIZE);
+
+	if (FileSeek(v->mdfd_vfd, seekpos, SEEK_SET) != seekpos)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek to block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+
+	nbytes = FileRead(v->mdfd_vfd, buffer, BLCKSZ);
+
+	TRACE_POSTGRESQL_SMGR_MD_READ_DONE(forknum, blocknum,
+									   reln->smgr_rnode.node.spcNode,
+									   reln->smgr_rnode.node.dbNode,
+									   reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.backend,
+									   nbytes,
+									   BLCKSZ);
+
+	if (nbytes < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not read block %u in file \"%s\": %m",
+						blocknum, FilePathName(v->mdfd_vfd))));
+
+	if (nbytes > 0 && nbytes < BLCKSZ)
+	{
+		ereport(LOG,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(v->mdfd_vfd),
+						nbytes, BLCKSZ)));
+	}
+
+	return nbytes;
+}
+
 /*
  *	mdwrite() -- Write the supplied block at the appropriate location.
  *
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 244b4ea..f0e9a7b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -51,6 +51,8 @@ typedef struct f_smgr
 											  BlockNumber blocknum);
 	void		(*smgr_read) (SMgrRelation reln, ForkNumber forknum,
 										  BlockNumber blocknum, char *buffer);
+	int			(*smgr_tryread) (SMgrRelation reln, ForkNumber forknum,
+										  BlockNumber blocknum, char *buffer);
 	void		(*smgr_write) (SMgrRelation reln, ForkNumber forknum,
 						 BlockNumber blocknum, char *buffer, bool skipFsync);
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
@@ -66,7 +68,7 @@ typedef struct f_smgr
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
-		mdprefetch, mdread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
+		mdprefetch, mdread, mdtryread, mdwrite, mdnblocks, mdtruncate, mdimmedsync,
 		mdpreckpt, mdsync, mdpostckpt
 	}
 };
@@ -626,6 +628,22 @@ smgrread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	(*(smgrsw[reln->smgr_which].smgr_read)) (reln, forknum, blocknum, buffer);
 }
 
+
+/*
+ *	smgtryrread() -- read a particular block from a relation into the supplied
+ *				  buffer.
+ *
+ *		This routine is called from the buffer manager in order to
+ *		instantiate pages in the shared buffer cache.  All storage managers
+ *		return pages in the format that POSTGRES expects.
+ */
+int
+smgrtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+		 char *buffer)
+{
+	return (*(smgrsw[reln->smgr_which].smgr_tryread)) (reln, forknum, blocknum, buffer);
+}
+
 /*
  *	smgrwrite() -- Write the supplied buffer out.
  *
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index 521ee1c..5f961af 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -40,6 +40,7 @@
 #define BM_CHECKPOINT_NEEDED	(1 << 7)		/* must write for checkpoint */
 #define BM_PERMANENT			(1 << 8)		/* permanent relation (not
 												 * unlogged) */
+#define BM_NEW					(1 << 9)		/* Not guaranteed to exist on disk */
 
 typedef bits16 BufFlags;
 
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index ec0a254..b52591f 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -153,6 +153,7 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
 						  ReadBufferMode mode, BufferAccessStrategy strategy);
+extern Buffer ExtendRelation(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 69a624f..07a331c 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -94,6 +94,8 @@ extern void smgrprefetch(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber blocknum);
 extern void smgrread(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer);
+extern int smgrtryread(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer);
 extern void smgrwrite(SMgrRelation reln, ForkNumber forknum,
 		  BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
@@ -114,12 +116,15 @@ extern void mdclose(SMgrRelation reln, ForkNumber forknum);
 extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
 extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
 extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+extern void mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdappend(SMgrRelation reln, ForkNumber forknum,
 		 BlockNumber blocknum, char *buffer, bool skipFsync);
 extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber blocknum);
 extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	   char *buffer);
+extern int mdtryread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
 extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
 		BlockNumber blocknum, char *buffer, bool skipFsync);
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-- 
2.3.0.149.gf3f4077.dirty

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Reply via email to