Re: BufferAlloc: don't take two simultaneous locks

Yura Sokolov Mon, 21 Feb 2022 00:07:08 -0800

Good day, Kyotaro Horiguchi and hackers.

В Чт, 17/02/2022 в 14:16 +0900, Kyotaro Horiguchi пишет:
> At Wed, 16 Feb 2022 10:40:56 +0300, Yura Sokolov <y.soko...@postgrespro.ru> 
> wrote in 
> > Hello, all.
> > 
> > I thought about patch simplification, and tested version
> > without BufTable and dynahash api change at all.
> > 
> > It performs suprisingly well. It is just a bit worse
> > than v1 since there is more contention around dynahash's
> > freelist, but most of improvement remains.
> > 
> > I'll finish benchmarking and will attach graphs with
> > next message. Patch is attached here.
> 
> Thanks for the new patch.  The patch as a whole looks fine to me. But
> some comments needs to be revised.


Thank you for review and remarks.

> 
> (existing comments)
> > * To change the association of a valid buffer, we'll need to have
> > * exclusive lock on both the old and new mapping partitions.
> ...
> > * Somebody could have pinned or re-dirtied the buffer while we were
> > * doing the I/O and making the new hashtable entry.  If so, we can't
> > * recycle this buffer; we must undo everything we've done and start
> > * over with a new victim buffer.
> 
> We no longer take a lock on the new partition and have no new hash
> entry (if others have not yet done) at this point.

fixed

> +        * Clear out the buffer's tag and flags.  We must do this to ensure 
> that
> +        * linear scans of the buffer array don't think the buffer is valid. 
> We
> 
> The reason we can clear out the tag is it's safe to use the victim
> buffer at this point. This comment needs to mention that reason.

Tried to describe.

> +        *
> +        * Since we are single pinner, there should no be PIN_COUNT_WAITER or
> +        * IO_IN_PROGRESS (flags that were not cleared in previous code).
> +        */
> +       Assert((oldFlags & (BM_PIN_COUNT_WAITER | BM_IO_IN_PROGRESS)) == 0);
> 
> It seems like to be a test for potential bugs in other functions.  As
> the comment is saying, we are sure that no other processes are pinning
> the buffer and the existing code doesn't seem to be care about that
> condition.  Is it really needed?

Ok, I agree this check is excess.
These two flags were not cleared in the previous code, and I didn't get
why. Probably, it is just a historical accident.

> 
> +       /*
> +        * Try to make a hashtable entry for the buffer under its new tag. 
> This
> +        * could fail because while we were writing someone else allocated 
> another
> 
> The most significant point of this patch is the reason that the victim
> buffer is protected from stealing until it is set up for new tag. I
> think we need an explanation about the protection here.

I don't get what you mean clearly :( . I would appreciate your
suggestion for this comment.

> 
> 
> +        * buffer for the same block we want to read in. Note that we have 
> not yet
> +        * removed the hashtable entry for the old tag.
> 
> Since we have removed the hash table entry for the old tag at this
> point, the comment got wrong.

Thanks, changed.

> +                * the first place.  First, give up the buffer we were 
> planning to use
> +                * and put it to free lists.
> ..
> +               StrategyFreeBuffer(buf);
> 
> This is one downside of this patch. But it seems to me that the odds
> are low that many buffers are freed in a short time by this logic.  By
> the way it would be better if the sentence starts with "First" has a
> separate comment section.

Splitted the comment.

> (existing comment)
> |        * Okay, it's finally safe to rename the buffer.
> 
> We don't "rename" the buffer here.  And the safety is already
> establishsed at the end of the oldPartitionLock section. So it would
> be just something like "Now allocate the victim buffer for the new
> tag"?

Changed to "Now it is safe to use victim buffer for new tag."


There is also tiny code change at block reuse finalization: instead
of LockBufHdr+UnlockBufHdr I use single atomic_fetch_or protected
with WaitBufHdrUnlocked. I've tried to explain its safety. Please,
check it.


Benchmarks:
- base point is 6ce16088bfed97f9.
- notebook with i7-1165G7 and server with Xeon 8354H (1&2 sockets)
- pgbench select only scale 100 (1.5GB on disk)
- two shared_buffers values: 128MB and 1GB.
- enabled hugepages
- second best result from five runs

Notebook:
  conns |     master |   patch_v3 |  master 1G | patch_v3 1G 
--------+------------+------------+------------+------------
      1 |      29508 |      29481 |      31774 |      32305 
      2 |      57139 |      56694 |      63393 |      62968 
      3 |      89759 |      90861 |     101873 |     102399 
      5 |     133491 |     134573 |     145451 |     145009 
      7 |     156739 |     155832 |     164633 |     164562 
     17 |     216863 |     216379 |     251923 |     251017 
     27 |     209532 |     209802 |     244202 |     243709 
     53 |     212615 |     213552 |     248107 |     250317 
     83 |     214446 |     218230 |     252414 |     252337 
    107 |     211276 |     217109 |     252762 |     250328 
    139 |     208070 |     214265 |     248350 |     249684 
    163 |     206764 |     214594 |     247369 |     250323 
    191 |     205478 |     213511 |     244597 |     246877 
    211 |     200976 |     212976 |     244035 |     245032 
    239 |     196588 |     211519 |     243897 |     245055 
    271 |     195813 |     209631 |     237457 |     242771 
    307 |     192724 |     208074 |     237658 |     241759 
    353 |     187847 |     207189 |     234548 |     239008 
    397 |     186942 |     205317 |     230465 |     238782

I don't get why numbers changed from first letter ))
But still no slowdown, and measurable gain at 128MB shared
buffers.

Xeon 1 socket

  conns |     master |   patch_v3 |  master 1G | patch_v3 1G 
--------+------------+------------+------------+------------
      1 |      41975 |      41799 |      52898 |      52715 
      2 |      77693 |      77531 |      97571 |      98547 
      3 |     114713 |     114533 |     142709 |     143579 
      5 |     188898 |     187241 |     239322 |     236682 
      7 |     261516 |     260249 |     329119 |     328562 
     17 |     521821 |     518981 |     672390 |     662987 
     27 |     555487 |     557019 |     674630 |     675703 
     53 |     868213 |     897097 |    1190734 |    1202575 
     83 |     868232 |     881705 |    1164997 |    1157764 
    107 |     850477 |     855169 |    1140597 |    1128027 
    139 |     816311 |     826756 |    1101471 |    1096197 
    163 |     794788 |     805946 |    1078445 |    1071535 
    191 |     765934 |     783209 |    1059497 |    1039936 
    211 |     738656 |     786171 |    1083356 |    1049025 
    239 |     713124 |     837040 |    1104629 |    1125969 
    271 |     692138 |     847741 |    1094432 |    1131968 
    307 |     682919 |     847939 |    1086306 |    1124649 
    353 |     679449 |     844596 |    1071482 |    1125980 
    397 |     676217 |     833009 |    1058937 |    1113496 

Here is small slowdown at some connection numbers (17,
107-191).It is reproducible. Probably it is due to one more
atomice write. Perhaps for some other scheduling issues (
processes block less on buffer manager but compete more
on other resources). I could not reliably determine why,
because change is too small, and `perf record` harms
performance more at this point.

This is the reason I've changed finalization to atomic_or
instead of Lock+Unlock pair. The changed helped a bit, but
didn't remove slowdown completely.

Xeon 2 socket

  conns |         m0 |   patch_v3 |      m0 1G | patch_v3 1G 
--------+------------+------------+------------+------------
      1 |      44317 |      43747 |      53920 |      53759 
      2 |      81193 |      79976 |      99138 |      99213 
      3 |     120755 |     114481 |     148102 |     146494 
      5 |     190007 |     187384 |     232078 |     229627 
      7 |     258602 |     256657 |     325545 |     322417 
     17 |     551814 |     549041 |     692312 |     688204 
     27 |     787353 |     787916 |    1023509 |    1020995 
     53 |     973880 |     996019 |    1228274 |    1246128 
     83 |    1108442 |    1258589 |    1596292 |    1662586 
    107 |    1072188 |    1317229 |    1542401 |    1684603 
    139 |    1000446 |    1272759 |    1490757 |    1672507 
    163 |     967378 |    1224513 |    1461468 |    1660012 
    191 |     926010 |    1178067 |    1435317 |    1645886 
    211 |     909919 |    1148862 |    1417437 |    1629487 
    239 |     895944 |    1108579 |    1393530 |    1616824 
    271 |     880545 |    1078280 |    1374878 |    1608412 
    307 |     865560 |    1056988 |    1355164 |    1601066 
    353 |     857591 |    1033980 |    1330069 |    1586769 
    397 |     840374 |    1016690 |    1312257 |    1573376 

regards,
Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com

From 04b07d0627ec65ba3327dc8338d59dbd15c405d8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH v3] [PGPRO-5616] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both lock simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

Tags: lwlock_numa
---
 src/backend/storage/buffer/bufmgr.c | 206 ++++++++++++++--------------
 1 file changed, 105 insertions(+), 101 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index b4532948d3f..bb8b1cd2f4b 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1114,6 +1114,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferDesc *buf;
 	bool		valid;
 	uint32		buf_state;
+	uint32		new_bits;
 
 	/* create a tag so we can lookup the buffer */
 	INIT_BUFFERTAG(newTag, smgr->smgr_rnode.node, forkNum, blockNum);
@@ -1288,93 +1289,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1382,52 +1306,132 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * Clear out the buffer's tag and flags.  We must do this to ensure that
+	 * linear scans of the buffer array don't think the buffer is valid. We
+	 * also reset the usage_count since any recency of use of the old content
+	 * is no longer relevant.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * We have pinned buffer and we are single pinner at the moment so there
+	 * is no other pinners. We hold buffer header lock and exclusive partition
+	 * lock if tag is valid. Given these statements it is safe to clear tag
+	 * since no other process can inspect it to the moment.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now it is safe to use victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
+	 *
+	 * We use direct atomic OR instead of Lock+Unlock since no other backend
+	 * could be interested in the buffer. But StrategyGetBuffer,
+	 * Flush*Buffers, Drop*Buffers are scanning all buffers and locks them to
+	 * compare tag, and UnlockBufHdr does raw write to state. So we have to
+	 * spin if we found buffer locked.
+	 *
+	 * Note that we write tag unlocked. It is also safe since there is always
+	 * check for BM_VALID when tag is compared.
 	 */
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
-		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
+		new_bits = BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
-		buf_state |= BM_TAG_VALID | BUF_USAGECOUNT_ONE;
-
-	UnlockBufHdr(buf, buf_state);
+		new_bits = BM_TAG_VALID | BUF_USAGECOUNT_ONE;
 
-	if (oldPartitionLock != NULL)
+	buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
+	while (unlikely(buf_state & BM_LOCKED))
 	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
+		WaitBufHdrUnlocked(buf);
+		buf_state = pg_atomic_fetch_or_u32(&buf->state, new_bits);
 	}
 
 	LWLockRelease(newPartitionLock);
-- 
2.35.1

<<attachment: res.zip>>

Re: BufferAlloc: don't take two simultaneous locks

Reply via email to