Re: BufferAlloc: don't take two simultaneous locks

Yura Sokolov Wed, 06 Apr 2022 06:17:52 -0700

Good day, Kyotaoro-san.
Good day, hackers.

В Вс, 20/03/2022 в 12:38 +0300, Yura Sokolov пишет:
> В Чт, 17/03/2022 в 12:02 +0900, Kyotaro Horiguchi пишет:
> > At Wed, 16 Mar 2022 14:11:58 +0300, Yura Sokolov <y.soko...@postgrespro.ru> 
> > wrote in 
> > > В Ср, 16/03/2022 в 12:07 +0900, Kyotaro Horiguchi пишет:
> > > > At Tue, 15 Mar 2022 13:47:17 +0300, Yura Sokolov 
> > > > <y.soko...@postgrespro.ru> wrote in 
> > > > In v7, HASH_ENTER returns the element stored in DynaHashReuse using
> > > > the freelist_idx of the new key.  v8 uses that of the old key (at the
> > > > time of HASH_REUSE).  So in the case "REUSE->ENTER(elem exists and
> > > > returns the stashed)" case the stashed element is returned to its
> > > > original partition.  But it is not what I mentioned.
> > > > 
> > > > On the other hand, once the stahsed element is reused by HASH_ENTER,
> > > > it gives the same resulting state with HASH_REMOVE->HASH_ENTER(borrow
> > > > from old partition) case.  I suspect that ththat the frequent freelist
> > > > starvation comes from the latter case.
> > > 
> > > Doubtfully. Due to probabilty theory, single partition doubdfully
> > > will be too overflowed. Therefore, freelist.
> > 
> > Yeah.  I think so generally.
> > 
> > > But! With 128kb shared buffers there is just 32 buffers. 32 entry for
> > > 32 freelist partition - certainly some freelist partition will certainly
> > > have 0 entry even if all entries are in freelists. 
> > 
> > Anyway, it's an extreme condition and the starvation happens only at a
> > neglegible ratio.
> > 
> > > > RETURNED: 2
> > > > ALLOCED: 0
> > > > BORROWED: 435
> > > > REUSED: 495444
> > > > ASSIGNED: 495467 (-23)
> > > > 
> > > > Now "BORROWED" happens 0.8% of REUSED
> > > 
> > > 0.08% actually :)
> > 
> > Mmm.  Doesn't matter:p
> > 
> > > > > > > I lost access to Xeon 8354H, so returned to old Xeon X5675.
> > > > > > ...
> > > > > > > Strange thing: both master and patched version has higher
> > > > > > > peak tps at X5676 at medium connections (17 or 27 clients)
> > > > > > > than in first october version [1]. But lower tps at higher
> > > > > > > connections number (>= 191 clients).
> > > > > > > I'll try to bisect on master this unfortunate change.
> > ...
> > > I've checked. Looks like something had changed on the server, since
> > > old master commit behaves now same to new one (and differently to
> > > how it behaved in October).
> > > I remember maintainance downtime of the server in november/december.
> > > Probably, kernel were upgraded or some system settings were changed.
> > 
> > One thing I have a little concern is that numbers shows 1-2% of
> > degradation steadily for connection numbers < 17.
> > 
> > I think there are two possible cause of the degradation.
> > 
> > 1. Additional branch by consolidating HASH_ASSIGN into HASH_ENTER.
> >   This might cause degradation for memory-contended use.
> > 
> > 2. nallocs operation might cause degradation on non-shared dynahasyes?
> >   I believe doesn't but I'm not sure.
> > 
> >   On a simple benchmarking with pgbench on a laptop, dynahash
> >   allocation (including shared and non-shared) happend about at 50
> >   times per second with 10 processes and 200 with 100 processes.
> > 
> > > > I don't think nalloced needs to be the same width to long.  For the
> > > > platforms with 32-bit long, anyway the possible degradation if any by
> > > > 64-bit atomic there doesn't matter.  So don't we always define the
> > > > atomic as 64bit and use the pg_atomic_* functions directly?
> > > 
> > > Some 32bit platforms has no native 64bit atomics. Then they are
> > > emulated with locks.
> > > 
> > > Well, and for 32bit platform long is just enough. Why spend other
> > > 4 bytes per each dynahash?
> > 
> > I don't think additional bytes doesn't matter, but emulated atomic
> > operations can matter. However I'm not sure which platform uses that
> > fallback implementations.  (x86 seems to have __sync_fetch_and_add()
> > since P4).
> > 
> > My opinion in the previous mail is that if that level of degradation
> > caued by emulated atomic operations matters, we shouldn't use atomic
> > there at all since atomic operations on the modern platforms are not
> > also free.
> > 
> > In relation to 2 above, if we observe that the degradation disappears
> > by (tentatively) use non-atomic operations for nalloced, we should go
> > back to the previous per-freelist nalloced.
> 
> Here is version with nalloced being union of appropriate atomic and
> long.
>


Ok, I got access to stronger server, did the benchmark, found weird
things, and so here is new version :-)

First I found if table size is strictly limited to NBuffers and FIXED,
then under high concurrency get_hash_entry may not find free entry
despite it must be there. It seems while process scans free lists, other
concurrent processes "moves entry around", ie one concurrent process
fetched it from one free list, other process put new entry in other
freelist, and unfortunate process missed it since it tests freelists
only once.

Second, I confirm there is problem with freelist spreading.
If I keep entry's freelist_idx, then one freelist is crowded.
If I use new entry's freelist_idx, then one freelist is emptified
constantly.

Third, I found increased concurrency could harm. When popular block is
evicted for some reason, then thundering herd effect occures: many
backends wants to read same block, they evict many other buffers, but
only one is inserted. Other goes to freelist. Evicted buffers by itself
reduce cache hit ratio and provocates more work. Old version resists
this effect by not removing old buffer before new entry is successfully
inserted.

To fix this issues I made following:

# Concurrency

First, I limit concurrency by introducing other lwlocks tranche -
BufferEvict. It is 8 times larger than BufferMapping tranche (1024 vs
128).
If backend doesn't find buffer in buffer table and wants to introduce
it, it first calls
    LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE)
If lock were acquired, then it goes to eviction and replace process.
Otherwise, it waits lock to be released and repeats search.

This greately improve performance for > 400 clients in pgbench.

I tried other variant as well:
- first insert entry with dummy buffer index into buffer table.
- if such entry were already here, then wait it to be filled.
- otherwise find victim buffer and replace dummy index with new one.
Wait were done with shared lock on EvictPartitionLock as well.
This variant performed quite same.

Logically I like that variant more, but there is one gotcha: 
FlushBuffer could fail with elog(ERROR). Therefore then there is
a need to reliable remove entry with dummy index.
And after all, I still need to hold EvictPartitionLock to notice
waiters.

I've tried to use ConditionalVariable, but its performance were much
worse.

# Dynahash capacity and freelists.

I returned back buffer table initialization:
- removed FIXES_SIZE restriction introduced in previous version
- returned `NBuffers + NUM_BUFFER_PARTITIONS`.
I really think, there should be more spare items, since almost always
entry_alloc is called at least once (on 128MB shared_buffers). But
let keep it as is for now.

`get_hash_entry` were changed to probe NUM_FREELISTS/4 (==8) freelists
before falling back to `entry_alloc`, and probing is changed from
linear to quadratic. This greately reduces number of calls to
`entry_alloc`, so more shared memory left intact. And I didn't notice
large performance hit from. Probably there is some, but I think it is
adequate trade-off.

`free_reused_entry` now returns entry to random position. It flattens
free entry's spread. Although it is not enough without other changes
(thundering herd mitigation and probing more lists in get_hash_entry).

# Benchmarks

Benchmarked on two socket Xeon(R) Gold 5220 CPU @2.20GHz
18 cores per socket + hyper-threading - upto 72 virtual core total.
turbo-boost disabled
Linux 5.10.103-1 Debian.

pgbench scale 100 simple_select + simple select with 3 keys (sql file
attached).

shared buffers 128MB & 1GB
huge_pages=on

1 socket
  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      27882 |      27738 |      32735 |      32439 
      2 |      54082 |      54336 |      64387 |      63846 
      3 |      80724 |      81079 |      96387 |      94439 
      5 |     134404 |     133429 |     160085 |     157399 
      7 |     185977 |     184502 |     219916 |     217142 
     17 |     335345 |     338214 |     393112 |     388796 
     27 |     393686 |     394948 |     447945 |     444915 
     53 |     572234 |     577092 |     678884 |     676493 
     83 |     558875 |     561689 |     669212 |     655697 
    107 |     553054 |     551896 |     654550 |     646010 
    139 |     541263 |     538354 |     641937 |     633840 
    163 |     532932 |     531829 |     635127 |     627600 
    191 |     524647 |     524442 |     626228 |     617347 
    211 |     521624 |     522197 |     629740 |     613143 
    239 |     509448 |     554894 |     652353 |     652972 
    271 |     468190 |     557467 |     647403 |     661348 
    307 |     454139 |     558694 |     642229 |     657649 
    353 |     446853 |     554301 |     635991 |     654571 
    397 |     441909 |     549822 |     625194 |     647973 

1 socket 3 keys

  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      16677 |      16477 |      22219 |      22030 
      2 |      32056 |      31874 |      43298 |      43153 
      3 |      48091 |      47766 |      64877 |      64600 
      5 |      78999 |      78609 |     105433 |     106101 
      7 |     108122 |     107529 |     148713 |     145343 
     17 |     205656 |     209010 |     272676 |     271449 
     27 |     252015 |     254000 |     323983 |     323499 
     53 |     317928 |     334493 |     446740 |     449641 
     83 |     299234 |     327738 |     437035 |     443113 
    107 |     290089 |     322025 |     430535 |     431530 
    139 |     277294 |     314384 |     422076 |     423606 
    163 |     269029 |     310114 |     416229 |     417412 
    191 |     257315 |     306530 |     408487 |     416170 
    211 |     249743 |     304278 |     404766 |     416393 
    239 |     243333 |     310974 |     397139 |     428167 
    271 |     236356 |     309215 |     389972 |     427498 
    307 |     229094 |     307519 |     382444 |     425891 
    353 |     224385 |     305366 |     375020 |     423284 
    397 |     218549 |     302577 |     364373 |     420846 

2 sockets

  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      27287 |      27631 |      32943 |      32493 
      2 |      52397 |      54011 |      64572 |      63596 
      3 |      76157 |      80473 |      93363 |      93528 
      5 |     127075 |     134310 |     153176 |     149984 
      7 |     177100 |     176939 |     216356 |     211599 
     17 |     379047 |     383179 |     464249 |     470351 
     27 |     545219 |     546706 |     664779 |     662488 
     53 |     728142 |     728123 |     857454 |     869407 
     83 |     918276 |     957722 |    1215252 |    1203443 
    107 |     884112 |     971797 |    1206930 |    1234606 
    139 |     822564 |     970920 |    1167518 |    1233230 
    163 |     788287 |     968248 |    1130021 |    1229250 
    191 |     772406 |     959344 |    1097842 |    1218541 
    211 |     756085 |     955563 |    1077747 |    1209489 
    239 |     732926 |     948855 |    1050096 |    1200878 
    271 |     692999 |     941722 |    1017489 |    1194012 
    307 |     668241 |     920478 |     994420 |    1179507 
    353 |     642478 |     908645 |     968648 |    1174265 
    397 |     617673 |     893568 |     950736 |    1173411 

2 sockets 3 keys

  conns |     master |  patch-v11 |  master 1G | patch-v11 1G 
--------+------------+------------+------------+------------
      1 |      16722 |      16393 |      20340 |      21813 
      2 |      32057 |      32009 |      39993 |      42959 
      3 |      46202 |      47678 |      59216 |      64374 
      5 |      78882 |      72002 |      98054 |     103731 
      7 |     103398 |      99538 |     135098 |     135828 
     17 |     205863 |     217781 |     293958 |     299690 
     27 |     283526 |     290539 |     414968 |     411219 
     53 |     336717 |     356130 |     460596 |     474563 
     83 |     307310 |     342125 |     419941 |     469989 
    107 |     294059 |     333494 |     405706 |     469593 
    139 |     278453 |     328031 |     390984 |     470553 
    163 |     270833 |     326457 |     384747 |     470977 
    191 |     259591 |     322590 |     376582 |     470335 
    211 |     263584 |     321263 |     375969 |     469443 
    239 |     257135 |     316959 |     370108 |     470904 
    271 |     251107 |     315393 |     365794 |     469517 
    307 |     246605 |     311585 |     360742 |     467566 
    353 |     236899 |     308581 |     353464 |     466936 
    397 |     249036 |     305042 |     344673 |     466842 

I skipped v10 since I used it internally for variant
"insert entry with dummy index then search victim".


------

regards

Yura Sokolov
y.soko...@postgrespro.ru
funny.fal...@gmail.com

From 68800f6f02f062320e6d9fe42c986809a06a37cb Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Mon, 21 Feb 2022 08:49:03 +0300
Subject: [PATCH 1/4] bufmgr: do not acquire two partition locks.

Acquiring two partition locks leads to complex dependency chain that hurts
at high concurrency level.

There is no need to hold both locks simultaneously. Buffer is pinned so
other processes could not select it for eviction. If tag is cleared and
buffer removed from old partition other processes will not find it.
Therefore it is safe to release old partition lock before acquiring
new partition lock.

---
 src/backend/storage/buffer/bufmgr.c | 198 ++++++++++++++--------------
 1 file changed, 96 insertions(+), 102 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f5459c68f89..f7dbfc90aaa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1275,8 +1275,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		}
 
 		/*
-		 * To change the association of a valid buffer, we'll need to have
-		 * exclusive lock on both the old and new mapping partitions.
+		 * To change the association of a valid buffer, we'll need to reset
+		 * tag first, so we need to have exclusive lock on the old mapping
+		 * partitions.
 		 */
 		if (oldFlags & BM_TAG_VALID)
 		{
@@ -1289,93 +1290,16 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 			oldHash = BufTableHashCode(&oldTag);
 			oldPartitionLock = BufMappingPartitionLock(oldHash);
 
-			/*
-			 * Must lock the lower-numbered partition first to avoid
-			 * deadlocks.
-			 */
-			if (oldPartitionLock < newPartitionLock)
-			{
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
-			else if (oldPartitionLock > newPartitionLock)
-			{
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-				LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
-			}
-			else
-			{
-				/* only one partition, only one lock */
-				LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
-			}
+			LWLockAcquire(oldPartitionLock, LW_EXCLUSIVE);
 		}
 		else
 		{
-			/* if it wasn't valid, we need only the new partition */
-			LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
 			/* remember we have no old-partition lock or tag */
 			oldPartitionLock = NULL;
 			/* keep the compiler quiet about uninitialized variables */
 			oldHash = 0;
 		}
 
-		/*
-		 * Try to make a hashtable entry for the buffer under its new tag.
-		 * This could fail because while we were writing someone else
-		 * allocated another buffer for the same block we want to read in.
-		 * Note that we have not yet removed the hashtable entry for the old
-		 * tag.
-		 */
-		buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
-
-		if (buf_id >= 0)
-		{
-			/*
-			 * Got a collision. Someone has already done what we were about to
-			 * do. We'll just handle this as if it were found in the buffer
-			 * pool in the first place.  First, give up the buffer we were
-			 * planning to use.
-			 */
-			UnpinBuffer(buf, true);
-
-			/* Can give up that buffer's mapping partition lock now */
-			if (oldPartitionLock != NULL &&
-				oldPartitionLock != newPartitionLock)
-				LWLockRelease(oldPartitionLock);
-
-			/* remaining code should match code at top of routine */
-
-			buf = GetBufferDescriptor(buf_id);
-
-			valid = PinBuffer(buf, strategy);
-
-			/* Can release the mapping lock as soon as we've pinned it */
-			LWLockRelease(newPartitionLock);
-
-			*foundPtr = true;
-
-			if (!valid)
-			{
-				/*
-				 * We can only get here if (a) someone else is still reading
-				 * in the page, or (b) a previous read attempt failed.  We
-				 * have to wait for any active read attempt to finish, and
-				 * then set up our own read attempt if the page is still not
-				 * BM_VALID.  StartBufferIO does it all.
-				 */
-				if (StartBufferIO(buf, true))
-				{
-					/*
-					 * If we get here, previous attempts to read the buffer
-					 * must have failed ... but we shall bravely try again.
-					 */
-					*foundPtr = false;
-				}
-			}
-
-			return buf;
-		}
-
 		/*
 		 * Need to lock the buffer header too in order to change its tag.
 		 */
@@ -1383,40 +1307,117 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/*
 		 * Somebody could have pinned or re-dirtied the buffer while we were
-		 * doing the I/O and making the new hashtable entry.  If so, we can't
-		 * recycle this buffer; we must undo everything we've done and start
-		 * over with a new victim buffer.
+		 * doing the I/O.  If so, we can't recycle this buffer; we must undo
+		 * everything we've done and start over with a new victim buffer.
 		 */
 		oldFlags = buf_state & BUF_FLAG_MASK;
 		if (BUF_STATE_GET_REFCOUNT(buf_state) == 1 && !(oldFlags & BM_DIRTY))
 			break;
 
 		UnlockBufHdr(buf, buf_state);
-		BufTableDelete(&newTag, newHash);
-		if (oldPartitionLock != NULL &&
-			oldPartitionLock != newPartitionLock)
+		if (oldPartitionLock != NULL)
 			LWLockRelease(oldPartitionLock);
-		LWLockRelease(newPartitionLock);
 		UnpinBuffer(buf, true);
 	}
 
 	/*
-	 * Okay, it's finally safe to rename the buffer.
+	 * We are single pinner, we hold buffer header lock and exclusive
+	 * partition lock (if tag is valid). It means no other process can inspect
+	 * it at the moment.
 	 *
-	 * Clearing BM_VALID here is necessary, clearing the dirtybits is just
-	 * paranoia.  We also reset the usage_count since any recency of use of
-	 * the old content is no longer relevant.  (The usage_count starts out at
-	 * 1 so that the buffer can survive one clock-sweep pass.)
+	 * But we will release partition lock and buffer header lock. We must be
+	 * sure other backend will not use this buffer until we reuse it for new
+	 * tag. Therefore, we clear out the buffer's tag and flags and remove it
+	 * from buffer table. Also buffer remains pinned to ensure
+	 * StrategyGetBuffer will not try to reuse the buffer concurrently.
+	 *
+	 * We also reset the usage_count since any recent use of the old
+	 * content is no longer relevant.
+	 */
+	CLEAR_BUFFERTAG(buf->tag);
+	buf_state &= ~(BUF_FLAG_MASK | BUF_USAGECOUNT_MASK);
+	UnlockBufHdr(buf, buf_state);
+
+	/* Delete old tag from hash table if it were valid. */
+	if (oldFlags & BM_TAG_VALID)
+		BufTableDelete(&oldTag, oldHash);
+
+	if (oldPartitionLock != newPartitionLock)
+	{
+		if (oldPartitionLock != NULL)
+			LWLockRelease(oldPartitionLock);
+		LWLockAcquire(newPartitionLock, LW_EXCLUSIVE);
+	}
+
+	/*
+	 * Try to make a hashtable entry for the buffer under its new tag. This
+	 * could fail because while we were writing someone else allocated another
+	 * buffer for the same block we want to read in. In that case we will have
+	 * to return our buffer to free list.
+	 */
+	buf_id = BufTableInsert(&newTag, newHash, buf->buf_id);
+
+	if (buf_id >= 0)
+	{
+		/*
+		 * Got a collision. Someone has already done what we were about to do.
+		 * We'll just handle this as if it were found in the buffer pool in
+		 * the first place.
+		 */
+
+		/*
+		 * First, give up the buffer we were planning to use and put it to
+		 * free lists.
+		 */
+		UnpinBuffer(buf, true);
+		StrategyFreeBuffer(buf);
+
+		/* remaining code should match code at top of routine */
+
+		buf = GetBufferDescriptor(buf_id);
+
+		valid = PinBuffer(buf, strategy);
+
+		/* Can release the mapping lock as soon as we've pinned it */
+		LWLockRelease(newPartitionLock);
+
+		*foundPtr = true;
+
+		if (!valid)
+		{
+			/*
+			 * We can only get here if (a) someone else is still reading in
+			 * the page, or (b) a previous read attempt failed.  We have to
+			 * wait for any active read attempt to finish, and then set up our
+			 * own read attempt if the page is still not BM_VALID.
+			 * StartBufferIO does it all.
+			 */
+			if (StartBufferIO(buf, true))
+			{
+				/*
+				 * If we get here, previous attempts to read the buffer must
+				 * have failed ... but we shall bravely try again.
+				 */
+				*foundPtr = false;
+			}
+		}
+
+		return buf;
+	}
+
+	/*
+	 * Now reuse victim buffer for new tag.
 	 *
 	 * Make sure BM_PERMANENT is set for buffers that must be written at every
 	 * checkpoint.  Unlogged buffers only need to be written at shutdown
 	 * checkpoints, except for their "init" forks, which need to be treated
 	 * just like permanent relations.
+	 *
+	 * The usage_count starts out at 1 so that the buffer can survive one
+	 * clock-sweep pass.
 	 */
+	buf_state = LockBufHdr(buf);
 	buf->tag = newTag;
-	buf_state &= ~(BM_VALID | BM_DIRTY | BM_JUST_DIRTIED |
-				   BM_CHECKPOINT_NEEDED | BM_IO_ERROR | BM_PERMANENT |
-				   BUF_USAGECOUNT_MASK);
 	if (relpersistence == RELPERSISTENCE_PERMANENT || forkNum == INIT_FORKNUM)
 		buf_state |= BM_TAG_VALID | BM_PERMANENT | BUF_USAGECOUNT_ONE;
 	else
@@ -1424,13 +1425,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 	UnlockBufHdr(buf, buf_state);
 
-	if (oldPartitionLock != NULL)
-	{
-		BufTableDelete(&oldTag, oldHash);
-		if (oldPartitionLock != newPartitionLock)
-			LWLockRelease(oldPartitionLock);
-	}
-
 	LWLockRelease(newPartitionLock);
 
 	/*
-- 
2.35.1


From 4e5695ec50a4ade734375ba88111a44e645e1d79 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Sun, 27 Mar 2022 14:36:58 +0300
Subject: [PATCH 2/4] prevent thundering herd and limit concurrency in bufmgr.

Benchmark shows with huge number of clients concurrent evicters that try
to load same page starts to content a lot on partition locks and
freelists.
Situation can be worse than old behaviour with simultaneous lock
acquiring for old and new partitions since page were not deleted before
new page inserted.

This patch adds other locks tranche that prevents concurrent loading of
same buffer.

Tags: bufmgr
---
 src/backend/storage/buffer/bufmgr.c   | 11 +++++++++++
 src/backend/storage/buffer/freelist.c |  8 ++++----
 src/backend/storage/lmgr/lwlock.c     |  5 +++++
 src/include/storage/buf_internals.h   |  5 +++++
 src/include/storage/lwlock.h          |  6 +++++-
 5 files changed, 30 insertions(+), 5 deletions(-)

diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index f7dbfc90aaa..4c6c57e0ea6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1107,6 +1107,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	BufferTag	newTag;			/* identity of requested block */
 	uint32		newHash;		/* hash value for newTag */
 	LWLock	   *newPartitionLock;	/* buffer partition lock for it */
+	LWLock	   *newEvictPartitionLock;	/* buffer partition lock for it */
 	BufferTag	oldTag;			/* previous identity of selected buffer */
 	uint32		oldHash;		/* hash value for oldTag */
 	LWLock	   *oldPartitionLock;	/* buffer partition lock for it */
@@ -1122,7 +1123,9 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	/* determine its hash code and partition lock ID */
 	newHash = BufTableHashCode(&newTag);
 	newPartitionLock = BufMappingPartitionLock(newHash);
+	newEvictPartitionLock = BufEvictPartitionLock(newHash);
 
+retry:
 	/* see if the block is in the buffer pool already */
 	LWLockAcquire(newPartitionLock, LW_SHARED);
 	buf_id = BufTableLookup(&newTag, newHash);
@@ -1170,6 +1173,12 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	 */
 	LWLockRelease(newPartitionLock);
 
+	/*
+	 * Prevent "thundering herd" problem and limit concurrency.
+	 */
+	if (!LWLockAcquireOrWait(newEvictPartitionLock, LW_EXCLUSIVE))
+		goto retry;
+
 	/* Loop here in case we have to try another victim buffer */
 	for (;;)
 	{
@@ -1380,6 +1389,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		/* Can release the mapping lock as soon as we've pinned it */
 		LWLockRelease(newPartitionLock);
+		LWLockRelease(newEvictPartitionLock);
 
 		*foundPtr = true;
 
@@ -1426,6 +1436,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	UnlockBufHdr(buf, buf_state);
 
 	LWLockRelease(newPartitionLock);
+	LWLockRelease(newEvictPartitionLock);
 
 	/*
 	 * Buffer contents are currently invalid.  Try to obtain the right to
diff --git a/src/backend/storage/buffer/freelist.c b/src/backend/storage/buffer/freelist.c
index 3b98e68d50f..36218975200 100644
--- a/src/backend/storage/buffer/freelist.c
+++ b/src/backend/storage/buffer/freelist.c
@@ -481,10 +481,10 @@ StrategyInitialize(bool init)
 	 *
 	 * Since we can't tolerate running out of lookup table entries, we must be
 	 * sure to specify an adequate table size here.  The maximum steady-state
-	 * usage is of course NBuffers entries, but BufferAlloc() tries to insert
-	 * a new entry before deleting the old.  In principle this could be
-	 * happening in each partition concurrently, so we could need as many as
-	 * NBuffers + NUM_BUFFER_PARTITIONS entries.
+	 * usage is of course NBuffers entries. But due to concurrent
+	 * access to numerous free lists in dynahash we can miss free entry that
+	 * moved between free lists. So it is better to have some spare free entries
+	 * to reduce probability of entry allocations after server start.
 	 */
 	InitBufTable(NBuffers + NUM_BUFFER_PARTITIONS);
 
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 8f7f1b2f7c3..08e7cb6b031 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -155,6 +155,8 @@ static const char *const BuiltinTrancheNames[] = {
 	"LockFastPath",
 	/* LWTRANCHE_BUFFER_MAPPING: */
 	"BufferMapping",
+	/* LWTRANCHE_BUFFER_EVICT: */
+	"BufferEvict",
 	/* LWTRANCHE_LOCK_MANAGER: */
 	"LockManager",
 	/* LWTRANCHE_PREDICATE_LOCK_MANAGER: */
@@ -525,6 +527,9 @@ InitializeLWLocks(void)
 	lock = MainLWLockArray + BUFFER_MAPPING_LWLOCK_OFFSET;
 	for (id = 0; id < NUM_BUFFER_PARTITIONS; id++, lock++)
 		LWLockInitialize(&lock->lock, LWTRANCHE_BUFFER_MAPPING);
+	lock = MainLWLockArray + BUFFER_EVICT_LWLOCK_OFFSET;
+	for (id = 0; id < NUM_BUFFER_EVICT_PARTITIONS; id++, lock++)
+		LWLockInitialize(&lock->lock, LWTRANCHE_BUFFER_EVICT);
 
 	/* Initialize lmgrs' LWLocks in main array */
 	lock = MainLWLockArray + LOCK_MANAGER_LWLOCK_OFFSET;
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index b903d2bcaf0..a1bb6ce60a0 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -126,9 +126,14 @@ typedef struct buftag
  */
 #define BufTableHashPartition(hashcode) \
 	((hashcode) % NUM_BUFFER_PARTITIONS)
+#define BufTableEvictPartition(hashcode) \
+	((hashcode) % NUM_BUFFER_EVICT_PARTITIONS)
 #define BufMappingPartitionLock(hashcode) \
 	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + \
 		BufTableHashPartition(hashcode)].lock)
+#define BufEvictPartitionLock(hashcode) \
+	(&MainLWLockArray[BUFFER_EVICT_LWLOCK_OFFSET + \
+		BufTableEvictPartition(hashcode)].lock)
 #define BufMappingPartitionLockByIndex(i) \
 	(&MainLWLockArray[BUFFER_MAPPING_LWLOCK_OFFSET + (i)].lock)
 
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c3d5889d7b2..12960cb79f5 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -81,6 +81,7 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 
 /* Number of partitions of the shared buffer mapping hashtable */
 #define NUM_BUFFER_PARTITIONS  128
+#define NUM_BUFFER_EVICT_PARTITIONS  (NUM_BUFFER_PARTITIONS * 8)
 
 /* Number of partitions the shared lock tables are divided into */
 #define LOG2_NUM_LOCK_PARTITIONS  4
@@ -92,8 +93,10 @@ extern PGDLLIMPORT int NamedLWLockTrancheRequests;
 
 /* Offsets for various chunks of preallocated lwlocks. */
 #define BUFFER_MAPPING_LWLOCK_OFFSET	NUM_INDIVIDUAL_LWLOCKS
-#define LOCK_MANAGER_LWLOCK_OFFSET		\
+#define BUFFER_EVICT_LWLOCK_OFFSET	\
 	(BUFFER_MAPPING_LWLOCK_OFFSET + NUM_BUFFER_PARTITIONS)
+#define LOCK_MANAGER_LWLOCK_OFFSET		\
+	(BUFFER_EVICT_LWLOCK_OFFSET + NUM_BUFFER_EVICT_PARTITIONS)
 #define PREDICATELOCK_MANAGER_LWLOCK_OFFSET \
 	(LOCK_MANAGER_LWLOCK_OFFSET + NUM_LOCK_PARTITIONS)
 #define NUM_FIXED_LWLOCKS \
@@ -179,6 +182,7 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_REPLICATION_SLOT_IO,
 	LWTRANCHE_LOCK_FASTPATH,
 	LWTRANCHE_BUFFER_MAPPING,
+	LWTRANCHE_BUFFER_EVICT,
 	LWTRANCHE_LOCK_MANAGER,
 	LWTRANCHE_PREDICATE_LOCK_MANAGER,
 	LWTRANCHE_PARALLEL_HASH_JOIN,
-- 
2.35.1


From 0e4ce63c9fb8c53565af1b1ef11e0c7c224663ec Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Mon, 28 Feb 2022 12:19:17 +0300
Subject: [PATCH 3/4] Add HASH_REUSE and use it in BufTable.

Avoid dynahash's freelist locking when BufferAlloc reuses buffer for
different tag.

HASH_REUSE acts as HASH_REMOVE, but stores element to reuse in static
variable instead of freelist partition. And HASH_ENTER then may use the
element.

Unfortunately, FreeListData->nentries had to be manipulated even in this
case. So instead of manipulation with nentries, we replace nentries with
nfree - actual length of free list, and nalloced - initially allocated
entries for free list. This were suggested by Robert Haas in
https://postgr.es/m/CA%2BTgmoZkg-04rcNRURt%3DjAG0Cs5oPyB-qKxH4wqX09e-oXy-nw%40mail.gmail.com

Also, get_hash_entry is modified to try NUM_FREELISTS/4 partitions
before falling back to allocator. This reduce need for shared allocation
a lot without noticable harm to performance.

---
 src/backend/storage/buffer/buf_table.c |   7 +-
 src/backend/storage/buffer/bufmgr.c    |   4 +-
 src/backend/utils/hash/dynahash.c      | 271 ++++++++++++++++++-------
 src/include/storage/buf_internals.h    |   2 +-
 src/include/utils/hsearch.h            |   3 +-
 5 files changed, 213 insertions(+), 74 deletions(-)

diff --git a/src/backend/storage/buffer/buf_table.c b/src/backend/storage/buffer/buf_table.c
index dc439940faa..c189555751e 100644
--- a/src/backend/storage/buffer/buf_table.c
+++ b/src/backend/storage/buffer/buf_table.c
@@ -143,10 +143,13 @@ BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id)
  * BufTableDelete
  *		Delete the hashtable entry for given tag (which must exist)
  *
+ * If reuse flag is true, deleted entry is cached for reuse, and caller
+ * must call BufTableInsert next.
+ *
  * Caller must hold exclusive lock on BufMappingLock for tag's partition
  */
 void
-BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
+BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse)
 {
 	BufferLookupEnt *result;
 
@@ -154,7 +157,7 @@ BufTableDelete(BufferTag *tagPtr, uint32 hashcode)
 		hash_search_with_hash_value(SharedBufHash,
 									(void *) tagPtr,
 									hashcode,
-									HASH_REMOVE,
+									reuse ? HASH_REUSE : HASH_REMOVE,
 									NULL);
 
 	if (!result)				/* shouldn't happen */
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 4c6c57e0ea6..a5a34133d29 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1349,7 +1349,7 @@ retry:
 
 	/* Delete old tag from hash table if it were valid. */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, true);
 
 	if (oldPartitionLock != newPartitionLock)
 	{
@@ -1545,7 +1545,7 @@ retry:
 	 * Remove the buffer from the lookup hashtable, if it was in there.
 	 */
 	if (oldFlags & BM_TAG_VALID)
-		BufTableDelete(&oldTag, oldHash);
+		BufTableDelete(&oldTag, oldHash, false);
 
 	/*
 	 * Done with mapping lock.
diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 3babde8d704..436b6f5af41 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -14,7 +14,7 @@
  * a hash table in partitioned mode, the HASH_PARTITION flag must be given
  * to hash_create.  This prevents any attempt to split buckets on-the-fly.
  * Therefore, each hash bucket chain operates independently, and no fields
- * of the hash header change after init except nentries and freeList.
+ * of the hash header change after init except nfree and freeList.
  * (A partitioned table uses multiple copies of those fields, guarded by
  * spinlocks, for additional concurrency.)
  * This lets any subset of the hash buckets be treated as a separately
@@ -98,6 +98,8 @@
 
 #include "access/xact.h"
 #include "common/hashfn.h"
+#include "common/pg_prng.h"
+#include "port/atomics.h"
 #include "port/pg_bitutils.h"
 #include "storage/shmem.h"
 #include "storage/spin.h"
@@ -138,8 +140,7 @@ typedef HASHBUCKET *HASHSEGMENT;
  *
  * In a partitioned hash table, each freelist is associated with a specific
  * set of hashcodes, as determined by the FREELIST_IDX() macro below.
- * nentries tracks the number of live hashtable entries having those hashcodes
- * (NOT the number of entries in the freelist, as you might expect).
+ * nfree tracks the actual number of free hashtable entries in the freelist.
  *
  * The coverage of a freelist might be more or less than one partition, so it
  * needs its own lock rather than relying on caller locking.  Relying on that
@@ -147,16 +148,26 @@ typedef HASHBUCKET *HASHSEGMENT;
  * need to "borrow" entries from another freelist; see get_hash_entry().
  *
  * Using an array of FreeListData instead of separate arrays of mutexes,
- * nentries and freeLists helps to reduce sharing of cache lines between
+ * nfree and freeLists helps to reduce sharing of cache lines between
  * different mutexes.
  */
 typedef struct
 {
 	slock_t		mutex;			/* spinlock for this freelist */
-	long		nentries;		/* number of entries in associated buckets */
+	long		nfree;			/* number of free entries in the list */
 	HASHELEMENT *freeList;		/* chain of free elements */
 } FreeListData;
 
+typedef union
+{
+#if SIZEOF_LONG == 4
+	pg_atomic_uint32 a;
+#else
+	pg_atomic_uint64 a;
+#endif
+	long		l;
+}			nalloced_t;
+
 /*
  * Header structure for a hash table --- contains all changeable info
  *
@@ -170,7 +181,7 @@ struct HASHHDR
 	/*
 	 * The freelist can become a point of contention in high-concurrency hash
 	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nentries count, instead of just a single one.  Although the freelists
+	 * nfree count, instead of just a single one.  Although the freelists
 	 * normally operate independently, we will scavenge entries from freelists
 	 * other than a hashcode's default freelist when necessary.
 	 *
@@ -195,6 +206,7 @@ struct HASHHDR
 	long		ssize;			/* segment size --- must be power of 2 */
 	int			sshift;			/* segment shift = log2(ssize) */
 	int			nelem_alloc;	/* number of entries to allocate at once */
+	nalloced_t	nalloced;		/* number of entries allocated */
 
 #ifdef HASH_STATISTICS
 
@@ -254,6 +266,15 @@ struct HTAB
  */
 #define MOD(x,y)			   ((x) & ((y)-1))
 
+/*
+ * Struct for reuse element.
+ */
+struct HASHREUSE
+{
+	HTAB	   *hashp;
+	HASHBUCKET	element;
+};
+
 #ifdef HASH_STATISTICS
 static long hash_accesses,
 			hash_collisions,
@@ -269,6 +290,7 @@ static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
 static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
+static void free_reused_entry(HTAB *hashp);
 static void hdefault(HTAB *hashp);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
@@ -293,6 +315,12 @@ DynaHashAlloc(Size size)
 }
 
 
+/*
+ * Support for HASH_REUSE + HASH_ASSIGN
+ */
+static struct HASHREUSE DynaHashReuse = {NULL, NULL};
+
+
 /*
  * HashCompareFunc for string keys
  *
@@ -306,6 +334,42 @@ string_compare(const char *key1, const char *key2, Size keysize)
 	return strncmp(key1, key2, keysize - 1);
 }
 
+static inline long
+hctl_nalloced(HASHHDR *hctl)
+{
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		return (long) pg_atomic_read_u32(&hctl->nalloced.a);
+#else
+		return (long) pg_atomic_read_u64(&hctl->nalloced.a);
+#endif
+	return hctl->nalloced.l;
+}
+
+static inline void
+hctl_nalloced_add(HASHHDR *hctl, long v)
+{
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		pg_atomic_fetch_add_u32(&hctl->nalloced.a, (int32) v);
+#else
+		pg_atomic_fetch_add_u64(&hctl->nalloced.a, (int64) v);
+#endif
+	else
+		hctl->nalloced.l += v;
+}
+
+static inline void
+hctl_nalloced_init(HASHHDR *hctl)
+{
+	hctl->nalloced.l = 0;
+	if (IS_PARTITIONED(hctl))
+#if SIZEOF_LONG == 4
+		pg_atomic_init_u32(&hctl->nalloced.a, 0);
+#else
+		pg_atomic_init_u64(&hctl->nalloced.a, 0);
+#endif
+}
 
 /************************** CREATE ROUTINES **********************/
 
@@ -534,6 +598,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 		hctl->num_partitions = info->num_partitions;
 	}
 
+	hctl_nalloced_init(hctl);
+
 	if (flags & HASH_SEGMENT)
 	{
 		hctl->ssize = info->ssize;
@@ -932,6 +998,8 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  *		HASH_ENTER: look up key in table, creating entry if not present
  *		HASH_ENTER_NULL: same, but return NULL if out of memory
  *		HASH_REMOVE: look up key in table, remove entry if present
+ *		HASH_REUSE: same as HASH_REMOVE, but stores removed element in static
+ *					variable instead of free list.
  *
  * Return value is a pointer to the element found/entered/removed if any,
  * or NULL if no match was found.  (NB: in the case of the REMOVE action,
@@ -943,6 +1011,11 @@ calc_bucket(HASHHDR *hctl, uint32 hash_val)
  * HASH_ENTER_NULL cannot be used with the default palloc-based allocator,
  * since palloc internally ereports on out-of-memory.
  *
+ * If HASH_REUSE were called then next dynahash operation must be HASH_ENTER
+ * on the same dynahash instance. Otherwise, assertion will be triggered.
+ * HASH_ENTER will reuse element stored with HASH_REUSE if no duplicate entry
+ * found.
+ *
  * If foundPtr isn't NULL, then *foundPtr is set true if we found an
  * existing entry in the table, false otherwise.  This is needed in the
  * HASH_ENTER case, but is redundant with the return value otherwise.
@@ -1000,7 +1073,10 @@ hash_search_with_hash_value(HTAB *hashp,
 		 * Can't split if running in partitioned mode, nor if frozen, nor if
 		 * table is the subject of any active hash_seq_search scans.
 		 */
-		if (hctl->freeList[0].nentries > (long) hctl->max_bucket &&
+		long		nentries;
+
+		nentries = hctl_nalloced(hctl) - hctl->freeList[0].nfree;
+		if (nentries > (long) hctl->max_bucket &&
 			!IS_PARTITIONED(hctl) && !hashp->frozen &&
 			!has_seq_scans(hashp))
 			(void) expand_table(hashp);
@@ -1044,6 +1120,9 @@ hash_search_with_hash_value(HTAB *hashp,
 	if (foundPtr)
 		*foundPtr = (bool) (currBucket != NULL);
 
+	/* Check there is no unfinished HASH_REUSE + HASH_ENTER pair */
+	Assert(action == HASH_ENTER || DynaHashReuse.element == NULL);
+
 	/*
 	 * OK, now what?
 	 */
@@ -1057,20 +1136,17 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_REMOVE:
 			if (currBucket != NULL)
 			{
-				/* if partitioned, must lock to touch nentries and freeList */
+				/* if partitioned, must lock to touch nfree and freeList */
 				if (IS_PARTITIONED(hctl))
 					SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
 
-				/* delete the record from the appropriate nentries counter. */
-				Assert(hctl->freeList[freelist_idx].nentries > 0);
-				hctl->freeList[freelist_idx].nentries--;
-
 				/* remove record from hash bucket's chain. */
 				*prevBucketPtr = currBucket->link;
 
 				/* add the record to the appropriate freelist. */
 				currBucket->link = hctl->freeList[freelist_idx].freeList;
 				hctl->freeList[freelist_idx].freeList = currBucket;
+				hctl->freeList[freelist_idx].nfree++;
 
 				if (IS_PARTITIONED(hctl))
 					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
@@ -1084,6 +1160,21 @@ hash_search_with_hash_value(HTAB *hashp,
 			}
 			return NULL;
 
+		case HASH_REUSE:
+			if (currBucket != NULL)
+			{
+				/* remove record from hash bucket's chain. */
+				*prevBucketPtr = currBucket->link;
+
+				/* and store for HASH_ASSIGN */
+				DynaHashReuse.element = currBucket;
+				DynaHashReuse.hashp = hashp;
+
+				/* Caller should call HASH_ASSIGN as the very next step. */
+				return (void *) ELEMENTKEY(currBucket);
+			}
+			return NULL;
+
 		case HASH_ENTER_NULL:
 			/* ENTER_NULL does not work with palloc-based allocator */
 			Assert(hashp->alloc != DynaHashAlloc);
@@ -1092,7 +1183,12 @@ hash_search_with_hash_value(HTAB *hashp,
 		case HASH_ENTER:
 			/* Return existing element if found, else create one */
 			if (currBucket != NULL)
+			{
+				if (unlikely(DynaHashReuse.element != NULL))
+					free_reused_entry(hashp);
+
 				return (void *) ELEMENTKEY(currBucket);
+			}
 
 			/* disallow inserts if frozen */
 			if (hashp->frozen)
@@ -1100,6 +1196,7 @@ hash_search_with_hash_value(HTAB *hashp,
 					 hashp->tabname);
 
 			currBucket = get_hash_entry(hashp, freelist_idx);
+
 			if (currBucket == NULL)
 			{
 				/* out of memory */
@@ -1292,87 +1389,121 @@ hash_update_hash_key(HTAB *hashp,
  * Allocate a new hashtable entry if possible; return NULL if out of memory.
  * (Or, if the underlying space allocator throws error for out-of-memory,
  * we won't return at all.)
+ * Return element stored with HASH_REUSE if any.
  */
 static HASHBUCKET
 get_hash_entry(HTAB *hashp, int freelist_idx)
 {
 	HASHHDR    *hctl = hashp->hctl;
 	HASHBUCKET	newElement;
+	bool		allocFailed = false;
+	int			borrow_from_idx;
+	int			num_freelists;
+	int			ntries;
+	int			d;
 
-	for (;;)
+	if (unlikely(DynaHashReuse.element != NULL))
 	{
-		/* if partitioned, must lock to touch nentries and freeList */
-		if (IS_PARTITIONED(hctl))
-			SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
+		Assert(DynaHashReuse.hashp == hashp);
 
-		/* try to get an entry from the freelist */
-		newElement = hctl->freeList[freelist_idx].freeList;
+		newElement = DynaHashReuse.element;
+		DynaHashReuse.element = NULL;
+		DynaHashReuse.hashp = NULL;
 
-		if (newElement != NULL)
-			break;
+		return newElement;
+	}
 
-		if (IS_PARTITIONED(hctl))
-			SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
+	num_freelists = IS_PARTITIONED(hctl) ? NUM_FREELISTS : 1;
+	ntries = num_freelists / 4 + 1;
+	borrow_from_idx = freelist_idx;
+	d = 1;
 
-		/*
-		 * No free elements in this freelist.  In a partitioned table, there
-		 * might be entries in other freelists, but to reduce contention we
-		 * prefer to first try to get another chunk of buckets from the main
-		 * shmem allocator.  If that fails, though, we *MUST* root through all
-		 * the other freelists before giving up.  There are multiple callers
-		 * that assume that they can allocate every element in the initially
-		 * requested table size, or that deleting an element guarantees they
-		 * can insert a new element, even if shared memory is entirely full.
-		 * Failing because the needed element is in a different freelist is
-		 * not acceptable.
-		 */
-		if (!element_alloc(hashp, hctl->nelem_alloc, freelist_idx))
+	for (; ntries || !allocFailed; ntries--)
+	{
+		if (ntries == 0)
 		{
-			int			borrow_from_idx;
-
-			if (!IS_PARTITIONED(hctl))
-				return NULL;	/* out of memory */
-
-			/* try to borrow element from another freelist */
+			/*
+			 * No free elements in first NUM_FREELISTS/4 freelists. To reduce
+			 * contention we prefer now to try to get another chunk of buckets
+			 * from the main shmem allocator. If that fails, though, we *MUST*
+			 * loop through all the remaining freelists before giving up.
+			 * There are multiple callers that assume that they can allocate
+			 * every element in the initially requested table size, or that
+			 * deleting an element guarantees they can insert a new element,
+			 * even if shared memory is entirely full. Failing because the
+			 * needed element is in a different freelist is not acceptable.
+			 */
+			allocFailed = !element_alloc(hashp, hctl->nelem_alloc,
+										 freelist_idx);
 			borrow_from_idx = freelist_idx;
-			for (;;)
-			{
-				borrow_from_idx = (borrow_from_idx + 1) % NUM_FREELISTS;
-				if (borrow_from_idx == freelist_idx)
-					break;		/* examined all freelists, fail */
+			ntries = num_freelists;
+			d = 1;
+		}
 
-				SpinLockAcquire(&(hctl->freeList[borrow_from_idx].mutex));
-				newElement = hctl->freeList[borrow_from_idx].freeList;
+		/* if partitioned, must lock to touch nfree and freeList */
+		if (IS_PARTITIONED(hctl))
+			SpinLockAcquire(&hctl->freeList[borrow_from_idx].mutex);
 
-				if (newElement != NULL)
-				{
-					hctl->freeList[borrow_from_idx].freeList = newElement->link;
-					SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
+		newElement = hctl->freeList[borrow_from_idx].freeList;
 
-					/* careful: count the new element in its proper freelist */
-					SpinLockAcquire(&hctl->freeList[freelist_idx].mutex);
-					hctl->freeList[freelist_idx].nentries++;
-					SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
-
-					return newElement;
-				}
+		if (newElement != NULL)
 
+		{
+			Assert(hctl->freeList[borrow_from_idx].nfree > 0);
+			hctl->freeList[borrow_from_idx].freeList = newElement->link;
+			hctl->freeList[borrow_from_idx].nfree--;
+			if (IS_PARTITIONED(hctl))
 				SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
-			}
 
-			/* no elements available to borrow either, so out of memory */
-			return NULL;
+			return newElement;
 		}
+
+		if (IS_PARTITIONED(hctl))
+			SpinLockRelease(&(hctl->freeList[borrow_from_idx].mutex));
+
+		/* Check num_freelists is power of 2 */
+		Assert((num_freelists & (num_freelists - 1)) == 0);
+		/* Quadratic probing guarantees we loop through all entries. */
+		borrow_from_idx = (borrow_from_idx + d++) & (num_freelists - 1);
+	}
+
+	return NULL;				/* out of memory */
+}
+
+/* Return entry stored with HASH_REUSE into appropriate freelist. */
+static void
+free_reused_entry(HTAB *hashp)
+{
+	HASHHDR    *hctl = hashp->hctl;
+	int			freelist_idx = 0;
+
+	Assert(DynaHashReuse.hashp == hashp);
+
+	/*
+	 * Testing shows best strategy is spread reused entry in random way.
+	 * Otherwise there is a chance for pathological case with crowding at
+	 * partition of hot element.
+	 */
+	if (IS_PARTITIONED(hctl))
+	{
+		freelist_idx = pg_prng_int32p(&pg_global_prng_state);
+		freelist_idx %= NUM_FREELISTS;
 	}
 
-	/* remove entry from freelist, bump nentries */
-	hctl->freeList[freelist_idx].freeList = newElement->link;
-	hctl->freeList[freelist_idx].nentries++;
+	/* if partitioned, must lock to touch nfree and freeList */
+	if (IS_PARTITIONED(hctl))
+		SpinLockAcquire(&(hctl->freeList[freelist_idx].mutex));
+
+	/* add the record to the appropriate freelist. */
+	DynaHashReuse.element->link = hctl->freeList[freelist_idx].freeList;
+	hctl->freeList[freelist_idx].freeList = DynaHashReuse.element;
+	hctl->freeList[freelist_idx].nfree++;
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
 
-	return newElement;
+	DynaHashReuse.element = NULL;
+	DynaHashReuse.hashp = NULL;
 }
 
 /*
@@ -1382,7 +1513,9 @@ long
 hash_get_num_entries(HTAB *hashp)
 {
 	int			i;
-	long		sum = hashp->hctl->freeList[0].nentries;
+	long		sum = hctl_nalloced(hashp->hctl);
+
+	sum -= hashp->hctl->freeList[0].nfree;
 
 	/*
 	 * We currently don't bother with acquiring the mutexes; it's only
@@ -1392,7 +1525,7 @@ hash_get_num_entries(HTAB *hashp)
 	if (IS_PARTITIONED(hashp->hctl))
 	{
 		for (i = 1; i < NUM_FREELISTS; i++)
-			sum += hashp->hctl->freeList[i].nentries;
+			sum -= hashp->hctl->freeList[i].nfree;
 	}
 
 	return sum;
@@ -1739,6 +1872,8 @@ element_alloc(HTAB *hashp, int nelem, int freelist_idx)
 	/* freelist could be nonempty if two backends did this concurrently */
 	firstElement->link = hctl->freeList[freelist_idx].freeList;
 	hctl->freeList[freelist_idx].freeList = prevElement;
+	hctl->freeList[freelist_idx].nfree += nelem;
+	hctl_nalloced_add(hctl, nelem);
 
 	if (IS_PARTITIONED(hctl))
 		SpinLockRelease(&hctl->freeList[freelist_idx].mutex);
diff --git a/src/include/storage/buf_internals.h b/src/include/storage/buf_internals.h
index a1bb6ce60a0..c1c6fdc1e33 100644
--- a/src/include/storage/buf_internals.h
+++ b/src/include/storage/buf_internals.h
@@ -333,7 +333,7 @@ extern void InitBufTable(int size);
 extern uint32 BufTableHashCode(BufferTag *tagPtr);
 extern int	BufTableLookup(BufferTag *tagPtr, uint32 hashcode);
 extern int	BufTableInsert(BufferTag *tagPtr, uint32 hashcode, int buf_id);
-extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode);
+extern void BufTableDelete(BufferTag *tagPtr, uint32 hashcode, bool reuse);
 
 /* localbuf.c */
 extern PrefetchBufferResult PrefetchLocalBuffer(SMgrRelation smgr,
diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
index 854c3312414..1ffb616d99e 100644
--- a/src/include/utils/hsearch.h
+++ b/src/include/utils/hsearch.h
@@ -113,7 +113,8 @@ typedef enum
 	HASH_FIND,
 	HASH_ENTER,
 	HASH_REMOVE,
-	HASH_ENTER_NULL
+	HASH_ENTER_NULL,
+	HASH_REUSE
 } HASHACTION;
 
 /* hash_seq status (should be considered an opaque type by callers) */
-- 
2.35.1


From 121d620126c441289c440ef094d82dcb80d80d17 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Sun, 20 Mar 2022 12:32:06 +0300
Subject: [PATCH 4/4] reduce memory allocation for non-partitioned dynahash

Non-partitioned hash table doesn't use 32 partitions of HASHHDR->freeList.
Lets allocate just single free list in this case.

Tags: bufmgr
---
 src/backend/utils/hash/dynahash.c | 37 +++++++++++++++++--------------
 1 file changed, 20 insertions(+), 17 deletions(-)

diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
index 436b6f5af41..aba60109d04 100644
--- a/src/backend/utils/hash/dynahash.c
+++ b/src/backend/utils/hash/dynahash.c
@@ -178,18 +178,6 @@ typedef union
  */
 struct HASHHDR
 {
-	/*
-	 * The freelist can become a point of contention in high-concurrency hash
-	 * tables, so we use an array of freelists, each with its own mutex and
-	 * nfree count, instead of just a single one.  Although the freelists
-	 * normally operate independently, we will scavenge entries from freelists
-	 * other than a hashcode's default freelist when necessary.
-	 *
-	 * If the hash table is not partitioned, only freeList[0] is used and its
-	 * spinlock is not used at all; callers' locking is assumed sufficient.
-	 */
-	FreeListData freeList[NUM_FREELISTS];
-
 	/* These fields can change, but not in a partitioned table */
 	/* Also, dsize can't change in a shared table, even if unpartitioned */
 	long		dsize;			/* directory size */
@@ -217,6 +205,18 @@ struct HASHHDR
 	long		accesses;
 	long		collisions;
 #endif
+
+	/*
+	 * The freelist can become a point of contention in high-concurrency hash
+	 * tables, so we use an array of freelists, each with its own mutex and
+	 * nfree count, instead of just a single one.  Although the freelists
+	 * normally operate independently, we will scavenge entries from freelists
+	 * other than a hashcode's default freelist when necessary.
+	 *
+	 * If the hash table is not partitioned, only freeList[0] is used and its
+	 * spinlock is not used at all; callers' locking is assumed sufficient.
+	 */
+	FreeListData freeList[NUM_FREELISTS];
 };
 
 #define IS_PARTITIONED(hctl)  ((hctl)->num_partitions != 0)
@@ -291,7 +291,7 @@ static bool dir_realloc(HTAB *hashp);
 static bool expand_table(HTAB *hashp);
 static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
 static void free_reused_entry(HTAB *hashp);
-static void hdefault(HTAB *hashp);
+static void hdefault(HTAB *hashp, bool partitioned);
 static int	choose_nelem_alloc(Size entrysize);
 static bool init_htab(HTAB *hashp, long nelem);
 static void hash_corrupted(HTAB *hashp);
@@ -570,7 +570,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	if (!hashp->hctl)
 	{
-		hashp->hctl = (HASHHDR *) hashp->alloc(sizeof(HASHHDR));
+		Assert(!(flags & HASH_PARTITION));
+		hashp->hctl = (HASHHDR *) hashp->alloc(offsetof(HASHHDR, freeList[1]));
 		if (!hashp->hctl)
 			ereport(ERROR,
 					(errcode(ERRCODE_OUT_OF_MEMORY),
@@ -579,7 +580,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
 
 	hashp->frozen = false;
 
-	hdefault(hashp);
+	hdefault(hashp, (flags & HASH_PARTITION) != 0);
 
 	hctl = hashp->hctl;
 
@@ -689,11 +690,13 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
  * Set default HASHHDR parameters.
  */
 static void
-hdefault(HTAB *hashp)
+hdefault(HTAB *hashp, bool partition)
 {
 	HASHHDR    *hctl = hashp->hctl;
 
-	MemSet(hctl, 0, sizeof(HASHHDR));
+	MemSet(hctl, 0, partition ?
+		   sizeof(HASHHDR) :
+		   offsetof(HASHHDR, freeList[1]));
 
 	hctl->dsize = DEF_DIRSIZE;
 	hctl->nsegs = 0;
-- 
2.35.1

simple_select3.sql
Description: application/sql

Re: BufferAlloc: don't take two simultaneous locks

Reply via email to