14.02.2025 13:24, Alexander Korotkov пишет:
> On Fri, Feb 14, 2025 at 11:45 AM Pavel Borisov <pashkin.e...@gmail.com> wrote:
>> On Fri, 14 Feb 2025 at 00:59, Alexander Korotkov <aekorot...@gmail.com>
>> wrote:
>>> On Thu, Feb 13, 2025 at 6:39 PM Pavel Borisov <pashkin.e...@gmail.com>
>>> wrote:
>>>> On Thu, 13 Feb 2025 at 14:08, Alexander Korotkov <aekorot...@gmail.com>
>>>> wrote:
>>>>>
>>>>> On Thu, Feb 13, 2025 at 11:45 AM Yura Sokolov <y.soko...@postgrespro.ru>
>>>>> wrote:
>>>>>> 13.02.2025 12:34, Alexander Korotkov пишет:
>>>>>>> On Wed, Feb 12, 2025 at 8:16 PM Yura Sokolov <y.soko...@postgrespro.ru>
>>>>>>> wrote:
>>>>>>>> 08.02.2025 13:07, Alexander Korotkov пишет:
>>>>>>>>> On Fri, Feb 7, 2025 at 1:39 PM Alexander Korotkov
>>>>>>>>> <aekorot...@gmail.com> wrote:
>>>>>>>>>> Good, thank you. I think 0001 patch is generally good, but needs
>>>>>>>>>> some
>>>>>>>>>> further polishing, e.g. more comments explaining how does it work.
>>>>>>>>
>>>>>>>> I tried to add more comments. I'm not good at, so recommendations are
>>>>>>>> welcome.
>>>>>>>>
>>>>>>>>> Two things comes to my mind worth rechecking about 0001.
>>>>>>>>> 1) Are XLogCtl->InitializeReserved, XLogCtl->InitializedUpTo and
>>>>>>>>> XLogCtl->xlblocks always page-aligned? Because algorithm seems to be
>>>>>>>>> sensitive to that. If so, I would propose to explicitly comment that
>>>>>>>>> and add corresponding asserts.
>>>>>>>>
>>>>>>>> They're certainly page aligned, since they are page borders.
>>>>>>>> I added assert on alignment of InitializeReserved for the sanity.
>>>>>>>>
>>>>>>>>> 2) Check if there are concurrency issues between
>>>>>>>>> AdvanceXLInsertBuffer() and switching to the new WAL file.
>>>>>>>>
>>>>>>>> There are no issues:
>>>>>>>> 1. CopyXLogRecordToWAL for isLogSwitch follows same protocol, ie uses
>>>>>>>> GetXLogBuffer to zero-out WAL page.
>>>>>>>> 2. WALINSERT_SPECIAL_SWITCH forces exclusive lock on all insertion
>>>>>>>> locks,
>>>>>>>> so switching wal is not concurrent. (Although, there is no need in this
>>>>>>>> exclusiveness, imho.)
>>>>>>>
>>>>>>> Good, thank you. I've also revised commit message and comments.
>>>>>>>
>>>>>>> But I see another issue with this patch. In the worst case, we do
>>>>>>> XLogWrite() by ourselves, and it could potentially could error out.
>>>>>>> Without patch, that would cause WALBufMappingLock be released and
>>>>>>> XLogCtl->InitializedUpTo not advanced. With the patch, that would
>>>>>>> cause other processes infinitely waiting till we finish the
>>>>>>> initialization.
>>>>>>>
>>>>>>> Possible solution would be to save position of the page to be
>>>>>>> initialized, and set it back to XLogCtl->InitializeReserved on error
>>>>>>> (everywhere we do LWLockReleaseAll()). We also must check that on
>>>>>>> error we only set XLogCtl->InitializeReserved to the past, because
>>>>>>> there could be multiple concurrent failures. Also we need to
>>>>>>> broadcast XLogCtl->InitializedUpToCondVar to wake up waiters.
>>>>>>
>>>>>> The single place where AdvanceXLInsertBuffer is called outside of
>>>>>> critical
>>>>>> section is in XLogBackgroundFlush. All other call stacks will issue
>>>>>> server
>>>>>> restart if XLogWrite will raise error inside of AdvanceXLInsertBuffer.
>>>>>>
>>>>>> XLogBackgroundFlush explicitely avoids writing buffers by passing
>>>>>> opportunistic=true parameter.
>>>>>>
>>>>>> Therefore, error in XLogWrite will not cause hang in
>>>>>> AdvanceXLInsertBuffer
>>>>>> since server will shutdown/restart.
>>>>>>
>>>>>> Perhaps, we just need to insert `Assert(CritSectionCount > 0);` before
>>>>>> call
>>>>>> to XLogWrite here?
>>>>>
>>>>> You're correct. I just reflected this in the next revision of the patch.
>>>>
>>>> I've looked at the patchset v6.
>>>
>>> Oh, sorry, I really did wrong. I've done git format-patch for wrong
>>> local branch for v5 and v6. Patches I've sent for v5 and v6 are
>>> actually the same as my v1. This is really pity. Please, find the
>>> right version of patchset attached.
>>
>> I've rechecked v7. In v6 a proposal from [1] was not reflected. Now it
>> landed in v7.
>>
>> Other changes are not regarding code behavior. The things from my
>> previous review that still could apply to v7:
>>
>> For 0001:
>>
>> Comment change proposed:
>> "lock-free with cooperation with" -> "lock-free accompanied by changes
>> to..." (maybe other variant)
>
> Good catch. I've rephrased this comment even more.
>
>> I propose a new define:
>> #define FirstValidXLogRecPtr 1
>> While FirstValidXLogRecPtr = InvalidXLogRecPtr + 1 is true in the code
>> that has no semantical meaning and it's better to avoid using direct
>> arithmetics to relate meaning of FirstValidXLogRecPtr from
>> InvalidXLogRecPtr.
>
> Makes sense, but I'm not sure if this change is required at all. I've
> reverted this to the state of master, and everything seems to work.
>
>> For 0002 both comments proposals from my message applied to v6 apply
>> to v7 as well
>
> Thank you for pointing. For now, I'm concentrated on improvements on
> 0001. Probably Yura could work on your notes to 0002.
I wrote good commit message for 0002 with calculated probabilities and
simple Ruby program which calculates them to explain choice of 2
conditional attempts. (At least I hope the message is good). And added
simple comment before `int attempts = 2;`
Also I simplified 0002 a bit to look a bit prettier (ie without goto), and
added static assert on NUM_XLOGINSERT_LOCKS being power of 2.
(0001 patch is same as for v8)
-------
regards
Yura Sokolov aka funny-falcon
From f9e4289000627801506226b0cd140872639e95ab Mon Sep 17 00:00:00 2001
From: Alexander Korotkov <akorot...@postgresql.org>
Date: Thu, 13 Feb 2025 01:20:20 +0200
Subject: [PATCH v9 1/2] Get rid of WALBufMappingLock
Allow multiple backends to initialize WAL buffers concurrently. This way
`MemSet((char *) NewPage, 0, XLOG_BLCKSZ);` can run in parallel without
taking a single LWLock in exclusive mode.
The new algorithm works as follows:
* reserve a page for initialization,
* Ensure the page is written out,
* once the page is initialized, signal concurrent initializers using
ConditionVariable,
* repeat previous steps until we reserve initialization up to the target
WAL position,
* wait until concurrent initialization finishes using a ConditionVariable.
Now, multiple backends can, in parallel, concurrently reserve pages,
initialize them, and advance XLogCtl->InitializedUpTo to point to the latest
initialized page.
Author: <y.soko...@postgrespro.ru>
Co-authored-by: Alexander Korotkov <aekorot...@gmail.com>
Reviewed-by: Pavel Borisov <pashkin.e...@gmail.com>
---
src/backend/access/transam/xlog.c | 176 +++++++++++++-----
.../utils/activity/wait_event_names.txt | 2 +-
src/include/storage/lwlocklist.h | 2 +-
3 files changed, 132 insertions(+), 48 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a50fd99d9e5..c232ba2d599 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -302,11 +302,6 @@ static bool doPageWrites;
* so it's a plain spinlock. The other locks are held longer (potentially
* over I/O operations), so we use LWLocks for them. These locks are:
*
- * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
- * It is only held while initializing and changing the mapping. If the
- * contents of the buffer being replaced haven't been written yet, the mapping
- * lock is released while the write is done, and reacquired afterwards.
- *
* WALWriteLock: must be held to write WAL buffers to disk (XLogWrite or
* XLogFlush).
*
@@ -473,21 +468,32 @@ typedef struct XLogCtlData
pg_atomic_uint64 logFlushResult; /* last byte + 1 flushed */
/*
- * Latest initialized page in the cache (last byte position + 1).
+ * Latest reserved for inititalization page in the cache (last byte
+ * position + 1).
*
- * To change the identity of a buffer (and InitializedUpTo), you need to
- * hold WALBufMappingLock. To change the identity of a buffer that's
+ * To change the identity of a buffer, you need to advance
+ * InitializeReserved first. To change the identity of a buffer that's
* still dirty, the old page needs to be written out first, and for that
* you need WALWriteLock, and you need to ensure that there are no
* in-progress insertions to the page by calling
* WaitXLogInsertionsToFinish().
*/
- XLogRecPtr InitializedUpTo;
+ pg_atomic_uint64 InitializeReserved;
+
+ /*
+ * Latest initialized page in the cache (last byte position + 1).
+ *
+ * InitializedUpTo is updated after the buffer initialization. After
+ * update, waiters got notification using InitializedUpToCondVar.
+ */
+ pg_atomic_uint64 InitializedUpTo;
+ ConditionVariable InitializedUpToCondVar;
/*
* These values do not change after startup, although the pointed-to pages
- * and xlblocks values certainly do. xlblocks values are protected by
- * WALBufMappingLock.
+ * and xlblocks values certainly do. xlblocks values are changed
+ * lock-free according to the check for the xlog write position and are
+ * accompanied by changes of InitializeReserved and InitializedUpTo.
*/
char *pages; /* buffers for unwritten XLOG pages */
pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
@@ -810,9 +816,9 @@ XLogInsertRecord(XLogRecData *rdata,
* fullPageWrites from changing until the insertion is finished.
*
* Step 2 can usually be done completely in parallel. If the required WAL
- * page is not initialized yet, you have to grab WALBufMappingLock to
- * initialize it, but the WAL writer tries to do that ahead of insertions
- * to avoid that from happening in the critical path.
+ * page is not initialized yet, you have to go through AdvanceXLInsertBuffer,
+ * which will ensure it is initialized. But the WAL writer tries to do that
+ * ahead of insertions to avoid that from happening in the critical path.
*
*----------
*/
@@ -1991,32 +1997,70 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
XLogRecPtr NewPageEndPtr = InvalidXLogRecPtr;
XLogRecPtr NewPageBeginPtr;
XLogPageHeader NewPage;
+ XLogRecPtr ReservedPtr;
int npages pg_attribute_unused() = 0;
- LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-
/*
- * Now that we have the lock, check if someone initialized the page
- * already.
+ * We must run the loop below inside the critical section as we expect
+ * XLogCtl->InitializedUpTo to eventually keep up. The most of callers
+ * already run inside the critical section. Except for WAL writer, which
+ * passed 'opportunistic == true', and therefore we don't perform
+ * operations that could error out.
+ *
+ * Start an explicit critical section anyway though.
*/
- while (upto >= XLogCtl->InitializedUpTo || opportunistic)
+ Assert(CritSectionCount > 0 || opportunistic);
+ START_CRIT_SECTION();
+
+ /*--
+ * Loop till we get all the pages in WAL buffer before 'upto' reserved for
+ * initialization. Multiple process can initialize different buffers with
+ * this loop in parallel as following.
+ *
+ * 1. Reserve page for initialization using XLogCtl->InitializeReserved.
+ * 2. Initialize the reserved page.
+ * 3. Attempt to advance XLogCtl->InitializedUpTo,
+ */
+ ReservedPtr = pg_atomic_read_u64(&XLogCtl->InitializeReserved);
+ while (upto >= ReservedPtr || opportunistic)
{
- nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
+ Assert(ReservedPtr % XLOG_BLCKSZ == 0);
/*
- * Get ending-offset of the buffer page we need to replace (this may
- * be zero if the buffer hasn't been used yet). Fall through if it's
- * already written out.
+ * Get ending-offset of the buffer page we need to replace.
+ *
+ * We don't lookup into xlblocks, but rather calculate position we
+ * must wait to be written. If it was written, xlblocks will have this
+ * position (or uninitialized)
*/
- OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
- if (LogwrtResult.Write < OldPageRqstPtr)
+ if (ReservedPtr + XLOG_BLCKSZ > XLOG_BLCKSZ * XLOGbuffers)
+ OldPageRqstPtr = ReservedPtr + XLOG_BLCKSZ - XLOG_BLCKSZ * XLOGbuffers;
+ else
+ OldPageRqstPtr = InvalidXLogRecPtr;
+
+ if (LogwrtResult.Write < OldPageRqstPtr && opportunistic)
{
/*
- * Nope, got work to do. If we just want to pre-initialize as much
- * as we can without flushing, give up now.
+ * If we just want to pre-initialize as much as we can without
+ * flushing, give up now.
*/
- if (opportunistic)
- break;
+ upto = ReservedPtr - 1;
+ break;
+ }
+
+ /*
+ * Attempt to reserve the page for initialization. Failure means that
+ * this page got reserved by another process.
+ */
+ if (!pg_atomic_compare_exchange_u64(&XLogCtl->InitializeReserved,
+ &ReservedPtr,
+ ReservedPtr + XLOG_BLCKSZ))
+ continue;
+
+ /* Fall through if it's already written out. */
+ if (LogwrtResult.Write < OldPageRqstPtr)
+ {
+ /* Nope, got work to do. */
/* Advance shared memory write request position */
SpinLockAcquire(&XLogCtl->info_lck);
@@ -2031,14 +2075,6 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
RefreshXLogWriteResult(LogwrtResult);
if (LogwrtResult.Write < OldPageRqstPtr)
{
- /*
- * Must acquire write lock. Release WALBufMappingLock first,
- * to make sure that all insertions that we need to wait for
- * can finish (up to this same position). Otherwise we risk
- * deadlock.
- */
- LWLockRelease(WALBufMappingLock);
-
WaitXLogInsertionsToFinish(OldPageRqstPtr);
LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
@@ -2060,9 +2096,6 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
PendingWalStats.wal_buffers_full++;
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
}
- /* Re-acquire WALBufMappingLock and retry */
- LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
- continue;
}
}
@@ -2070,10 +2103,17 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
* Now the next buffer slot is free and we can set it up to be the
* next output page.
*/
- NewPageBeginPtr = XLogCtl->InitializedUpTo;
+ NewPageBeginPtr = ReservedPtr;
NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
+ nextidx = XLogRecPtrToBufIdx(ReservedPtr);
- Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
+#ifdef USE_ASSERT_CHECKING
+ {
+ XLogRecPtr storedBound = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
+
+ Assert(storedBound == OldPageRqstPtr || storedBound == InvalidXLogRecPtr);
+ }
+#endif
NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
@@ -2139,11 +2179,50 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
pg_write_barrier();
pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
- XLogCtl->InitializedUpTo = NewPageEndPtr;
+
+ /*
+ * Try to advance XLogCtl->InitializedUpTo.
+ *
+ * If the CAS operation failed, then some of previous pages are not
+ * initialized yet, and this backend gives up.
+ *
+ * Since initializer of next page might give up on advancing of
+ * InitializedUpTo, this backend have to attempt advancing until it
+ * find page "in the past" or concurrent backend succeeded at
+ * advancing. When we finish advancing XLogCtl->InitializedUpTo, we
+ * notify all the waiters with XLogCtl->InitializedUpToCondVar.
+ */
+ while (pg_atomic_compare_exchange_u64(&XLogCtl->InitializedUpTo, &NewPageBeginPtr, NewPageEndPtr))
+ {
+ NewPageBeginPtr = NewPageEndPtr;
+ NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
+ nextidx = XLogRecPtrToBufIdx(NewPageBeginPtr);
+
+ if (pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]) != NewPageEndPtr)
+ {
+ /*
+ * Page at nextidx wasn't initialized yet, so we cann't move
+ * InitializedUpto further. It will be moved by backend which
+ * will initialize nextidx.
+ */
+ ConditionVariableBroadcast(&XLogCtl->InitializedUpToCondVar);
+ break;
+ }
+ }
npages++;
}
- LWLockRelease(WALBufMappingLock);
+
+ END_CRIT_SECTION();
+
+ /*
+ * All the pages in WAL buffer before 'upto' were reserved for
+ * initialization. However, some pages might be reserved by concurrent
+ * processes. Wait till they finish initialization.
+ */
+ while (upto >= pg_atomic_read_u64(&XLogCtl->InitializedUpTo))
+ ConditionVariableSleep(&XLogCtl->InitializedUpToCondVar, WAIT_EVENT_WAL_BUFFER_INIT);
+ ConditionVariableCancelSleep();
#ifdef WAL_DEBUG
if (XLOG_DEBUG && npages > 0)
@@ -5044,6 +5123,10 @@ XLOGShmemInit(void)
pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+ pg_atomic_init_u64(&XLogCtl->InitializeReserved, InvalidXLogRecPtr);
+ pg_atomic_init_u64(&XLogCtl->InitializedUpTo, InvalidXLogRecPtr);
+ ConditionVariableInit(&XLogCtl->InitializedUpToCondVar);
}
/*
@@ -6063,7 +6146,7 @@ StartupXLOG(void)
memset(page + len, 0, XLOG_BLCKSZ - len);
pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
- XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
+ pg_atomic_write_u64(&XLogCtl->InitializedUpTo, endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
}
else
{
@@ -6072,8 +6155,9 @@ StartupXLOG(void)
* let the first attempt to insert a log record to initialize the next
* buffer.
*/
- XLogCtl->InitializedUpTo = EndOfLog;
+ pg_atomic_write_u64(&XLogCtl->InitializedUpTo, EndOfLog);
}
+ pg_atomic_write_u64(&XLogCtl->InitializeReserved, pg_atomic_read_u64(&XLogCtl->InitializedUpTo));
/*
* Update local and shared status. This is OK to do without any locks
diff --git a/src/backend/utils/activity/wait_event_names.txt b/src/backend/utils/activity/wait_event_names.txt
index e199f071628..ccf73781d81 100644
--- a/src/backend/utils/activity/wait_event_names.txt
+++ b/src/backend/utils/activity/wait_event_names.txt
@@ -155,6 +155,7 @@ REPLICATION_SLOT_DROP "Waiting for a replication slot to become inactive so it c
RESTORE_COMMAND "Waiting for <xref linkend="guc-restore-command"/> to complete."
SAFE_SNAPSHOT "Waiting to obtain a valid snapshot for a <literal>READ ONLY DEFERRABLE</literal> transaction."
SYNC_REP "Waiting for confirmation from a remote server during synchronous replication."
+WAL_BUFFER_INIT "Waiting on WAL buffer to be initialized."
WAL_RECEIVER_EXIT "Waiting for the WAL receiver to exit."
WAL_RECEIVER_WAIT_START "Waiting for startup process to send initial data for streaming replication."
WAL_SUMMARY_READY "Waiting for a new WAL summary to be generated."
@@ -310,7 +311,6 @@ XidGen "Waiting to allocate a new transaction ID."
ProcArray "Waiting to access the shared per-process data structures (typically, to get a snapshot or report a session's transaction ID)."
SInvalRead "Waiting to retrieve messages from the shared catalog invalidation queue."
SInvalWrite "Waiting to add a message to the shared catalog invalidation queue."
-WALBufMapping "Waiting to replace a page in WAL buffers."
WALWrite "Waiting for WAL buffers to be written to disk."
ControlFile "Waiting to read or update the <filename>pg_control</filename> file or create a new WAL file."
MultiXactGen "Waiting to read or update shared multixact state."
diff --git a/src/include/storage/lwlocklist.h b/src/include/storage/lwlocklist.h
index cf565452382..ff897515769 100644
--- a/src/include/storage/lwlocklist.h
+++ b/src/include/storage/lwlocklist.h
@@ -37,7 +37,7 @@ PG_LWLOCK(3, XidGen)
PG_LWLOCK(4, ProcArray)
PG_LWLOCK(5, SInvalRead)
PG_LWLOCK(6, SInvalWrite)
-PG_LWLOCK(7, WALBufMapping)
+/* 7 was WALBufMapping */
PG_LWLOCK(8, WALWrite)
PG_LWLOCK(9, ControlFile)
/* 10 was CheckpointLock */
--
2.43.0
From f0d9cc2140a10ca8d2b16a73e4be889f3665d89a Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Thu, 16 Jan 2025 15:30:57 +0300
Subject: [PATCH v9 2/2] Several attempts to lock WALInsertLocks.
Birthday paradox tells we could get collision with 34% probability with
just 3, 58% with 4 and 79% with 5 concurrent processes
(when NUM_XLOGINSERT_LOCKS == 8).
Trying several times to lock conditionally with just linear increase by
random delta, will reduce probability of blocking greatly:
- with 1 conditional attempt - 4%, 14% and 33% respectively
- with 2 conditional attempts - 0%, 2.1% and 9.7%
- with 3 conditional attempts - 0%, 0% and 1.85%
Probabilities are calculated with simple Ruby program:
def try_add(arr, n)
a, b = rand(8), rand(8)|1
n.times {
(arr << a; break) unless arr.include?(a)
a = (a + b) % 8
}
end
def calc_prob(n, k)
300000.times.
map{ ar=[]; n.times{ try_add(ar, k) }; ar}.
count{|a| a.length < n} / 300000.0 * 100
end
(3..5).each{|n| (1..5).each{|k| p [n,k,calc_prob(n, k).round(2)]}}
Given every attempt is a cache miss, 2 attempts following non-conditional
lock looks to be good compromise.
---
src/backend/access/transam/xlog.c | 42 +++++++++++++++++++------------
1 file changed, 26 insertions(+), 16 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c232ba2d599..e9fce1c8d9c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -68,6 +68,7 @@
#include "catalog/pg_database.h"
#include "common/controldata_utils.h"
#include "common/file_utils.h"
+#include "common/pg_prng.h"
#include "executor/instrument.h"
#include "miscadmin.h"
#include "pg_trace.h"
@@ -1376,7 +1377,11 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
static void
WALInsertLockAcquire(void)
{
- bool immed;
+ /*
+ * Two conditional attempts looks to be good compromise between good
+ * probability to acquire lock and cache misses on every attempt.
+ */
+ int attempts = 2;
/*
* It doesn't matter which of the WAL insertion locks we acquire, so try
@@ -1389,29 +1394,34 @@ WALInsertLockAcquire(void)
* (semi-)randomly. This allows the locks to be used evenly if you have a
* lot of very short connections.
*/
- static int lockToTry = -1;
+ static uint32 lockToTry = 0;
+ static uint32 lockDelta = 0;
- if (lockToTry == -1)
+ if (lockDelta == 0)
+ {
lockToTry = MyProcNumber % NUM_XLOGINSERT_LOCKS;
+ lockDelta = pg_prng_uint32(&pg_global_prng_state) % NUM_XLOGINSERT_LOCKS;
+ lockDelta |= 1; /* must be odd */
+ }
+
MyLockNo = lockToTry;
- /*
- * The insertingAt value is initially set to 0, as we don't know our
- * insert location yet.
- */
- immed = LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);
- if (!immed)
+ while (attempts-- > 0)
{
+ if (LWLockConditionalAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE))
+ return;
+
/*
- * If we couldn't get the lock immediately, try another lock next
- * time. On a system with more insertion locks than concurrent
- * inserters, this causes all the inserters to eventually migrate to a
- * lock that no-one else is using. On a system with more inserters
- * than locks, it still helps to distribute the inserters evenly
- * across the locks.
+ * If we couldn't get the lock immediately, try another lock. On a
+ * system with more insertion locks than concurrent inserters, this
+ * causes all the inserters to eventually migrate to a lock that
+ * no-one else is using. On a system with more inserters than locks,
+ * it still helps to distribute the inserters evenly across the locks.
*/
- lockToTry = (lockToTry + 1) % NUM_XLOGINSERT_LOCKS;
+ lockToTry = (lockToTry + lockDelta) % NUM_XLOGINSERT_LOCKS;
+ MyLockNo = lockToTry;
}
+ LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);
}
/*
--
2.43.0