Re: [RFC] Lock-free XLog Reservation from WAL

Yura Sokolov Sun, 26 Jan 2025 06:59:46 -0800

24.01.2025 12:07, Japin Li пишет:

On Thu, 23 Jan 2025 at 21:44, Japin Li <japi...@hotmail.com> wrote:

On Thu, 23 Jan 2025 at 15:03, Yura Sokolov <y.soko...@postgrespro.ru> wrote:

23.01.2025 11:46, Japin Li пишет:

On Wed, 22 Jan 2025 at 22:44, Japin Li <japi...@hotmail.com> wrote:

On Wed, 22 Jan 2025 at 17:02, Yura Sokolov <y.soko...@postgrespro.ru> wrote:

I believe, I know why it happens: I was in hurry making v2 by
cherry-picking internal version. I reverted some changes in
CalcCuckooPositions manually and forgot to add modulo
PREV_LINKS_HASH_CAPA.


Here's the fix:

          pos->pos[0] = hash % PREV_LINKS_HASH_CAPA;
-       pos->pos[1] = pos->pos[0] + 1;
+       pos->pos[1] = (pos->pos[0] + 1) % PREV_LINKS_HASH_CAPA;
          pos->pos[2] = (hash >> 16) % PREV_LINKS_HASH_CAPA;
-       pos->pos[3] = pos->pos[2] + 2;
+       pos->pos[3] = (pos->pos[2] + 2) % PREV_LINKS_HASH_CAPA;

Any way, here's v3:
- excess file "v0-0001-Increase..." removed. I believe it was source
    of white-space apply warnings.
- this mistake fixed
- more clear slots strategies + "8 positions in two cache-lines" strategy.

You may play with switching PREV_LINKS_HASH_STRATEGY to 2 or 3 and see
if it affects measurably.


Thanks for your quick fixing.  I will retest it tomorrow.

Hi, Yura Sokolov
Here is my test result of the v3 patch:
| case                          | min        | avg        | max
|
|-------------------------------+------------+------------+------------|
| master (44b61efb79)           | 865,743.55 | 871,237.40 | 874,492.59 |
| v3                            | 857,020.58 | 860,180.11 | 864,355.00 |
| v3 PREV_LINKS_HASH_STRATEGY=2 | 853,187.41 | 855,796.36 | 858,436.44 |
| v3 PREV_LINKS_HASH_STRATEGY=3 | 863,131.97 | 864,272.91 | 865,396.42 |
It seems there are some performance decreases :( or something I
missed?


Hi, Japin.
(Excuse me for duplicating message, I found I sent it only to you
first time).

Never mind!

v3 (as well as v2) doesn't increase NUM_XLOGINSERT_LOCKS itself.
With only 8 in-progress inserters spin-lock is certainly better than any
more complex solution.

You need to compare "master" vs "master + NUM_XLOGINSERT_LOCKS=64" vs
"master + NUM_XLOGINSERT_LOCKS=64 + v3".

And even this way I don't claim "Lock-free reservation" gives any profit.

That is why your benchmarking is very valuable! It could answer, does
we need such not-small patch, or there is no real problem at all?


Hi, Yura Sokolov

Here is the test result compared with NUM_XLOGINSERT_LOCKS and the v3 patch.

| case                  | min          | avg          | max          | rate% |
|-----------------------+--------------+--------------+--------------+-------|
| master (4108440)      | 891,225.77   | 904,868.75   | 913,708.17   |       |
| lock 64               | 1,007,716.95 | 1,012,013.22 | 1,018,674.00 | 11.84 |
| lock 64 attempt 1     | 1,016,716.07 | 1,017,735.55 | 1,019,328.36 | 12.47 |
| lock 64 attempt 2     | 1,015,328.31 | 1,018,147.74 | 1,021,513.14 | 12.52 |
| lock 128              | 1,010,147.38 | 1,014,128.11 | 1,018,672.01 | 12.07 |
| lock 128 attempt 1    | 1,018,154.79 | 1,023,348.35 | 1,031,365.42 | 13.09 |
| lock 128 attempt 2    | 1,013,245.56 | 1,018,984.78 | 1,023,696.00 | 12.61 |
| lock 64 v3            | 1,010,893.30 | 1,022,787.25 | 1,029,200.26 | 13.03 |
| lock 64 attempt 1 v3  | 1,014,961.21 | 1,019,745.09 | 1,025,511.62 | 12.70 |
| lock 64 attempt 2 v3  | 1,015,690.73 | 1,018,365.46 | 1,020,200.57 | 12.54 |
| lock 128 v3           | 1,012,653.14 | 1,013,637.09 | 1,014,358.69 | 12.02 |
| lock 128 attempt 1 v3 | 1,008,027.57 | 1,016,849.87 | 1,024,597.15 | 12.38 |
| lock 128 attempt 2 v3 | 1,020,552.04 | 1,024,658.92 | 1,027,855.90 | 13.24 |


Sorry for pause, it was my birthday, so I was on short vacation.

So, in total:
- increasing NUM_XLOGINSERT_LOCKS to 64 certainly helps
- additional lock attempts seems to help a bit in this benchmark,
  but it helps more in other (rather synthetic) benchmark [1]
- my version of lock-free reservation looks to help a bit when
  applied alone, but look strange in conjunction with additional
  lock attempts.

I don't see small improvement from my version of Lock-Free reservation
(1.1% = 1023/1012) pays for its complexity at the moment.

Probably, when other places will be optimized/improved, it will pay
more.

Or probably Zhiguo Zhou's version will perform better.

I think, we could measure theoretical benefit by completely ignoring
fill of xl_prev. I've attached patch "Dumb-lock-free..." so you could
measure. It passes almost all "recovery" tests, though fails two
strictly dependent on xl_prev.

[1]https://postgr.es/m/3b11fdc2-9793-403d-b3d4-67ff9a00d447%40postgrespro.ru


------

regards
Yura

From d8b1e82bab1ee51b416956b241e824f0b1d125e8 Mon Sep 17 00:00:00 2001
From: Yura Sokolov <y.soko...@postgrespro.ru>
Date: Sun, 19 Jan 2025 17:40:28 +0300
Subject: [PATCH] Dumb lock-free XLog Reservation without xl_prev

---
 src/backend/access/transam/xlog.c       | 106 ++++++++++--------------
 src/backend/access/transam/xlogreader.c |   2 +-
 2 files changed, 47 insertions(+), 61 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index bf3dbda901d..d45e6408268 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -395,17 +395,18 @@ static SessionBackupState sessionBackupState = SESSION_BACKUP_NONE;
  */
 typedef struct XLogCtlInsert
 {
-	slock_t		insertpos_lck;	/* protects CurrBytePos and PrevBytePos */
-
 	/*
 	 * CurrBytePos is the end of reserved WAL. The next record will be
-	 * inserted at that position. PrevBytePos is the start position of the
-	 * previously inserted (or rather, reserved) record - it is copied to the
-	 * prev-link of the next record. These are stored as "usable byte
-	 * positions" rather than XLogRecPtrs (see XLogBytePosToRecPtr()).
+	 * inserted at that position.
+	 *
+	 * The start position of the previously inserted (or rather, reserved)
+	 * record (it is copied to the prev-link of the next record) will be
+	 * stored in PrevLinksHash.
+	 *
+	 * These are stored as "usable byte positions" rather than XLogRecPtrs
+	 * (see XLogBytePosToRecPtr()).
 	 */
-	uint64		CurrBytePos;
-	uint64		PrevBytePos;
+	pg_atomic_uint64 CurrBytePos;
 
 	/*
 	 * Make sure the above heavily-contended spinlock and byte positions are
@@ -715,6 +716,12 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+static inline XLogRecPtr
+ReadInsertCurrBytePos(void)
+{
+	return pg_atomic_read_u64(&XLogCtl->Insert.CurrBytePos);
+}
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -1111,36 +1118,18 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
-	uint64		prevbytepos;
 
 	size = MAXALIGN(size);
 
 	/* All (non xlog-switch) records should contain data. */
 	Assert(size > SizeOfXLogRecord);
 
-	/*
-	 * The duration the spinlock needs to be held is minimized by minimizing
-	 * the calculations that have to be done while holding the lock. The
-	 * current tip of reserved WAL is kept in CurrBytePos, as a byte position
-	 * that only counts "usable" bytes in WAL, that is, it excludes all WAL
-	 * page headers. The mapping between "usable" byte positions and physical
-	 * positions (XLogRecPtrs) can be done outside the locked region, and
-	 * because the usable byte position doesn't include any headers, reserving
-	 * X bytes from WAL is almost as simple as "CurrBytePos += X".
-	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
-
-	startbytepos = Insert->CurrBytePos;
+	startbytepos = pg_atomic_fetch_add_u64(&Insert->CurrBytePos, size);
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
-
-	SpinLockRelease(&Insert->insertpos_lck);
 
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
-	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
+	*PrevPtr = *StartPos - 1;
 
 	/*
 	 * Check that the conversions between "usable byte positions" and
@@ -1148,7 +1137,6 @@ ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 	 */
 	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
 	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
-	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
 }
 
 /*
@@ -1166,32 +1154,29 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	uint64		startbytepos;
 	uint64		endbytepos;
-	uint64		prevbytepos;
 	uint32		size = MAXALIGN(SizeOfXLogRecord);
 	XLogRecPtr	ptr;
 	uint32		segleft;
 
 	/*
-	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion locks, there
-	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
-	 * compete for it, but that's not called very frequently.
+	 * Currently ReserveXLogInsertLocation is protected with exclusive
+	 * insertion lock, so there is no contention against CurrBytePos, But we
+	 * still do CAS loop for being uniform.
+	 *
+	 * Probably we'll get rid of exclusive lock in a future.
 	 */
-	SpinLockAcquire(&Insert->insertpos_lck);
 
-	startbytepos = Insert->CurrBytePos;
+repeat:
+	startbytepos = pg_atomic_read_u64(&Insert->CurrBytePos);
 
 	ptr = XLogBytePosToEndRecPtr(startbytepos);
 	if (XLogSegmentOffset(ptr, wal_segment_size) == 0)
 	{
-		SpinLockRelease(&Insert->insertpos_lck);
 		*EndPos = *StartPos = ptr;
 		return false;
 	}
 
 	endbytepos = startbytepos + size;
-	prevbytepos = Insert->PrevBytePos;
-
 	*StartPos = XLogBytePosToRecPtr(startbytepos);
 	*EndPos = XLogBytePosToEndRecPtr(endbytepos);
 
@@ -1202,17 +1187,24 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 		*EndPos += segleft;
 		endbytepos = XLogRecPtrToBytePos(*EndPos);
 	}
-	Insert->CurrBytePos = endbytepos;
-	Insert->PrevBytePos = startbytepos;
 
-	SpinLockRelease(&Insert->insertpos_lck);
+	if (!pg_atomic_compare_exchange_u64(&Insert->CurrBytePos,
+										&startbytepos,
+										endbytepos))
+	{
+		/*
+		 * Don't use spin delay here: perform_spin_delay primary case is for
+		 * solving single core contention. But on single core we will succeed
+		 * on the next attempt.
+		 */
+		goto repeat;
+	}
 
-	*PrevPtr = XLogBytePosToRecPtr(prevbytepos);
+	*PrevPtr = *StartPos - 1;
 
 	Assert(XLogSegmentOffset(*EndPos, wal_segment_size) == 0);
 	Assert(XLogRecPtrToBytePos(*EndPos) == endbytepos);
 	Assert(XLogRecPtrToBytePos(*StartPos) == startbytepos);
-	Assert(XLogRecPtrToBytePos(*PrevPtr) == prevbytepos);
 
 	return true;
 }
@@ -1507,7 +1499,6 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	XLogRecPtr	inserted;
 	XLogRecPtr	reservedUpto;
 	XLogRecPtr	finishedUpto;
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
 	int			i;
 
 	if (MyProc == NULL)
@@ -1522,9 +1513,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 		return inserted;
 
 	/* Read the current insert position */
-	SpinLockAcquire(&Insert->insertpos_lck);
-	bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
+	bytepos = ReadInsertCurrBytePos();
 	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
 
 	/*
@@ -5017,12 +5006,13 @@ XLOGShmemInit(void)
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	pg_atomic_init_u64(&XLogCtl->logInsertResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logWriteResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->logFlushResult, InvalidXLogRecPtr);
 	pg_atomic_init_u64(&XLogCtl->unloggedLSN, InvalidXLogRecPtr);
+
+	pg_atomic_init_u64(&XLogCtl->Insert.CurrBytePos, 0);
 }
 
 /*
@@ -6018,8 +6008,11 @@ StartupXLOG(void)
 	 * previous incarnation.
 	 */
 	Insert = &XLogCtl->Insert;
-	Insert->PrevBytePos = XLogRecPtrToBytePos(endOfRecoveryInfo->lastRec);
-	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+	{
+		XLogRecPtr	endOfLog = XLogRecPtrToBytePos(EndOfLog);
+
+		pg_atomic_write_u64(&Insert->CurrBytePos, endOfLog);
+	}
 
 	/*
 	 * Tricky point here: lastPage contains the *last* block that the LastRec
@@ -7005,7 +6998,7 @@ CreateCheckPoint(int flags)
 
 	if (shutdown)
 	{
-		XLogRecPtr	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
+		XLogRecPtr	curInsert = XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 
 		/*
 		 * Compute new REDO record ptr = location of next XLOG record.
@@ -9434,14 +9427,7 @@ register_persistent_abort_backup_handler(void)
 XLogRecPtr
 GetXLogInsertRecPtr(void)
 {
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	uint64		current_bytepos;
-
-	SpinLockAcquire(&Insert->insertpos_lck);
-	current_bytepos = Insert->CurrBytePos;
-	SpinLockRelease(&Insert->insertpos_lck);
-
-	return XLogBytePosToRecPtr(current_bytepos);
+	return XLogBytePosToRecPtr(ReadInsertCurrBytePos());
 }
 
 /*
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 3596af06172..0851b62af93 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1165,7 +1165,7 @@ ValidXLogRecordHeader(XLogReaderState *state, XLogRecPtr RecPtr,
 		 * check guards against torn WAL pages where a stale but valid-looking
 		 * WAL record starts on a sector boundary.
 		 */
-		if (record->xl_prev != PrevRecPtr)
+		if (false && record->xl_prev != PrevRecPtr)
 		{
 			report_invalid_record(state,
 								  "record with incorrect prev-link %X/%X at %X/%X",
-- 
2.43.0

Re: [RFC] Lock-free XLog Reservation from WAL

Reply via email to