02.01.2025 10:05, Zhou, Zhiguo wrote:
Hi all,
I am reaching out to solicit your insights and comments on a recent proposal regarding
the "Lock-free XLog Reservation from WAL." We have identified some challenges
with the current WAL insertions, which require space reservations in the WAL buffer which
involve updating two shared-memory statuses in XLogCtlInsert: CurrBytePos (the start
position of the current XLog) and PrevBytePos (the prev-link to the previous XLog).
Currently, the use of XLogCtlInsert.insertpos_lck ensures consistency but introduces lock
contention, hindering the parallelism of XLog insertions.
To address this issue, we propose the following changes:
1. Removal of PrevBytePos: This will allow increments of the CurrBytePos (a
single uint64 field) to be implemented with an atomic operation (fetch_add).
2. Updating Prev-Link of next XLog: Based on the fact that the prev-link of the
next XLog always points to the head of the current Xlog,we will slightly exceed
the reserved memory range of the current XLog to update the prev-link of the
next XLog, regardless of which backend acquires the next memory space. The next
XLog inserter will wait until its prev-link is updated for CRC calculation
before starting its own XLog copy into the WAL.
3. Breaking Sequential Write Convention: Each backend will update the prev-link
of its next XLog first, then return to the header position for the current log
insertion. This change will reduce the dependency of XLog writes on previous
ones (compared with the sequential writes).
4. Revised GetXLogBuffer: To support #3, we need update this function to
separate the LSN it intends to access from the LSN it expects to update in the
insertingAt field.
5. Increase NUM_XLOGINSERT_LOCKS: With the above changes, increasing
NUM_XLOGINSERT_LOCKS, for example to 128, could effectively enhance the
parallelism.
The attached patch could pass the regression tests (make check, make
check-world), and in the performance test of this POC on SPR (480 vCPU) shows
that this optimization could help the TPCC benchmark better scale with the core
count and as a result the performance with full cores enabled could be improved
by 2.04x.
Before we proceed with further patch validation and refinement work, we are
eager to hear the community's thoughts and comments on this optimization so
that we can confirm our current work aligns with expectations.
Good day, Zhiguo.
Idea looks great.
Minor issue:
- you didn't remove use of `insertpos_lck` from `ReserveXLogSwitch`.
I initially thought it became un-synchronized against
`ReserveXLogInsertLocation`, but looking closer I found it is
synchronized with `WALInsertLockAcquireExclusive`.
Since there are no other `insertpos_lck` usages after your patch, I
don't see why it should exists and be used in `ReserveXLogSwitch`.
Still I'd prefer to see CAS loop in this place to be consistent with
other non-locking access. And it will allow to get rid of
`WALInsertLockAcquireExclusive`, (though probably it is not a big issue).
Major issue:
- `SetPrevRecPtr` and `GetPrevRecPtr` do non-atomic write/read with on
platforms where MAXALIGN != 8 or without native 64 load/store. Branch
with 'memcpy` is rather obvious, but even pointer de-referencing on
"lucky case" is not safe either.
I have no idea how to fix it at the moment.
Readability issue:
- It would be good to add `Assert(ptr >= upto)` into `GetXLogBuffer`.
I had hard time to recognize `upto` is strictly not in the future.
- Certainly, final version have to have fixed and improved comments.
Many patch's ideas are strictly non-obvious. I had hard time to
recognize patch is not a piece of ... (excuse me for the swear sentence).
Indeed, patch is much better than it looks on first sight.
I came with alternative idea yesterday, but looking closer to your patch
today I see it is superior to mine (if atomic access will be fixed).
----
regards,
Yura Sokolov aka funny-falcon